«`html
Understanding the Target Audience
The target audience for this tutorial primarily includes data scientists, machine learning engineers, and software developers with an interest in deploying machine learning models as APIs. They are likely to be working in tech startups or established companies focusing on AI-driven solutions. Key characteristics of this persona include:
- Pain Points: Difficulty in efficiently deploying machine learning models, managing multiple endpoints, and ensuring scalability and performance.
- Goals: To learn how to deploy ML models quickly, effectively, and in a way that can handle real-world data processing demands.
- Interests: Innovations in AI, practical applications of machine learning, frameworks that simplify deployment, and performance optimization techniques.
- Communication Preferences: They prefer concise, technical content that includes code snippets and practical examples. A hands-on approach with a focus on real-world applications resonates well with this audience.
Building Advanced Multi-Endpoint Machine Learning APIs with LitServe
In this tutorial, we delve into LitServe, a lightweight framework designed for deploying machine learning models as APIs with minimal effort. We will build and test multiple endpoints that showcase functionalities such as text generation, batching, streaming, multi-task processing, and caching—all running locally without relying on external APIs. By the end of this tutorial, you will have a clear understanding of how to design scalable and flexible ML serving pipelines suitable for production-level applications.
Setup and Installation
We begin by setting up our environment on Google Colab, installing all required dependencies, including LitServe, PyTorch, and Transformers:
!pip install litserve torch transformers -q
Next, we import the necessary libraries:
import litserve as ls
import torch
from transformers import pipeline
import time
from typing import List
Creating Text Generation and Sentiment Analysis APIs
We create two APIs using the LitServe framework:
Text Generator API
class TextGeneratorAPI(ls.LitAPI):
def setup(self, device):
self.model = pipeline("text-generation", model="distilgpt2", device=0 if device == "cuda" and torch.cuda.is_available() else -1)
self.device = device
def decode_request(self, request):
return request["prompt"]
def predict(self, prompt):
result = self.model(prompt, max_length=100, num_return_sequences=1, temperature=0.8, do_sample=True)
return result[0]['generated_text']
def encode_response(self, output):
return {"generated_text": output, "model": "distilgpt2"}
Batched Sentiment Analysis API
class BatchedSentimentAPI(ls.LitAPI):
def setup(self, device):
self.model = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english", device=0 if device == "cuda" and torch.cuda.is_available() else -1)
def decode_request(self, request):
return request["text"]
def batch(self, inputs: List[str]) -> List[str]:
return inputs
def predict(self, batch: List[str]):
results = self.model(batch)
return results
def unbatch(self, output):
return output
def encode_response(self, output):
return {"label": output["label"], "score": float(output["score"]), "batched": True}
Implementing Streaming and Multi-Task APIs
Streaming Text Generation API
class StreamingTextAPI(ls.LitAPI):
def setup(self, device):
self.model = pipeline("text-generation", model="distilgpt2", device=0 if device == "cuda" and torch.cuda.is_available() else -1)
def decode_request(self, request):
return request["prompt"]
def predict(self, prompt):
words = ["Once", "upon", "a", "time", "in", "a", "digital", "world"]
for word in words:
time.sleep(0.1)
yield word + " "
def encode_response(self, output):
for token in output:
yield {"token": token}
Multi-Task API
class MultiTaskAPI(ls.LitAPI):
def setup(self, device):
self.sentiment = pipeline("sentiment-analysis", device=-1)
self.summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-6-6", device=-1)
self.device = device
def decode_request(self, request):
return {"task": request.get("task", "sentiment"), "text": request["text"]}
def predict(self, inputs):
task = inputs["task"]
text = inputs["text"]
if task == "sentiment":
result = self.sentiment(text)[0]
return {"task": "sentiment", "result": result}
elif task == "summarize":
if len(text.split()) < 30:
return {"task": "summarize", "result": {"summary_text": text}}
result = self.summarizer(text, max_length=50, min_length=10)[0]
return {"task": "summarize", "result": result}
else:
return {"task": "unknown", "error": "Unsupported task"}
def encode_response(self, output):
return output
Implementing Caching in APIs
class CachedAPI(ls.LitAPI):
def setup(self, device):
self.model = pipeline("sentiment-analysis", device=-1)
self.cache = {}
self.hits = 0
self.misses = 0
def decode_request(self, request):
return request["text"]
def predict(self, text):
if text in self.cache:
self.hits += 1
return self.cache[text], True
self.misses += 1
result = self.model(text)[0]
self.cache[text] = result
return result, False
def encode_response(self, output):
result, from_cache = output
return {"label": result["label"], "score": float(result["score"]), "from_cache": from_cache, "cache_stats": {"hits": self.hits, "misses": self.misses}}
Testing APIs Locally
def test_apis_locally():
print("=" * 70)
print("Testing APIs Locally (No Server)")
print("=" * 70)
api1 = TextGeneratorAPI(); api1.setup("cpu")
decoded = api1.decode_request({"prompt": "Artificial intelligence will"})
result = api1.predict(decoded)
encoded = api1.encode_response(result)
print(f"✓ Result: {encoded['generated_text'][:100]}...")
api2 = BatchedSentimentAPI(); api2.setup("cpu")
texts = ["I love Python!", "This is terrible.", "Neutral statement."]
decoded_batch = [api2.decode_request({"text": t}) for t in texts]
batched = api2.batch(decoded_batch)
results = api2.predict(batched)
unbatched = api2.unbatch(results)
for i, r in enumerate(unbatched):
encoded = api2.encode_response(r)
print(f"✓ '{texts[i]}' -> {encoded['label']} ({encoded['score']:.2f})")
api3 = MultiTaskAPI(); api3.setup("cpu")
decoded = api3.decode_request({"task": "sentiment", "text": "Amazing tutorial!"})
result = api3.predict(decoded)
print(f"✓ Sentiment: {result['result']}")
api4 = CachedAPI(); api4.setup("cpu")
test_text = "LitServe is awesome!"
for i in range(3):
decoded = api4.decode_request({"text": test_text})
result = api4.predict(decoded)
encoded = api4.encode_response(result)
print(f"✓ Request {i+1}: {encoded['label']} (cached: {encoded['from_cache']})")
print("=" * 70)
print(" All tests completed successfully!")
print("=" * 70)
test_apis_locally()
Conclusion
In this tutorial, we successfully created and tested diverse APIs that showcase the versatility of LitServe. We experimented with text generation, sentiment analysis, multi-tasking, and caching, demonstrating how LitServe seamlessly integrates with Hugging Face pipelines. By simplifying model deployment workflows, LitServe enables users to serve intelligent ML systems with just a few lines of Python code while maintaining flexibility, performance, and simplicity.
Check out the FULL CODES here. Feel free to explore our GitHub Page for tutorials, codes, and notebooks, and follow us on Twitter. Additionally, join our community on Reddit with over 100,000 members and subscribe to our newsletter. For Telegram users, we now have a channel available as well.
```