An Implementation on Building Advanced Multi-Endpoint Machine Learning APIs with LitServe: Batching, Streaming, Caching, and Local Inference

«`html

Understanding the Target Audience

The target audience for this tutorial primarily includes data scientists, machine learning engineers, and software developers with an interest in deploying machine learning models as APIs. They are likely to be working in tech startups or established companies focusing on AI-driven solutions. Key characteristics of this persona include:

Pain Points: Difficulty in efficiently deploying machine learning models, managing multiple endpoints, and ensuring scalability and performance.
Goals: To learn how to deploy ML models quickly, effectively, and in a way that can handle real-world data processing demands.
Interests: Innovations in AI, practical applications of machine learning, frameworks that simplify deployment, and performance optimization techniques.
Communication Preferences: They prefer concise, technical content that includes code snippets and practical examples. A hands-on approach with a focus on real-world applications resonates well with this audience.

Building Advanced Multi-Endpoint Machine Learning APIs with LitServe

In this tutorial, we delve into LitServe, a lightweight framework designed for deploying machine learning models as APIs with minimal effort. We will build and test multiple endpoints that showcase functionalities such as text generation, batching, streaming, multi-task processing, and caching—all running locally without relying on external APIs. By the end of this tutorial, you will have a clear understanding of how to design scalable and flexible ML serving pipelines suitable for production-level applications.

Setup and Installation

We begin by setting up our environment on Google Colab, installing all required dependencies, including LitServe, PyTorch, and Transformers:

!pip install litserve torch transformers -q

Next, we import the necessary libraries:

import litserve as ls
import torch
from transformers import pipeline
import time
from typing import List

Creating Text Generation and Sentiment Analysis APIs

We create two APIs using the LitServe framework:

Text Generator API

class TextGeneratorAPI(ls.LitAPI):
    def setup(self, device):
        self.model = pipeline("text-generation", model="distilgpt2", device=0 if device == "cuda" and torch.cuda.is_available() else -1)
        self.device = device

    def decode_request(self, request):
        return request["prompt"]

    def predict(self, prompt):
        result = self.model(prompt, max_length=100, num_return_sequences=1, temperature=0.8, do_sample=True)
        return result[0]['generated_text']

    def encode_response(self, output):
        return {"generated_text": output, "model": "distilgpt2"}

Batched Sentiment Analysis API

class BatchedSentimentAPI(ls.LitAPI):
    def setup(self, device):
        self.model = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english", device=0 if device == "cuda" and torch.cuda.is_available() else -1)

    def decode_request(self, request):
        return request["text"]

    def batch(self, inputs: List[str]) -> List[str]:
        return inputs

    def predict(self, batch: List[str]):
        results = self.model(batch)
        return results

    def unbatch(self, output):
        return output

    def encode_response(self, output):
        return {"label": output["label"], "score": float(output["score"]), "batched": True}

Implementing Streaming and Multi-Task APIs

Streaming Text Generation API

class StreamingTextAPI(ls.LitAPI):
    def setup(self, device):
        self.model = pipeline("text-generation", model="distilgpt2", device=0 if device == "cuda" and torch.cuda.is_available() else -1)

    def decode_request(self, request):
        return request["prompt"]

    def predict(self, prompt):
        words = ["Once", "upon", "a", "time", "in", "a", "digital", "world"]
        for word in words:
            time.sleep(0.1)
            yield word + " "

    def encode_response(self, output):
        for token in output:
            yield {"token": token}

Multi-Task API

class MultiTaskAPI(ls.LitAPI):
    def setup(self, device):
        self.sentiment = pipeline("sentiment-analysis", device=-1)
        self.summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-6-6", device=-1)
        self.device = device

    def decode_request(self, request):
        return {"task": request.get("task", "sentiment"), "text": request["text"]}

    def predict(self, inputs):
        task = inputs["task"]
        text = inputs["text"]
        if task == "sentiment":
            result = self.sentiment(text)[0]
            return {"task": "sentiment", "result": result}
        elif task == "summarize":
            if len(text.split()) < 30:
                return {"task": "summarize", "result": {"summary_text": text}}
            result = self.summarizer(text, max_length=50, min_length=10)[0]
            return {"task": "summarize", "result": result}
        else:
            return {"task": "unknown", "error": "Unsupported task"}

    def encode_response(self, output):
        return output

Implementing Caching in APIs

class CachedAPI(ls.LitAPI):
    def setup(self, device):
        self.model = pipeline("sentiment-analysis", device=-1)
        self.cache = {}
        self.hits = 0
        self.misses = 0

    def decode_request(self, request):
        return request["text"]

    def predict(self, text):
        if text in self.cache:
            self.hits += 1
            return self.cache[text], True
        self.misses += 1
        result = self.model(text)[0]
        self.cache[text] = result
        return result, False

    def encode_response(self, output):
        result, from_cache = output
        return {"label": result["label"], "score": float(result["score"]), "from_cache": from_cache, "cache_stats": {"hits": self.hits, "misses": self.misses}}

Testing APIs Locally

def test_apis_locally():
    print("=" * 70)
    print("Testing APIs Locally (No Server)")
    print("=" * 70)

    api1 = TextGeneratorAPI(); api1.setup("cpu")
    decoded = api1.decode_request({"prompt": "Artificial intelligence will"})
    result = api1.predict(decoded)
    encoded = api1.encode_response(result)
    print(f"✓ Result: {encoded['generated_text'][:100]}...")

    api2 = BatchedSentimentAPI(); api2.setup("cpu")
    texts = ["I love Python!", "This is terrible.", "Neutral statement."]
    decoded_batch = [api2.decode_request({"text": t}) for t in texts]
    batched = api2.batch(decoded_batch)
    results = api2.predict(batched)
    unbatched = api2.unbatch(results)
    for i, r in enumerate(unbatched):
        encoded = api2.encode_response(r)
        print(f"✓ '{texts[i]}' -> {encoded['label']} ({encoded['score']:.2f})")

    api3 = MultiTaskAPI(); api3.setup("cpu")
    decoded = api3.decode_request({"task": "sentiment", "text": "Amazing tutorial!"})
    result = api3.predict(decoded)
    print(f"✓ Sentiment: {result['result']}")

    api4 = CachedAPI(); api4.setup("cpu")
    test_text = "LitServe is awesome!"
    for i in range(3):
        decoded = api4.decode_request({"text": test_text})
        result = api4.predict(decoded)
        encoded = api4.encode_response(result)
        print(f"✓ Request {i+1}: {encoded['label']} (cached: {encoded['from_cache']})")

    print("=" * 70)
    print(" All tests completed successfully!")
    print("=" * 70)

test_apis_locally()

Conclusion

In this tutorial, we successfully created and tested diverse APIs that showcase the versatility of LitServe. We experimented with text generation, sentiment analysis, multi-tasking, and caching, demonstrating how LitServe seamlessly integrates with Hugging Face pipelines. By simplifying model deployment workflows, LitServe enables users to serve intelligent ML systems with just a few lines of Python code while maintaining flexibility, performance, and simplicity.

Check out the FULL CODES here. Feel free to explore our GitHub Page for tutorials, codes, and notebooks, and follow us on Twitter. Additionally, join our community on Reddit with over 100,000 members and subscribe to our newsletter. For Telegram users, we now have a channel available as well.

```