«`html
Comparing the Top 6 Inference Runtimes for LLM Serving in 2025
Large language models (LLMs) are increasingly constrained by the efficiency of serving tokens under real traffic conditions. Key implementation details include how the runtime batches requests, overlaps prefill and decode, and manages the KV cache. Different engines make varying trade-offs on these axes, leading to differences in tokens per second, P50/P99 latency, and GPU memory usage.
Target Audience Analysis
The target audience for this content includes:
- Data Scientists and AI Engineers: Professionals focused on optimizing LLM performance in production environments.
- IT Managers: Individuals responsible for infrastructure decisions related to AI deployments.
- Business Executives: Leaders interested in understanding the implications of AI technology on business operations.
Common pain points include:
- High operational costs associated with LLM serving.
- Latency issues affecting user experience.
- Complexity in choosing the right runtime for specific use cases.
Goals include improving efficiency, reducing costs, and enhancing the performance of AI applications. The audience prefers clear, concise communication with a focus on technical specifications and practical applications.
Overview of Inference Runtimes
This article compares six inference runtimes that frequently appear in production stacks:
- vLLM
- TensorRT LLM
- Hugging Face Text Generation Inference (TGI v3)
- LMDeploy
- SGLang
- DeepSpeed Inference / ZeRO Inference
1. vLLM
Design
vLLM utilizes PagedAttention, partitioning the KV cache into fixed-size blocks and employing an indirection layer for efficient access. This design results in:
- Low KV fragmentation (reported <4% waste vs 60–80% in naïve allocators)
- High GPU utilization with continuous batching
- Support for prefix sharing and KV reuse at the block level
Performance
vLLM achieves 14–24× higher throughput than Hugging Face Transformers and 2.2–3.5× higher than early TGI for LLaMA models on NVIDIA GPUs.
Where it fits
vLLM is suitable as a default high-performance engine for general LLM serving with good throughput and hardware flexibility.
2. TensorRT LLM
Design
TensorRT LLM is a compilation-based engine leveraging NVIDIA TensorRT, generating fused kernels per model and shape. Key features include:
- Paged KV cache
- Quantized KV cache (INT8, FP8)
- Circular buffer KV cache
Performance
TensorRT LLM provides very low single request latency on NVIDIA GPUs when compiled for specific models, with tunable performance for low TTFT or high throughput.
Where it fits
Best for latency-critical workloads in NVIDIA environments where teams can invest in model-specific tuning.
3. Hugging Face TGI v3
Design
TGI v3 features a Rust-based server with continuous batching and backends for PyTorch and TensorRT. It introduces a long context pipeline with:
- Chunked prefill for long inputs
- Prefix KV caching for long conversation histories
Performance
TGI v3 processes approximately 3× more tokens and is up to 13× faster than vLLM for long prompts under specific setups.
Where it fits
Ideal for production stacks already using Hugging Face, particularly for chat applications with long histories.
4. LMDeploy
Design
LMDeploy is a toolkit from the InternLM ecosystem, featuring:
- High-performance CUDA kernels for NVIDIA GPUs
- Dynamic split and fuse for attention blocks
Performance
LMDeploy can achieve up to 1.8× higher request throughput than vLLM, especially under high concurrency.
Where it fits
Best for NVIDIA-centric deployments aiming for maximum throughput.
5. SGLang
Design
SGLang combines a domain-specific language for structured LLM programs with a runtime implementing RadixAttention, which optimizes KV reuse.
Performance
SGLang can achieve up to 6.4× higher throughput and 3.7× lower latency on structured workloads with high KV hit rates.
Where it fits
Suitable for agentic systems and applications where KV reuse is critical.
6. DeepSpeed Inference / ZeRO Inference
Design
DeepSpeed offers optimized transformer kernels and offloading techniques to run large models on limited GPU memory.
Performance
In specific configurations, DeepSpeed can achieve 43 tokens per second with full CPU offload, which is faster than partial offload setups.
Where it fits
Best for offline or batch inference scenarios where model size is a priority over latency.
Comparison Summary
| Runtime | Main Design Idea | Relative Strength | KV Strategy | Typical Use Case |
|---|---|---|---|---|
| vLLM | PagedAttention, continuous batching | High tokens per second at given TTFT | Paged KV blocks, FP8 KV support | General purpose GPU serving, multi hardware |
| TensorRT LLM | Compiled kernels on NVIDIA + KV reuse | Very low latency and high throughput on NVIDIA | Paged, quantized KV, reuse and offload | NVIDIA only, latency sensitive |
| TGI v3 | HF serving layer with long prompt path | Strong long prompt performance, integrated stack | Paged KV, chunked prefill, prefix caching | HF centric APIs, long chat histories |
| LMDeploy | TurboMind kernels, blocked KV, quant | Up to 1.8× vLLM throughput in vendor tests | Blocked KV cache, weight and KV quant | NVIDIA deployments focused on raw throughput |
| SGLang | RadixAttention and structured programs | Up to 6.4× throughput and 3.7× lower latency on structured workloads | Radix tree KV reuse over prefixes | Agents, RAG, high prefix reuse |
| DeepSpeed | GPU CPU NVMe offload for huge models | Enables large models on small GPU; throughput oriented | Offloaded weights and sometimes KV | Very large models, offline or low QPS |
Choosing a Runtime in Practice
When selecting a runtime for production systems, consider the following patterns:
- For a strong default engine with minimal custom work: Start with vLLM for good throughput and solid KV handling.
- If committed to NVIDIA and needing control over latency: Use TensorRT LLM with model-specific tuning.
- If using Hugging Face and focused on long chats: Opt for TGI v3 for its effective long prompt pipeline.
- For maximum throughput with quantized models: Choose LMDeploy with TurboMind.
- If building agents or heavy RAG systems: Utilize SGLang for high KV reuse.
- If running large models on limited GPUs: Consider DeepSpeed Inference / ZeRO Inference for higher throughput.
Overall, the key takeaway is that KV cache management is crucial in LLM serving. The most effective runtimes treat KV as a first-class data structure, optimizing its use through paging, quantization, reuse, and offloading.
«`