«`html

Comparing the Top 6 Inference Runtimes for LLM Serving in 2025

Large language models (LLMs) are increasingly constrained by the efficiency of serving tokens under real traffic conditions. Key implementation details include how the runtime batches requests, overlaps prefill and decode, and manages the KV cache. Different engines make varying trade-offs on these axes, leading to differences in tokens per second, P50/P99 latency, and GPU memory usage.

Target Audience Analysis

The target audience for this content includes:

Data Scientists and AI Engineers: Professionals focused on optimizing LLM performance in production environments.
IT Managers: Individuals responsible for infrastructure decisions related to AI deployments.
Business Executives: Leaders interested in understanding the implications of AI technology on business operations.

Common pain points include:

High operational costs associated with LLM serving.
Latency issues affecting user experience.
Complexity in choosing the right runtime for specific use cases.

Goals include improving efficiency, reducing costs, and enhancing the performance of AI applications. The audience prefers clear, concise communication with a focus on technical specifications and practical applications.

Overview of Inference Runtimes

This article compares six inference runtimes that frequently appear in production stacks:

vLLM
TensorRT LLM
Hugging Face Text Generation Inference (TGI v3)
LMDeploy
SGLang
DeepSpeed Inference / ZeRO Inference

1. vLLM

Design

vLLM utilizes PagedAttention, partitioning the KV cache into fixed-size blocks and employing an indirection layer for efficient access. This design results in:

Low KV fragmentation (reported <4% waste vs 60–80% in naïve allocators)
High GPU utilization with continuous batching
Support for prefix sharing and KV reuse at the block level

Performance

vLLM achieves 14–24× higher throughput than Hugging Face Transformers and 2.2–3.5× higher than early TGI for LLaMA models on NVIDIA GPUs.

Where it fits

vLLM is suitable as a default high-performance engine for general LLM serving with good throughput and hardware flexibility.

2. TensorRT LLM

Design

TensorRT LLM is a compilation-based engine leveraging NVIDIA TensorRT, generating fused kernels per model and shape. Key features include:

Paged KV cache
Quantized KV cache (INT8, FP8)
Circular buffer KV cache

Performance

TensorRT LLM provides very low single request latency on NVIDIA GPUs when compiled for specific models, with tunable performance for low TTFT or high throughput.

Where it fits

Best for latency-critical workloads in NVIDIA environments where teams can invest in model-specific tuning.

3. Hugging Face TGI v3

Design

TGI v3 features a Rust-based server with continuous batching and backends for PyTorch and TensorRT. It introduces a long context pipeline with:

Chunked prefill for long inputs
Prefix KV caching for long conversation histories

Performance

TGI v3 processes approximately 3× more tokens and is up to 13× faster than vLLM for long prompts under specific setups.

Where it fits

Ideal for production stacks already using Hugging Face, particularly for chat applications with long histories.

4. LMDeploy

Design

LMDeploy is a toolkit from the InternLM ecosystem, featuring:

High-performance CUDA kernels for NVIDIA GPUs
Dynamic split and fuse for attention blocks

Performance

LMDeploy can achieve up to 1.8× higher request throughput than vLLM, especially under high concurrency.

Where it fits

Best for NVIDIA-centric deployments aiming for maximum throughput.

5. SGLang

Design

SGLang combines a domain-specific language for structured LLM programs with a runtime implementing RadixAttention, which optimizes KV reuse.

Performance

SGLang can achieve up to 6.4× higher throughput and 3.7× lower latency on structured workloads with high KV hit rates.

Where it fits

Suitable for agentic systems and applications where KV reuse is critical.

6. DeepSpeed Inference / ZeRO Inference

Design

DeepSpeed offers optimized transformer kernels and offloading techniques to run large models on limited GPU memory.

Performance

In specific configurations, DeepSpeed can achieve 43 tokens per second with full CPU offload, which is faster than partial offload setups.

Where it fits

Best for offline or batch inference scenarios where model size is a priority over latency.

Comparison Summary

Runtime	Main Design Idea	Relative Strength	KV Strategy	Typical Use Case
vLLM	PagedAttention, continuous batching	High tokens per second at given TTFT	Paged KV blocks, FP8 KV support	General purpose GPU serving, multi hardware
TensorRT LLM	Compiled kernels on NVIDIA + KV reuse	Very low latency and high throughput on NVIDIA	Paged, quantized KV, reuse and offload	NVIDIA only, latency sensitive
TGI v3	HF serving layer with long prompt path	Strong long prompt performance, integrated stack	Paged KV, chunked prefill, prefix caching	HF centric APIs, long chat histories
LMDeploy	TurboMind kernels, blocked KV, quant	Up to 1.8× vLLM throughput in vendor tests	Blocked KV cache, weight and KV quant	NVIDIA deployments focused on raw throughput
SGLang	RadixAttention and structured programs	Up to 6.4× throughput and 3.7× lower latency on structured workloads	Radix tree KV reuse over prefixes	Agents, RAG, high prefix reuse
DeepSpeed	GPU CPU NVMe offload for huge models	Enables large models on small GPU; throughput oriented	Offloaded weights and sometimes KV	Very large models, offline or low QPS

Choosing a Runtime in Practice

When selecting a runtime for production systems, consider the following patterns:

For a strong default engine with minimal custom work: Start with vLLM for good throughput and solid KV handling.
If committed to NVIDIA and needing control over latency: Use TensorRT LLM with model-specific tuning.
If using Hugging Face and focused on long chats: Opt for TGI v3 for its effective long prompt pipeline.
For maximum throughput with quantized models: Choose LMDeploy with TurboMind.
If building agents or heavy RAG systems: Utilize SGLang for high KV reuse.
If running large models on limited GPUs: Consider DeepSpeed Inference / ZeRO Inference for higher throughput.

Overall, the key takeaway is that KV cache management is crucial in LLM serving. The most effective runtimes treat KV as a first-class data structure, optimizing its use through paging, quantization, reuse, and offloading.

«`