NVIDIA Researchers Introduce Dynamic Memory Sparsification (DMS) for 8× KV Cache Compression in Transformer LLMs
As the demand for reasoning-heavy tasks increases, large language models (LLMs) are expected to generate longer sequences or parallel chains of reasoning. However, inference-time performance is significantly hindered by the memory footprint of the key–value (KV) cache, not just the number of tokens produced. In a recent study, researchers from NVIDIA and the University of Edinburgh present Dynamic Memory Sparsification (DMS)—a data-efficient, retrofit-friendly method that compresses KV caches and enables inference-time hyper-scaling without compromising model accuracy.
The Bottleneck: KV Cache in Transformer Inference
Transformer-based models like GPT, LLaMA, and Qwen utilize KV caches to store past token representations for autoregressive generation. This cache expands linearly with sequence length and width (parallel threads), consuming significant GPU memory and resulting in slower inference due to frequent memory access.
Current techniques for optimizing the KV cache either rely on training-free heuristics, such as attention weight-based token eviction, or require extensive post-training retrofits like Dynamic Memory Compression (DMC). Both approaches have notable drawbacks: the former can negatively affect accuracy, while the latter is computationally intensive.
Dynamic Memory Sparsification DMS: Compression Without Compromise
DMS tackles these challenges with a hybrid approach: it sparsifies the KV cache similarly to conventional pruning methods but does so with minimal training overhead (~1,000 steps) and delayed eviction, allowing tokens to be temporarily retained even after being marked for removal. This design preserves essential context information and prevents sudden drops in accuracy.
The core concept is to make eviction decisions differentiable during training using a Gumbel-sigmoid-based sampling mechanism. Tokens predicted for future eviction remain available for a sliding window duration before being discarded, enabling the model to effectively utilize their informational value.
Efficient Retrofitting with Minimal Data
Unlike DMC, which necessitates thousands of training steps and intricate gradient-based optimization, DMS does not introduce additional parameters per attention head. It repurposes a minor part of the attention mechanism (a single neuron) to predict evictions, making DMS suitable for retrofitting existing models without altering their architecture.
Empirical results indicate that with as few as 1K training steps, DMS can achieve 8× KV cache compression, maintaining or even enhancing model performance across reasoning tasks.
Benchmark Results: Scaling Performance Without Scaling Cost
The research team evaluated DMS on reasoning-heavy benchmarks such as:
- AIME 2024 (advanced math)
- MATH 500 (mathematical problem solving)
- GPQA Diamond (hard science QA)
- LiveCodeBench (code generation)
Across various model sizes—Qwen-R1 1.5B, 7B, and 32B—DMS improved exact-match performance by 9.1 points on AIME, 7.6 on GPQA, and 9.6 on LiveCodeBench, all while maintaining the same memory and compute budgets.
Compared to leading baselines like Quest and TOVA, DMS consistently outperformed them in both KV cache read efficiency (runtime proxy) and peak memory usage, achieving improved Pareto frontiers.
General-Purpose Utility
DMS also demonstrates efficacy in non-reasoning tasks. On short-context benchmarks such as MMLU, GSM8K, and HellaSwag, DMS maintained performance at compression ratios up to 4× with minimal degradation (~3.5 points). For long-context tasks like Needle-in-a-Haystack and Variable Tracking, DMS even exceeded the performance of vanilla models, suggesting its potential to alleviate issues like information over-squashing in lengthy sequences.
Conclusion
In summary, Dynamic Memory Sparsification (DMS) offers a practical and scalable solution for enhancing the inference-time efficiency of Transformer-based language models. By intelligently compressing the KV cache with minimal retraining, DMS allows models to reason over longer sequences or in parallel without increasing runtime or memory demands. Its consistent improvements across a variety of reasoning and general-purpose tasks emphasize its versatility and effectiveness. As LLMs become more prevalent in resource-constrained environments, DMS provides a compelling path forward—balancing compression, accuracy, and ease of integration for real-world inference workloads.
Check out the paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and subscribe to our newsletter.
Looking to showcase your product, webinar, or service to over 1 million AI engineers, developers, data scientists, architects, CTOs, and CIOs? Let’s explore a strategic partnership.