←back to Blog

Cache-to-Cache(C2C): Direct Semantic Communication Between Large Language Models via KV-Cache Fusion

Cache-to-Cache (C2C): Direct Semantic Communication Between Large Language Models via KV-Cache Fusion

Understanding the Target Audience

The target audience for the Cache-to-Cache (C2C) communication paradigm primarily consists of AI researchers, data scientists, and business managers involved in AI deployment. These individuals are typically looking to enhance the efficiency and effectiveness of multi-large language model (LLM) systems.

  • Pain Points:
    • Latency issues in multi-LLM systems due to text-based communication.
    • Semantic loss during the transfer of information between models.
    • Complexity in integrating multiple models with varying architectures and sizes.
  • Goals:
    • To improve the accuracy and speed of LLM interactions.
    • To explore innovative communication methods that reduce reliance on natural language.
    • To leverage the strengths of different models in a seamless manner.
  • Interests:
    • Advancements in AI communication techniques.
    • Research findings that can be applied to real-world business problems.
    • Case studies demonstrating successful LLM collaborations.
  • Communication Preferences:
    • Technical documentation and peer-reviewed research papers.
    • Webinars and workshops focused on practical applications of AI.
    • Engagement through professional networks and forums.

Overview of Cache-to-Cache (C2C)

Can large language models collaborate without sending a single token of text? A team of researchers from Tsinghua University, Infinigence AI, The Chinese University of Hong Kong, Shanghai AI Laboratory, and Shanghai Jiao Tong University assert that this is possible. Cache-to-Cache (C2C) is a new communication paradigm where large language models exchange information through their KV-Cache rather than through generated text.

The Bottleneck of Text Communication

Current multi-LLM systems primarily rely on text for communication, where one model generates text that another model interprets. This approach incurs three significant costs:

  • Internal activations are compressed into short natural language messages, resulting in the loss of semantic signals in the KV-Cache.
  • Natural language is inherently ambiguous, leading to potential misinterpretations of structural signals.
  • Token-by-token decoding increases latency during lengthy analytical exchanges.

Oracle Experiments: Testing KV-Cache as a Communication Medium

The research team conducted two oracle-style experiments to evaluate the effectiveness of KV-Cache as a communication medium.

Cache Enrichment Oracle

The team compared three setups on multiple-choice benchmarks:

  • Direct: Prefill on the question only.
  • Few Shot: Prefill on exemplars plus question, resulting in a longer cache.
  • Oracle: Prefill on exemplars plus question, then discard the exemplar segment, keeping only the question-aligned slice of the cache.

Results showed that the Oracle setup improved accuracy from 58.42% to 62.34%, while the Few Shot approach reached 63.39%. This indicates that enriching the question KV-Cache enhances performance without increasing token count.

Cache Transformation Oracle

The second experiment tested whether KV-Cache from one model could be transformed into the space of another model. A three-layer MLP was trained to map KV-Cache from Qwen3 4B to Qwen3 0.6B, demonstrating that transformed cache lies within the target cache manifold.

C2C: Direct Semantic Communication through KV-Cache

Based on the oracle experiments, the researchers defined Cache-to-Cache communication between a Sharer and a Receiver model. During prefill, both models read the same input and produce layer-wise KV-Cache. The C2C process involves selecting a mapped Sharer layer for each Receiver layer and applying a C2C Fuser to produce a fused cache.

C2C Fuser Architecture

The C2C Fuser follows a residual integration principle and consists of three modules:

  • Projection Module: Concatenates Sharer and Receiver KV-Cache vectors, applies a projection layer, and a feature fusion layer.
  • Dynamic Weighting Module: Modulates heads based on input, allowing some attention heads to rely more on Sharer information.
  • Learnable Gate: Adds a per-layer gate that decides whether to inject Sharer context into that layer.

This architecture allows the Receiver to selectively absorb Sharer semantics without destabilizing its own representation.

Performance Results: C2C vs. Text Communication

Across various Sharer-Receiver combinations built from Qwen2.5, Qwen3, Llama3.2, and Gemma3, C2C consistently improves Receiver accuracy and reduces latency:

  • C2C achieves approximately 8.5% to 10.5% higher average accuracy than individual models.
  • C2C outperforms text communication by about 3.0% to 5.0% on average.
  • C2C delivers around 2x average speedup in latency compared to text-based collaboration.

For instance, using Qwen3 0.6B as the Receiver and Qwen2.5 0.5B as the Sharer, the Receiver alone reaches 35.53%, text-to-text communication reaches 41.03%, and C2C achieves 42.92%. The average time per query for text-to-text is 1.52 units, while C2C stays close to the single model at 0.40.

Key Takeaways

  • Cache-to-Cache communication allows a Sharer model to send information to a Receiver model directly via KV-Cache, eliminating the token bottleneck and reducing semantic loss in multi-model systems.
  • Oracle studies confirm that enriching the question-aligned slice of the cache improves accuracy at constant sequence length and that KV-Cache from a larger model can be mapped into a smaller model’s cache space.
  • The C2C Fuser architecture effectively combines Sharer and Receiver caches, enabling selective absorption of semantics while maintaining stability.
  • Consistent accuracy and latency gains are observed across various model pairs, making C2C a significant advancement in multi-LLM communication.

Further Exploration

For more details, check out the full research paper. Explore our GitHub Page for tutorials, codes, and notebooks. Follow us on Twitter, join our 100k+ ML SubReddit, and subscribe to our Newsletter. Also, connect with us on Telegram.

The post Cache-to-Cache (C2C): Direct Semantic Communication Between Large Language Models via KV-Cache Fusion appeared first on MarkTechPost.