Tencent Open Sources Hunyuan-A13B: A 13B Active Parameter MoE Model with Dual-Mode Reasoning and 256K Context

Understanding the Target Audience for Tencent’s Hunyuan-A13B

The target audience for Tencent’s Hunyuan-A13B model primarily consists of AI researchers, data scientists, and business managers in tech-driven industries. These professionals are often engaged in developing AI solutions, optimizing workflows, and enhancing decision-making processes through advanced technologies.

Pain Points

Need for efficient AI models that balance performance and computational costs.
Challenges in deploying large language models in real-time applications.
Desire for models that can handle long-context tasks effectively.

Goals

To leverage AI for improved operational efficiency and decision-making.
To explore open-source solutions that allow for customization and experimentation.
To stay ahead in the competitive landscape by utilizing state-of-the-art AI technologies.

Interests

Advancements in AI model architectures, particularly in sparse Mixture-of-Experts (MoE) designs.
Applications of AI in various domains, including natural language processing and agentic reasoning.
Open-source tools and frameworks that facilitate research and development.

Communication Preferences

Preference for technical documentation and peer-reviewed research articles.
Interest in case studies and real-world applications of AI technologies.
Engagement through professional networks and platforms like GitHub and Hugging Face.

Tencent Open Sources Hunyuan-A13B: A 13B Active Parameter MoE Model

Tencent’s Hunyuan team has introduced Hunyuan-A13B, an open-source large language model built on a sparse Mixture-of-Experts (MoE) architecture. The model consists of 80 billion total parameters, with only 13 billion active during inference, ensuring a balance between performance and computational cost. It supports Grouped Query Attention (GQA), a context length of 256K, and a dual-mode reasoning framework that toggles between fast and slow thinking.

Architecture: Sparse MoE with 13B Active Parameters

Hunyuan-A13B features a fine-grained MoE design comprising 1 shared expert and 64 non-shared experts, activating 8 experts per forward pass. This architecture, supported by scaling experiments, ensures consistent performance while minimizing inference costs. The model includes 32 layers, utilizes SwiGLU activations, and has a vocabulary size of 128K. It integrates GQA for enhanced memory efficiency during long-context inference.

The model’s MoE setup is complemented by an optimized training curriculum: a 20 TB token pretraining phase, followed by fast annealing and long-context adaptation. This final phase scales the context window first to 32K and then to 256K tokens using NTK-aware positional encoding, ensuring stable performance at large sequence lengths.

Dual-Mode Reasoning: Fast and Slow Thinking

A notable feature of Hunyuan-A13B is its dual-mode Chain-of-Thought (CoT) capability. It supports both a low-latency fast-thinking mode for routine queries and a more elaborate slow-thinking mode for multi-step reasoning. Users can toggle between these modes using a simple tag system: /no think for fast inference and /think for reflective reasoning. This flexibility allows users to adapt computational costs based on task complexity.

Post-Training: Reinforcement Learning with Task-Specific Reward Models

The post-training pipeline of Hunyuan-A13B includes multi-stage supervised fine-tuning (SFT) and reinforcement learning (RL) across both reasoning-specific and general tasks. The RL stages incorporate outcome-based rewards and tool-specific feedback, including sandbox execution environments for code and rule-based checks for agents.

In the agent training phase, the team synthesized diverse tool-use scenarios with planner, checker, and tool roles, generating over 20,000 format combinations. This reinforced Hunyuan-A13B’s ability to execute real-world workflows such as spreadsheet processing, information search, and structured reasoning.

Evaluation: State-of-the-Art Agentic Performance

Hunyuan-A13B demonstrates strong benchmark results across various NLP tasks:

On MATH, CMATH, and GPQA, it scores on par or above larger dense and MoE models.
It surpasses Qwen3-A22B and DeepSeek R1 in logical reasoning (BBH: 89.1; ZebraLogic: 84.7).
In coding, it holds its own with 83.9 on MBPP and 69.3 on MultiPL-E.
For agent tasks, it leads on BFCL-v3 (78.3) and ComplexFuncBench (61.2), validating its tool-usage capabilities.
Long-context comprehension is another highlight. On PenguinScrolls, it scores 87.7—just shy of Gemini 2.5 Pro. On RULER, it maintains high performance (73.9) even at 64K–128K context, outperforming larger models like Qwen3-A22B and DeepSeek R1 in context resilience.

Inference Optimization and Deployment

Hunyuan-A13B is fully integrated with popular inference frameworks like vLLM, SGLang, and TensorRT-LLM. It supports precision formats such as W16A16, W8A8, and KV Cache FP8, along with features like Auto Prefix Caching and Chunk Prefill. It achieves up to 1981.99 tokens/sec throughput on a 32-batch input (2048 input, 14336 output length), making it practical for real-time applications.

Open Source and Industry Relevance

Available on Hugging Face and GitHub, Hunyuan-A13B is released with permissive open-source licensing. It is engineered for efficient research and production use, particularly in latency-sensitive environments and long-context tasks.

By combining MoE scalability, agentic reasoning, and open-source accessibility, Tencent’s Hunyuan-A13B offers a compelling alternative to heavyweight LLMs, enabling broader experimentation and deployment without sacrificing capability.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.