Moonshot AI Releases Kimi K2 Thinking: An Impressive Thinking Model that can Execute up to 200–300 Sequential Tool Calls without Human Interference

Understanding the Target Audience

The target audience for Kimi K2 Thinking includes AI researchers, business managers, and decision-makers in tech companies focused on AI integration. Their pain points often involve:

Difficulty in implementing AI systems that require minimal human intervention.
The need for reliable AI models that can manage complex tasks over extended periods.
Concerns about the efficiency and scalability of AI solutions in business environments.

Their goals include:

Finding advanced AI models that can enhance productivity and decision-making.
Leveraging AI for tasks requiring deep reasoning and sequential decision-making.

Interests span across the latest advancements in AI technology, practical applications in business, and tools that facilitate seamless AI deployment. Their communication preferences lean towards concise, data-driven content that highlights practical implementations and technical specifications.

What is Kimi K2 Thinking?

Kimi K2 Thinking is the latest version of Moonshot’s open-source thinking model, designed as a thinking agent capable of step-by-step reasoning while dynamically invoking tools during inference. This model interleaves chains of thought with function calls, allowing it to read, think, call a tool, and repeat this process for hundreds of steps.

The model achieves state-of-the-art scores on Humanity’s Last Exam and BrowseComp, maintaining coherent behavior across approximately 200 to 300 sequential tool calls without human interference. It is released as an open weights model featuring a 256K token context window and native INT4 inference, which reduces latency and GPU memory usage while preserving benchmark performance.

Architecture and Technical Specifications

Kimi K2 Thinking employs the Kimi K2 Mixture of Experts (MoE) architecture. Key specifications include:

Total parameters: 1T
Activated parameters per token: 32B
Layers: 61 (including 1 dense layer)
Experts: 384 (8 experts selected per token, 1 shared expert)
Attention heads: 64
Attention hidden dimension: 7168
MoE hidden dimension: 2048 per expert
Vocabulary size: 160K tokens
Context length: 256K

The attention mechanism utilizes Multi-head Latent Attention, and the activation function is SwiGLU.

Test Time Scaling and Long Horizon Thinking

Kimi K2 Thinking is optimized for test time scaling, expanding its reasoning length and tool call depth when facing more challenging tasks. Performance benchmarks include:

Humanity’s Last Exam (no tools): 23.9
Humanity’s Last Exam (with tools): 44.9
Heavy setting: 51.0
AIME25 with Python: 99.1
HMMT25 with Python: 95.1
IMO AnswerBench: 78.6
GPQA: 84.5

Benchmarks in Agentic Search and Coding

In agentic search tasks, Kimi K2 Thinking achieved the following scores:

BrowseComp: 60.2
BrowseComp ZH: 62.3
Seal 0: 56.3
FinSearchComp T3: 47.4
Frames: 87.0

For general knowledge benchmarks:

MMLU Pro: 84.6
MMLU Redux: 94.4
Longform Writing: 73.8
HealthBench: 58.0

In coding tasks, it scored:

SWE bench Verified with tools: 71.3
SWE bench Multilingual with tools: 61.1
Multi SWE bench with tools: 41.9
SciCode: 44.8
LiveCodeBenchV6: 83.1
OJ Bench (C++ setting): 48.7
Terminal Bench with simulated tools: 47.1

Native INT4 Quantization and Deployment

Kimi K2 Thinking is trained as a native INT4 model, utilizing Quantization Aware Training during the post-training stage. This supports INT4 inference, enabling approximately a 2x generation speed improvement in low latency mode while maintaining state-of-the-art performance. All benchmark scores reported are under INT4 precision.

Key Takeaways

Kimi K2 Thinking is an open weights thinking agent that extends the Kimi K2 Mixture of Experts architecture with explicit long horizon reasoning and tool use. The model features:

A trillion parameter MoE design with tens of billions of active parameters per token.
A 256K context window.
Native INT4 model with Quantization Aware Training, enabling faster inference while preserving performance.

K2 Thinking is optimized for test time scaling, capable of executing hundreds of sequential tool calls within a single task. It shows competitive performance across various reasoning, agentic search, and coding tasks, demonstrating the advantages of a thinking-oriented AI model.

Explore Further

For more information on Kimi K2 Thinking, check out the Model Weights and Technical Details. Visit our GitHub Page for tutorials, codes, and notebooks. Stay updated by following us on Twitter, and join our 100k+ ML SubReddit. Don’t forget to subscribe to our newsletter. You can also connect with us on Telegram.