xAI Launches Grok-4-Fast: Unified Reasoning and Non-Reasoning Model with 2M-Token Context

xAI has introduced Grok-4-Fast, a cost-optimized successor to Grok-4 that integrates “reasoning” and “non-reasoning” behaviors into a single model. This model is designed to enhance high-throughput search, coding, and Q&A applications, featuring a 2M-token context window and native tool-use reinforcement learning (RL) that determines when to browse the web, execute code, or utilize tools.

Architecture Overview

Previous iterations of Grok utilized separate models for long-chain “reasoning” and short “non-reasoning” responses. Grok-4-Fast’s unified weight space minimizes end-to-end latency and token usage by steering behavior through system prompts. This is particularly relevant for real-time applications such as search, assistive agents, and interactive coding, where switching models can increase both latency and costs.

Search and Agentic Use Cases

Grok-4-Fast was trained end-to-end using tool-use reinforcement learning, demonstrating significant improvements on search-centric benchmarks:

BrowseComp: 44.9%
SimpleQA: 95.0%
Reka Research: 66.0%
BrowseComp-zh (Chinese variant): 51.2%

In private testing on LMArena, Grok-4-Fast (codename “menlo”) achieved the top rank in the Search Arena with an Elo score of 1163, while the text variant (codename “tahoe”) ranked #8 in the Text Arena, comparable to Grok-4-0709.

Performance and Efficiency

Grok-4-Fast has demonstrated frontier-class performance on both internal and public benchmarks, achieving pass@1 results of:

AIME 2025: 92.0% (no tools)
HMMT 2025: 93.3% (no tools)
GPQA Diamond: 85.7%
LiveCodeBench (Jan–May): 80.0%

Notably, it uses approximately 40% fewer “thinking” tokens on average compared to Grok-4, which xAI describes as “intelligence density.” This efficiency translates to a ~98% reduction in costs to achieve similar benchmark performance as Grok-4, factoring in the lower token count and new per-token pricing structure.

Deployment and Pricing

The Grok-4-Fast model is available to all users in Grok’s Fast and Auto modes across web and mobile platforms. The Auto mode selects Grok-4-Fast for complex queries to enhance latency without sacrificing quality. For the first time, free users can access xAI’s latest model tier. Developers can choose between two SKUs—grok-4-fast-reasoning and grok-4-fast-non-reasoning—both featuring a 2M context window. Pricing for the xAI API is structured as follows:

$0.20 / 1M input tokens (<128k)
$0.40 / 1M input tokens (≥128k)
$0.50 / 1M output tokens (<128k)
$1.00 / 1M output tokens (≥128k)
$0.05 / 1M cached input tokens

Key Technical Takeaways

Unified model with 2M context: Grok-4-Fast employs a single weight space for both reasoning and non-reasoning tasks, prompt-steered, with a 2,000,000-token window.
Scalable pricing: API pricing begins at $0.20/M input and $0.50/M output, with cached input priced at $0.05/M.
Efficiency improvements: xAI claims ~40% fewer “thinking” tokens while maintaining comparable accuracy to Grok-4, resulting in a ~98% lower cost for equivalent performance on benchmarks.
Benchmark performance: Reported pass@1 results include AIME-2025 at 92.0%, HMMT-2025 at 93.3%, GPQA-Diamond at 85.7%, and LiveCodeBench at 80.0%.
Designed for agentic/search applications: Post-training with tool-use RL, Grok-4-Fast is optimized for browsing and search workflows, with documented metrics supporting its capabilities.

Grok-4-Fast sets a new standard for cost-efficient intelligence, combining advanced capabilities into a single, prompt-steerable model. It is available for free on various platforms, including iOS and Android apps.

For more technical details, visit the official xAI announcement.

xAI launches Grok-4-Fast: Unified Reasoning and Non-Reasoning Model with 2M-Token Context and Trained End-to-End with Tool-Use Reinforcement Learning (RL)