«`html
Alibaba Releases Tongyi DeepResearch: A 30B-Parameter Open-Source Agentic LLM Optimized for Long-Horizon Research
Alibaba’s Tongyi Lab has open-sourced Tongyi DeepResearch-30B-A3B, an agent-specialized large language model built for long-horizon, deep information-seeking with web tools. The model utilizes a mixture-of-experts (MoE) design with approximately 30.5B total parameters and approximately 3–3.3B active per token, enabling high throughput while preserving strong reasoning performance. It targets multi-turn research workflows—searching, browsing, extracting, cross-checking, and synthesizing evidence—under ReAct-style tool use and a heavier test-time scaling mode. The release includes weights (Apache-2.0), inference scripts, and evaluation utilities.
What the Benchmarks Show
Tongyi DeepResearch reports state-of-the-art results on agentic search suites frequently used to test “deep research” agents:
- Humanity’s Last Exam (HLE): 32.9
- BrowseComp: 43.4 (EN) and 46.7 (ZH)
- xbench-DeepSearch: 75
The team finds that the system is on par with OpenAI-style deep research agents and systematically outperforms existing proprietary and open-source agents across these tasks.
Architecture and Inference Profile
Tongyi DeepResearch employs MoE routing (Qwen3-MoE lineage) with approximately 30.5B total and approximately 3.3B active parameters, offering the cost envelope of a small dense model while retaining specialist capacity. Its context length is 128K tokens, suitable for long, tool-augmented browsing sessions and iterative synthesis.
It features dual inference modes:
- ReAct (native) for direct evaluation of intrinsic reasoning and tool use
- IterResearch “Heavy” mode for test-time scaling with structured multi-round synthesis/reconstruction of context to reduce noise accumulation
Training Pipeline: Synthetic Data + On-Policy RL
Tongyi DeepResearch is trained end-to-end as an agent, not merely as a chat LLM, utilizing a fully automated, scalable data engine:
- Agentic continual pre-training (CPT): Large-scale synthetic trajectories built from curated corpora, historical tool traces, and graph-structured knowledge to teach retrieval, browsing, and multi-source fusion.
- Agentic SFT cold-start: Trajectories in ReAct and IterResearch formats for schema-consistent planning and tool use.
- On-policy RL with Group Relative Policy Optimization (GRPO), token-level policy gradients, leave-one-out advantage estimation, and negative-sample filtering to stabilize learning in non-stationary web environments.
Role in Document and Web Research Workflows
Deep-research tasks emphasize four capabilities:
- Long-horizon planning
- Iterative retrieval and verification across sources
- Evidence tracking with low hallucination rates
- Synthesis under large contexts
The IterResearch rollout restructures context each round, retaining only essential artifacts to mitigate context bloat and error propagation. The ReAct baseline demonstrates that the behaviors are learned rather than prompt-engineered. The reported scores on HLE and BrowseComp suggest improved robustness on multi-hop, tool-mediated queries where prior agents often over-fit to prompt patterns or saturate at low depths.
Key Features of Tongyi DeepResearch-30B-A3B
- MoE efficiency at scale: Approximately 30.5B total parameters with approximately 3.0–3.3B activated per token (Qwen3-MoE lineage), enabling small-model inference cost with large-model capacity.
- 128K context window: Long-horizon rollouts with evidence accumulation for multi-step web research.
- Dual inference paradigms: Native ReAct for intrinsic tool-use evaluation and IterResearch “Heavy” (test-time scaling) for deeper multi-round synthesis.
- Automated agentic data engine: Fully automated synthesis pipeline powering agentic continual pre-training (CPT), supervised fine-tuning (SFT), and RL.
- On-policy RL with GRPO: Group Relative Policy Optimization with token-level policy gradients, leave-one-out advantage estimation, and selective negative-sample filtering for stability.
Summary
Tongyi DeepResearch-30B-A3B packages a MoE (~30B total, ~3B active) architecture, 128K context, dual ReAct/IterResearch rollouts, and an automated agentic data and GRPO RL pipeline into a reproducible open-source stack. For teams building long-horizon research agents, it offers a practical balance of inference cost and capability with reported strong performance on deep-research benchmarks, particularly in scenarios where precision and reliability are critical.
Check out the Models on Hugging Face, GitHub Page, and Twitter. Don’t forget to join our 100k+ ML SubReddit and subscribe to our Newsletter.
«`