«`html
Ant Group Releases Ling 2.0: A Reasoning-First MoE Language Model Series
The target audience for Ling 2.0 primarily includes AI researchers, data scientists, and business leaders in technology sectors who are interested in advanced language models and their applications in enterprise solutions. These individuals often face pain points such as the need for more efficient AI models that deliver high-quality outputs without excessive computational costs. Their goals include improving the reasoning capabilities of AI systems while managing resource allocation effectively. Interests typically revolve around innovations in AI architecture, scalability, and practical applications of language models in business contexts. Communication preferences lean towards technical detail and peer-reviewed research, valuing concise, data-driven insights over marketing rhetoric.
Overview of Ling 2.0
Ling 2.0 is a reasoning-based language model family developed by the Inclusion AI team at Ant Group. The model is built on the principle that each activation enhances reasoning capability. It transitions from 16 billion (16B) parameters to 1 trillion (1T) while maintaining nearly unchanged computation for each token. The series includes:
- Ling mini 2.0: 16B total parameters with 1.4B activated
- Ling flash 2.0: 100B total parameters with 6.1B activated
- Ling 1T: 1T total parameters with approximately 50B active per token
Sparse MoE as the Central Design
The Ling 2.0 models utilize a sparse Mixture of Experts (MoE) layer. Each layer comprises 256 routed experts and one shared expert, with the router selecting 8 routed experts for each token, resulting in about 3.5 percent activation. This design reportedly achieves seven times the efficiency compared to equivalent dense models, as only a small portion of the network is activated for each token while leveraging a large parameter pool.
Key Technical Specifications
Model Architecture
The architecture is determined using Ling Scaling Laws through a method called the Ling Wind Tunnel, which allows for efficient predictions of loss, activation, and expert balance across different model sizes. The architecture maintains a consistent activation ratio of 1/32 across the series.
Pre-training
Ling 2.0 is trained on over 20 trillion (20T) tokens, gradually increasing the proportion of reasoning-heavy sources, such as mathematics and code, to nearly half of the corpus. The training pipeline extends context length progressively, achieving up to 128K while retaining quality.
Post-training
Alignment is divided into two passes: capability and preference. This process includes Decoupled Fine Tuning and evolutionary Chain of Thought (CoT) stages, enhancing the model’s ability to switch between quick responses and deep reasoning.
Infrastructure
Ling 2.0 operates natively in FP8, optimizing utilization and efficiency. The system employs heterogeneous pipeline parallelism and other techniques to facilitate practical training at a trillion scale.
Evaluation Results
Models in the Ling 2.0 series deliver competitive quality while maintaining low per-token compute. For instance:
- Ling mini 2.0 matches the performance of 7B to 8B dense models while generating over 300 tokens per second in simple QA tasks.
- Ling flash 2.0 maintains the same activation ratio and performs effectively within the 100B parameter range.
- Ling 1T showcases the full design capabilities with extensive reasoning and context management.
Conclusion
Ling 2.0 exemplifies a complete sparse MoE architecture, emphasizing the efficiency of a 1/32 activation ratio. The model’s design and training methodologies indicate a shift towards organizing reasoning capabilities around fixed sparsity rather than solely increasing computational density.
For further information, refer to the original research paper here.
«`