←back to Blog

Alibaba Qwen Team Just Released FP8 Builds of Qwen3-Next-80B-A3B (Instruct & Thinking), Bringing 80B/3B-Active Hybrid-MoE to Commodity GPUs

«`html

Alibaba Qwen Team Releases FP8 Builds of Qwen3-Next-80B-A3B (Instruct & Thinking)

Alibaba’s Qwen team has released FP8-quantized checkpoints for its new Qwen3-Next-80B-A3B models in two post-training variants—Instruct and Thinking. These models are designed for high-throughput inference with ultra-long context and Mixture-of-Experts (MoE) efficiency. The FP8 repositories mirror the BF16 releases but package “fine-grained FP8” weights (block size 128) and deployment notes for sglang and vLLM nightly builds. Benchmarks remain those of the original BF16 models; FP8 is provided for convenience and performance, not as a separate evaluation run.

What’s in the A3B Stack

Qwen3-Next-80B-A3B is a hybrid architecture that combines Gated DeltaNet (a linear/conv-style attention surrogate) with Gated Attention, interleaved with an ultra-sparse MoE. The 80B total parameter budget activates approximately 3B parameters per token via 512 experts (10 routed + 1 shared). The model consists of 48 layers arranged into 12 blocks: 3×(Gated DeltaNet → MoE) followed by 1×(Gated Attention → MoE). The native context is 262,144 tokens, validated up to approximately 1,010,000 tokens using RoPE scaling (YaRN). The hidden size is 2048; attention uses 16 Q heads and 2 KV heads at head dimension 256; DeltaNet uses 32 V and 16 QK linear heads at head dimension 128.

The Qwen team reports that the 80B-A3B base model outperforms Qwen3-32B on downstream tasks at approximately 10% of its training cost and delivers around 10× inference throughput beyond 32K context, driven by low activation in MoE and multi-token prediction (MTP). The Instruct variant is non-reasoning (no tags), while the Thinking variant enforces reasoning traces by default and is optimized for complex problems.

FP8 Releases: What Actually Changed

The FP8 model cards indicate that the quantization is “fine-grained FP8” with block size 128. Deployment differs slightly from BF16: both sglang and vLLM require current main/nightly builds, with example commands provided for 256K context and optional MTP. The Thinking FP8 card also recommends a reasoning parser flag (e.g., —reasoning-parser deepseek-r1 in sglang, deepseek_r1 in vLLM). These releases retain Apache-2.0 licensing.

Benchmarks (Reported on BF16 Weights)

The Instruct FP8 card reproduces Qwen’s BF16 comparison table, placing Qwen3-Next-80B-A3B-Instruct on par with Qwen3-235B-A22B-Instruct-2507 on several knowledge, reasoning, and coding benchmarks, and ahead on long-context workloads (up to 256K). The Thinking FP8 card lists AIME’25, HMMT’25, MMLU-Pro/Redux, and LiveCodeBench v6, where Qwen3-Next-80B-A3B-Thinking surpasses earlier Qwen3 Thinking releases (30B A3B-2507, 32B) and claims wins over Gemini-2.5-Flash-Thinking on multiple benchmarks.

Training and Post-Training Signals

The series is trained on approximately 15T tokens before post-training. Qwen highlights stability additions (zero-centered, weight-decayed layer norm, etc.) and uses GSPO in RL post-training for the Thinking model to handle the hybrid attention and high-sparsity MoE combination. MTP is used to speed inference and improve pretraining signal.

Why FP8 Matters?

On modern accelerators, FP8 activations and weights reduce memory bandwidth pressure and resident footprint compared to BF16, allowing larger batch sizes or longer sequences at similar latency. Since A3B routes only approximately 3B parameters per token, the combination of FP8 and MoE sparsity compounds throughput gains in long-context regimes, particularly when paired with speculative decoding via MTP as exposed in the serving flags. However, quantization interacts with routing and attention variants; real-world acceptance rates for speculative decoding and end-task accuracy can vary with engine and kernel implementations. Hence, Qwen advises using current sglang/vLLM and tuning speculative settings.

Summary

Qwen’s FP8 releases make the 80B/3B-active A3B stack practical to serve at 256K context on mainstream engines, preserving the hybrid-MoE design and MTP path for high throughput. The model cards retain benchmarks from BF16, so teams should validate FP8 accuracy and latency on their own stacks, especially with reasoning parsers and speculative settings. The net outcome is lower memory bandwidth and improved concurrency without architectural regressions, positioned for long-context production workloads.

«`