Xiaomi Released MiMo-Audio, a 7B Speech Language Model Trained on 100M+ Hours with High-Fidelity Discrete Tokens
Understanding the Target Audience
The target audience for Xiaomi’s MiMo-Audio includes AI researchers, developers, and business leaders in the tech industry. These individuals are typically engaged in the fields of machine learning, natural language processing, and audio technology. Their pain points often revolve around the challenges of integrating high-quality speech recognition and synthesis into applications, as well as the need for efficient models that can handle diverse audio tasks.
Goals for this audience include:
- Implementing advanced speech technologies in products and services
- Improving user experience through natural language interactions
- Reducing development time and costs associated with speech-related projects
Interests include the latest advancements in AI models, practical applications of speech technology, and tools that facilitate research and development. Communication preferences lean towards technical documentation, peer-reviewed studies, and detailed product specifications.
Overview of MiMo-Audio
Xiaomi’s MiMo team has introduced MiMo-Audio, a 7-billion-parameter audio-language model that operates on a single next-token objective over interleaved text and discretized speech, scaling pretraining beyond 100 million hours of audio.
Key Innovations
MiMo-Audio distinguishes itself by utilizing a bespoke RVQ (residual vector quantization) tokenizer that enhances both semantic fidelity and high-quality reconstruction. This tokenizer operates at 25 Hz and produces 8 RVQ layers (≈200 tokens/s), allowing the language model (LM) to access “lossless” speech features for autoregressive modeling alongside text.
Architecture
The architecture consists of a patch encoder, a 7B LLM, and a patch decoder. To address the audio/text rate mismatch, the system packs four timesteps per patch for LM consumption (downsampling 25 Hz to 6.25 Hz) and reconstructs full-rate RVQ streams with a causal patch decoder. A delayed multi-layer RVQ generation scheme is employed to stabilize synthesis and respect inter-layer dependencies. All components are trained under a single next-token objective.
Training Phases
Training occurs in two major phases:
- An “understanding” stage that optimizes text-token loss over interleaved speech-text corpora
- A joint “understanding + generation” stage that activates audio losses for speech continuation, speech-to-text (S2T), text-to-speech (T2S) tasks, and instruction-style data
The report highlights a compute/data threshold where few-shot behavior begins to emerge, similar to trends observed in large text-only language models.
Performance Benchmarks
MiMo-Audio has been evaluated on speech reasoning suites (e.g., SpeechMMLU) and broad audio understanding benchmarks (e.g., MMAU), achieving strong scores across various tasks and significantly reducing the “modality gap” between text-only and speech-in/speech-out settings. Xiaomi has also released MiMo-Audio-Eval, a public toolkit for reproducing these results.
Importance of MiMo-Audio
The model’s design is intentionally straightforward, avoiding multi-head task towers or bespoke ASR/TTS objectives during pretraining. The key engineering innovations include:
- A tokenizer that preserves prosody and speaker identity
- Patchification to manage sequence lengths effectively
- Delayed RVQ decoding to maintain quality during generation
These design choices enable few-shot speech-to-speech editing and robust speech continuation with minimal task-specific fine-tuning, making it a valuable tool for teams developing spoken agents.
Technical Takeaways
- High-Fidelity Tokenization: MiMo-Audio employs a custom RVQ tokenizer operating at 25 Hz with 8 active codebooks, ensuring speech tokens maintain prosody, timbre, and speaker identity.
- Patchified Sequence Modeling: The model reduces sequence length by grouping 4 timesteps into one patch (25 Hz to 6.25 Hz), allowing efficient handling of long speech without losing detail.
- Unified Next-Token Objective: MiMo-Audio trains under a single next-token prediction loss across interleaved text and audio, simplifying architecture while supporting multi-task generalization.
- Emergent Few-Shot Abilities: Few-shot behaviors such as speech continuation, voice conversion, emotion transfer, and speech translation emerge once training exceeds a large-scale data threshold (approximately 100 million hours).
- Benchmark Leadership: MiMo-Audio achieves state-of-the-art scores on SpeechMMLU (S2S 69.1, T2S 71.5) and MMAU (66.0 overall), minimizing the text-to-speech modality gap to just 3.4 points.
- Open Ecosystem Release: Xiaomi provides the tokenizer, 7B checkpoints (base and instruct), MiMo-Audio-Eval toolkit, and public demos, enabling researchers and developers to explore speech-to-speech intelligence in open-source environments.
Conclusion
MiMo-Audio illustrates that high-fidelity, RVQ-based “lossless” tokenization combined with patchified next-token pretraining at scale can unlock few-shot speech intelligence without the need for task-specific heads. The 7B architecture—tokenizer, patch encoder, LLM, and patch decoder—effectively bridges the audio/text rate gap (25 to 6.25 Hz) while preserving prosody and speaker identity through delayed multi-layer RVQ decoding. Empirical results show the model narrows the text-speech modality gap, generalizes across various benchmarks, and supports in-context speech-to-speech editing and continuation.
For further details, check out the MiMo-Audio Demo.