«`html

Meet VoXtream: An Open-Sourced Full-Stream Zero-Shot TTS Model for Real-Time Use that Begins Speaking from the First Word

VoXtream, developed by KTH’s Speech, Music and Hearing group, addresses the latency issues faced by real-time agents, live dubbing, and simultaneous translation. Traditional streaming TTS (Text to Speech) systems often wait for a chunk of text before emitting sound, resulting in a delay before the voice starts. In contrast, VoXtream begins speaking after the first word, outputs audio in 80 ms frames, and achieves a first-packet latency (FPL) of 102 ms on a modern GPU with PyTorch compile.

Understanding Full-Stream TTS

Full-stream TTS differs from output streaming in that it consumes text as it arrives, emitting audio in real-time. VoXtream implements this by ingesting a word stream and generating audio frames continuously, eliminating input-side buffering while maintaining low per-frame compute. This architecture focuses on first-word onset rather than just steady-state throughput.

How VoXtream Starts Speaking Immediately

The key innovation is a dynamic phoneme look-ahead within an incremental Phoneme Transformer (PT). This allows the system to begin generating audio as soon as the first word enters the buffer, avoiding delays associated with fixed look-ahead windows.

Technical Architecture

VoXtream consists of a single, fully-autoregressive (AR) pipeline with three transformers:

Phoneme Transformer (PT): decoder-only, incremental; dynamic look-ahead ≤ 10 phonemes; phonemization via g2pE at the word level.
Temporal Transformer (TT): AR predictor over Mimi codec semantic tokens plus a duration token that encodes a monotonic phoneme-to-audio alignment.
Depth Transformer (DT): AR generator for the remaining Mimi acoustic codebooks, conditioned on TT outputs and a ReDimNet speaker embedding for zero-shot voice prompting.

Performance Metrics

The repository includes a benchmark script measuring both FPL and real-time factor (RTF). On an A100 GPU, the team reports 171 ms / 1.00 RTF without compile and 102 ms / 0.17 RTF with compile. On an RTX 3090, the figures are 205 ms / 1.19 RTF uncompiled and 123 ms / 0.19 RTF compiled.

Comparative Analysis

In evaluations against popular streaming baselines, VoXtream demonstrates a lower word error rate (WER) of 3.24% compared to CosyVoice2’s 6.11%. Listener studies indicate a significant preference for VoXtream’s naturalness, while CosyVoice2 excels in speaker similarity. VoXtream operates over 5× faster than real-time in compiled mode (RTF ≈ 0.17).

Data Utilization

VoXtream was trained on a ~9k-hour mid-scale corpus, comprising approximately 4.5k hours each of Emilia and HiFiTTS-2 (22 kHz subset). The team implemented a diarization process to remove multi-speaker clips and filtered transcripts using ASR, ensuring high-quality audio output.

Quality Metrics

VoXtream’s performance holds up across various metrics, including WER, UTMOS (MOS predictor), and speaker similarity. An ablation study showed that adding the CSM Depth Transformer and speaker encoder improves similarity without significantly impacting WER.

Positioning in the TTS Landscape

VoXtream’s core contribution lies in its latency-focused AR arrangement and duration-token alignment, which preserves input-side streaming. This design offers a trade-off of slightly lower speaker similarity in exchange for significantly reduced FPL compared to chunked NAR vocoders.

Further Resources

For more information, check out the research paper, the model on Hugging Face, and the GitHub page for tutorials, codes, and notebooks.

«`