Neuphonic Open-Sources NeuTTS Air: A 748M-Parameter On-Device Speech Language Model with Instant Voice Cloning

Understanding the Target Audience

The target audience for NeuTTS Air includes:

AI Developers: Interested in implementing advanced speech synthesis in applications.
Business Managers: Seeking innovative solutions for customer engagement and user experience.
Privacy-Conscious Users: Concerned about data security and the implications of cloud-based services.

Common pain points include:

Dependence on cloud services for TTS solutions.
Concerns over privacy and data security.
Need for high-quality, realistic voice synthesis without high computational costs.

Goals and interests involve:

Implementing on-device solutions that enhance user experience.
Exploring open-source technologies for flexibility and customization.
Reducing latency in voice applications.

Preferred communication methods include:

Technical documentation and tutorials.
Community forums and discussion groups.
Webinars and live demonstrations.

Overview of NeuTTS Air

Neuphonic has released NeuTTS Air, an open-source text-to-speech (TTS) model designed for local, real-time operation on CPUs. The model features 748M parameters based on the Qwen2 architecture and is distributed in GGUF quantizations (Q4/Q8), allowing inference through llama.cpp/llama-cpp-python without cloud dependencies. It is licensed under Apache-2.0 and includes a runnable demo and examples.

Key Features

Realism at sub-1B scale: Human-like prosody and timbre preservation for a ~0.7B (Qwen2-class) TTS model.
On-device deployment: Suitable for laptops, phones, and Raspberry Pi-class boards.
Instant speaker cloning: Clones a voice from ~3 seconds of reference audio.
Compact LM+codec stack: Efficient representation for real-time use.

Model Architecture and Runtime Path

The architecture includes:

Backbone: Qwen 0.5B as a lightweight language model for speech generation, reported as 748M parameters.
Codec: NeuCodec for low-bitrate acoustic tokenization, targeting 0.8 kbps with 24 kHz output.
Quantization & format: Prebuilt GGUF backbones (Q4/Q8) available; instructions for llama-cpp-python and ONNX decoder path included.
Dependencies: Utilizes espeak for phonemization with examples and a Jupyter notebook provided for synthesis.

On-Device Performance Focus

NeuTTS Air is designed for real-time generation on mid-range devices, focusing on CPU-first defaults. The model targets local inference without a GPU, demonstrating a working flow through the provided examples.

Voice Cloning Workflow

The voice cloning process requires:

A reference WAV file.
The transcript text for that reference.

The system encodes the reference to style tokens and synthesizes arbitrary text in the reference speaker’s timbre.

Privacy and Responsibility

NeuTTS Air emphasizes on-device privacy, ensuring that audio and text do not leave the machine without user approval. All generated audio includes a Perth (Perceptual Threshold) watermark to support responsible use.

Comparison with Other TTS Systems

While other open, local TTS systems exist, NeuTTS Air stands out for its combination of a small language model, neural codec, instant cloning capabilities, and watermarking under a permissive license. The model’s specifications and features provide a pragmatic solution for real-time, CPU-only TTS applications.

Conclusion

The combination of a ~0.7B Qwen-class backbone with GGUF quantization and NeuCodec creates a viable option for real-time, on-device TTS that maintains timbre fidelity while minimizing latency and memory usage. The Apache-2.0 licensing and built-in watermarking enhance deployment flexibility while addressing privacy concerns.

Additional Resources

For more information, check out the Model Card on Hugging Face and the GitHub Page for Tutorials, Codes, and Notebooks. Follow us on Twitter and join our ML SubReddit. You can also join us on Telegram.