«`html
StepFun AI Releases Step-Audio-EditX: A New Open-Source 3B LLM-Grade Audio Editing Model Excelling at Expressive and Iterative Audio Editing
Understanding the Target Audience
The primary audience for StepFun AI’s Step-Audio-EditX includes developers, audio engineers, and researchers in the fields of artificial intelligence and audio processing. Their pain points often revolve around the limitations of existing text-to-speech (TTS) systems, particularly in terms of control over emotional expression, style, and paralinguistic features. Their goals include achieving more precise audio editing capabilities that are as intuitive as text editing. They are interested in open-source solutions that allow for customization and experimentation, and they prefer clear, technical communication that provides actionable insights and detailed specifications.
Transforming Speech Editing
StepFun AI has open-sourced Step-Audio-EditX, a 3B parameter LLM-based audio model that enables expressive speech editing to be performed as a token-level text operation rather than a waveform-level signal processing task.
Why Developers Care About Controllable TTS
Most zero-shot TTS systems replicate emotion, style, accent, and timbre from a brief reference audio, resulting in natural-sounding outputs but limited control. Text-based style prompts are effective only for in-domain voices, and the cloned voice may not adhere to the requested emotional or stylistic parameters. Previous attempts to disentangle these factors involved complex architectures and additional encoders. In contrast, Step-Audio-EditX maintains a relatively entangled representation while modifying the data and post-training objectives to enhance control through exposure to numerous pairs and triplets where the text remains constant, but one attribute varies significantly.
Architecture Overview
Step-Audio-EditX employs a dual codebook tokenizer, mapping speech into two token streams: a linguistic stream at 16.7 Hz with a 1024-entry codebook, and a semantic stream at 25 Hz with a 4096-entry codebook. These tokens are interleaved in a 2 to 3 ratio, preserving prosody and emotional information.
The model is initialized from a text LLM and trained on a blended corpus comprising equal parts pure text and dual codebook audio tokens in chat-style prompts. The audio LLM can process text tokens, audio tokens, or both, generating dual codebook audio tokens as output. A separate audio decoder reconstructs the audio using a diffusion transformer-based flow matching module, which predicts Mel spectrograms from audio tokens, reference audio, and speaker embeddings. This module is trained on approximately 200,000 hours of high-quality speech, enhancing pronunciation and timbre similarity.
Large Margin Synthetic Data
The key innovation in Step-Audio-EditX is large margin learning. The model undergoes post-training on triplets and quadruplets that fix the text while varying one attribute with a clear margin. For zero-shot TTS, it utilizes a high-quality in-house dataset primarily consisting of Chinese and English audio, with a smaller representation of Cantonese and Sichuanese, encompassing around 60,000 speakers and capturing a wide range of intra- and inter-speaker variations in style and emotion.
For emotion and speaking style editing, synthetic large margin triplets are created, where voice actors record 10-second clips for each emotion and style. The StepTTS zero-shot cloning process generates both neutral and emotional versions of the same text and speaker. A margin scoring model, trained on a small human-labeled dataset, evaluates pairs on a scale of 1 to 10, retaining only those with a score of at least 6.
Paralinguistic editing, which includes elements like breathing and laughter, employs a semi-synthetic strategy using the NVSpeech dataset. Quadruplets are constructed where the target is the original NVSpeech audio and transcript, while the input is a cloned version with tags removed from the text, providing supervision for time-domain editing without a margin model.
Post-Training Process
The post-training phase consists of two stages: supervised fine-tuning (SFT) followed by Proximal Policy Optimization (PPO). In SFT, system prompts define zero-shot TTS and editing tasks in a unified chat format. For TTS, the prompt waveform is encoded into dual codebook tokens, converted to string form, and included in the system prompt as speaker information. The user message contains the target text, and the model outputs new audio tokens. For editing, the user message includes original audio tokens and a natural language instruction, with the model returning edited tokens.
Reinforcement learning refines instruction following, utilizing a 3B reward model initialized from the SFT checkpoint and trained with Bradley-Terry loss on large margin preference pairs. The reward is computed directly on dual codebook token sequences, and PPO training employs this reward model, a clip threshold, and a KL penalty to balance quality and adherence to the SFT policy.
Step-Audio-Edit-Test: Evaluating Control
To assess control, the research team introduced Step-Audio-Edit-Test, using Gemini 2.5 Pro as a judge to evaluate emotion, speaking style, and paralinguistic accuracy. The benchmark includes 8 speakers from various datasets, with 50 prompts per category for both Chinese and English across five emotion categories, seven speaking styles, and ten paralinguistic labels.
Editing is evaluated iteratively, with initial zero-shot cloning followed by three rounds of editing based on text instructions. For instance, in Chinese, emotion accuracy improved from 57.0% at iteration 0 to 77.7% at iteration 3, while speaking style accuracy increased from 41.6% to 69.2%. Similar improvements were observed in English, supporting the large margin learning hypothesis.
The same editing model was applied to four closed-source TTS systems, demonstrating that a single editing iteration with Step-Audio-EditX enhances both emotion and style accuracy, with further iterations yielding additional improvements.
Paralinguistic editing scores improved from an average of 1.91 at iteration 0 to 2.89 after a single edit, comparable to native paralinguistic synthesis in leading commercial systems.
Key Takeaways
- Step-Audio-EditX utilizes a dual codebook tokenizer and a 3B parameter audio LLM, allowing speech to be treated as discrete tokens for text-like audio editing.
- The model relies on large margin synthetic data for various attributes, avoiding the need for additional disentangling encoders.
- Supervised fine-tuning and PPO with a token-level reward model align the audio LLM to follow natural language editing instructions for TTS and editing tasks.
- The Step-Audio-Edit-Test benchmark shows significant accuracy gains across three editing iterations for emotion, style, and paralinguistic control in both Chinese and English.
- Step-Audio-EditX can enhance speech from closed-source TTS systems, and the complete stack, including code and checkpoints, is available as open source for developers.
Conclusion
Step-Audio-EditX represents a significant advancement in controllable speech synthesis, maintaining the Step-Audio tokenizer while integrating a compact 3B audio LLM. The optimization of control through large margin data and PPO, along with the introduction of the Step-Audio-Edit-Test benchmark, provides a concrete evaluation framework for emotion, speaking style, and paralinguistic control. The open-source release facilitates practical audio editing research, making audio editing more akin to text editing.
Further Resources
Check out the Paper, Repo, and Model Weights. Feel free to explore our GitHub Page for tutorials, codes, and notebooks. Follow us on Twitter, join our 100k+ ML SubReddit, and subscribe to our newsletter. You can also connect with us on Telegram.
«`