«`html
Efficient and Adaptable Speech Enhancement via Pre-trained Generative Audioencoders and Vocoders
Recent advances in speech enhancement (SE) have shifted from traditional mask or signal prediction methods to utilizing pre-trained audio models for richer, more transferable features. Models like WavLM extract meaningful audio embeddings that significantly enhance SE performance. Some approaches leverage these embeddings to predict masks or combine them with spectral data for improved accuracy. Others explore generative techniques, employing neural vocoders to reconstruct clean speech directly from noisy embeddings. However, these methods often involve freezing pre-trained models or require extensive fine-tuning, which limits adaptability and increases computational costs, complicating transfer to other tasks.
Researchers at MiLM Plus, Xiaomi Inc., have introduced a lightweight and flexible SE method that utilizes pre-trained models. Initially, audio embeddings are extracted from noisy speech using a frozen audioencoder. These embeddings are then refined by a small denoise encoder before being passed to a vocoder to generate clean speech. Unlike task-specific models, both the audioencoder and vocoder are pre-trained separately, making the system adaptable to tasks such as dereverberation or separation. Experiments indicate that generative models outperform discriminative ones regarding speech quality and speaker fidelity. This system is not only efficient but also surpasses a leading SE model in listening tests.
System Components
The proposed speech enhancement system comprises three main components:
- Noisy speech is processed through a pre-trained audioencoder, generating noisy audio embeddings.
- A denoise encoder refines these embeddings to produce cleaner versions.
- A vocoder converts the cleaned embeddings back into speech.
Both the denoise encoder and vocoder are trained separately, relying on the same frozen, pre-trained audioencoder. During training, the denoise encoder minimizes the difference between noisy and clean embeddings, generated in parallel from paired speech samples, using a Mean Squared Error loss. The encoder employs a ViT architecture with standard activation and normalization layers.
The vocoder is trained in a self-supervised manner using clean speech data alone. It learns to reconstruct speech waveforms from audio embeddings by predicting Fourier spectral coefficients, which are converted back to audio through the inverse short-time Fourier transform. A modified version of the Vocos framework accommodates various audioencoders. A Generative Adversarial Network (GAN) setup is utilized, where the generator is based on ConvNeXt, and the discriminators include both multi-period and multi-resolution types. The training incorporates adversarial, reconstruction, and feature matching losses, while the audioencoder remains unchanged, using weights from publicly available models.
Evaluation Results
Evaluation results demonstrate that generative audioencoders, such as Dasheng, consistently outperform discriminative ones. On the DNS1 dataset, Dasheng achieved a speaker similarity score of 0.881, while WavLM and Whisper scored 0.486 and 0.489, respectively. In terms of speech quality, non-intrusive metrics like DNSMOS and NISQAv2 indicated significant improvements, even with smaller denoise encoders. For instance, ViT3 reached a DNSMOS of 4.03 and a NISQAv2 score of 4.41. Subjective listening tests involving 17 participants revealed that Dasheng produced a Mean Opinion Score (MOS) of 3.87, surpassing Demucs at 3.11 and LMS at 2.98, highlighting its strong perceptual performance.
Conclusion
This study presents a practical and adaptable speech enhancement system that relies on pre-trained generative audioencoders and vocoders, eliminating the need for full model fine-tuning. By denoising audio embeddings using a lightweight encoder and reconstructing speech with a pre-trained vocoder, the system achieves both computational efficiency and robust performance. Evaluations show that generative audioencoders significantly outperform discriminative ones in terms of speech quality and speaker fidelity. The compact denoise encoder maintains high perceptual quality even with fewer parameters, and subjective listening tests further confirm that this method delivers superior perceptual clarity compared to existing state-of-the-art models.
Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project.
«`