←back to Blog

Building a Speech Enhancement and Automatic Speech Recognition (ASR) Pipeline in Python Using SpeechBrain

«`html

Building a Speech Enhancement and Automatic Speech Recognition (ASR) Pipeline in Python Using SpeechBrain

Understanding the Target Audience

The primary audience for this tutorial includes data scientists, machine learning engineers, and developers interested in speech processing technologies. They typically work in tech companies, research institutions, or startups focused on AI solutions. Their pain points often involve challenges in processing noisy audio data and achieving high accuracy in speech recognition tasks. Their goals include improving the performance of ASR systems and exploring open-source tools like SpeechBrain for practical implementations. They prefer clear, concise communication with a focus on technical details and practical applications.

Tutorial Overview

This tutorial provides a comprehensive workflow for building a speech enhancement and ASR pipeline using SpeechBrain. We will generate clean speech samples using gTTS, introduce noise to simulate real-world conditions, and apply SpeechBrain’s MetricGAN+ model for audio enhancement. Finally, we will run ASR with a language model–rescored CRDNN system and compare word error rates (WER) before and after enhancement.

Setting Up the Environment

We start by installing the required libraries and tools in our Colab environment:

!pip -q install -U speechbrain gTTS jiwer pydub librosa soundfile torchaudio
!apt -qq install -y ffmpeg >/dev/null

We define basic paths and parameters and prepare the device for building our speech pipeline.

Generating Speech Samples

We define functions to synthesize speech, add noise, and manage audio files:

def tts_to_wav(text: str, out_wav: str, lang="en"):
   mp3 = out_wav.replace(".wav", ".mp3")
   gTTS(text=text, lang=lang).save(mp3)
   a = AudioSegment.from_file(mp3, format="mp3").set_channels(1).set_frame_rate(sr)
   a.export(out_wav, format="wav")
   os.remove(mp3)

Creating Sample Data

We generate three spoken sentences, save both clean and noisy versions, and organize them into Sample objects:

sentences = [
   "Artificial intelligence is transforming everyday life.",
   "Open source tools enable rapid research and innovation.",
   "SpeechBrain brings flexible speech pipelines to Python."
]

Loading Pre-trained Models

Next, we load SpeechBrain’s pre-trained ASR and MetricGAN+ enhancement models:

asr = EncoderDecoderASR.from_hparams(
   source="speechbrain/asr-crdnn-rnnlm-librispeech",
   run_opts={"device": device},
   savedir=str(root/"pretrained_asr"),
)
enhancer = SpectralMaskEnhancement.from_hparams(
   source="speechbrain/metricgan-plus-voicebank",
   run_opts={"device": device},
   savedir=str(root/"pretrained_enh"),
)

Enhancing Audio and Transcribing

We create functions to enhance noisy audio and transcribe speech:

def enhance_file(in_wav: str, out_wav: str):
   sig = enhancer.enhance_file(in_wav) 
   if sig.dim() == 1: sig = sig.unsqueeze(0)
   torchaudio.save(out_wav, sig.cpu(), sr)

Evaluating Performance

We evaluate the performance of the ASR system by comparing the WER of the noisy and enhanced audio:

for smp in samples:
   enhance_file(smp.noisy_wav, smp.enhanced_wav)
   hyp_noisy,  wer_noisy  = eval_pair(smp.text, smp.noisy_wav)
   hyp_enh,    wer_enh    = eval_pair(smp.text, smp.enhanced_wav)
   rows.append((smp.text, hyp_noisy, wer_noisy, hyp_enh, wer_enh))

Results Summary

We summarize the results, including average WERs, to demonstrate the effectiveness of the pipeline:

avg_wn = sum(wN for _,_,wN,_,_ in rows) / len(rows)
avg_we = sum(wE for _,_,_,_,wE in rows) / len(rows)
print("\n Summary:")
print(f"Avg WER (Noisy):     {avg_wn:.3f}")
print(f"Avg WER (Enhanced):  {avg_we:.3f}")

Conclusion

This tutorial illustrates the integration of speech enhancement and ASR into a unified pipeline using SpeechBrain. By generating audio, introducing noise, enhancing it, and transcribing, we gain valuable insights into improving recognition accuracy in noisy environments. The practical benefits of using open-source speech technologies are evident, providing a framework that can be extended for larger datasets and custom tasks.

Further Resources

Check out the FULL CODES. Feel free to explore our GitHub Page for Tutorials, Codes, and Notebooks. Follow us on Twitter and join our ML SubReddit community.

«`