«`html

How to Build an Advanced Voice AI Pipeline with WhisperX for Transcription, Alignment, Analysis, and Export

In this tutorial, we walk through an advanced implementation of WhisperX, focusing on transcription, alignment, and word-level timestamps in detail. This process includes setting up the environment, loading and preprocessing audio files, and executing the full pipeline, from transcription to alignment and analysis. Additionally, we ensure memory efficiency and support batch processing throughout the tutorial.

Understanding the Target Audience

The primary audience for this tutorial consists of professionals in AI and business management. This group includes data scientists, AI developers, project managers, and business analysts. These individuals often seek to leverage AI for business solutions, particularly in automating transcription and analysis of audio data.

Pain Points: The audience faces challenges with:

Processing large volumes of audio data efficiently.
Generating accurate transcriptions and insights from audio content.
Managing resource allocation and ensuring quality outputs with limited computational resources.

Goals: Their objectives typically include:

Implementing AI-driven solutions to save time and enhance productivity.
Fostering deeper understanding and analysis of audio materials.
Exporting results in multiple formats for various business applications.

Interests: This audience is interested in:

Technological advancements in AI and machine learning.
Practical applications of AI tools in real-world scenarios.
Efficient coding practices and optimization techniques.

Communication Preferences: Professionals prefer clear, concise instructions guided by visual aids such as charts or code snippets, and comprehensive explanations of technical concepts.

Setting Up the Environment

We start by installing WhisperX along with essential libraries to prepare our working environment:

!pip install -q git+https://github.com/m-bain/whisperX.git
!pip install -q pandas matplotlib seaborn

Next, we set the configuration that determines computation settings:

CONFIG = {
   "device": "cuda" if torch.cuda.is_available() else "cpu",
   "compute_type": "float16" if torch.cuda.is_available() else "int8",
   "batch_size": 16,
   "model_size": "base",
   "language": None,
}

Our setup confirms the computing device and model type, ensuring optimal performance.

Loading and Analyzing Audio

The next step involves downloading and analyzing a sample audio file:

def download_sample_audio():
   """Download a sample audio file for testing."""
   !wget -q -O sample.mp3 https://github.com/mozilla-extensions/speaktome/raw/master/content/cv-valid-dev/sample-000000.mp3
   print("Sample audio downloaded")
   return "sample.mp3"

Once we have the audio, we load it for analysis:

def load_and_analyze_audio(audio_path):
   """Load audio and display basic info."""
   audio = whisperx.load_audio(audio_path)
   duration = len(audio) / 16000
   print(f"Audio: {Path(audio_path).name}")
   print(f"Duration: {duration:.2f} seconds")
   print(f"Sample rate: 16000 Hz")
   display(Audio(audio_path))
   return audio, duration

Transcribing Audio

Next, we transcribe the audio, setting up batched inference with the selected model size:

def transcribe_audio(audio, model_size=CONFIG["model_size"], language=None):
   """Transcribe audio using WhisperX (batched inference)."""
   print("\nSTEP 1: Transcribing audio...")
   model = whisperx.load_model(
       model_size,
       CONFIG["device"],
       compute_type=CONFIG["compute_type"]
   )
   transcribe_kwargs = {
       "batch_size": CONFIG["batch_size"]
   }
   if language:
       transcribe_kwargs["language"] = language
   result = model.transcribe(audio, **transcribe_kwargs)
   del model
   gc.collect()
   if CONFIG["device"] == "cuda":
       torch.cuda.empty_cache()
   print(f"Transcription complete!")
   return result

Aligning Transcription

We align the transcription to generate accurate word-level timestamps:

def align_transcription(segments, audio, language_code):
   """Align transcription for accurate word-level timestamps."""
   print("\nSTEP 2: Aligning for word-level timestamps...")
   try:
       model_a, metadata = whisperx.load_align_model(
           language_code=language_code,
           device=CONFIG["device"]
       )
       result = whisperx.align(
           segments,
           model_a,
           metadata,
           audio,
           CONFIG["device"],
           return_char_alignments=False
       )
       return result
   except Exception as e:
       print(f"Alignment failed: {str(e)}")
       return {"segments": segments, "word_segments": []}

Analyzing Transcription

We generate statistics about the transcription, providing insights into the audio content:

def analyze_transcription(result):
   """Generate statistics about the transcription."""
   print("\nTRANSCRIPTION STATISTICS")
   segments = result["segments"]
   total_duration = max(seg["end"] for seg in segments) if segments else 0
   total_words = sum(len(seg.get("words", [])) for seg in segments)
   print(f"Total duration: {total_duration:.2f} seconds")
   print(f"Total words: {total_words}")
   print(f"Words per minute: {(total_words / total_duration * 60):.1f}")

Displaying and Exporting Results

The results are displayed in a formatted table and exported in multiple formats:

def display_results(result, show_words=False, max_rows=50):
   """Display transcription results in a formatted table."""
   data = []
   for seg in result["segments"]:
       text = seg["text"].strip()
       start = f"{seg['start']:.2f}s"
       end = f"{seg['end']:.2f}s"
       data.append({"Start": start, "End": end, "Text": text})
   df = pd.DataFrame(data)
   display(HTML(df.head(max_rows).to_html(index=False)))
   return df

def export_results(result, output_dir="output", filename="transcript"):
   """Export results in multiple formats."""
   os.makedirs(output_dir, exist_ok=True)
   json_path = f"{output_dir}/{filename}.json"
   with open(json_path, "w", encoding="utf-8") as f:
       json.dump(result, f, indent=2, ensure_ascii=False)

Batch Processing and Keyword Extraction

We implement a batch processing feature to handle multiple files efficiently while also extracting keywords from the transcriptions:

def batch_process_files(audio_files, output_dir="batch_output"):
   """Process multiple audio files in batch."""
   for audio_path in audio_files:
       process_audio_file(audio_path)

def extract_keywords(result, top_n=10):
   """Extract the most common words from transcription."""
   text = " ".join(seg["text"] for seg in result["segments"])
   words = re.findall(r'\b\w+\b', text.lower())
   print(f"\nTop {top_n} Keywords:")

Conclusion

Through this guide, we have set up a complete WhisperX pipeline that transcribes audio, aligns it for precise timestamps, analyzes the content, and exports results in multiple formats—all while optimizing memory usage and supporting batch processing. This process facilitates the development of AI-driven solutions for transcription and audio analysis.

For complete code implementations, please refer to the original sources.

«`

How to Build an Advanced Voice AI Pipeline with WhisperX for Transcription, Alignment, Analysis, and Export?