How to Evaluate Voice Agents in 2025: Beyond Automatic Speech Recognition (ASR) and Word Error Rate (WER) to Task Success, Barge-In, and Hallucination-Under-Noise

In an era where voice interactions are becoming increasingly central to user experiences, it is crucial to evaluate voice agents beyond traditional metrics like Automatic Speech Recognition (ASR) and Word Error Rate (WER). The landscape demands a more holistic approach that factors in user satisfaction and real-world performance, including task success, barge-in behavior, and hallucination-under-noise.

Why WER Isn’t Enough?

While WER measures transcription fidelity, it fails to capture the quality of interactions that users experience. Two voice agents with similar WER scores can yield vastly different outcomes in dialog success due to factors such as latency, turn-taking, and misunderstanding recovery. Research indicates that user satisfaction should be assessed based on real interaction signals rather than solely on ASR accuracy.

What to Measure (and How)?

To evaluate voice agents effectively, the following metrics should be measured:

End-to-End Task Success

Metric: Task Success Rate (TSR), Task Completion Time (TCT), and Turns-to-Success.

Why: Voice assistants are evaluated by their ability to complete tasks successfully, making it essential to track these metrics.

Protocol: Define tasks with verifiable endpoints and use human raters alongside automatic logs to compute the metrics.

Barge-In and Turn-Taking

Metrics: Barge-In Detection Latency, True/False Barge-In Rates, and Endpointing Latency.

Why: Efficient handling of interruptions and fast endpointing significantly enhance user experience.

Protocol: Script specific prompts for controlled interruptions, measuring timings with high-precision logs.

Hallucination-Under-Noise (HUN)

Metric: HUN Rate, which assesses the proportion of outputs that are fluent but semantically unrelated to the audio in noisy conditions.

Why: Voice models can produce «convincing nonsense,» particularly in challenging audio environments. Tracking HUN is essential for understanding these failures.

Protocol: Create audio sets with environmental noise and measure semantic relatedness through human judgment.

Instruction Following, Safety, and Robustness

Metric Families include Instruction-Following Accuracy and Safety Refusal Rates.

Why: Assessing these metrics is crucial for ensuring that voice agents adhere to user instructions and maintain safety standards.

Protocol: Utilize benchmarks like VoiceBench to cover various speech-interaction capabilities.

Perceptual Speech Quality

Metric: Subjective Mean Opinion Score using ITU-T P.808.

Why: Both recognition and playback quality directly affect user satisfaction.

Benchmark Landscape: What Each Covers

VoiceBench (2024): Multi-faceted evaluation tool for voice assistants that measures robustness across various conditions.
SLUE / SLUE Phase-2: Focuses on spoken language understanding tasks.
MASSIVE: Offers extensive multilingual data for task-oriented evaluation.
Spoken-SQuAD / HeySQuAD: Tests ASR-aware comprehension and multi-accent robustness.
DSTC: Investigates robust dialog modeling with spoken data.

Filling the Gaps: What You Still Need to Add

Current evaluations may lack:

Barge-In & Endpointing KPIs
Hallucination-Under-Noise (HUN) Protocols
On-Device Interaction Latency
Cross-Axis Robustness Matrices
Perceptual Quality for Playback

A Concrete, Reproducible Evaluation Plan

To create a comprehensive evaluation suite, follow these guidelines:

Utilize VoiceBench for knowledge and safety measurements.
Incorporate SLUE/Phase-2 for language understanding performance.
Engage MASSIVE for multilingual capabilities.
Implement scripted interruptions for barge-in metrics.
Monitor HUN with noise overlays.

Report Structure

Your evaluation report should include:

Primary table of metrics including TSR and HUN rates.
Stress plots correlating performance with environmental variables.

How to Evaluate Voice Agents in 2025: Beyond Automatic Speech Recognition (ASR) and Word Error Rate (WER) to Task Success, Barge-In, and Hallucination-Under-Noise

How to Evaluate Voice Agents in 2025: Beyond Automatic Speech Recognition (ASR) and Word Error Rate (WER) to Task Success, Barge-In, and Hallucination-Under-Noise

Why WER Isn’t Enough?

What to Measure (and How)?

Benchmark Landscape: What Each Covers

Filling the Gaps: What You Still Need to Add

A Concrete, Reproducible Evaluation Plan

Report Structure

References