Which single metric should I use?

If you only pick one, pick Word Error Rate. It is the cheapest to set up, it is deterministic, and it catches the most impactful failure class (wrong or missing words). Pair it with speaker similarity across the batch as soon as you can.

Do I need human listeners at all?

For research and final acceptance, yes. For day-to-day production QA, no - a stack of WER, speaker similarity, UTMOS, and specific artifact detectors will catch the vast majority of failures without humans in the loop.

Why not just use MOS?

MOS is the gold standard for research but impractical for production. You need twenty or more human listeners per clip for significance, sessions take hours, and the results are sensitive to instruction wording. Use automated MOS prediction (UTMOS) as a proxy for continuous monitoring.

What is the difference between WER and script accuracy?

WER is a specific formula: (substitutions + insertions + deletions) / total words. Script accuracy is the broader concept of whether the spoken audio matches the input text, and WER is the standard way to quantify it. TTSAudit uses WER under the hood for its script accuracy check.

Text-to-Speech Quality Metrics: MOS, WER, MCD, PESQ Explained

Published March 27, 2026

Text-to-Speech quality is measurable, but the metrics landscape is confusing. Papers reach for the acronym of the month, vendors cite whichever score flatters their model, and production teams have no easy way to know which metric they should trust. This guide explains every major Text-to-Speech quality metric in plain language - what it measures, where it shines, where it breaks down, and when to reach for it in a production workflow.

Subjective metrics: Mean Opinion Score

MOS is the original Text-to-Speech quality metric, standardised in ITU-T Recommendation P.800. You play a clip to a panel of human listeners and they rate it on a 1 (bad) to 5 (excellent) scale; the average across listeners is the Mean Opinion Score. P.800 recommends a sufficient listener panel for statistical reliability (commonly at least 16-24 subjects). The strengths are obvious - it directly measures what you actually care about, which is how the audio sounds to a human. The weaknesses are also obvious: you need real humans, rating sessions are slow and expensive, and the results are sensitive to the wording of the instructions. MOS is the gold standard for publishing research and a terrible choice for nightly regression tests in a production pipeline.

Automated MOS prediction: UTMOS and SpeechMOS

UTMOS (UTokyo-SaruLab, VoiceMOS Challenge 2022) is the best-known neural network trained to predict what human listeners would rate a clip, without needing a reference audio. Feed in audio, get a number back that correlates reasonably well with a panel MOS. Python packages like SpeechMOS bundle UTMOS and similar predictors for one-line inference. The big win is that they are fast, deterministic, and free - you can run them in CI on every build. The big loss is that they return a single score per file and do not tell you what is wrong with a bad file. A clip with a hallucinated line can still score 3.9 because most of it sounds fine. Use them for trend monitoring and regression detection, not as a gate.

Word Error Rate

WER is the measure of how accurately the spoken audio matches the text it was supposed to say. You run speech-to-text over the generated audio and diff it against the original script. The WER is the count of substitutions, insertions, and deletions divided by the total number of words. WER is the single best metric for catching hallucinated words, dropped words, and outright pronunciation errors. It is almost useless for catching voice drift, pacing issues, or emotional flatness - a flat, drifting, boring delivery can score a perfect WER of zero because every word is technically pronounced. Use it as a hard gate on script-accurate content - audiobooks, voice agents, dubbing - and pair it with other metrics for everything else.

Mel Cepstral Distortion

MCD measures the spectral distance between the generated audio and a reference recording - how far apart they are in the frequency domain. It is good at catching timbre and voice-character differences, which is why the research community reaches for it in voice cloning papers. The catch is that MCD needs a reference audio to compare against. That is fine in a research benchmark where you have ground-truth human recordings, but in production Text-to-Speech there is no ground truth - the generated audio is the only version. MCD is mostly useful for internal consistency checks, like comparing a new model checkpoint against the output of the previous one.

PESQ and STOI

PESQ is the ITU-T P.862 standard for perceived speech quality. It was designed for telephony, where the reference is always a clean studio recording and the test is how much the network path degrades it. That is not a good fit for Text-to-Speech, where the best-case output already beats telephony quality and there is no reference to compare against. STOI measures intelligibility - how understandable the speech is - and has similar constraints. Both are useful in the narrow cases you would expect (voice compression research, noisy-channel testing) and are bad choices as the primary metric for production Text-to-Speech quality assurance.

Speaker similarity and voice drift

Voice drift is a metric by itself, and it is a known unsolved problem in the field - even state-of-the-art models like Hume AI's TADA document "occasional cases of speaker drift during long generations." You run a speaker encoder over every file in a batch, compute how far each file sits from the batch centroid (or from a reference clip), and flag anything that sits far enough outside. Unlike MOS or WER, this metric cleanly answers the question "does file 50 sound like file 1?" - and it is the only metric that catches gradual voice drift before listeners do. No single score at the per-file level is going to tell you that file 50 is subtly different from file 1; the anomaly only exists at the batch level.

What every metric misses

None of the metrics above catch everything. MOS and UTMOS miss hallucinations. WER misses drift and pacing. MCD and PESQ need a reference. Speaker similarity misses bad pronunciation. Production Text-to-Speech QA is a stack of metrics, not a single number - and the stack needs to include at least one detector per failure mode that matters to your application. For the full catalogue of failure modes and which detector catches which, see our audio artifacts reference.

Practical recommendations

If you only run one metric, run WER - it is the cheapest to set up, it is deterministic, and it catches the worst failure class (wrong words). If you run two, add speaker similarity across the batch to catch drift. If you run three, add UTMOS for trend monitoring so you notice when a model update quietly regresses overall quality. Beyond that you are building a full quality stack, at which point the pragmatic move is to use a purpose-built platform like TTSAudit that runs all of this on every batch in one pass. See our tool comparison for a full rundown of the options.

Run the full metric stack in one pass

TTSAudit runs WER, speaker similarity, drift, artifact, and pacing checks on every file in your batch. 100 free credits.

Try TTSAudit Free

Key capabilities

🔤

Script Accuracy (WER)

Automated speech-to-text diff against your source text catches hallucinations and dropped words.

🎙️

Speaker Similarity

Per-file and batch-level embedding analysis catches voice drift and content-triggered accent leakage.

📈

UTMOS Trend Monitoring

Track model quality over time and catch regressions after provider updates.

🎯

Per-Failure Detectors

One metric per failure mode - pacing, silence, artifacts, hallucination, and drift all checked separately.

Text-to-Speech Quality Metrics Explained