Text-to-Speech quality is measurable, but the metrics landscape is confusing. Papers reach for the acronym of the month, vendors cite whichever score flatters their model, and production teams have no easy way to know which metric they should trust. This guide explains every major Text-to-Speech quality metric in plain language - what it measures, where it shines, where it breaks down, and when to reach for it in a production workflow.
Subjective metrics: Mean Opinion Score
MOS is the original Text-to-Speech quality metric, standardised in ITU-T Recommendation P.800. You play a clip to a panel of human listeners and they rate it on a 1 (bad) to 5 (excellent) scale; the average across listeners is the Mean Opinion Score. P.800 recommends a sufficient listener panel for statistical reliability (commonly at least 16-24 subjects). The strengths are obvious - it directly measures what you actually care about, which is how the audio sounds to a human. The weaknesses are also obvious: you need real humans, rating sessions are slow and expensive, and the results are sensitive to the wording of the instructions. MOS is the gold standard for publishing research and a terrible choice for nightly regression tests in a production pipeline.
Automated MOS prediction: UTMOS and SpeechMOS
Word Error Rate
WER is the measure of how accurately the spoken audio matches the text it was supposed to say. You run speech-to-text over the generated audio and diff it against the original script. The WER is the count of substitutions, insertions, and deletions divided by the total number of words. WER is the single best metric for catching hallucinated words, dropped words, and outright pronunciation errors. It is almost useless for catching voice drift, pacing issues, or emotional flatness - a flat, drifting, boring delivery can score a perfect WER of zero because every word is technically pronounced. Use it as a hard gate on script-accurate content - audiobooks, voice agents, dubbing - and pair it with other metrics for everything else.
Mel Cepstral Distortion
MCD measures the spectral distance between the generated audio and a reference recording - how far apart they are in the frequency domain. It is good at catching timbre and voice-character differences, which is why the research community reaches for it in voice cloning papers. The catch is that MCD needs a reference audio to compare against. That is fine in a research benchmark where you have ground-truth human recordings, but in production Text-to-Speech there is no ground truth - the generated audio is the only version. MCD is mostly useful for internal consistency checks, like comparing a new model checkpoint against the output of the previous one.
PESQ and STOI
Speaker similarity and voice drift
Voice drift is a metric by itself, and it is a known unsolved problem in the field - even state-of-the-art models like Hume AI's TADA document "occasional cases of speaker drift during long generations." You run a speaker encoder over every file in a batch, compute how far each file sits from the batch centroid (or from a reference clip), and flag anything that sits far enough outside. Unlike MOS or WER, this metric cleanly answers the question "does file 50 sound like file 1?" - and it is the only metric that catches gradual voice drift before listeners do. No single score at the per-file level is going to tell you that file 50 is subtly different from file 1; the anomaly only exists at the batch level.
What every metric misses
None of the metrics above catch everything. MOS and UTMOS miss hallucinations. WER misses drift and pacing. MCD and PESQ need a reference. Speaker similarity misses bad pronunciation. Production Text-to-Speech QA is a stack of metrics, not a single number - and the stack needs to include at least one detector per failure mode that matters to your application. For the full catalogue of failure modes and which detector catches which, see our audio artifacts reference.
Practical recommendations
If you only run one metric, run WER - it is the cheapest to set up, it is deterministic, and it catches the worst failure class (wrong words). If you run two, add speaker similarity across the batch to catch drift. If you run three, add UTMOS for trend monitoring so you notice when a model update quietly regresses overall quality. Beyond that you are building a full quality stack, at which point the pragmatic move is to use a purpose-built platform like TTSAudit that runs all of this on every batch in one pass. See our tool comparison for a full rundown of the options.
Run the full metric stack in one pass
TTSAudit runs WER, speaker similarity, drift, artifact, and pacing checks on every file in your batch. 100 free credits.
Try TTSAudit FreeKey capabilities
Script Accuracy (WER)
Automated speech-to-text diff against your source text catches hallucinations and dropped words.
Speaker Similarity
Per-file and batch-level embedding analysis catches voice drift and content-triggered accent leakage.
UTMOS Trend Monitoring
Track model quality over time and catch regressions after provider updates.
Per-Failure Detectors
One metric per failure mode - pacing, silence, artifacts, hallucination, and drift all checked separately.