The Text-to-Speech Issues Hiding in Every Batch

Six failures we see in production every week - with real audio examples. Listen, then see how we catch them.

The scale of the problem

Every text-to-speech batch has bad files. The only question is whether you catch them before your users do.

~%
Of text-to-speech batches contain files needing regeneration
h
Hours to manually listen to a 1K-file batch
%
Of bad files missed by spot-checking
$
Cost to re-run an entire batch vs targeted regeneration

The six most common issues

Click any row to jump to its audio example and full breakdown below.

1 Audio quality

Garbled speech

Words collapse into unintelligible mumbling - a string of phonemes that don't form real words.

volume_up Listen to the example

0:000:05

What’s happening

The model loses the plot mid-sentence and outputs a slur of speech-like sounds instead of actual words. It happens most often on long or unusual inputs, and the surrounding sentences can sound completely fine - so a quick spot-check rarely catches it.

How we detect it

Our audio-quality check scans every file for glitches and artifacts - including garbled speech - using a model trained to recognise the acoustic signature of non-words. Each garbled region is timestamped with a severity score so you can sort and regenerate the worst offenders.

Learn about the Audio Quality check
2 Truncation

Cut-off ending

The file stops mid-sentence. The last word (or the last few seconds) is simply missing.

volume_up Listen to the example

0:000:06

What’s happening

The text-to-speech engine returns a success response but the audio ends before the script does. Your user hears the narrator get cut off in the middle of a thought. It's one of the most jarring failures because it leaves a sentence hanging and the listener knows something is wrong.

How we detect it

We compare the expected script length against the audio length, and we check the final seconds of every file for an abnormal cutoff - audio that ends mid-word instead of tapering naturally. Files with suspicious endings get flagged with a timestamp.

Learn about the Script Accuracy check
3 Script accuracy

Spoken stage direction

The model reads a bracketed cue out loud instead of performing it.

volume_up Listen to the example

0:000:06

Source script

People usually think this city's history is carved in stone. [scoff] Not here.

What’s happening

Text-to-speech engines sometimes accept inline cues like [scoff], [laughs], or [whispers] as acting directions. When they work, the delivery changes. When they don't, the model literally says the word 'scoff' in the middle of your sentence. The file sounds completely wrong and the listener hears a technical tag leak into the narration.

How we detect it

We transcribe each file and scan for tag tokens that shouldn't appear in the spoken output - words like 'scoff', 'laughs', 'sighs', 'pause', and common bracket artifacts. Matches are flagged with the exact timestamp in the audio.

Learn about the Script Accuracy check
4 Model failure

Repeated phrases

A phrase, sentence, or whole paragraph gets spoken twice - sometimes on a loop.

volume_up Listen to the example

0:000:27

What’s happening

The model's attention collapses and it starts repeating itself. Sometimes it's a single word stuttered twice, sometimes the same sentence read back-to-back, sometimes an entire paragraph on a loop until the file runs out. It's most common on long inputs and unusual punctuation.

How we detect it

We transcribe the audio and run a repetition scan that looks for repeated n-grams that don't appear in the source script. We also align the transcript back to the script to catch extra-long loops. Every repetition is reported with its start and end times.

Learn about the Script Accuracy check
5 Speaker consistency

Accent drift between files

An American voice comes back Scottish on a single file - nothing in your config changed.

volume_up Listen to the example

0:000:00

What’s happening

Some Text-to-Speech models shift the narrator's accent based on what the text is about. A voice locked to an American speaker ID can come back Scottish on a sentence about Aberdeen, French on a quote from Flaubert, or Australian on a passage about the Outback. Same voice, same settings, different accent - and a listener notices in the first few words. The clip below is one file from a 13-stop tour batch generated with a single American voice - every other file in the batch matched the baseline, but this one leaked into a Scottish accent.

How we detect it

Our Speaker Consistency check builds a voice fingerprint for every file in your batch and compares them all against each other. Files whose voice drifts from the baseline - whether the cause is accent leakage, a silent fallback to a standard voice, or a provider-side model update - show up as outliers with a deviation score.

Learn about the Speaker Consistency check
6 Audio quality

Echo and reverb

The voice sounds like it's being recorded in a stairwell or an empty warehouse.

volume_up Listen to the example

0:000:10

What’s happening

An unnatural echo creeps into the audio - as if the narrator suddenly walked into a different room. It usually isn't consistent across a batch, which is what makes it so disruptive: one file sounds clean, the next sounds like it's bouncing off concrete walls. Listeners notice immediately.

How we detect it

Our audio-quality check runs a spectral analysis on every file and looks for the reverberation signature that doesn't belong in narrated content. Affected files are flagged with a quality score so you can sort and regenerate the worst offenders.

Learn about the Audio Quality check

Why you can’t catch these manually

Every issue above sounds obvious in isolation. The problem is they’re buried inside batches of hundreds or thousands of files.

headset_off

You can't listen to every file

A 500-file batch takes hours to sit through. Nobody has the time, so teams spot-check - and the bad files slip through to production.

visibility_off

Drift is invisible in isolation

Speaker and pacing drift happens gradually. Each file sounds fine compared to the one before it. You only notice by comparing files far apart in the batch.

percent

Spot-checking has a terrible hit rate

Listening to 10% of a batch catches roughly 30% of the bad files. The rest ship to production and your users find them first.

refresh

Re-generating the whole batch is wasteful

When you can't tell which files are bad, you regenerate everything. 85%+ of those files were fine to begin with - pure wasted spend.

Find these issues in your own batches

100 free credits on signup. No credit card required.