You pick a voice in the Text-to-Speech API, pin the same voice ID for every file in the batch, and expect the narration to sound like the same person from start to finish. Most of the time it does. Sometimes one file comes back wrong - not a glitch, not a mispronunciation, but a completely different accent. Nothing in your settings changed. The only thing that changed was the text.
Here's a real three-file slice from an audit we ran. The full batch was thirteen short narration clips for a walking tour of Aberdeen, all generated back to back with the same American voice. Press play on each track below - you don't need any technical background to hear what happened.
An American voice went Scottish on one file
Three consecutive tracks from a 13-file walking tour of Aberdeen. Two sound like the American voice we picked. One came back Scottish - and we don't fully know why.
| File | Deviation | Status |
|---|---|---|
Beach Ballroom American accent, matches baseline | Pass | |
Linx Ice Arena Unintended Scottish accent | +32.7% | Flagged |
Beach Leisure Centre American accent, matches baseline | Pass |
Beach Ballroom and Beach Leisure Centre sound like the American voice we asked for. Linx Ice Arena came back in a clear Scottish accent. Same voice ID, same model, same API call. Just a different file.
Here's the weird part. Every file in this batch is about a Scottish landmark - the whole tour is in Aberdeen. If the trigger were simply "content mentions Scotland," every file would have come back Scottish. Only one did.
We don't have a clean answer for why this particular file and not the others. Our best guess is that the voice is nudged by many different things in the text, and for reasons we can't easily pin down, this specific combination of words pulled hard enough to flip the accent. It's not random, though. You can regenerate Linx Ice Arena on its own, in a cold session, and it still comes back Scottish. The trigger lives somewhere in the text itself.
We call this accent leakage, or content-conditioned voice drift. It's different from the usual thing people mean by voice drift. The usual kind creeps in slowly over a long run - you generate a hundred files in a row and the voice gradually wanders. Accent leakage isn't about time. It's about the text pulling the voice in a direction the text shouldn't be pulling it.
Our working theory is that large speech-language models entangle the speaker and the content more tightly than their designers probably meant to. The model trained on hours of Scottish voices pronouncing Scottish place names, and learned those sounds go together. At inference time, when the voice ID says American and the text says Aberdeen, the model tries to honour both signals at once - and for certain sentences, the content wins.
We've seen the same effect triggered by other things. A direct quote from a named person. Song lyrics embedded in narration. A single Spanish or French phrase inside an English paragraph. Strong geographic adjectives. Each of those is content the model has strong associations for, and any of them can nudge the voice.
One thing that doesn't reliably prevent it is being careful with your voice settings. You can lock the voice ID, keep every parameter the same, and still get a file or two that come back wrong. The fix isn't upstream. The fix is catching the outliers downstream.
Production teams hit this constantly. Audiobook producers working through travel memoirs notice the narrator's accent shifting with the chapter setting. Game studios see it when a specific NPC line suddenly sounds like a different character. E-learning teams catch it when a lesson about French history picks up a subtle French lilt mid-module. Localization is where it gets really ugly - a multilingual voice can briefly slip into the wrong language for a single proper noun.
The good news: accent leakage is easy to detect. The affected file is genuinely a different voice by any reasonable measurement, so speaker-similarity metrics pick it up cleanly. In the sample above, Linx Ice Arena drifted 32.7 percent away from the batch median. The other two sat within a few percent. That kind of gap is trivial to flag automatically.
The detection technique is simple. Run each file through a speaker encoder, compute how close every file is to the batch average, and flag anything that sits far enough outside. Files that drifted into a different accent always land on the flagged list - no one needs to listen to them first.
You can also reduce leakage at generation time, even if you can't eliminate it. Use a model with a stricter speaker conditioning mode if it offers one. Add pronunciation hints for foreign place names so the model reacts to your phonetic spelling rather than the native one. Split multilingual scripts into per-language runs. And - this is the one that actually works at scale - audit every batch for voice consistency and regenerate the outliers one file at a time.
The broader lesson is that modern neural Text-to-Speech doesn't cleanly separate voice from content the way older models did. Most of the time that's fine. Some of the time it produces a file where the narrator on your American-voiced tour of Aberdeen suddenly sounds like a local. Once you know to look for it, you find it everywhere - and once you catch it automatically, it stops being a production problem and becomes a one-click regeneration.
Key capabilities
Content-Conditioned Drift Detection
Flag files where the voice shifts unexpectedly even when the voice ID and every setting stays the same. Catch accent leakage triggered by place names, quotes, or code-switched content.
Per-File Deviation Scores
Every file gets a deviation score against the batch average. Outliers like the Linx Ice Arena file above land above your threshold automatically - no listening required.
Surgical Regeneration
When one file out of a hundred has leaked into the wrong accent, regenerate just that file. Save credits and time while still shipping consistent audio.
Works Across Providers
Accent leakage happens across ElevenLabs, OpenAI, Google, Azure, and most modern neural Text-to-Speech stacks. TTSAudit analyses the audio output, so it catches the issue whatever model produced it.