Common Text-to-Speech Audio Artifacts

Every type of Text-to-Speech audio failure - clicks, metallic tones, hallucinated words, truncation, drift - with descriptions, causes, and detection guidance.

Every Text-to-Speech platform produces audio artifacts. The only question is whether you catch them before your listeners do. This is a reference guide to every common failure mode we see across production Text-to-Speech pipelines, what each one sounds like, why it happens, where it shows up most, and how to detect it. Bookmark it and come back every time you hit a new weird thing in a batch.

Clicks and pops

Sharp transient noise, like a finger snap or a mouth click, usually sitting right at a sentence or paragraph boundary. The rest of the audio sounds fine. Most often caused by imperfect stitching between synthesised segments, DC offset at segment boundaries, or a slightly clipped onset on the first phoneme of a new utterance. Shows up most in pipelines that concatenate multiple short requests into one file, and in any provider that streams chunks without smoothing. Waveform-level amplitude spike detection catches it cleanly - any jump over a few dB between adjacent samples at a segment boundary is almost always an artifact and not speech.

Metallic or robotic tone

A thin, tinny, ringing quality over the whole clip. Sustained vowels and sibilants are where you notice it first. Usually caused by a server-side processing change, a new model checkpoint that compresses the high-frequency band differently, or a degraded decoder path on a low-cost tier. Gemini 2.5 Pro picked this up after the December 10, 2025 model update - developers reported new metallic artifacts and degraded expressivity on the Google AI Developer Forum - and it is still intermittent on Pro today (see our Gemini guide). Play.ht users reported similar peak-hour degradation before the platform shut down. Spectral analysis looking at unnatural harmonic regularity flags it automatically - speech should not have perfectly spaced peaks in its spectrum.

Garbled or slurred speech

Words run together, consonants blur, vowels get swallowed. Sometimes a single word is unintelligible, sometimes an entire phrase. The rest of the file is clean. This is the probabilistic model falling into a low-quality sample from the distribution, and it happens more on faster or cheaper tiers because they use smaller or more aggressively quantised models. ElevenLabs Turbo v2.5, the low-latency 32-language model, is particularly exposed - users report slurred words and garbled phrases at a noticeably higher rate than v3. The reliable detection method is to run the audio through speech-to-text and diff against the input script, which catches the garbled segments as transcription errors.

Voice character change

The speaker sounds like a different person, either mid-file or between files. No amount of re-reading helps because the voice itself shifted, not the words. This is the voice drift problem - Hume AI's TADA research documents "occasional cases of speaker drift during long generations" as an unsolved problem even in state-of-the-art models. It has a full post of its own - see Why Your Text-to-Speech Voice Changes Between Files and Why Text-to-Speech Voices Switch Accent for the deep dive. Causes include probabilistic sampling, soft speaker conditioning, long-context attention drift, and content-triggered accent leakage. Detection uses a speaker encoder to embed each file and compare against the batch centroid - outliers are the ones that drifted.

Hallucinated words or phrases

The audio contains words or sentences that do not exist in the input text. Sometimes it is a filler phrase at a paragraph boundary, sometimes a repeated last line, sometimes a completely invented sentence. This is the large-language-model class of Text-to-Speech producing tokens that map to plausible speech but not to your script. Coqui XTTS has documented hallucinations (end-of-sentence gibberish, repetition artifacts, and language bleed-through in the GitHub discussions), Gemini's multi-speaker mode inserts phantom lines, and ElevenLabs v3 does it occasionally on long generations. Even Hume AI's TADA model - which claims zero content hallucinations as a design goal - acknowledges the problem exists across the category. Detection is script-diff: transcribe the output with Whisper and compare against the source text.

Truncated output

The audio cuts off mid-sentence or mid-word. The API responds with a 200. Nothing obvious is wrong in the logs. Gemini Text-to-Speech is particularly prone to this, returning a finishReason of "OTHER" on otherwise successful calls. ElevenLabs occasionally truncates very long inputs. The most reliable detection is a duration check on the client side - compare the expected read time of the script (a conservative 160-180 words per minute) against the actual duration of the returned audio, and flag anything shorter than 80% of expected. If you pass the original script to TTSAudit's Script Accuracy check, it will also surface the missing content as diff errors against the transcript.

Silence gaps and missing audio

Unnaturally long pauses between words, sentences, or paragraphs, or entire stretches of silence where speech should be. Caused by paragraph-boundary handling, SSML pause tags rendering longer than intended, or the model briefly losing its place in the input. OpenAI TTS-1 has been documented dropping silence gaps in specific phrasing patterns. Detection uses a silence-duration analyser with configurable thresholds - anything longer than about 1.5 seconds inside a paragraph is almost always an artifact.

Pacing anomalies

The voice speeds up or slows down unnaturally. Sometimes the acceleration is gradual inside a long generation, sometimes a single file in a batch comes back noticeably faster than its neighbours. ElevenLabs shows progressive acceleration past 800-900 characters in a single request. Play.ht has had bugs where saved speed settings reset to a much faster value. Detection is a words-per-minute analyser running across rolling windows of the file - stable speech sits in a narrow range around the batch average, and anything outside that range is a pacing outlier.

Pronunciation errors

Proper nouns, technical terms, and foreign words pronounced incorrectly. Multilingual models are the worst offenders because they can flip the pronunciation of a word that exists in two languages toward the wrong one. ElevenLabs multilingual v2 is known for it, and every platform struggles with domain-specific terminology. Detection is speech-to-text against the input script - outright substitutions show up as diff errors. Softer mispronunciations that the transcriber still maps to the right word generally need a human ear or a phonetic-distance model layered on top of the ASR output.

Background noise and hum

Low-level hiss, hum, or room tone that should not be in synthetic speech at all. The usual cause is a cloned voice trained from a noisy reference - the model ends up bleeding the training noise into every generation. Any voice cloned from non-studio recordings is at risk, and Play.ht's instant cloning has been particularly exposed to this. Detection uses noise-floor analysis on the silent passages of the file - a healthy Text-to-Speech output should have a noise floor around -60 dB or lower, anything higher is an artifact.

How to check all of these systematically

Manual listening is the only way to catch every one of these artifacts with full confidence, and it does not scale past about twenty files. Every team that ships Text-to-Speech at production volume ends up automating the detection for the nine non-subjective artifact types above, leaving only final-polish spot checks for a human. That is the workflow TTSAudit is built around - upload a batch, get a per-file report covering voice drift, audio quality artifacts, pacing anomalies, and (when you pass the original script) missing or hallucinated lines — click through to the exact location of any flag, and regenerate only the files that failed.

The meta-lesson here is that "Text-to-Speech quality" is not one thing. It is a cluster of distinct failure modes that each need a different detector. A batch that passes a single overall quality score can still have three hallucinated lines, one truncated file, and a slow pitch drift hiding inside it. Shipping polished audio at scale means checking for each failure mode individually and treating the checks as a deploy gate.

Find the artifacts you missed

Upload your Text-to-Speech batch and get a per-file report covering voice drift, audio quality artifacts, pacing, and script-diff errors. 100 free credits on signup.

Try TTSAudit Free

Key capabilities

🎛️

Ten Artifact Detectors

One check per failure mode - clicks, metallic tone, garbled speech, drift, hallucination, truncation, silence, pacing, pronunciation, and noise.

🧭

Per-File Location

Every flag links to the exact timestamp in the file where the artifact occurs, so you can verify in one click.

📦

Batch Coverage

Upload up to 500 files per batch and get every artifact class checked on every file. No spot-checking.

🔁

Any Provider

Works on output from ElevenLabs, OpenAI, Gemini, Azure, Play.ht, Amazon Polly, Murf, and any open-source Text-to-Speech model.

Frequently asked questions

Catch bad TTS files before they ship

Run a free audit on your batch - no credit card required.