AI Audiobook QA Checklist

A complete quality assurance checklist for AI-narrated audiobooks - ACX technical requirements, TTS-specific failure modes, and automated detection.

The global audiobook market was estimated at $11.06 billion in 2025 and is forecast to grow at a 26.2 percent CAGR through 2030 (Grand View Research). AI Text-to-Speech has cut audiobook production costs by roughly 80 percent and compressed timelines from months to weeks. A 100,000-word novel that used to take a studio four to six weeks now runs overnight. The catch is that 100,000 words is 8 to 12 hours of audio, and if chapter 15 sounds subtly different from chapter 1, the entire production loses credibility - and ACX (the biggest audiobook retail gate) will reject the submission before it even reaches listeners.

Every ACX production checklist on the internet assumes human narration. None of them cover the specific failure modes that come with AI Text-to-Speech: voice drift across chapters, pronunciation inconsistency, pacing acceleration inside long generations, hallucinated lines, truncated chapters, and the compliance-breaking quirks of generated audio. This is a TTS-specific QA checklist you can run against every batch before submission.

ACX technical requirements you must hit

ACX is strict on the underlying audio engineering regardless of whether the voice is human or synthetic. Every file must be 192 kbps or higher MP3, constant bit rate, 44.1 kHz sample rate. RMS between -23 dB and -18 dB per file. Peak amplitude below -3 dB. Noise floor at -60 dB or lower. One chapter per file. Separate opening credits, closing credits, and retail sample (5 minutes or less). ACX runs automated audio scanners on submissions, so any file outside these bounds is an automatic rejection.

Text-to-Speech output usually passes these on the first try because generated audio has no room tone and no organic noise floor, but watch for loudness drift between chapters (Turbo-style short-form models are noisier than v3-style long-form models) and for voice cloning cases where the reference audio noise bleeds into the generated files. Normalise the final output with Auphonic or FFmpeg before submission.

The Text-to-Speech-specific QA checklist

1. Voice consistency across chapters. Does the voice maintain the same pitch, timbre, and pace from chapter 1 to the final chapter? Does it sound like the same narrator across a 20-chapter run? Voice drift is the single most common reason AI audiobooks feel "off" to reviewers, and it is invisible to spot-checking. See our voice drift post for the full diagnosis.

2. Pronunciation accuracy. Character names, place names, technical terms, foreign words - all pronounced consistently throughout the book. Numbers, dates, and abbreviations spoken naturally. Watch for multilingual models flipping word pronunciation into the wrong language.

3. Pacing uniformity. Reading speed consistent across chapters. Natural pauses between sentences and paragraphs. Watch for pace acceleration inside long single generations - ElevenLabs specifically has a documented acceleration bug past 800-900 characters per render. Silence gaps between paragraphs should be neither too long nor too short.

4. Audio artifact detection. Clicks, pops, static, and metallic tones. Garbled or slurred speech. Truncated output. Hallucinated words or phrases that never appeared in the source manuscript. See our artifact reference guide for the full catalogue of what to check for.

5. Emotional consistency. Dialogue matches the emotional context of the scene. Narration holds genre-appropriate tone. No sudden shifts from engaged to monotone delivery. Emotional flattening is the quietest failure mode and the one that kills listener reviews on longer books.

6. Technical compliance. RMS in the ACX -23 dB to -18 dB range. No clipping above -3 dB. Noise floor below -60 dB. Consistent volume across all chapter files. Constant bit rate MP3 at 192 kbps or higher.

Manual vs automated QA

Listening to every hour of audio costs you 8-12 hours of attentive time per book, you will still miss the gradual voice drift because your ear adjusts, and you cannot do it for two books in a row without losing perspective. Spot-checking catches sudden failures but misses everything gradual. Automated QA is the only approach that scales past the first few books. Upload all chapter files to a batch QA tool, get a per-file anomaly report, click through to the exact location of each flag, and regenerate only what failed. Most audiobook batches end up regenerating 5 to 15 percent of the chapters and shipping the rest.

Common failures and how to fix them

Voice changes mid-chapter: regenerate from the paragraph before the shift, not from scratch. Mispronounced character name: use SSML phoneme tags where supported, or pronunciation dictionaries on the provider. Pacing too fast in long passages: break the text into shorter segments and regenerate - 600-800 character chunks are the stable range on most providers. Silence gaps too long or short: adjust paragraph break settings in your Text-to-Speech tool. Failed ACX RMS: normalise audio in post-production with Auphonic, FFmpeg, or Audacity. Hallucinated line: regenerate the specific chapter, because re-rolling usually produces a different sample from the distribution.

Which Text-to-Speech platform for audiobooks

ElevenLabs v3 has the best voice stability on long runs and the strongest voice cloning - it is the default choice for commercial audiobooks. Gemini 2.5 Pro Text-to-Speech is explicitly designed for long-form narration and is cheaper on large projects, at the cost of more intermittent artifacts (see our Gemini quality guide). Azure and OpenAI are viable but less expressive. Open-source models like XTTS are workable only if you accept a higher regeneration rate and run rigorous automated QA. Whichever you pick, treat audit as non-optional.

Run your audiobook through TTSAudit before submission

Upload every chapter file as a batch, get a per-chapter anomaly report, regenerate only what failed. 100 free credits.

Audit Your Audiobook Free

Key capabilities

📖

Chapter-Level Analysis

Submit all chapters as a batch. Each one is individually scored for anomalies.

🎙️

Drift Detection

Find where voice consistency breaks down across long audiobook sessions before ACX does.

📊

Per-Chapter Scores

Know exactly which chapters to regenerate and why, with timestamps linking to the exact problem.

ACX Pre-Check

Catch the TTS-specific failure modes that ACX's automated scanner will flag before you submit.

Frequently asked questions

Catch bad TTS files before they ship

Run a free audit on your batch - no credit card required.