Gemini 3.1 Flash TTS launched yesterday. On paper, it is exactly what the Text-to-Speech world has been waiting for: 200+ audio tags for controlling tone and delivery, support for over 70 languages, native multi-speaker dialogue, and an Elo score of 1,211 on the Artificial Analysis leaderboard - second only to Inworld TTS 1.5 Max. For short clips, it genuinely delivers. The expressiveness on quotes and emotional cues is a step change from anything Google has shipped before.
Then you ask it to read something longer than a minute.
Press play on these two clips. They are from the same three-minute generation - same voice, same prompt, same API call. The first clip is the opening sixteen seconds. The second is the final twenty seconds of that same file.
Same file, same voice - 3 minutes apart
Both clips are from a single three-minute generation using Gemini 3.1 Flash TTS. The first clip is the opening. The second is the ending. Listen for the loss of clarity and articulation.
| File | Score | Status |
|---|---|---|
Opening (0:00 – 0:16) Clear, expressive, natural delivery | 92 | Pass |
Ending (3:25 – 3:45) Mumbled, degraded articulation | 34 | Flagged |
We have been testing Gemini 3.1 Flash TTS across dozens of long-form generations since the preview dropped, and the pattern is consistent enough to call it a systemic issue: in roughly 90% of generations over one minute, quality degrades noticeably. By two minutes, the voice starts losing articulation. By three minutes, you are listening to something that sounds like the model is talking through a pillow - mumbled consonants, swallowed word endings, and a general loss of clarity that makes the audio genuinely hard to follow.
The degradation is not subtle. It is not the kind of drift where you need a measurement tool to notice. It is the kind where you play the file for someone who knows nothing about Text-to-Speech and they say "something is wrong with the audio." The opening of every generation sounds excellent - crisp, expressive, natural - which makes the collapse even more jarring when it arrives.
A preview model, not a production model
This is a preview model, and Google has not claimed it is production-ready. That is worth saying. But it is also worth saying that preview models get adopted by production teams constantly, especially when the benchmarks look this good. A team evaluating Text-to-Speech providers might run a thirty-second demo, hear the quality, and commit their pipeline before discovering the long-form cliff. That is the actual risk here - and it is why we are writing this now rather than waiting for GA.
What it gets right
The expressiveness improvements are real and worth acknowledging. Gemini 3.1 Flash TTS supports what Google calls audio tags - freeform markers in square brackets like [whispers], [laughs], [determination], and [enthusiasm] that you embed directly in your input text to control tone and delivery. There is no fixed list - you can write any natural-language instruction inside the brackets and the model will attempt to follow it. These work remarkably well in short-form content and are a genuine step forward for controllable Text-to-Speech. Direct quotes in narration get genuinely different delivery. Character dialogue in multi-speaker mode has personality that previous Google models could not touch.
The 70+ language support is also a strong point. If you are producing multilingual content, having a single model that covers that range with decent quality on short clips is valuable. Previous Google Text-to-Speech models supported many locales too, but the expressiveness was not there - 3.1 Flash actually sounds natural across languages in a way that earlier models did not.
If your use case is short clips under sixty seconds, Gemini 3.1 Flash TTS is legitimately impressive and worth evaluating.
Pricing is the same as Gemini 2.5 Pro TTS
One thing worth noting: the pricing on Gemini 3.1 Flash TTS is identical to Gemini 2.5 Pro TTS - $1.00 per million input tokens and $20.00 per million audio output tokens. That means there is no cost advantage to switching from 2.5 to 3.1. You are paying the same rate for a model that has better expressiveness on short clips but significantly worse long-form stability. The free tier in Google AI Studio makes experimentation cheap, but for production workloads the maths does not change.
Why this matters for long-form content
Audiobooks, walking tours, long podcast segments, training modules - anything where you need consistent quality across minutes, not seconds - the model is not there yet. The 90% failure rate we observed on content over one minute is not a minor caveat. It means you would need to regenerate almost every file in a long-form batch, which defeats the purpose of using a cheap, fast model. You save nothing if nine out of ten outputs need to be thrown away.
This is not a new problem for Google's Text-to-Speech line, either. The Gemini 2.5 generation had its own long-form issues - metallic artifacts, truncation, voice drift between clips. Developers on the Google AI Developer Forum have documented these problems extensively, and some remain unresolved. The jump from 2.5 to 3.1 raised the expressiveness ceiling significantly, but the long-form stability floor may have actually gotten worse.
The per-field input limits on Gemini 3.1 Flash TTS do not help either. The Cloud TTS API caps the text field at 4,000 bytes (roughly 600 to 700 words) and the prompt field at another 4,000 bytes, which forces chunking on anything longer than a few paragraphs. Chunking has historically been where Google TTS quality falls apart - voice consistency across chunks is poor, and stitching them together into smooth audio is an unsolved problem for most production teams.
Stick with Gemini 2.5 for now
Our recommendation: if you are already using Gemini 2.5 Pro TTS in production, stay on it. The 2.5 model has its own issues, but it is a known quantity at this point - the failure modes are documented, workarounds exist, and the long-form stability, while imperfect, is significantly better than what 3.1 Flash is currently producing. You are paying the same price either way, so there is no reason to trade proven reliability for a preview model that degrades on anything over a minute.
If you are not yet on any Gemini model and are evaluating options, Gemini 2.5 Pro TTS is still the better starting point for production work. Use 3.1 Flash for short-form experimentation if you want to try the new emotion tags - they really are good - but do not build a production pipeline around it yet.
We will keep testing as Google iterates on this preview. The foundation is there - the short-form quality proves the model can produce excellent speech. The question is whether Google can maintain that quality across longer generations, or whether the architecture has a fundamental attention-degradation problem that requires a deeper fix.
For now, if you are evaluating Gemini 3.1 Flash TTS for production, run your test on content that matches your actual use case length. A thirty-second demo will mislead you. A three-minute test will tell you what you actually need to know.
Key capabilities
Long-Form Quality Monitoring
Detect quality degradation across the duration of a file. Catch the point where articulation drops off before your listeners do.
Per-File Quality Scores
Every file in your batch gets an individual quality score. Files where the model mumbled or lost clarity are flagged automatically.
Regenerate Only What Failed
When 90% of a batch needs regenerating, you need to know which 10% passed. Get a per-file verdict so you only redo the bad ones.
Cross-Provider Comparison
Test the same script on Gemini, ElevenLabs, and OpenAI. Compare quality scores side by side to pick the right model for your content length.