How Parkbench Ensures Voice Quality, Part 3: Finding the Operating Envelope of Long-Form AI Speech

The script the model hears is not the script the user reads.

Visualization of the operating envelope of long-form AI speech synthesis

Where We Left Off

Part 1 was about catching what was missing: the truncated sentence, the dropped tail, the autoregressive model that quietly stops generating when its context window fills up.

Part 2 was about catching how the audio sounded wrong: pops, hiss, elongation, voice drift, and the false-positive tax you pay when your detectors become too eager.

This one is about a different problem:

the model is fine, the orchestration is fine, the script is fine, and the audio still glitches.

Over the last few months, we learned that long-form NeuTTS is not just a model with a quality gate around it. It has an operating envelope: a set of conditions where it reliably produces clean audio, and another where it fails in surprisingly specific ways.

Most of the work has been figuring out where those boundaries are, then reshaping the pipeline to stay inside them.

The Promise, and the Reality

When we first started experimenting with NeuTTS for Parkbench’s wellness and sleep content, the promise was real: a modern neural TTS system that runs on CPU, supports long-form expressive narration, and is far cheaper to scale than GPU-heavy voice systems.

The model can absolutely generate beautiful audio. The harder question was whether it could do so reliably across thousands of long-form generations a day.

There is a large difference between:

a short demo clip
a controlled benchmark
a single successful generation

…and a production system shipping thousands of minutes of wellness audio where users expect calmness, consistency, and stability every single time. That gap is where most of the engineering lives.

The Strange Failure Modes of Long-Form Speech

At first, the failures seemed almost random. Some generations sounded incredible. Others contained catastrophic noise spikes, clipped words, elongations, truncated endings, or bizarre repeated phrases. Meditation and sleep content seemed especially unstable.

What made this difficult to reason about was that the failures were probabilistic. A configuration could work perfectly nine times, then fail badly on the tenth. Even within a single run, nineteen chunks would be flawless and chunk twenty would suddenly render the word fresh as a 2.2-second glitched vowel before drifting into held-tone noise.

The deeper we investigated, the clearer it became that we were not dealing with one issue. Multiple systems were interacting:

chunking strategy
punctuation normalization
sentence boundaries
synthesis call count
fallback stitching
resource contention
pacing logic
post-processing
QC thresholds

All of them subtly influencing voice quality through inputs that looked identical to a human reader.

When “More Expressive” Became Less Stable

One of the more surprising discoveries was that our meditation-specific synthesis path was substantially less stable than the normal narration path.

Originally, the meditation pipeline tried to create calmer pacing directly inside synthesis itself: sentence-by-sentence vocalization, longer pauses, meditation-specific chunking, expressive pacing, ellipsis-heavy scripts.

In theory, this should have sounded more natural.

In practice, it dramatically increased the number of synthesis boundaries and therefore the number of opportunities for something to fail. A 15-minute meditation might contain dozens of tiny synthesis calls. Even if each call only had a small failure probability, the overall odds of at least one audible glitch became uncomfortably high.

The model does not need to understand meditation pacing. It only needs to generate stable speech.

So we changed the architecture. Wellness long-form now defaults to the narration TTS path, with the legacy meditation pipeline behind a rollback flag. The same calm pacing is still there, just produced somewhere else in the stack.

The catastrophic noise spikes almost disappeared. What remained were narrower, far more recoverable failures: occasional truncation, fallback seams, pacing imperfections, pronunciation edge cases. A completely different class of problem.

Separating Synthesis From Arrangement

Once wellness content moved onto the narration path, a new question appeared:

How do you get meditation pacing back without reintroducing the instability you just escaped?

The answer was to move pacing outside the model.

After NeuTTS produces the assembled voice stem, a post-synthesis stage scans for natural sentence-boundary silences using a VAD-style RMS detector and expands them to meditation pacing targets. Sentence pauses widen to roughly 3 seconds. Comma pauses widen to around 1.5 seconds, with slight jitter so the rhythm does not become mechanical.

The important detail is that the original silence is widened, not replaced. The listener simply hears a longer version of a pause that already existed.

A few implementation details turned out to matter:

Asymmetric edge guards: the leading 150 ms and trailing 1 second of every chunk are excluded from expansion so we do not stack pauses on top of the assembler’s existing inter-chunk silence.
Pre-long-silence guard: if the next region is already silent, we skip expansion. Layering pauses sounds unnatural even when each individual pause is technically correct.
Feature-flagged: a single environment flag controls the entire pass, and every expansion is persisted to metadata for auditability.

In practice, this worked far better than trying to force meditation pacing directly through synthesis. NeuTTS handled the speech generation itself, while the surrounding system shaped the pacing and atmosphere afterward.

The Punctuation That Wasn’t Punctuation

Once the synthesis path stabilized, another class of failures came into focus:

the script the model receives is not the script the user reads.

NeuTTS, like most autoregressive TTS systems, is unusually sensitive to certain tokens. A few common punctuation patterns turned out to be acoustic landmines.

The Triple-Dot Landmine

Our meditation prose was full of ... ellipses, intended to suggest gentle pacing.

NeuTTS interpreted them as a strong prosodic cue and frequently elongated the preceding word into a sustained vowel.

“settle in...” became “settle innnnn”

“deep breath...” became “breath…hhhh”

In one real production failure, the word fresh turned into a 2.2-second held vowel before the model recovered. The fix was unglamorous: every run of three or more dots gets rewritten to a single period immediately before the text reaches NeuTTS. The pacing still exists upstream in the arrangement layer. The model simply never sees the dangerous token sequence.

Dashes That Don’t Pause

LLMs love parenthetical asides: “wait, listen”

NeuTTS treats dashes as continuation rather than punctuation, so the following word arrives unnaturally fast. The fix became content-aware:

journals and conversational content rewrite standalone dashes to ...
meditation and sleep content rewrite them to commas

An ellipsis inside meditation content would inherit the long pause expansion and create dead air. Compound words like self-care remain untouched because only whitespace-surrounded hyphens are normalized.

The Vowel-Elongation Interjection

A journal comment beginning with “Ahhh,” came back as a three-second sustained vowel. The sanitizer now strips elongated interjections like:

Ahhh
Ohhh
Aww
Wooow

Single-letter forms like Ah and Oh are preserved.

The Breath-Cue Glitch

Breath-cue failures took the longest to diagnose. NeuTTS reliably garbles the word in when it immediately follows breathe or breath.

“breathe in” became robotic “innnnn”. The mitigation ended up with three layers:

The generation prompt explicitly tells the LLM to avoid phrases like breathe in and prefer alternatives such as breathe deeply or inhale gently.
A pre-TTS rewrite layer substitutes risky phrases before synthesis.
If a chunk contains multiple breath cues, later occurrences are rewritten to different variants because repeated cues themselves sometimes destabilize synthesis.

Each change is small, but together they eliminated a measurable percentage of production failures. Not model failures, token failures.

Word-Level Surgery: Finding the Word That Failed

Once Whisper was already running for truncation verification and word-boundary detection, a more powerful capability emerged almost accidentally: we could point at a specific moment in a chunk and ask which word was being spoken.

That changed how we approached glitch recovery. Previously, a persistent amplitude spike triggered repeated retries and eventually a white-noise patch over the damaged region. It worked, but it was blunt. The newer system performs Script Repair:

QC retries fail on audio spike
Whisper generates per-word timestamps
The overlapping word is identified
A local Llama model replaces that single word with a natural synonym
One additional TTS pass runs on the rewritten chunk
QC runs again

If the repaired chunk passes, the rewritten version ships and the substitution is persisted to generation metadata. If anything fails, the system falls back to the existing white-noise patch path.

A few design choices mattered:

no recursive repair attempts
small local models only
amplitude-spike failures only
repair happens after QC confirms the failure

The conceptual shift was important: when a glitch is intermittent and tied to a specific word, the cheapest repair is often not regenerating the audio at all. It is simply changing the word.

Rewriting the Script in Flight

Script Repair handles single problematic words, but some failures operate at the chunk level instead: persistent truncation, repeated HF noise bursts, or chunks that continue failing through retries, sentence fallback, clause fallback, and missing-tail append.

Some token sequences simply do not synthesise reliably on a given voice. For those cases, we built Chunk Paraphrase Repair. The flow:

Primary retries fail
Truncation Split Repair fails
A local Llama model paraphrases the entire chunk while preserving tone and meaning
A secondary QC loop runs against the rewritten text
If successful, the rewritten chunk becomes the new structural baseline

Everything is persisted to metadata for auditability.

We also built a narrower version specifically for breath-cue elongation. If retries repeatedly fail near a breath cue, only the surrounding phrasing is rewritten.

Over time, we stopped treating the script as fixed input and started treating it as another parameter the system could reshape in order to stay inside the model’s operating envelope.

The Repeated-Tail Stitching Artifact

Part 1 introduced the missing-tail append system: when Whisper detects that the end of a sentence is missing, only the missing words are regenerated and stitched onto the original audio. Fast, surgical, and far better than regenerating the entire sentence.

Then another artifact appeared. Audio that initially sounded fine would reveal itself on repeated listens:

“…you arrive at the lake’s edge … lake’s edge.”

The append had succeeded, but the original truncated audio still contained a partial version of the same phrase.

The solution became a repeated-tail detection pass. After the append completes, Whisper runs again on the assembled audio. If a repeated 1 to 4 word phrase appears near the end, the first garbled occurrence is trimmed and a short cosine crossfade is applied across the splice.

Every fallback layer creates its own seam, and eventually every seam needs its own integrity check.

Discovering the Model’s Operating Envelope

The most useful discovery came from observing the system at scale. We found strong correlations between chunk structure and synthesis quality:

ellipses destabilized synthesis
many tiny chunks increased glitch probability
very large chunks increased truncation probability
medium-sized paragraph blocks were dramatically more stable than either extreme

So we started shaping prompts around the model’s behavior. Instead of fragmented meditation prose full of ellipses, we guided the LLM toward paragraph-style spoken blocks with cleaner punctuation and more predictable pacing. The generation pipeline now targets:

~25 to 35 word spoken blocks
a hard 35-word chunk cap
minimal ellipsis usage
paragraph-aligned pacing
pause expansion handled after synthesis

Two implementation details mattered more than expected.

Break-Aware Chunking

The chunker preferentially splits at: (1) sentence boundaries, (2) strong punctuation, (3) commas. Only if none exist does it perform a hard word-count cut. A 36-word sentence with a comma becomes two natural 18-word chunks instead of a 35-word chunk plus a one-word orphan.

Backward-Merge Guard

Without a guard, the chunker would aggressively merge tiny trailing fragments back into previous chunks and quietly violate the cap. The guard prevents merges from ever crossing the threshold.

It sounds like a small text-processing detail, but it materially improved generation reliability.

More broadly, we stopped treating the model like a black box and started treating it like a probabilistic system with measurable constraints.

Catching the Class You Just Found

What makes this sustainable instead of an endless game of whack-a-mole is that every fix ships with its own QC detector calibrated to catch the same class of failure if it reappears. A few examples:

the fresh elongation glitch led to lowering the spectral-stability detector threshold from 3.0 seconds to 2.0 seconds
breath-cue glitches produced contextual HF gates plus Whisper-driven elongation detection
repeated-tail artifacts produced their own detector based on Whisper word boundaries
Script Repair introduced its own forensic metadata fields for operator auditing

Each detector is small and specific. Together they form a QC stack that is far harder to bypass than any single large heuristic system.

Quality Emerges From the System Around the Model

Over time, Parkbench accumulated:

probabilistic QC gates
truncation detection
word-level Whisper alignment
elongation and droning detectors
HF noise burst analysis
voice drift detection
contextual QC gates
post-synthesis pause expansion
arrangement-layer fades and crossfades
script sanitization
in-flight script rewrites
repeated-tail trimming
progressive fallback systems
per-chunk forensic tooling
regeneration workflows
severity classification and alerting
daily quality digests

Most of these systems exist because of the model’s constraints. The limitations became part of the architecture.

Ironically, that probably produced a more robust system than simply replacing the model with a larger one would have. A bigger model might shift the failure distribution, but it would not eliminate the need for orchestration, recovery paths, or forensic tooling.

Over time, more and more of the actual product value ended up living in the orchestration layer around the model.

The Difference Between a Demo and a Product

There is a recurring pattern in AI where a model appears impressive in isolation, but the real engineering only emerges later:

concurrency, retries, orchestration, cold starts, edge cases, long-form stability, user expectations.

Voice systems are no different. What we learned building Parkbench is that high-quality long-form AI speech is less about maximizing expressiveness and more about balancing:

stability
pacing
chunk topology
orchestration
quality control
rewriteable scripts
the constraints of the underlying model

Sometimes the most reliable solution is not the most complicated one. Often, it is the solution that respects the limits of the system, shapes the inputs around those limits, and reserves the heavy machinery for the rare cases where the simple path fails.

The next problems are already visible: prosody consistency across very long generations, smoother cross-voice handoffs, pronunciation validation for proper nouns, calmness drift detection over time.

The underlying pattern probably will not change. The model is only one part of the system. Most of the work still happens in the layers around it.

Parkbench, 2026: Your thoughts are not training data. They are part of a relationship.