Michael Stewart

Parkbench.ai: How Parkbench Ensures Voice Quality

The engineering behind making AI voices sound right, every time.

The Challenge

Parkbench generates audio clips every day: meditation guides, sleep aids, conversations, and long-form narrations; all using AI voice synthesis.

Unlike text, where a missing word is immediately visible, audio failures are subtle. A listener might hear a sentence that trails off mid-thought, a word that sounds slightly swallowed, or a pause that feels unnatural.

The core technology, NeuTTS Air, is an autoregressive transformer that generates audio token by token. Like all autoregressive models, it has a finite context window (~2048 tokens, roughly 30 seconds of audio). Push it too far and it doesn’t crash, it simply stops generating.

The audio file looks normal. It plays normally. But the last sentence is just… gone.

This is the worst kind of bug: intermittent, silent, and invisible to automated monitoring.


The Problem We Discovered

A user reported that a generated voice clip had its last word “slightly cut off.”

  • Input: 36 words (201 characters)
  • Output: 13 seconds of audio (seemingly reasonable)

But on inspection, the last 8 words were completely missing.

The model generated natural-sounding speech for the first 28 words, with proper pacing, intonation, and pauses, then quietly stopped.

Our duration-based detector (which flags audio that’s “too short” for its word count) saw:

13 seconds for 36 words → “Looks fine”

It wasn’t fine.

22% of the content was missing.


Why Simple Checks Don’t Work

Our first instinct was a duration heuristic:

If audio is shorter than expected → it’s probably truncated

This works for obvious failures, but it fundamentally cannot distinguish between:

  • A fast speaker who said everything (valid)
  • A normal speaker who stopped early (broken)

Speech rate varies:

  • Meditation → slow
  • Conversations → faster
  • Voices → naturally different

A single threshold either:

  • Misses real truncation, or
  • Flags valid audio (wasting compute)

We tried tightening it. It caught more issues, but also increased false positives.


The Insight: Ask the Audio What It Said

Instead of guessing based on duration…

Why not just listen to the audio?

We already use Whisper (OpenAI’s open-source speech recognition model) running internally.

  • ~75MB model
  • Runs on CPU
  • Transcribes ~15s audio in 1–3 seconds

So we flipped the problem:

  1. Generate audio
  2. Transcribe it back to text
  3. Compare with the original
  4. Regenerate if needed

This isn’t heuristic. It’s content verification.


How It Works in Practice

Two-Tier Detection

We kept the duration check (it’s free and instant), and added Whisper as a second layer.

Generate audio
→ Duration check (instant)
→ Too short? → Regenerate sentence-by-sentence
→ Looks OK? → Whisper transcription (~1–3s)
→ Words missing? → Regenerate sentence-by-sentence
→ All words present? → Accept audio ✓

Fuzzy Text Matching

Whisper isn’t perfect, so we don’t require exact matches.

We:

  • Normalize text (lowercase, remove punctuation)
  • Compare words in order using fuzzy matching
  • Require ≥85% word coverage
  • Explicitly verify the last 3 words

This avoids false positives while reliably catching truncation.

Progressive Fallback

When truncation is detected, we don’t retry blindly, we reduce input size:

  1. Full chunk (initial attempt)
  2. Sentence-level generation
  3. Clause-level splitting (commas, semicolons)

Shorter inputs = safer generation.


Coverage


The Privacy Angle

Everything runs locally.

  • Whisper runs on the same CPU pod
  • No audio leaves our infrastructure
  • No external APIs

This was non-negotiable.


Performance Impact

  • Whisper adds 1–3 seconds per chunk
  • Typical generation: 2+ minutes → negligible overhead

Fallback regeneration is slower, but only triggered when there’s a real issue.

Correct output on first delivery is worth the cost.


What We Learned

  1. Heuristics aren’t enough: They catch obvious issues, but not edge cases.
  2. Reuse existing tools: Whisper was already deployed. This was low effort, high impact.
  3. Intermittent bugs need deterministic solutions: Verification eliminates entire bug classes.
  4. Layer defenses: Duration → Whisper → fallback
  5. Make it toggleable: Feature flags allow instant rollback.

This pattern: generate → verify → regenerate extends naturally:

  • Pronunciation validation
  • Emotional tone matching
  • Multi-speaker consistency
  • Music/voice balance

The idea is simple:

Use AI to check AI before humans ever notice.


The Cold-Start Problem

Users reported clipped or garbled first words.

Cause: autoregressive models start with a “cold” hidden state.

Fix: warm-up generation

  • Generate a throwaway ~17-word phrase
  • Discard it
  • Then generate real content

Result: clean starts, no artifacts.


When Fallbacks Fail

A 39-word sentence kept truncating, even after fallback.

Root issue:

  • Clause splitting worked
  • But recombination logic merged it back
  • System returned “unsplittable” → kept bad audio

The Fix

1. Escalation logic

  • If 40-word split fails → force 20-word split

2. Missing-tail append

If truncation persists:

  • Extract missing tail words
  • Generate them separately
  • Append with ~50ms silence
  • Re-verify with Whisper

The Full Fallback Chain

Generate full chunk
→ Duration + Whisper
→ Truncated? → Sentence split
→ Per sentence: Duration + Whisper
→ Truncated? → Clause split (40 words)
→ Can't split? → Fine split (20 words)
→ Still truncated? → Append missing tail
→ Final Whisper verification

Head + Tail Verification

We now check both ends:

  • First 3 words (head)
  • Last 3 words (tail)

Both must pass.


Whisper at Every Level

Whisper now runs at:

  • Chunk
  • Sentence
  • Clause
  • Final output

This gives precise diagnostics, not just final failure detection.


Full Transcription Logging

We log:

  • Original text
  • Whisper output

Used for:

  • Debugging
  • Monitoring word coverage trends

What This Taught Us

  1. Fallbacks must be tested end-to-end
  2. Detection without correction is useless
  3. Surgical fixes beat brute force
  4. Autoregressive models need priming
  5. Verification must happen at every layer

Parkbench generates personalized AI audio including meditation, sleep, conversations, and narration.

Everything runs locally, using open-source models.

No user data ever leaves our infrastructure.


Leave a comment