Skip to Content
System ReferenceVoice Generation

Voice Generation

Voice generation converts the written script into spoken audio with word-level subtitle timing.

What It Does

This stage takes a completed script and produces two outputs: an MP3 audio file of the spoken voiceover, and precise word-by-word timing data used to drive subtitle highlighting in the final video. The audio is uploaded to Supabase Storage and the timing data is stored on the content record.

How It Works

  1. Text-to-Speech — The script is sent to OpenAI’s TTS API (tts-1 model) which returns an MP3 audio file
  2. Subtitle Generation — The audio is sent back to OpenAI’s Whisper API which transcribes it with word-level timestamps
  3. Duration Detection — FFprobe reads the actual audio duration from the generated file
  4. Upload — The MP3 is uploaded to the content-audio Supabase Storage bucket

Subtitle Data Format

The Whisper API returns word-level timing that is stored in the subtitle_data field:

{ "words": [ {"word": "This", "start": 0.0, "end": 0.3}, {"word": "game", "start": 0.3, "end": 0.6}, {"word": "changed", "start": 0.6, "end": 1.0} ], "full_text": "This game changed..." }

This timing data drives the word-highlighted subtitles in the final video. Each word is displayed at exactly the right moment, with the currently spoken word highlighted in the accent color.

Voice Options

The voice is configured per-channel in voice_config:

VoiceCharacter
alloyNeutral, balanced (default)
echoWarm, conversational
fableExpressive, storytelling
novaEnergetic, bright
onyxDeep, authoritative
shimmerClear, gentle

Speed ranges from 0.25x to 4.0x (default 1.0). Most channels work well between 0.95 and 1.1x.

Voice Testing

You can generate a voice sample without creating content. This is useful for previewing voice and speed combinations before committing to a channel configuration. The test endpoint generates a short audio clip using sample text so you can hear how different voices sound.

Where to Find It

  • Dashboard: Channel Settings, Voice tab — preview voices and configure settings
  • Trigger: Pipeline page, “Generate Voice” button
  • API: POST /pipeline/generate-voice (production) or POST /pipeline/generate-voice-sample (testing)

Configuration

FieldTypeDefaultDescription
voicestring"alloy"OpenAI TTS voice name
speednumber1.0Speaking speed multiplier (0.25 to 4.0)

These are set in the voice_config JSON object on the channel record.

Dependencies

  • OPENAI_API_KEY — Required for both TTS generation and Whisper transcription
  • ffprobe (part of FFmpeg) — Required for reading audio duration from the generated file
Last updated on