Voice Generation

Voice generation converts the written script into spoken audio with word-level subtitle timing.

What It Does

This stage takes a completed script and produces two outputs: an MP3 audio file of the spoken voiceover, and precise word-by-word timing data used to drive subtitle highlighting in the final video. The audio is uploaded to Supabase Storage and the timing data is stored on the content record.

How It Works

Text-to-Speech — The script is sent to OpenAI’s TTS API (tts-1 model) which returns an MP3 audio file
Subtitle Generation — The audio is sent back to OpenAI’s Whisper API which transcribes it with word-level timestamps
Duration Detection — FFprobe reads the actual audio duration from the generated file
Upload — The MP3 is uploaded to the content-audio Supabase Storage bucket

Subtitle Data Format

The Whisper API returns word-level timing that is stored in the subtitle_data field:


{
  "words": [
    {"word": "This", "start": 0.0, "end": 0.3},
    {"word": "game", "start": 0.3, "end": 0.6},
    {"word": "changed", "start": 0.6, "end": 1.0}
  ],
  "full_text": "This game changed..."
}

This timing data drives the word-highlighted subtitles in the final video. Each word is displayed at exactly the right moment, with the currently spoken word highlighted in the accent color.

Voice Options

The voice is configured per-channel in voice_config:

Voice	Character
`alloy`	Neutral, balanced (default)
`echo`	Warm, conversational
`fable`	Expressive, storytelling
`nova`	Energetic, bright
`onyx`	Deep, authoritative
`shimmer`	Clear, gentle

Speed ranges from 0.25x to 4.0x (default 1.0). Most channels work well between 0.95 and 1.1x.

Voice Testing

You can generate a voice sample without creating content. This is useful for previewing voice and speed combinations before committing to a channel configuration. The test endpoint generates a short audio clip using sample text so you can hear how different voices sound.

Where to Find It

Dashboard: Channel Settings, Voice tab — preview voices and configure settings
Trigger: Pipeline page, “Generate Voice” button
API: POST /pipeline/generate-voice (production) or POST /pipeline/generate-voice-sample (testing)

Configuration

Field	Type	Default	Description
`voice`	string	`"alloy"`	OpenAI TTS voice name
`speed`	number	`1.0`	Speaking speed multiplier (0.25 to 4.0)

These are set in the voice_config JSON object on the channel record.

Dependencies

OPENAI_API_KEY — Required for both TTS generation and Whisper transcription
ffprobe (part of FFmpeg) — Required for reading audio duration from the generated file