Voice Generation
Voice generation converts the written script into spoken audio with word-level subtitle timing.
What It Does
This stage takes a completed script and produces two outputs: an MP3 audio file of the spoken voiceover, and precise word-by-word timing data used to drive subtitle highlighting in the final video. The audio is uploaded to Supabase Storage and the timing data is stored on the content record.
How It Works
- Text-to-Speech — The script is sent to OpenAI’s TTS API (tts-1 model) which returns an MP3 audio file
- Subtitle Generation — The audio is sent back to OpenAI’s Whisper API which transcribes it with word-level timestamps
- Duration Detection — FFprobe reads the actual audio duration from the generated file
- Upload — The MP3 is uploaded to the
content-audioSupabase Storage bucket
Subtitle Data Format
The Whisper API returns word-level timing that is stored in the subtitle_data field:
{
"words": [
{"word": "This", "start": 0.0, "end": 0.3},
{"word": "game", "start": 0.3, "end": 0.6},
{"word": "changed", "start": 0.6, "end": 1.0}
],
"full_text": "This game changed..."
}This timing data drives the word-highlighted subtitles in the final video. Each word is displayed at exactly the right moment, with the currently spoken word highlighted in the accent color.
Voice Options
The voice is configured per-channel in voice_config:
| Voice | Character |
|---|---|
alloy | Neutral, balanced (default) |
echo | Warm, conversational |
fable | Expressive, storytelling |
nova | Energetic, bright |
onyx | Deep, authoritative |
shimmer | Clear, gentle |
Speed ranges from 0.25x to 4.0x (default 1.0). Most channels work well between 0.95 and 1.1x.
Voice Testing
You can generate a voice sample without creating content. This is useful for previewing voice and speed combinations before committing to a channel configuration. The test endpoint generates a short audio clip using sample text so you can hear how different voices sound.
Where to Find It
- Dashboard: Channel Settings, Voice tab — preview voices and configure settings
- Trigger: Pipeline page, “Generate Voice” button
- API:
POST /pipeline/generate-voice(production) orPOST /pipeline/generate-voice-sample(testing)
Configuration
| Field | Type | Default | Description |
|---|---|---|---|
voice | string | "alloy" | OpenAI TTS voice name |
speed | number | 1.0 | Speaking speed multiplier (0.25 to 4.0) |
These are set in the voice_config JSON object on the channel record.
Dependencies
OPENAI_API_KEY— Required for both TTS generation and Whisper transcriptionffprobe(part of FFmpeg) — Required for reading audio duration from the generated file