Word Timestamps

Get precise timing information for every word in your transcription.

Overview

Word-level timestamps provide start and end times for each word, essential for:

  • Video subtitle generation
  • Audio/video editing alignment
  • Karaoke-style highlighting
  • Accessibility applications
  • Search and navigation within audio

Enable Word Timestamps

Use the verbose_json response format with timestamp_granularities=word:

import requests

response = requests.post(
    "https://yourvoic.com/api/v1/stt/cipher/transcribe",
    headers={"X-API-Key": "your_api_key"},
    files={"file": open("audio.mp3", "rb")},
    data={
        "model": "cipher-max",
        "response_format": "verbose_json",
        "timestamp_granularities": "word"
    }
)

result = response.json()

Response Format

{
    "success": true,
    "text": "Hello world, this is a test.",
    "words": [
        {"word": "Hello", "start": 0.0, "end": 0.45},
        {"word": "world,", "start": 0.52, "end": 0.89},
        {"word": "this", "start": 1.02, "end": 1.18},
        {"word": "is", "start": 1.22, "end": 1.35},
        {"word": "a", "start": 1.38, "end": 1.42},
        {"word": "test.", "start": 1.45, "end": 1.82}
    ],
    "duration": 1.82,
    "language": "en"
}

Timestamp Granularities

ValueDescriptionUse Case
wordIndividual word timingSubtitles, karaoke
segmentSentence/phrase timingParagraph-level alignment

Generate SRT Subtitles

def create_srt(words, max_chars=42, max_duration=4.0):
    """Generate SRT subtitle file from word timestamps"""
    subtitles = []
    current_line = []
    current_chars = 0
    line_start = None
    
    for word in words:
        if line_start is None:
            line_start = word['start']
        
        if current_chars + len(word['word']) > max_chars or \
           word['end'] - line_start > max_duration:
            # Save current line and start new one
            subtitles.append({
                'start': line_start,
                'end': current_line[-1]['end'] if current_line else word['start'],
                'text': ' '.join(w['word'] for w in current_line)
            })
            current_line = [word]
            current_chars = len(word['word'])
            line_start = word['start']
        else:
            current_line.append(word)
            current_chars += len(word['word']) + 1
    
    # Don't forget the last line
    if current_line:
        subtitles.append({
            'start': line_start,
            'end': current_line[-1]['end'],
            'text': ' '.join(w['word'] for w in current_line)
        })
    
    return subtitles

# Use it
result = response.json()
subtitles = create_srt(result['words'])

Direct Subtitle Formats

For convenience, you can request subtitles directly:

# Get SRT format directly
curl -X POST "https://yourvoic.com/api/v1/stt/cipher/transcribe" \
  -H "X-API-Key: your_api_key" \
  -F "file=@video.mp4" \
  -F "model=cipher-max" \
  -F "response_format=srt"

# Get WebVTT format
curl -X POST "https://yourvoic.com/api/v1/stt/cipher/transcribe" \
  -H "X-API-Key: your_api_key" \
  -F "file=@video.mp4" \
  -F "model=cipher-max" \
  -F "response_format=vtt"