Response Formats

Understanding the different response formats available from the Speech-to-Text API.

Standard JSON Response

Default response format for all batch transcription requests:

{
    "success": true,
    "text": "Hello, this is a transcription test.",
    "duration": 5.2,
    "language": "en",
    "model": "cipher-fast",
    "credits_used": 16
}

Verbose JSON Response

Extended response with word-level timestamps (use response_format=verbose_json):

{
    "success": true,
    "text": "Hello world",
    "duration": 1.5,
    "language": "en",
    "words": [
        {"word": "Hello", "start": 0.0, "end": 0.5, "confidence": 0.98},
        {"word": "world", "start": 0.6, "end": 1.0, "confidence": 0.97}
    ],
    "segments": [
        {
            "text": "Hello world",
            "start": 0.0,
            "end": 1.0,
            "confidence": 0.97
        }
    ]
}

Diarization Response

Response when speaker diarization is enabled:

{
    "success": true,
    "text": "Hello everyone. Thanks for having me.",
    "duration": 4.5,
    "utterances": [
        {
            "speaker": 0,
            "text": "Hello everyone.",
            "start": 0.0,
            "end": 1.5,
            "confidence": 0.95
        },
        {
            "speaker": 1,
            "text": "Thanks for having me.",
            "start": 2.0,
            "end": 3.8,
            "confidence": 0.93
        }
    ],
    "speakers": {
        "count": 2
    }
}

SRT Format

SubRip subtitle format for video captioning:

1
00:00:00,000 --> 00:00:02,500
Hello, welcome to our video.

2
00:00:03,000 --> 00:00:05,500
Today we'll be discussing the API.

WebVTT Format

Web Video Text Tracks format for HTML5 video:

WEBVTT

00:00:00.000 --> 00:00:02.500
Hello, welcome to our video.

00:00:03.000 --> 00:00:05.500
Today we'll be discussing the API.

Real-time Streaming Messages

Transcript Message

{
    "type": "transcript",
    "text": "Hello, how are you?",
    "is_final": true,
    "confidence": 0.95,
    "start": 0.0,
    "end": 1.5,
    "words": [
        {"word": "Hello", "start": 0.0, "end": 0.3},
        {"word": "how", "start": 0.4, "end": 0.5},
        {"word": "are", "start": 0.6, "end": 0.7},
        {"word": "you", "start": 0.8, "end": 1.0}
    ]
}

Speech Events

// Speech started
{
    "type": "speech_started",
    "timestamp": 1234567890.123
}

// Speech ended
{
    "type": "speech_ended", 
    "timestamp": 1234567891.456
}

Session Status

{
    "type": "session_status",
    "duration": 45.5,
    "credits_used": 137,
    "credits_remaining": 9863
}

Error Response

{
    "success": false,
    "error": {
        "code": "INVALID_FILE_FORMAT",
        "message": "Unsupported audio format. Please use MP3, WAV, M4A, or WebM."
    }
}

Previous ← Request Parameters Next Pricing →