Speaker Diarization

Automatically identify and distinguish between different speakers in your audio recordings.

Overview

Speaker diarization answers the question "who spoke when?" by segmenting audio based on speaker identity. This is essential for meetings, interviews, podcasts, and any multi-speaker content.

Supported Models

Model	Diarization Support	Max Speakers
`lucid-mono`	✅	Unlimited
`lucid-multi`	✅	Unlimited
`lucid-agent`	✅	Unlimited
`lucid-lite`	✅	Unlimited
`cipher-fast`	❌	-
`cipher-max`	❌	-

Enable Diarization

import requests

response = requests.post(
    "https://yourvoic.com/api/v1/stt/lucid/transcribe",
    headers={"X-API-Key": "your_api_key"},
    files={"file": open("meeting.mp3", "rb")},
    data={
        "model": "lucid-mono",
        "diarize": "true"
    }
)

result = response.json()

Response Format

When diarization is enabled, the response includes speaker information:

{
    "success": true,
    "text": "Hello everyone, welcome to the meeting. Thanks for having me.",
    "utterances": [
        {
            "speaker": 0,
            "text": "Hello everyone, welcome to the meeting.",
            "start": 0.0,
            "end": 2.5,
            "confidence": 0.95
        },
        {
            "speaker": 1,
            "text": "Thanks for having me.",
            "start": 3.0,
            "end": 4.2,
            "confidence": 0.92
        }
    ],
    "speakers": {
        "count": 2
    }
}

Processing the Results

# Print transcript with speaker labels
result = response.json()

for utterance in result.get('utterances', []):
    speaker = utterance['speaker']
    text = utterance['text']
    start = utterance['start']
    
    print(f"[Speaker {speaker}] ({start:.1f}s): {text}")

Best Practices

Audio Quality: Clear audio with minimal background noise produces better speaker separation
Speaker Distance: Ensure speakers are at similar distances from the microphone
Overlapping Speech: Minimize simultaneous talking for cleaner diarization
Consistent Voices: Works best when speakers have distinct voice characteristics

💡 Note: Speaker labels (0, 1, 2...) are assigned based on order of first appearance in the audio, not by any pre-defined identity.

Previous ← Languages Support Next Word Timestamps →