STT Providers & Settings

Speech-to-Text (STT) converts what callers say into text for your AI to understand. Burki Voice AI supports multiple STT providers—choose based on your needs for speed, language support, or enterprise features.

Provider Comparison

Feature	Deepgram	ElevenLabs Scribe v2	Azure Speech
Speed	Ultra-fast (~100ms)	Ultra-fast (~150ms)	Fast (~200ms)
Languages	30+	90+	100+
Models	Nova 2, Nova 3	Scribe v2 Realtime	Standard, Enhanced, Neural
Term Boosting	Keywords, Keyterms	❌	Phrase Lists
Best For	Phone calls, Speed	Multi-language, VAD	Enterprise, Multi-language
Diarization	✅	❌	✅
Real-Time	✅	✅	✅

⚡ Deepgram

Ultra-Low Latency~100ms response time, optimized for phone calls. Nova-3 keyterms for English, Nova-2 for multi-language.

🎙️ ElevenLabs Scribe v2

Multi-Language Excellence~150ms latency, 90+ languages, advanced VAD-based speech detection, 93.5% accuracy.

☁️ Azure Speech

Enterprise Scale100+ languages, Microsoft ecosystem integration, phrase lists for term boosting, custom speech models.

Deepgram

Deepgram is the default STT provider, optimized for speed and phone call quality.

Models

Model	Features	Keywords	Keyterms	Best For
Nova-3	Latest, keyterms support	❌	✅	English calls, best accuracy
Nova-2	Keywords support	✅	❌	Multi-language, reliable
Nova	Keywords support	✅	❌	Balanced performance
Enhanced	Keywords support	✅	❌	Legacy support
Base	Keywords support	✅	❌	Basic transcription

Recommended: Use Nova-3 for English calls (supports keyterms) or Nova-2 for other languages (supports keywords).

Configuration

{
  "stt_settings": {
    "provider": "deepgram",
    "model": "nova-3",
    "language": "en-US"
  }
}

ElevenLabs Scribe v2

ElevenLabs Scribe v2 Configuration

ElevenLabs Scribe v2 Realtime provides ultra-low latency speech recognition with excellent multi-language support and advanced voice activity detection.Key Features:

Ultra-low latency (~150ms) with 93.5% accuracy
90+ languages supported
Advanced VAD-based commit strategy
Word-level timestamps support
Automatic language detection

Setup:

Sign up at ElevenLabs
Get your API key from the dashboard
Configure in assistant settings

Configuration:

{
  "stt_settings": {
    "provider": "elevenlabs",
    "model": "scribe_v2_realtime",
    "language": "en",
    "elevenlabs_config": {
      "commit_strategy": "vad",
      "vad_threshold": 0.4,
      "vad_silence_threshold_secs": 1.5
    }
  }
}

📖 Full ElevenLabs Documentation

See the complete ElevenLabs Scribe v2 guide for VAD settings, language options, and best practices.

Azure Speech

Azure Speech Configuration

Azure Speech provides enterprise-grade speech recognition with broad language support and Microsoft ecosystem integration.Key Features:

100+ languages and regional variants
Phrase lists for domain-specific term boosting
Custom speech models for specialized vocabulary
Speaker diarization support

Setup:

Create Azure Speech resource in Azure Portal
Get your subscription key and region
Configure in assistant settings

Configuration:

{
  "stt_settings": {
    "provider": "azure",
    "model": "standard",
    "language": "en-US",
    "azure_config": {
      "subscription_key": "your_key",
      "region": "eastus"
    }
  }
}

📖 Full Azure Documentation

See the complete Azure Speech STT guide for models, languages, configuration options, and best practices.

Key Settings

Model & Language

Provider: Choose Deepgram for speed or Azure for enterprise features
Model: Choose based on your needs (Nova-3 for English, Standard for multi-language)
Language: Select from common options or enter a custom language code
Custom Language: Enter any supported language code (e.g., fr-FR, es-ES)

Advanced Timing Controls

These settings control how the STT provider detects when someone has finished speaking. Getting these right is crucial for natural conversation flow.

Endpointing (Silence Threshold)

What it does: How long the provider waits after detecting silence before considering speech has ended.Technical Details:

Measured in: Milliseconds
Default: 10ms (minimal endpointing for real-time applications)
Range: 10ms - 2000ms (recommended)
Config Path: stt_settings.endpointing.silence_threshold

Real Example:

10ms: Very responsive (default) - might cut off slow speakers
500ms: “I need help with…” → 0.5s silence → Provider says “speech ended”
1000ms: More patient (good for people who pause while thinking)

When to Adjust:

Lower (10-100ms): For fast talkers or quick interactions (default)
Higher (500-1000ms): For elderly callers or complex topics
Much higher (1500ms+): For people with speech difficulties

Min Silence Duration

What it does: Internal timeout for utterance processing when the provider doesn’t send speech_final (not sent to provider API).Technical Details:

Measured in: Milliseconds
Default: 1500ms
Range: 500ms - 5000ms (recommended)
Config Path: stt_settings.endpointing.min_silence_duration
Used for: Call handler utterance timeout logic when speech_final is missing

Real Example:

1500ms: Wait 1.5s for speech_final, then process accumulated utterance (default)
1000ms: Quicker timeout for responsive conversation
2500ms: More patience for complex responses or noisy environments

When to Adjust:

Lower (500-1000ms): For quick, responsive interactions
Higher (2000-3000ms): For environments with background noise where speech_final may be unreliable
Match with conversation style: Shorter for rapid-fire Q&A, longer for detailed discussions

Utterance End Timeout

What it does: Maximum time the provider waits for a complete utterance before sending UtteranceEnd event.Technical Details:

Measured in: Milliseconds
Default: 1000ms
Range: 500ms - 5000ms (recommended)
Config Path: stt_settings.utterance_end_ms
API Parameter: utterance_end_ms

Real Example:

1000ms: If someone starts talking but doesn’t finish within 1 second, provider sends UtteranceEnd (default)
500ms: Quick timeout (might cut off long sentences)
2000ms: Patient timeout (good for complex responses)

When to Adjust:

Lower (500-800ms): For short, quick interactions
Higher (1500-3000ms): For detailed conversations or forms
Consider your use case: Customer service vs. quick orders

VAD Events

What it does: Enables Voice Activity Detection events for enhanced speech detection and UtteranceEnd events.Technical Details:

Type: Boolean (true/false)
Default: true (enabled)
Config Path: stt_settings.vad_events
API Parameter: vad_events

Real Example:

true: Enhanced speech detection with UtteranceEnd events when speech_final doesn’t work (recommended)
false: Basic speech detection only (legacy mode)

When to Enable:

Always recommended: Provides better speech detection in noisy environments
Essential for: Background noise, poor connections, multiple speakers
Backup mechanism: When speech_final doesn’t trigger due to audio issues

Why It Matters: VAD events provide UtteranceEnd signals as a fallback when normal speech detection fails due to background noise or audio quality issues.

🎯 Timing Settings Quick Guide

Real-Time/Fast Conversations (Default):

Endpointing: 10ms, Min Silence: 1500ms, Utterance End: 1000ms, VAD Events: true

Balanced Professional:

Endpointing: 300ms, Min Silence: 1500ms, Utterance End: 1500ms, VAD Events: true

Patient/Elderly Callers:

Endpointing: 800ms, Min Silence: 2500ms, Utterance End: 2000ms, VAD Events: true

Critical: These settings work together with Call Management interruption settings. Endpointing controls provider responsiveness, Min Silence Duration controls internal timeout handling, and both affect conversation flow timing.

Processing Options

Keywords & Keyterms

Keywords (Deepgram Nova-2, Nova, Enhanced, Base):

Boost recognition of specific words
Format: word:boost_factor (e.g., Deepgram:2.0, API:1.5)
Great for company names, technical terms

Keyterms (Deepgram Nova-3 only, English only):

Advanced keyword detection
Format: word1, word2, word3
More sophisticated than keywords

Phrase Lists (Azure Speech):

Boost recognition of specific terms
Format: Comma-separated list
Works with all Azure models and languages

Use keywords/keyterms/phrase lists for your company name, product names, and industry-specific terms to improve accuracy.

Audio Denoising

Burki Voice AI includes RNNoise for real-time audio denoising, which removes background noise before transcription.

When to Enable:

Noisy environments (restaurants, offices, outdoors)
Poor phone connections
Background music or chatter

Trade-offs:

Slightly increases latency (~50-100ms)
Improves transcription accuracy in noisy conditions

Troubleshooting

Common STT Issues & Solutions

Speech Detection Problems:

AI misses words: Enable denoising or add keywords/phrase lists for important terms
Cuts off callers mid-sentence: Increase endpointing (10ms → 500ms) and utterance end timeout
Long awkward pauses: Decrease min silence duration for faster internal processing
Interrupts slow speakers: Increase endpointing and min silence duration
Misses trailing words: Enable VAD events and increase utterance end timeout

Language & Recognition:

Wrong language detected: Set correct language code or use “custom” option
Technical terms not recognized: Add them as keywords/keyterms/phrase lists with boost factors
Company names garbled: Add company/product names to keywords list

Audio Quality:

Noisy background: Enable audio denoising and increase VAD turnoff
Poor phone connection: Enable denoising and use more conservative timing settings
Multiple speakers: Use higher silence thresholds to avoid cross-talk issues

Provider-Specific:

Deepgram connection issues: Verify your Deepgram API key in Settings → Provider Keys
Azure authentication failed: Verify subscription key and region match your Speech resource in Settings → Provider Keys

Testing Strategy: Record test calls with different timing settings and listen to the conversation flow. What feels natural to you will feel natural to callers.

Best Practices

Start with defaults and adjust based on testing
Test with real calls in your target environment
Use term boosting (keywords/keyterms/phrase lists) for your business-specific terminology
Enable denoising if you expect background noise
Monitor call quality and adjust timing as needed
Choose the right provider based on your primary needs (speed vs. language support)

How STT Works with Call Management

🔗 STT + Call Management = Natural Conversations

STT Settings control when the provider detects speech has ended.Call Management Settings control how your AI responds to that detected speech.Both must work together for natural conversation flow!

The Flow:

STT detects speech using your timing settings (silence threshold, VAD, etc.)
Call Management decides response using interruption and timeout settings
Result: Natural conversation or awkward pauses

Key Relationships:

STT min_silence_duration (internal timeout) should be longer than Call Management interruption_cooldown
Lower STT endpointing (more responsive) works well with lower Call Management interruption_threshold
Higher STT timing settings pair well with patient Call Management idle_timeout

Next Step: Configure Call Management settings to control conversation flow after STT detects speech.

Getting Started

Core Concepts

AI Providers

Features

Advanced

Help & Resources

Provider Comparison

⚡ Deepgram

🎙️ ElevenLabs Scribe v2

☁️ Azure Speech

Deepgram

Models

Configuration

ElevenLabs Scribe v2

📖 Full ElevenLabs Documentation

Azure Speech

📖 Full Azure Documentation

Key Settings

Endpointing (Silence Threshold)

Min Silence Duration

Utterance End Timeout

VAD Events

🎯 Timing Settings Quick Guide

Audio Denoising

Troubleshooting

Best Practices

How STT Works with Call Management

🔗 STT + Call Management = Natural Conversations

Getting Started

Core Concepts

AI Providers

Features

Advanced

Help & Resources

​Provider Comparison

⚡ Deepgram

🎙️ ElevenLabs Scribe v2

☁️ Azure Speech

​Deepgram

​Models

​Configuration

​ElevenLabs Scribe v2

📖 Full ElevenLabs Documentation

​Azure Speech

📖 Full Azure Documentation

​Key Settings

​Endpointing (Silence Threshold)

​Min Silence Duration

​Utterance End Timeout

​VAD Events

🎯 Timing Settings Quick Guide

​Audio Denoising

​Troubleshooting

​Best Practices

​How STT Works with Call Management

🔗 STT + Call Management = Natural Conversations

Provider Comparison

Deepgram

Models

Configuration

ElevenLabs Scribe v2

Azure Speech

Key Settings

Endpointing (Silence Threshold)

Min Silence Duration

Utterance End Timeout

VAD Events

Audio Denoising

Troubleshooting

Best Practices

How STT Works with Call Management