Skip to main content
Speech-to-Text (STT) converts what callers say into text for your AI to understand. Burki Voice AI supports multiple STT providers—choose based on your needs for speed, language support, or enterprise features.

Provider Comparison

FeatureDeepgramElevenLabs Scribe v2Azure Speech
SpeedUltra-fast (~100ms)Ultra-fast (~150ms)Fast (~200ms)
Languages30+90+100+
ModelsNova 2, Nova 3Scribe v2 RealtimeStandard, Enhanced, Neural
Term BoostingKeywords, KeytermsPhrase Lists
Best ForPhone calls, SpeedMulti-language, VADEnterprise, Multi-language
Diarization
Real-Time

⚡ Deepgram

Ultra-Low Latency~100ms response time, optimized for phone calls. Nova-3 keyterms for English, Nova-2 for multi-language.

🎙️ ElevenLabs Scribe v2

Multi-Language Excellence~150ms latency, 90+ languages, advanced VAD-based speech detection, 93.5% accuracy.

☁️ Azure Speech

Enterprise Scale100+ languages, Microsoft ecosystem integration, phrase lists for term boosting, custom speech models.

Deepgram

Deepgram is the default STT provider, optimized for speed and phone call quality.

Models

ModelFeaturesKeywordsKeytermsBest For
Nova-3Latest, keyterms supportEnglish calls, best accuracy
Nova-2Keywords supportMulti-language, reliable
NovaKeywords supportBalanced performance
EnhancedKeywords supportLegacy support
BaseKeywords supportBasic transcription
Recommended: Use Nova-3 for English calls (supports keyterms) or Nova-2 for other languages (supports keywords).

Configuration

{
  "stt_settings": {
    "provider": "deepgram",
    "model": "nova-3",
    "language": "en-US"
  }
}

ElevenLabs Scribe v2

ElevenLabs Scribe v2 Realtime provides ultra-low latency speech recognition with excellent multi-language support and advanced voice activity detection.Key Features:
  • Ultra-low latency (~150ms) with 93.5% accuracy
  • 90+ languages supported
  • Advanced VAD-based commit strategy
  • Word-level timestamps support
  • Automatic language detection
Setup:
  1. Sign up at ElevenLabs
  2. Get your API key from the dashboard
  3. Configure in assistant settings
Configuration:
{
  "stt_settings": {
    "provider": "elevenlabs",
    "model": "scribe_v2_realtime",
    "language": "en",
    "elevenlabs_config": {
      "commit_strategy": "vad",
      "vad_threshold": 0.4,
      "vad_silence_threshold_secs": 1.5
    }
  }
}

📖 Full ElevenLabs Documentation

See the complete ElevenLabs Scribe v2 guide for VAD settings, language options, and best practices.

Azure Speech

Azure Speech provides enterprise-grade speech recognition with broad language support and Microsoft ecosystem integration.Key Features:
  • 100+ languages and regional variants
  • Phrase lists for domain-specific term boosting
  • Custom speech models for specialized vocabulary
  • Speaker diarization support
Setup:
  1. Create Azure Speech resource in Azure Portal
  2. Get your subscription key and region
  3. Configure in assistant settings
Configuration:
{
  "stt_settings": {
    "provider": "azure",
    "model": "standard",
    "language": "en-US",
    "azure_config": {
      "subscription_key": "your_key",
      "region": "eastus"
    }
  }
}

📖 Full Azure Documentation

See the complete Azure Speech STT guide for models, languages, configuration options, and best practices.

Key Settings

  • Provider: Choose Deepgram for speed or Azure for enterprise features
  • Model: Choose based on your needs (Nova-3 for English, Standard for multi-language)
  • Language: Select from common options or enter a custom language code
  • Custom Language: Enter any supported language code (e.g., fr-FR, es-ES)
These settings control how the STT provider detects when someone has finished speaking. Getting these right is crucial for natural conversation flow.

Endpointing (Silence Threshold)

What it does: How long the provider waits after detecting silence before considering speech has ended.Technical Details:
  • Measured in: Milliseconds
  • Default: 10ms (minimal endpointing for real-time applications)
  • Range: 10ms - 2000ms (recommended)
  • Config Path: stt_settings.endpointing.silence_threshold
Real Example:
  • 10ms: Very responsive (default) - might cut off slow speakers
  • 500ms: “I need help with…” → 0.5s silence → Provider says “speech ended”
  • 1000ms: More patient (good for people who pause while thinking)
When to Adjust:
  • Lower (10-100ms): For fast talkers or quick interactions (default)
  • Higher (500-1000ms): For elderly callers or complex topics
  • Much higher (1500ms+): For people with speech difficulties

Min Silence Duration

What it does: Internal timeout for utterance processing when the provider doesn’t send speech_final (not sent to provider API).Technical Details:
  • Measured in: Milliseconds
  • Default: 1500ms
  • Range: 500ms - 5000ms (recommended)
  • Config Path: stt_settings.endpointing.min_silence_duration
  • Used for: Call handler utterance timeout logic when speech_final is missing
Real Example:
  • 1500ms: Wait 1.5s for speech_final, then process accumulated utterance (default)
  • 1000ms: Quicker timeout for responsive conversation
  • 2500ms: More patience for complex responses or noisy environments
When to Adjust:
  • Lower (500-1000ms): For quick, responsive interactions
  • Higher (2000-3000ms): For environments with background noise where speech_final may be unreliable
  • Match with conversation style: Shorter for rapid-fire Q&A, longer for detailed discussions

Utterance End Timeout

What it does: Maximum time the provider waits for a complete utterance before sending UtteranceEnd event.Technical Details:
  • Measured in: Milliseconds
  • Default: 1000ms
  • Range: 500ms - 5000ms (recommended)
  • Config Path: stt_settings.utterance_end_ms
  • API Parameter: utterance_end_ms
Real Example:
  • 1000ms: If someone starts talking but doesn’t finish within 1 second, provider sends UtteranceEnd (default)
  • 500ms: Quick timeout (might cut off long sentences)
  • 2000ms: Patient timeout (good for complex responses)
When to Adjust:
  • Lower (500-800ms): For short, quick interactions
  • Higher (1500-3000ms): For detailed conversations or forms
  • Consider your use case: Customer service vs. quick orders

VAD Events

What it does: Enables Voice Activity Detection events for enhanced speech detection and UtteranceEnd events.Technical Details:
  • Type: Boolean (true/false)
  • Default: true (enabled)
  • Config Path: stt_settings.vad_events
  • API Parameter: vad_events
Real Example:
  • true: Enhanced speech detection with UtteranceEnd events when speech_final doesn’t work (recommended)
  • false: Basic speech detection only (legacy mode)
When to Enable:
  • Always recommended: Provides better speech detection in noisy environments
  • Essential for: Background noise, poor connections, multiple speakers
  • Backup mechanism: When speech_final doesn’t trigger due to audio issues
Why It Matters: VAD events provide UtteranceEnd signals as a fallback when normal speech detection fails due to background noise or audio quality issues.

🎯 Timing Settings Quick Guide

Real-Time/Fast Conversations (Default):
  • Endpointing: 10ms, Min Silence: 1500ms, Utterance End: 1000ms, VAD Events: true
Balanced Professional:
  • Endpointing: 300ms, Min Silence: 1500ms, Utterance End: 1500ms, VAD Events: true
Patient/Elderly Callers:
  • Endpointing: 800ms, Min Silence: 2500ms, Utterance End: 2000ms, VAD Events: true
Critical: These settings work together with Call Management interruption settings. Endpointing controls provider responsiveness, Min Silence Duration controls internal timeout handling, and both affect conversation flow timing.
Keywords (Deepgram Nova-2, Nova, Enhanced, Base):
  • Boost recognition of specific words
  • Format: word:boost_factor (e.g., Deepgram:2.0, API:1.5)
  • Great for company names, technical terms
Keyterms (Deepgram Nova-3 only, English only):
  • Advanced keyword detection
  • Format: word1, word2, word3
  • More sophisticated than keywords
Phrase Lists (Azure Speech):
  • Boost recognition of specific terms
  • Format: Comma-separated list
  • Works with all Azure models and languages
Use keywords/keyterms/phrase lists for your company name, product names, and industry-specific terms to improve accuracy.

Audio Denoising

Burki Voice AI includes RNNoise for real-time audio denoising, which removes background noise before transcription.
When to Enable:
  • Noisy environments (restaurants, offices, outdoors)
  • Poor phone connections
  • Background music or chatter
Trade-offs:
  • Slightly increases latency (~50-100ms)
  • Improves transcription accuracy in noisy conditions

Troubleshooting

Speech Detection Problems:
  • AI misses words: Enable denoising or add keywords/phrase lists for important terms
  • Cuts off callers mid-sentence: Increase endpointing (10ms → 500ms) and utterance end timeout
  • Long awkward pauses: Decrease min silence duration for faster internal processing
  • Interrupts slow speakers: Increase endpointing and min silence duration
  • Misses trailing words: Enable VAD events and increase utterance end timeout
Language & Recognition:
  • Wrong language detected: Set correct language code or use “custom” option
  • Technical terms not recognized: Add them as keywords/keyterms/phrase lists with boost factors
  • Company names garbled: Add company/product names to keywords list
Audio Quality:
  • Noisy background: Enable audio denoising and increase VAD turnoff
  • Poor phone connection: Enable denoising and use more conservative timing settings
  • Multiple speakers: Use higher silence thresholds to avoid cross-talk issues
Provider-Specific:
  • Deepgram connection issues: Verify your Deepgram API key in SettingsProvider Keys
  • Azure authentication failed: Verify subscription key and region match your Speech resource in SettingsProvider Keys
Testing Strategy: Record test calls with different timing settings and listen to the conversation flow. What feels natural to you will feel natural to callers.

Best Practices

  • Start with defaults and adjust based on testing
  • Test with real calls in your target environment
  • Use term boosting (keywords/keyterms/phrase lists) for your business-specific terminology
  • Enable denoising if you expect background noise
  • Monitor call quality and adjust timing as needed
  • Choose the right provider based on your primary needs (speed vs. language support)

How STT Works with Call Management

🔗 STT + Call Management = Natural Conversations

STT Settings control when the provider detects speech has ended.Call Management Settings control how your AI responds to that detected speech.Both must work together for natural conversation flow!
The Flow:
  1. STT detects speech using your timing settings (silence threshold, VAD, etc.)
  2. Call Management decides response using interruption and timeout settings
  3. Result: Natural conversation or awkward pauses
Key Relationships:
  • STT min_silence_duration (internal timeout) should be longer than Call Management interruption_cooldown
  • Lower STT endpointing (more responsive) works well with lower Call Management interruption_threshold
  • Higher STT timing settings pair well with patient Call Management idle_timeout
Next Step: Configure Call Management settings to control conversation flow after STT detects speech.