Speech-to-Text (STT) converts what callers say into text for your AI to understand. Burki Voice AI supports multiple STT providers—choose based on your needs for speed, language support, or enterprise features.
Provider Comparison
| Feature | Deepgram | ElevenLabs Scribe v2 | Azure Speech |
|---|---|---|---|
| Speed | Ultra-fast (~100ms) | Ultra-fast (~150ms) | Fast (~200ms) |
| Languages | 30+ | 90+ | 100+ |
| Models | Nova 2, Nova 3 | Scribe v2 Realtime | Standard, Enhanced, Neural |
| Term Boosting | Keywords, Keyterms | ❌ | Phrase Lists |
| Best For | Phone calls, Speed | Multi-language, VAD | Enterprise, Multi-language |
| Diarization | ✅ | ❌ | ✅ |
| Real-Time | ✅ | ✅ | ✅ |
⚡ Deepgram
Ultra-Low Latency~100ms response time, optimized for phone calls. Nova-3 keyterms for English, Nova-2 for multi-language.
🎙️ ElevenLabs Scribe v2
Multi-Language Excellence~150ms latency, 90+ languages, advanced VAD-based speech detection, 93.5% accuracy.
☁️ Azure Speech
Enterprise Scale100+ languages, Microsoft ecosystem integration, phrase lists for term boosting, custom speech models.
Deepgram
Deepgram is the default STT provider, optimized for speed and phone call quality.Models
| Model | Features | Keywords | Keyterms | Best For |
|---|---|---|---|---|
| Nova-3 | Latest, keyterms support | ❌ | ✅ | English calls, best accuracy |
| Nova-2 | Keywords support | ✅ | ❌ | Multi-language, reliable |
| Nova | Keywords support | ✅ | ❌ | Balanced performance |
| Enhanced | Keywords support | ✅ | ❌ | Legacy support |
| Base | Keywords support | ✅ | ❌ | Basic transcription |
Recommended: Use Nova-3 for English calls (supports keyterms) or Nova-2 for other languages (supports keywords).
Configuration
ElevenLabs Scribe v2
ElevenLabs Scribe v2 Configuration
ElevenLabs Scribe v2 Configuration
ElevenLabs Scribe v2 Realtime provides ultra-low latency speech recognition with excellent multi-language support and advanced voice activity detection.Key Features:
- Ultra-low latency (~150ms) with 93.5% accuracy
- 90+ languages supported
- Advanced VAD-based commit strategy
- Word-level timestamps support
- Automatic language detection
- Sign up at ElevenLabs
- Get your API key from the dashboard
- Configure in assistant settings
📖 Full ElevenLabs Documentation
See the complete ElevenLabs Scribe v2 guide for VAD settings, language options, and best practices.
Azure Speech
Azure Speech Configuration
Azure Speech Configuration
Azure Speech provides enterprise-grade speech recognition with broad language support and Microsoft ecosystem integration.Key Features:
- 100+ languages and regional variants
- Phrase lists for domain-specific term boosting
- Custom speech models for specialized vocabulary
- Speaker diarization support
- Create Azure Speech resource in Azure Portal
- Get your subscription key and region
- Configure in assistant settings
📖 Full Azure Documentation
See the complete Azure Speech STT guide for models, languages, configuration options, and best practices.
Key Settings
Model & Language
Model & Language
- Provider: Choose Deepgram for speed or Azure for enterprise features
- Model: Choose based on your needs (Nova-3 for English, Standard for multi-language)
- Language: Select from common options or enter a custom language code
- Custom Language: Enter any supported language code (e.g.,
fr-FR,es-ES)
Advanced Timing Controls
Advanced Timing Controls
These settings control how the STT provider detects when someone has finished speaking. Getting these right is crucial for natural conversation flow.
Endpointing (Silence Threshold)
What it does: How long the provider waits after detecting silence before considering speech has ended.Technical Details:- Measured in: Milliseconds
- Default: 10ms (minimal endpointing for real-time applications)
- Range: 10ms - 2000ms (recommended)
- Config Path:
stt_settings.endpointing.silence_threshold
- 10ms: Very responsive (default) - might cut off slow speakers
- 500ms: “I need help with…” → 0.5s silence → Provider says “speech ended”
- 1000ms: More patient (good for people who pause while thinking)
- Lower (10-100ms): For fast talkers or quick interactions (default)
- Higher (500-1000ms): For elderly callers or complex topics
- Much higher (1500ms+): For people with speech difficulties
Min Silence Duration
What it does: Internal timeout for utterance processing when the provider doesn’t sendspeech_final (not sent to provider API).Technical Details:- Measured in: Milliseconds
- Default: 1500ms
- Range: 500ms - 5000ms (recommended)
- Config Path:
stt_settings.endpointing.min_silence_duration - Used for: Call handler utterance timeout logic when
speech_finalis missing
- 1500ms: Wait 1.5s for
speech_final, then process accumulated utterance (default) - 1000ms: Quicker timeout for responsive conversation
- 2500ms: More patience for complex responses or noisy environments
- Lower (500-1000ms): For quick, responsive interactions
- Higher (2000-3000ms): For environments with background noise where
speech_finalmay be unreliable - Match with conversation style: Shorter for rapid-fire Q&A, longer for detailed discussions
Utterance End Timeout
What it does: Maximum time the provider waits for a complete utterance before sending UtteranceEnd event.Technical Details:- Measured in: Milliseconds
- Default: 1000ms
- Range: 500ms - 5000ms (recommended)
- Config Path:
stt_settings.utterance_end_ms - API Parameter:
utterance_end_ms
- 1000ms: If someone starts talking but doesn’t finish within 1 second, provider sends UtteranceEnd (default)
- 500ms: Quick timeout (might cut off long sentences)
- 2000ms: Patient timeout (good for complex responses)
- Lower (500-800ms): For short, quick interactions
- Higher (1500-3000ms): For detailed conversations or forms
- Consider your use case: Customer service vs. quick orders
VAD Events
What it does: Enables Voice Activity Detection events for enhanced speech detection and UtteranceEnd events.Technical Details:- Type: Boolean (true/false)
- Default: true (enabled)
- Config Path:
stt_settings.vad_events - API Parameter:
vad_events
- true: Enhanced speech detection with UtteranceEnd events when
speech_finaldoesn’t work (recommended) - false: Basic speech detection only (legacy mode)
- Always recommended: Provides better speech detection in noisy environments
- Essential for: Background noise, poor connections, multiple speakers
- Backup mechanism: When
speech_finaldoesn’t trigger due to audio issues
🎯 Timing Settings Quick Guide
Real-Time/Fast Conversations (Default):
- Endpointing: 10ms, Min Silence: 1500ms, Utterance End: 1000ms, VAD Events: true
- Endpointing: 300ms, Min Silence: 1500ms, Utterance End: 1500ms, VAD Events: true
- Endpointing: 800ms, Min Silence: 2500ms, Utterance End: 2000ms, VAD Events: true
Critical: These settings work together with Call Management interruption settings. Endpointing controls provider responsiveness, Min Silence Duration controls internal timeout handling, and both affect conversation flow timing.
Processing Options
Processing Options
Keywords & Keyterms
Keywords & Keyterms
Keywords (Deepgram Nova-2, Nova, Enhanced, Base):
- Boost recognition of specific words
- Format:
word:boost_factor(e.g.,Deepgram:2.0, API:1.5) - Great for company names, technical terms
- Advanced keyword detection
- Format:
word1, word2, word3 - More sophisticated than keywords
- Boost recognition of specific terms
- Format: Comma-separated list
- Works with all Azure models and languages
Use keywords/keyterms/phrase lists for your company name, product names, and industry-specific terms to improve accuracy.
Audio Denoising
Burki Voice AI includes RNNoise for real-time audio denoising, which removes background noise before transcription.
- Noisy environments (restaurants, offices, outdoors)
- Poor phone connections
- Background music or chatter
- Slightly increases latency (~50-100ms)
- Improves transcription accuracy in noisy conditions
Troubleshooting
Common STT Issues & Solutions
Common STT Issues & Solutions
Speech Detection Problems:
- AI misses words: Enable denoising or add keywords/phrase lists for important terms
- Cuts off callers mid-sentence: Increase endpointing (10ms → 500ms) and utterance end timeout
- Long awkward pauses: Decrease min silence duration for faster internal processing
- Interrupts slow speakers: Increase endpointing and min silence duration
- Misses trailing words: Enable VAD events and increase utterance end timeout
- Wrong language detected: Set correct language code or use “custom” option
- Technical terms not recognized: Add them as keywords/keyterms/phrase lists with boost factors
- Company names garbled: Add company/product names to keywords list
- Noisy background: Enable audio denoising and increase VAD turnoff
- Poor phone connection: Enable denoising and use more conservative timing settings
- Multiple speakers: Use higher silence thresholds to avoid cross-talk issues
- Deepgram connection issues: Verify your Deepgram API key in Settings → Provider Keys
- Azure authentication failed: Verify subscription key and region match your Speech resource in Settings → Provider Keys
Testing Strategy: Record test calls with different timing settings and listen to the conversation flow. What feels natural to you will feel natural to callers.
Best Practices
- Start with defaults and adjust based on testing
- Test with real calls in your target environment
- Use term boosting (keywords/keyterms/phrase lists) for your business-specific terminology
- Enable denoising if you expect background noise
- Monitor call quality and adjust timing as needed
- Choose the right provider based on your primary needs (speed vs. language support)
How STT Works with Call Management
🔗 STT + Call Management = Natural Conversations
STT Settings control when the provider detects speech has ended.Call Management Settings control how your AI responds to that detected speech.Both must work together for natural conversation flow!
- STT detects speech using your timing settings (silence threshold, VAD, etc.)
- Call Management decides response using interruption and timeout settings
- Result: Natural conversation or awkward pauses
- STT
min_silence_duration(internal timeout) should be longer than Call Managementinterruption_cooldown - Lower STT
endpointing(more responsive) works well with lower Call Managementinterruption_threshold - Higher STT timing settings pair well with patient Call Management
idle_timeout
Next Step: Configure Call Management settings to control conversation flow after STT detects speech.