ElevenLabs Scribe v2 Realtime is an ultra-low latency STT provider with ~150ms response time, 93.5% accuracy, and support for 90+ languages. It uses advanced Voice Activity Detection (VAD) for intelligent speech boundary detection.
Overview
| Feature | Details |
|---|---|
| Model | scribe_v2_realtime |
| Latency | ~150ms |
| Accuracy | 93.5% |
| Languages | 90+ |
| VAD Support | ✅ Advanced |
| Word Timestamps | ✅ Optional |
| Language Detection | ✅ Optional |
Quick Start
- Get API Key: Sign up at ElevenLabs and get your API key
- Add to Provider Keys: Go to Settings → Provider Keys and add your ElevenLabs API key
- Configure STT: In your assistant settings, select ElevenLabs as the STT provider
Configuration Options
Basic Settings
| Parameter | Type | Default | Description |
|---|---|---|---|
model | string | scribe_v2_realtime | The STT model to use |
language | string | en | Language code (ISO 639-1 format) |
VAD Settings
Voice Activity Detection (VAD) controls how the STT provider detects when someone has finished speaking.| Parameter | Type | Default | Range | Description |
|---|---|---|---|---|
commit_strategy | string | vad | vad, manual | How transcripts are committed |
vad_threshold | float | 0.4 | 0.1 - 0.9 | VAD sensitivity (lower = more sensitive) |
vad_silence_threshold_secs | float | 1.5 | 0.3 - 3.0 | Silence duration to commit transcript |
min_speech_duration_ms | int | 100 | 50 - 2000 | Minimum speech duration to consider |
min_silence_duration_ms | int | 100 | 50 - 2000 | Minimum silence duration to consider |
Additional Options
| Parameter | Type | Default | Description |
|---|---|---|---|
include_timestamps | boolean | true | Include word-level timestamps |
include_language_detection | boolean | false | Include detected language in response |
Full Configuration Example
VAD Commit Strategy
ElevenLabs Scribe v2 uses a VAD-based commit strategy by default, which automatically detects when speech has ended and commits the transcript.VAD vs Manual Commit
VAD vs Manual Commit
VAD (Voice Activity Detection) - Recommended:
- Automatically detects speech boundaries
- Commits transcript when silence is detected
- Best for natural conversation flow
- Configurable sensitivity and timing
- You control when transcripts are committed
- Useful for specific use cases where you need precise control
- Requires handling commit signals in your application
For phone calls, VAD commit strategy is recommended as it provides the most natural conversation experience.
VAD Tuning Guide
Adjusting VAD for Different Scenarios
Adjusting VAD for Different Scenarios
Fast-Paced Conversations
- Lower silence threshold for quicker responses
- More sensitive VAD detection
Patient/Thoughtful Speakers
- Higher silence threshold to avoid cutting off
- More patience for pauses during thinking
Noisy Environments
- Higher VAD threshold to filter noise
- Longer minimum durations to avoid false triggers
Supported Languages
ElevenLabs Scribe v2 supports 90+ languages. Use the ISO 639-1 language code:Common Language Codes
Common Language Codes
| Language | Code | Language | Code |
|---|---|---|---|
| English | en | Spanish | es |
| French | fr | German | de |
| Italian | it | Portuguese | pt |
| Dutch | nl | Polish | pl |
| Russian | ru | Japanese | ja |
| Korean | ko | Chinese | zh |
| Arabic | ar | Hindi | hi |
| Turkish | tr | Swedish | sv |
For the full list of supported languages, visit the ElevenLabs documentation.
Word-Level Timestamps
Wheninclude_timestamps is enabled, each word in the transcript includes timing information:
Word timestamps are useful for analytics, keyword spotting, and advanced conversation analysis.
Best Practices
Start with Defaults
The default VAD settings work well for most phone call scenarios. Only adjust after testing.
Test with Real Calls
Record test calls and listen to the conversation flow. Adjust VAD settings based on actual user experience.
Match Call Style
Fast customer service? Lower thresholds. Complex discussions? Higher thresholds and more patience.
Enable Denoising
For noisy environments, enable audio denoising in your STT settings alongside VAD tuning.
Troubleshooting
Common Issues
Common Issues
Transcripts cut off mid-sentence:
- Increase
vad_silence_threshold_secs(try 2.0 seconds) - Increase
min_silence_duration_ms
- Decrease
vad_silence_threshold_secs(try 1.0 seconds) - Lower
vad_thresholdfor more sensitive detection
- Increase
vad_threshold(try 0.6) - Increase
min_speech_duration_ms - Enable audio denoising
- Set the correct language code explicitly
- Disable
include_language_detectionif not needed
- Verify your ElevenLabs API key in Settings → Provider Keys
- Check your account has sufficient credits
Comparison with Other Providers
| Feature | ElevenLabs Scribe v2 | Deepgram | Azure Speech |
|---|---|---|---|
| Latency | ~150ms | ~100ms | ~200ms |
| Languages | 90+ | 30+ | 100+ |
| VAD | Advanced | Basic | Standard |
| Word Timestamps | ✅ | ✅ | ✅ |
| Term Boosting | ❌ | ✅ Keywords | ✅ Phrase Lists |
| Best For | Multi-language, VAD | Speed, English | Enterprise |
Choose ElevenLabs Scribe v2 when you need excellent multi-language support with advanced VAD capabilities. Choose Deepgram for the absolute lowest latency, or Azure for enterprise features and custom models.