Skip to main content
ElevenLabs Scribe v2 Realtime is an ultra-low latency STT provider with ~150ms response time, 93.5% accuracy, and support for 90+ languages. It uses advanced Voice Activity Detection (VAD) for intelligent speech boundary detection.

Overview

FeatureDetails
Modelscribe_v2_realtime
Latency~150ms
Accuracy93.5%
Languages90+
VAD Support✅ Advanced
Word Timestamps✅ Optional
Language Detection✅ Optional

Quick Start

  1. Get API Key: Sign up at ElevenLabs and get your API key
  2. Add to Provider Keys: Go to SettingsProvider Keys and add your ElevenLabs API key
  3. Configure STT: In your assistant settings, select ElevenLabs as the STT provider
{
  "stt_settings": {
    "provider": "elevenlabs",
    "model": "scribe_v2_realtime",
    "language": "en"
  }
}

Configuration Options

Basic Settings

ParameterTypeDefaultDescription
modelstringscribe_v2_realtimeThe STT model to use
languagestringenLanguage code (ISO 639-1 format)

VAD Settings

Voice Activity Detection (VAD) controls how the STT provider detects when someone has finished speaking.
ParameterTypeDefaultRangeDescription
commit_strategystringvadvad, manualHow transcripts are committed
vad_thresholdfloat0.40.1 - 0.9VAD sensitivity (lower = more sensitive)
vad_silence_threshold_secsfloat1.50.3 - 3.0Silence duration to commit transcript
min_speech_duration_msint10050 - 2000Minimum speech duration to consider
min_silence_duration_msint10050 - 2000Minimum silence duration to consider

Additional Options

ParameterTypeDefaultDescription
include_timestampsbooleantrueInclude word-level timestamps
include_language_detectionbooleanfalseInclude detected language in response

Full Configuration Example

{
  "stt_settings": {
    "provider": "elevenlabs",
    "model": "scribe_v2_realtime",
    "language": "en",
    "elevenlabs_config": {
      "commit_strategy": "vad",
      "vad_threshold": 0.4,
      "vad_silence_threshold_secs": 1.5,
      "min_speech_duration_ms": 100,
      "min_silence_duration_ms": 100,
      "include_timestamps": true,
      "include_language_detection": false
    }
  }
}

VAD Commit Strategy

ElevenLabs Scribe v2 uses a VAD-based commit strategy by default, which automatically detects when speech has ended and commits the transcript.
VAD (Voice Activity Detection) - Recommended:
  • Automatically detects speech boundaries
  • Commits transcript when silence is detected
  • Best for natural conversation flow
  • Configurable sensitivity and timing
Manual Commit:
  • You control when transcripts are committed
  • Useful for specific use cases where you need precise control
  • Requires handling commit signals in your application
For phone calls, VAD commit strategy is recommended as it provides the most natural conversation experience.

VAD Tuning Guide

Fast-Paced Conversations

{
  "vad_threshold": 0.3,
  "vad_silence_threshold_secs": 0.8,
  "min_speech_duration_ms": 50,
  "min_silence_duration_ms": 50
}
  • Lower silence threshold for quicker responses
  • More sensitive VAD detection

Patient/Thoughtful Speakers

{
  "vad_threshold": 0.5,
  "vad_silence_threshold_secs": 2.0,
  "min_speech_duration_ms": 150,
  "min_silence_duration_ms": 150
}
  • Higher silence threshold to avoid cutting off
  • More patience for pauses during thinking

Noisy Environments

{
  "vad_threshold": 0.6,
  "vad_silence_threshold_secs": 1.5,
  "min_speech_duration_ms": 200,
  "min_silence_duration_ms": 200
}
  • Higher VAD threshold to filter noise
  • Longer minimum durations to avoid false triggers

Supported Languages

ElevenLabs Scribe v2 supports 90+ languages. Use the ISO 639-1 language code:
LanguageCodeLanguageCode
EnglishenSpanishes
FrenchfrGermande
ItalianitPortuguesept
DutchnlPolishpl
RussianruJapaneseja
KoreankoChinesezh
ArabicarHindihi
TurkishtrSwedishsv
For the full list of supported languages, visit the ElevenLabs documentation.

Word-Level Timestamps

When include_timestamps is enabled, each word in the transcript includes timing information:
{
  "text": "Hello, how can I help you today?",
  "words": [
    { "word": "Hello", "start": 0.0, "end": 0.45 },
    { "word": "how", "start": 0.52, "end": 0.68 },
    { "word": "can", "start": 0.70, "end": 0.85 },
    { "word": "I", "start": 0.87, "end": 0.92 },
    { "word": "help", "start": 0.95, "end": 1.15 },
    { "word": "you", "start": 1.18, "end": 1.32 },
    { "word": "today", "start": 1.35, "end": 1.72 }
  ]
}
Word timestamps are useful for analytics, keyword spotting, and advanced conversation analysis.

Best Practices

Start with Defaults

The default VAD settings work well for most phone call scenarios. Only adjust after testing.

Test with Real Calls

Record test calls and listen to the conversation flow. Adjust VAD settings based on actual user experience.

Match Call Style

Fast customer service? Lower thresholds. Complex discussions? Higher thresholds and more patience.

Enable Denoising

For noisy environments, enable audio denoising in your STT settings alongside VAD tuning.

Troubleshooting

Transcripts cut off mid-sentence:
  • Increase vad_silence_threshold_secs (try 2.0 seconds)
  • Increase min_silence_duration_ms
Long pauses before AI responds:
  • Decrease vad_silence_threshold_secs (try 1.0 seconds)
  • Lower vad_threshold for more sensitive detection
Background noise triggering false transcripts:
  • Increase vad_threshold (try 0.6)
  • Increase min_speech_duration_ms
  • Enable audio denoising
Wrong language detected:
  • Set the correct language code explicitly
  • Disable include_language_detection if not needed
Connection issues:
  • Verify your ElevenLabs API key in SettingsProvider Keys
  • Check your account has sufficient credits

Comparison with Other Providers

FeatureElevenLabs Scribe v2DeepgramAzure Speech
Latency~150ms~100ms~200ms
Languages90+30+100+
VADAdvancedBasicStandard
Word Timestamps
Term Boosting✅ Keywords✅ Phrase Lists
Best ForMulti-language, VADSpeed, EnglishEnterprise
Choose ElevenLabs Scribe v2 when you need excellent multi-language support with advanced VAD capabilities. Choose Deepgram for the absolute lowest latency, or Azure for enterprise features and custom models.