Skip to main content
End of Speech (EOS) detection determines when a caller has finished their turn and the assistant should start responding. It is the single most impactful setting for perceived conversation latency — a fast EOS means the assistant responds quickly, but a premature EOS means the assistant cuts the caller off mid-sentence. EOS works downstream of VAD. While VAD detects whether the caller is actively making sound (frame-by-frame), EOS decides whether a period of silence means “I’m done talking” or “I’m pausing to think”.

Providers

Rapida supports three EOS providers, ranging from a simple silence timer to ML-powered turn detection models.
ProviderDescription
Silence-BasedFixed silence timeout. Simple, reliable, zero compute overhead. The default.
Pipecat Smart TurnWhisper-based audio model (~8 MB). Predicts turn completion directly from speech audio.
LiveKit Turn DetectorLanguage model that predicts turn completion from transcribed text. Context-aware with conversation history.

Silence-Based EOS

The simplest approach: after the last speech activity, wait for a fixed duration of silence, then trigger end-of-speech. No ML model, no inference overhead. Why choose Silence-Based:
  • Zero additional compute — no model to load or run
  • Predictable, deterministic behaviour — the timeout is exactly what you configure
  • Works with any language, any accent, any audio quality
  • Easiest to reason about and debug
When to use: Most deployments, especially when starting out. Silence-based EOS is the default and works well for the majority of voice AI use cases. It is the right choice when you want simplicity and predictability, or when your callers speak in short, clear turns (IVR menus, yes/no questions, appointment booking). When it falls short: Callers who pause mid-sentence (e.g., “I’d like to book a flight to… hmm… London”) will be interrupted if the pause exceeds the timeout. This is where model-based EOS providers add value.

Parameters

ParameterConfig KeyDefaultRangeDescription
Activity Timeoutmicrophone.eos.timeout1000 ms500 – 4000 msDuration of silence after the last speech activity before triggering end-of-speech.
Timeout tuning guide:
  • 500 – 600 ms — Very fast. The assistant responds almost immediately when the caller pauses. Best for IVR-style interactions with short, predictable answers (“yes”, “no”, “option 2”). Callers will be cut off if they pause to think.
  • 700 – 800 ms — Fast and balanced. Good default for most conversational assistants. Short enough to feel responsive, long enough for typical sentence-internal pauses.
  • 1000 – 1500 ms — Relaxed. Gives callers time to pause and continue. Good for complex conversations where callers need to recall information (account numbers, addresses, medical details).
  • 2000 – 4000 ms — Very patient. Use for elderly callers, non-native speakers, or scenarios where callers frequently pause mid-thought. Increases perceived latency significantly.
The default when switching to Silence-Based EOS in the UI is 700 ms. The backend default (when no value is set) is 1000 ms. The 700 ms UI default is optimized for a balance between responsiveness and natural conversation flow.

Pipecat Smart Turn EOS

Pipecat Smart Turn uses a Whisper-based audio model (~8 MB) to predict whether the caller has finished their turn directly from the speech audio waveform. Unlike silence-based detection, it understands prosodic cues — falling intonation, slowing speech rate, and other acoustic signals that indicate turn completion. Why choose Pipecat Smart Turn:
  • Detects turn completion from audio features, not just silence — catches prosodic cues like falling intonation at the end of a sentence
  • ~10 ms inference time per prediction — negligible latency impact
  • Supports 23 languages out of the box
  • Small model size (~8 MB ONNX)
  • Uses a rolling audio buffer (~5 seconds) for context — doesn’t need the full conversation history
When to use: Conversations where callers frequently pause mid-sentence. Pipecat Smart Turn significantly reduces premature turn-taking compared to silence-based detection because it can distinguish between a “thinking pause” (flat or rising intonation) and a “finished speaking” pause (falling intonation, complete sentence prosody). Best for: customer support, complex information gathering (addresses, travel bookings), multilingual deployments where pause patterns vary by language. How it works:
  1. Audio from the caller is accumulated in a rolling buffer (max ~5 seconds at 16 kHz)
  2. When a final STT transcript arrives, the model runs inference on the buffered audio
  3. The model outputs a probability between 0 and 1 indicating likelihood of turn completion
  4. If probability >= threshold → use quick_timeout (short wait, then fire)
  5. If probability < threshold → use silence_timeout (long wait, keep listening)
  6. Interim STT transcripts reset the timer with the fallback_timeout

Parameters

ParameterConfig KeyDefaultRangeDescription
Turn Completion Thresholdmicrophone.eos.threshold0.50.1 – 0.9Probability threshold for turn completion. When the model’s prediction exceeds this value, it considers the turn complete and uses the quick timeout.
Quick Timeoutmicrophone.eos.quick_timeout200 ms50 – 1000 msShort silence buffer after the model predicts “turn complete”. Gives the caller a brief window to correct themselves (“yes… actually wait”) before the assistant responds.
Extended Timeoutmicrophone.eos.silence_timeout2000 ms500 – 5000 msSilence duration used when the model predicts the caller is still speaking. Acts as a long patience window for mid-thought pauses.
Fallback Timeoutmicrophone.eos.timeout500 ms500 – 4000 msSilence timeout used for interim STT transcripts and when model inference fails. Falls back to simple silence-based behaviour.
Turn Completion Threshold (0.1 – 0.9)
RangeBehaviour
0.1 – 0.3Aggressive. The model triggers on weak signals. Faster responses but more false triggers. Good for IVR-style interactions.
0.4 – 0.6Balanced. Default is 0.5. The model needs moderate confidence before triggering. Best for general-purpose conversations.
0.7 – 0.9Conservative. Only strong turn-completion signals trigger. Use when false interruptions are very costly (legal, medical).
Quick Timeout (50 – 1000 ms)
RangeBehaviour
50 – 150 msAlmost instant response after model says “done”. Snappy but no correction window.
200 – 300 msDefault range. Brief correction window. Good balance.
500 – 1000 msLong correction window. Use if callers frequently say “wait” or “actually”.
Extended Timeout (500 – 5000 ms)
RangeBehaviour
500 – 1000 msShort patience. If the model keeps saying “not done”, force EOS quickly anyway.
2000 msDefault. Gives callers 2 seconds to continue after a pause.
3000 – 5000 msVery patient. For callers who take long pauses mid-thought.

LiveKit Turn Detector EOS

The LiveKit Turn Detector uses a language model to predict turn completion from transcribed text combined with conversation history. Unlike Pipecat (which analyzes audio), LiveKit analyzes the linguistic content of what was said to determine if the caller is done. Why choose LiveKit Turn Detector:
  • Context-aware — uses conversation history (up to 6 turns by default) to make better predictions. If the assistant asked “What is your address?”, the model knows the caller is likely still speaking during a pause after saying “123 Main Street”
  • Text-based analysis — catches semantic cues that audio models miss. For example, “My address is 123” is clearly incomplete, regardless of intonation
  • Reduces false triggers on addresses, phone numbers, and lists — the model understands that these naturally contain pauses between segments
  • Available in two model variants: English-only (66 MB, optimized) and Multilingual (378 MB, 14 languages)
When to use: Conversations with structured data collection where callers frequently pause mid-answer. The LiveKit model excels at preventing premature turn-taking during:
  • Address dictation (“123 Main Street… apartment 4B… New York”)
  • Phone numbers (“area code 212… 555… 1234”)
  • Lists or multi-part answers
  • Complex questions requiring thought
How it works:
  1. Final STT transcripts are accumulated into the current user turn
  2. When a final transcript arrives, the model builds a chat template from conversation history + current text
  3. The model predicts an end-of-utterance probability
  4. If probability >= threshold → use quick_timeout
  5. If probability < threshold → use silence_timeout
  6. Assistant responses (from LLMResponseDonePacket) are recorded in history for context

Parameters

ParameterConfig KeyDefaultRangeDescription
Modelmicrophone.eos.modelenen, multilingualModel variant. en is 66 MB and optimized for English. multilingual is 378 MB and supports Chinese, German, Dutch, English, Portuguese, Spanish, French, Italian, Japanese, Korean, Russian, Turkish, Indonesian, and Hindi.
Turn Completion Thresholdmicrophone.eos.threshold0.02890.001 – 0.1Probability threshold for turn completion. This value is much lower than Pipecat’s threshold because the LiveKit model outputs probabilities on a different scale. The default 0.0289 is the “unlikely threshold” from LiveKit’s reference configuration.
Quick Timeoutmicrophone.eos.quick_timeout250 ms50 – 500 msShort buffer after model says “done” before firing EOS. Catches fast corrections.
Safety Timeoutmicrophone.eos.silence_timeout1500 ms500 – 5000 msMaximum silence before forcing EOS when the model keeps predicting “not done”. Acts as a safety fallback to prevent indefinite waiting.
Fallback Timeoutmicrophone.eos.timeout500 ms300 – 2000 msSilence timeout for interim transcripts and model inference failures.
The LiveKit threshold range (0.001 – 0.1) is very different from Pipecat’s (0.1 – 0.9). Do not copy threshold values between providers — they use fundamentally different models with different probability distributions.
English model (en, 66 MB)
AspectDetail
OptimisationFaster inference, smaller memory footprint
AccuracyBetter for English conversations than the multilingual model
When to useYour callers speak English, even with accents (the STT provider handles accent recognition; LiveKit only sees the transcribed text)
Multilingual model (multilingual, 378 MB)
AspectDetail
LanguagesChinese (zh), German (de), Dutch (nl), English (en), Portuguese (pt), Spanish (es), French (fr), Italian (it), Japanese (ja), Korean (ko), Russian (ru), Turkish (tr), Indonesian (id), Hindi (hi)
Init timeTakes longer to load at session start (~100–200 ms extra)
When to useYour callers speak non-English languages or you handle mixed-language conversations
Threshold (0.001 – 0.1)
ValueBehaviour
0.01 – 0.02Conservative. The model needs high confidence to trigger. Fewer false interruptions but slower response.
0.0289Default. Matches LiveKit’s “unlikely threshold” — a good balance between responsiveness and accuracy.
0.05 – 0.1Aggressive. Triggers on weaker signals. Faster responses but more false turns, especially during pauses in structured data.
Safety Timeout (500 – 5000 ms)
RangeBehaviour
500 – 1000 msShort safety net. If the model keeps saying “not done” for 1 second, fire anyway. Use for fast-paced conversations.
1500 msDefault. Good balance.
3000 – 5000 msVery patient. The model gets a long time to keep predicting. Use when callers dictate very long multi-part answers.

Choosing a provider

CriteriaSilence-BasedPipecat Smart TurnLiveKit Turn Detector
ApproachFixed silence timerAudio model (prosody)Language model (text + history)
Model sizeNone~8 MB ONNX66 MB (en) / 378 MB (multilingual)
Init timeInstant~20–50 ms~50–200 ms
Inference timeNone~10 ms per prediction~5–15 ms per prediction
Context usedSilence duration onlyLast ~5s of audioTranscribed text + conversation history
LanguagesAll (language-agnostic)23 languagesEnglish or 14 languages
Handles mid-sentence pausesNoYes (prosodic cues)Yes (semantic understanding)
Handles structured dataNoPartiallyYes (understands incomplete addresses, numbers)
Compute overheadZeroLowModerate

Decision guide

1

Start with Silence-Based

For most new assistants, Silence-Based EOS with a 700 ms timeout is the right starting point. It’s simple, predictable, and works well for 80% of use cases.
2

Switch to Pipecat if callers get cut off

If your conversation logs show frequent premature turn-taking — callers being interrupted mid-sentence during natural pauses — switch to Pipecat Smart Turn. Its audio model catches prosodic cues that silence timers miss.
3

Switch to LiveKit for structured data collection

If your assistant collects addresses, phone numbers, or multi-part answers where callers naturally pause between segments, LiveKit’s text-based model with conversation history is the strongest choice. It understands that “123 Main Street” after “What is your address?” is likely incomplete.
You can combine any EOS provider with any VAD provider. They are independent components in the voice pipeline. A common high-quality configuration is Silero VAD + LiveKit EOS or Silero VAD + Pipecat Smart Turn EOS.

How EOS providers interact with VAD

VAD and EOS work together but serve different purposes:
StageVADEOS
What it detects”Is the caller making sound right now?""Has the caller finished their complete thought?”
GranularityFrame-by-frame (every 10–16 ms)Per-utterance (after each STT transcript)
OutputSpeech onset/offset events, activity heartbeatsEnd-of-speech signal that triggers LLM response
Affected byAudio signal quality, background noisePause patterns, speech content, conversation context
The VAD continuously sends speech activity heartbeats while the caller is speaking. These heartbeats reset the EOS silence timer, preventing the EOS from firing while the caller is actively speaking. When speech stops, the VAD stops sending heartbeats, and the EOS timer begins counting down. For the model-based EOS providers (Pipecat and LiveKit), the EOS also receives the final STT transcript. On receiving a final transcript, the model runs inference to decide whether to use the quick timeout (turn complete) or extended timeout (still speaking).

Next steps