Voice Activity Detection (VAD) determines when a caller is speaking and when they are silent. It is the first stage of the voice pipeline — audio flows through VAD before reaching STT or End of Speech. A well-tuned VAD prevents background noise from triggering transcription, enables barge-in (interrupting the assistant mid-sentence), and feeds the End of Speech detector with speech activity signals. All VAD providers in Rapida process 16 kHz LINEAR16 mono audio and run locally — no network calls, no API keys, no added latency.Documentation Index
Fetch the complete documentation index at: https://doc.rapida.ai/llms.txt
Use this file to discover all available pages before exploring further.
Providers
Rapida supports three VAD providers. Each shares the same three configuration parameters but uses a different detection model and approach.| Provider | Description |
|---|---|
| Silero VAD | The default. Lightweight ONNX model (~2 MB) with strong accuracy across languages and noise conditions. Best general-purpose choice. |
| TEN VAD | TEN Framework library with fixed 16 ms frame hops. Slightly lower latency per frame than Silero. Good for latency-critical deployments. |
| FireRed VAD | DFSMN streaming model with Kaldi-compatible fbank features and a 4-state postprocessor. Higher accuracy on noisy or overlapping speech at the cost of slightly more compute. |
Silero VAD
Silero VAD is a pre-trained ONNX model developed by the Silero team. It is one of the most widely adopted open-source VAD models, used in production by projects like Pipecat, LiveKit, and dozens of voice AI platforms. Why choose Silero VAD:- Battle-tested in production across thousands of voice applications
- Small model size (~2 MB ONNX) with fast inference
- Strong accuracy across languages — trained on over 100 languages
- Works well in both clean and moderately noisy environments
- Default provider in Rapida — zero configuration needed to get started
Parameters
| Parameter | Config Key | Default | Range | Description |
|---|---|---|---|---|
| VAD Threshold | microphone.vad.threshold | 0.5 | 0.3 – 1.0 | Speech probability threshold. Each audio frame produces a probability score between 0 and 1. Frames above this threshold are classified as speech. |
| Min Silence Frames | microphone.vad.min_silence_frame | 20 | 1 – 30 | Minimum consecutive silence frames before ending a speech segment. Each frame is 10 ms, so the default of 20 = 200 ms of silence required to close a speech segment. |
| Min Speech Frames | microphone.vad.min_speech_frame | 8 | 1 – 20 | Minimum consecutive speech frames before confirming speech onset. Each frame is 10 ms, so the default of 8 = 80 ms of speech required before triggering an interruption. |
TEN VAD
TEN VAD is part of the TEN Framework, an open-source real-time communication framework. It uses a native C library for frame-level speech probability scoring. Why choose TEN VAD:- Fixed 256-sample hop size (16 ms at 16 kHz) provides consistent, predictable frame timing
- Native C implementation — lower per-frame overhead than ONNX inference
- Lightweight with no model file to load from disk
- Good choice when you want the lowest possible per-frame processing time
Parameters
| Parameter | Config Key | Default | Range | Description |
|---|---|---|---|---|
| VAD Threshold | microphone.vad.threshold | 0.5 | 0.3 – 1.0 | Speech probability threshold. Same semantics as Silero — frames above this value are classified as speech. |
| Min Silence Frames | microphone.vad.min_silence_frame | 20 | 1 – 30 | Minimum consecutive silence frames (each ~10 ms) before ending a speech segment. Default 20 = 200 ms. |
| Min Speech Frames | microphone.vad.min_speech_frame | 8 | 1 – 20 | Minimum consecutive speech frames (each ~10 ms) before confirming speech onset. Default 8 = 80 ms. |
TEN VAD uses a hysteresis offset of 0.15 for speech offset detection — speech ends only when the probability drops below
threshold - 0.15. This prevents rapid toggling at the threshold boundary during natural speech.FireRed VAD
FireRed VAD is a DFSMN (Deep Feed-forward Sequential Memory Network) streaming model developed by the FireRed team at Ant Group. It uses Kaldi-compatible fbank feature extraction and CMVN normalization, with a sophisticated 4-state postprocessor for speech boundary detection. Why choose FireRed VAD:- Uses a 4-state machine (Silence → Possible Speech → Speech → Possible Silence) for more precise speech boundary detection
- Probability smoothing via a moving average window reduces jitter from frame-to-frame noise
- Better at handling overlapping speech and noisy environments than simpler threshold-based approaches
- Designed for streaming — processes 25 ms frames with 10 ms shifts for fine-grained detection
Parameters
| Parameter | Config Key | Default | Range | Description |
|---|---|---|---|---|
| Speech Threshold | microphone.vad.threshold | 0.5 | 0.3 – 1.0 | Probability threshold for speech detection. Applied after smoothing — the postprocessor uses a moving average window (size 5) to smooth raw probabilities before comparing against this threshold. |
| Min Silence Frames | microphone.vad.min_silence_frame | 20 | 1 – 30 | Minimum consecutive silence frames before ending a speech segment. Each frame is 10 ms. Directly controls the MinSilenceFrame in the postprocessor state machine. |
| Min Speech Frames | microphone.vad.min_speech_frame | 8 | 1 – 20 | Minimum consecutive speech frames before confirming speech onset. Each frame is 10 ms. Controls how long the postprocessor stays in the “Possible Speech” state before transitioning to confirmed “Speech”. |
FireRed VAD’s postprocessor internally uses a smoothed speech threshold of
0.4 (after moving average) and pads speech start by 5 frames (50 ms) to capture the onset of the utterance that triggered detection. The max speech frame limit is 2000 frames (20 seconds) — after which a forced speech end is emitted.Choosing a provider
| Criteria | Silero VAD | TEN VAD | FireRed VAD |
|---|---|---|---|
| Model size | ~2 MB ONNX | No model file (native library) | ~5 MB ONNX |
| Init time | Fast (~10–30 ms) | Very fast (~5 ms) | Fast (~10–30 ms) |
| Accuracy | High | Good | High |
| Noise robustness | Good | Moderate | Best |
| Languages | 100+ | Language-agnostic | Language-agnostic |
| Best for | General purpose | Latency-critical | Noisy environments |
How VAD parameters affect the conversation
Min Speech Frames — barge-in sensitivity
Themin_speech_frame parameter controls how quickly the assistant detects that a caller has started speaking and triggers a barge-in (interrupting assistant speech).
- Lower values (1–4 frames, 10–40 ms): The assistant stops speaking almost instantly when the caller makes any sound. Good for fast-paced IVR-style interactions. Risk: short sounds like coughs or “um” trigger interruption.
- Default (8 frames, 80 ms): Balanced. Requires ~80 ms of sustained speech before interrupting the assistant. Filters out most accidental sounds.
- Higher values (12–20 frames, 120–200 ms): The assistant only stops when the caller has clearly started a full utterance. Good for avoiding false interruptions during one-way information delivery (e.g., reading terms and conditions). Risk: the caller has to speak for longer before the assistant stops.
Min Silence Frames — speech segment boundary
Themin_silence_frame parameter controls how long the VAD waits in silence before considering a speech segment finished.
- Lower values (5–10 frames, 50–100 ms): Speech segments are split at short pauses. Words spoken with natural gaps between them may be split into multiple segments.
- Default (20 frames, 200 ms): Covers typical intra-sentence pauses. Most natural speech is captured as a single continuous segment.
- Higher values (25–30 frames, 250–300 ms): Only longer pauses end a segment. Good for speakers who pause frequently mid-sentence.
Next steps
End of Speech Detection
Configure how the assistant detects when a caller has finished their turn.
Create an Assistant
Set up VAD as part of the full voice pipeline configuration.