firered_vad
Source Location
How It Works
- Incoming LINEAR16 bytes are converted to
int16samples and buffered - Complete frames are extracted: 400 samples (25 ms) with 160-sample shift (10 ms)
- For each frame: fbank features are extracted, CMVN normalization is applied, ONNX inference produces a raw speech probability
- The postprocessor smooths probabilities (moving average, window size 5) and runs a 4-state machine:
- Speech onset is confirmed only after
MinSpeechFrameconsecutive frames above threshold — this filters out short noise bursts - Same packet emission:
InterruptionPacketon confirmed onset,VadSpeechActivityPacketheartbeats during confirmed speech
Parameters
| Option Key | Default | Range | Description |
|---|---|---|---|
microphone.vad.threshold | 0.5 | 0.3–1.0 | Speech probability threshold (applied after smoothing) |
microphone.vad.min_silence_frame | 20 | 1–30 | Silence frames before segment end (each frame = 10 ms) |
microphone.vad.min_speech_frame | 8 | 1–20 | Speech frames before confirming onset (each frame = 10 ms) |
Internal Postprocessor Defaults
These are not configurable via assistant options — they are hardcoded inDefaultPostprocessorConfig():
| Setting | Value | Description |
|---|---|---|
| SmoothWindowSize | 5 | Moving average window for probability smoothing |
| SpeechThreshold | 0.4 | Smoothed probability threshold (separate from the configurable threshold) |
| PadStartFrame | 5 | Frames padded at speech start to capture onset audio (50 ms) |
| MaxSpeechFrame | 2000 | Max consecutive speech frames before forced reset (20 seconds) |
Model Path
| Source | Resolution |
|---|---|
| Environment variable | FIRERED_VAD_MODEL_PATH — absolute path to the .onnx file |
| Default (Docker) | ./models/firered_vad/fireredvad_stream_vad_with_cache.onnx |
| Default (source) | api/assistant-api/internal/vad/internal/firered_vad/models/fireredvad_stream_vad_with_cache.onnx |