Speech-to-Text (STT) converts user speech into text for the assistant. STT quality affects everything downstream: the LLM only sees what the transcription provider returns.Documentation Index
Fetch the complete documentation index at: https://doc.rapida.ai/llms.txt
Use this file to discover all available pages before exploring further.
STT is configured in the Voice Input step of a Phone Call, Web Widget, or Web App / SDK deployment.
Setup flow
Create the provider credential
Add the STT provider credential in Credentials. The deployment wizard can only select credentials that already exist.
Open the deployment voice input step
Go to Configure Assistant -> Deployments, create or edit a voice-capable deployment, then open Voice Input.
Select the model
Choose the provider-specific model. For phone calls, prefer real-time or telephony-friendly models. For browser audio, use the provider’s recommended real-time model.
Set language when required
Select the primary user language when the provider requires it. Use multilingual or automatic detection only when the provider supports it and your use case needs it.
Supported providers
| Provider | Typical use |
|---|---|
| Deepgram | Low-latency streaming transcription and telephony use cases. |
| AssemblyAI | Real-time transcription with strong conversation-oriented models. |
| Azure Cognitive Services | Enterprise Microsoft environments and multilingual deployments. |
| Google Speech Service | Google Cloud speech recognition workflows. |
| OpenAI | OpenAI transcription models for voice applications. |
| AWS Transcribe | AWS-native speech recognition. |
| Cartesia | Voice AI workflows that also use Cartesia TTS. |
| Sarvam AI | Indian language voice applications. |
| Groq | Low-latency Whisper-compatible transcription. |
| Speechmatics | Broad language coverage and accent robustness. |
| NVIDIA | NVIDIA-hosted speech models. |
| Custom STT | Your own WebSocket-compatible STT backend. |
Configuration fields
The exact fields vary by provider, but STT configuration usually includes:| Field | What it controls |
|---|---|
| Credential | Which stored provider credential Rapida uses. |
| Model | The transcription model. This is usually the main accuracy/latency tradeoff. |
| Language | The expected user language or provider language code. |
Choosing a provider
| Need | Recommended direction |
|---|---|
| Lowest latency | Use a provider with streaming transcription and a real-time model. |
| Phone calls | Choose a model that handles 8 kHz telephony audio well. |
| Browser microphone | Use a real-time model that performs well on cleaner wideband audio. |
| Noisy environments | Pair the provider with Noise Cancellation and stricter VAD. |
| Multilingual users | Use explicit language selection or a provider with reliable language detection. |
| Private provider | Use Custom STT. |
Channel guidance
Phone calls
Phone calls often use narrowband or compressed audio. Prefer STT models that are tested for telephony and real-time streaming. Keep RNNoise enabled for most phone deployments. Start with:| Area | Starting point |
|---|---|
| Model | Real-time or telephony-friendly model. |
| Noise cancellation | RNNoise enabled. |
| VAD | Silero VAD with balanced threshold. |
| EOS | Pipecat Smart Turn or Silence-Based at 700-1000 ms. |
Web widget and web app
Browser microphone audio is often cleaner than phone audio, but user environments vary widely. Use a real-time model and test with laptop microphones, headsets, and mobile browsers. Start with:| Area | Starting point |
|---|---|
| Model | Real-time model for browser audio. |
| Noise cancellation | Enabled for uncontrolled environments. |
| VAD | Silero VAD. |
| EOS | Pipecat Smart Turn for natural conversation. |
Troubleshooting
| Symptom | Likely cause | What to adjust |
|---|---|---|
| Transcripts miss quiet speech | VAD threshold too high or wrong STT model | Lower VAD threshold and test another STT model. |
| Transcripts include background noise | Noise cancellation off or VAD too sensitive | Enable RNNoise and raise VAD threshold. |
| User words are cut off at the beginning | VAD speech confirmation too strict | Lower minimum speech frames or VAD threshold. |
| Assistant responds to incomplete transcript | EOS too aggressive | Tune End of Speech Detection. |
| Multilingual users are transcribed incorrectly | Language mismatch | Set the correct language or use a multilingual-capable provider. |
Related
Voice Pipeline Overview
See how STT fits into the full audio flow.
Noise Cancellation
Clean audio before it reaches STT.
Custom STT
Connect a custom WebSocket transcription provider with DSL rules.
Voice Activity Detection
Tune when user speech starts and stops.
End of Speech Detection
Decide when the transcript is ready for the assistant to answer.