Skip to main content

Documentation Index

Fetch the complete documentation index at: https://doc.rapida.ai/llms.txt

Use this file to discover all available pages before exploring further.

Speech-to-Text (STT) converts user speech into text for the assistant. STT quality affects everything downstream: the LLM only sees what the transcription provider returns.
STT is configured in the Voice Input step of a Phone Call, Web Widget, or Web App / SDK deployment.

Setup flow

1

Create the provider credential

Add the STT provider credential in Credentials. The deployment wizard can only select credentials that already exist.
2

Open the deployment voice input step

Go to Configure Assistant -> Deployments, create or edit a voice-capable deployment, then open Voice Input.
3

Choose the STT provider

Select the provider that will transcribe user audio.
4

Select the model

Choose the provider-specific model. For phone calls, prefer real-time or telephony-friendly models. For browser audio, use the provider’s recommended real-time model.
5

Set language when required

Select the primary user language when the provider requires it. Use multilingual or automatic detection only when the provider supports it and your use case needs it.
6

Tune advanced voice input

Configure noise cancellation, VAD, and EOS from Show advanced settings. These settings control what audio reaches STT and when a user turn is complete.

Supported providers

ProviderTypical use
DeepgramLow-latency streaming transcription and telephony use cases.
AssemblyAIReal-time transcription with strong conversation-oriented models.
Azure Cognitive ServicesEnterprise Microsoft environments and multilingual deployments.
Google Speech ServiceGoogle Cloud speech recognition workflows.
OpenAIOpenAI transcription models for voice applications.
AWS TranscribeAWS-native speech recognition.
CartesiaVoice AI workflows that also use Cartesia TTS.
Sarvam AIIndian language voice applications.
GroqLow-latency Whisper-compatible transcription.
SpeechmaticsBroad language coverage and accent robustness.
NVIDIANVIDIA-hosted speech models.
Custom STTYour own WebSocket-compatible STT backend.

Configuration fields

The exact fields vary by provider, but STT configuration usually includes:
FieldWhat it controls
CredentialWhich stored provider credential Rapida uses.
ModelThe transcription model. This is usually the main accuracy/latency tradeoff.
LanguageThe expected user language or provider language code.
Some providers expose only a model because language is inferred, encoded in the model, or configured provider-side.

Choosing a provider

NeedRecommended direction
Lowest latencyUse a provider with streaming transcription and a real-time model.
Phone callsChoose a model that handles 8 kHz telephony audio well.
Browser microphoneUse a real-time model that performs well on cleaner wideband audio.
Noisy environmentsPair the provider with Noise Cancellation and stricter VAD.
Multilingual usersUse explicit language selection or a provider with reliable language detection.
Private providerUse Custom STT.
Do not judge STT accuracy from one setting alone. Wrong VAD, disabled noise cancellation, or aggressive EOS can produce clipped or incomplete audio that looks like an STT issue.

Channel guidance

Phone calls

Phone calls often use narrowband or compressed audio. Prefer STT models that are tested for telephony and real-time streaming. Keep RNNoise enabled for most phone deployments. Start with:
AreaStarting point
ModelReal-time or telephony-friendly model.
Noise cancellationRNNoise enabled.
VADSilero VAD with balanced threshold.
EOSPipecat Smart Turn or Silence-Based at 700-1000 ms.

Web widget and web app

Browser microphone audio is often cleaner than phone audio, but user environments vary widely. Use a real-time model and test with laptop microphones, headsets, and mobile browsers. Start with:
AreaStarting point
ModelReal-time model for browser audio.
Noise cancellationEnabled for uncontrolled environments.
VADSilero VAD.
EOSPipecat Smart Turn for natural conversation.

Troubleshooting

SymptomLikely causeWhat to adjust
Transcripts miss quiet speechVAD threshold too high or wrong STT modelLower VAD threshold and test another STT model.
Transcripts include background noiseNoise cancellation off or VAD too sensitiveEnable RNNoise and raise VAD threshold.
User words are cut off at the beginningVAD speech confirmation too strictLower minimum speech frames or VAD threshold.
Assistant responds to incomplete transcriptEOS too aggressiveTune End of Speech Detection.
Multilingual users are transcribed incorrectlyLanguage mismatchSet the correct language or use a multilingual-capable provider.

Voice Pipeline Overview

See how STT fits into the full audio flow.

Noise Cancellation

Clean audio before it reaches STT.

Custom STT

Connect a custom WebSocket transcription provider with DSL rules.

Voice Activity Detection

Tune when user speech starts and stops.

End of Speech Detection

Decide when the transcript is ready for the assistant to answer.