Skip to main content

Documentation Index

Fetch the complete documentation index at: https://doc.rapida.ai/llms.txt

Use this file to discover all available pages before exploring further.

Text-to-Speech (TTS) converts the assistant’s text response into spoken audio. It controls how the assistant sounds, how quickly users hear the first audio, and how clearly domain-specific terms are pronounced.
TTS is configured in the Voice Output step of a Phone Call, Web Widget, or Web App / SDK deployment.

Setup flow

1

Create the provider credential

Add the TTS provider credential in Credentials before configuring the deployment.
2

Open the deployment voice output step

Go to Configure Assistant -> Deployments, create or edit a voice-capable deployment, then open Voice Output.
3

Choose the TTS provider

Select the provider that will synthesize assistant responses.
4

Choose the model

Select the provider-specific model. Some providers optimize for lowest latency, while others optimize for voice quality.
5

Choose the voice

Select the voice ID. For providers that support custom voices, use the custom voice ID when the field allows custom values.
6

Set the language

Match the TTS language to the assistant’s expected response language and selected voice.

Supported providers

ProviderTypical use
ElevenLabsNatural voices, custom voices, and brand voice workflows.
DeepgramLow-latency streaming voice output.
Azure Cognitive ServicesEnterprise Microsoft environments and broad voice catalog support.
Google Speech ServiceGoogle Cloud text-to-speech workflows.
OpenAIOpenAI TTS models and voices.
AWS PollyAWS-native neural and standard voices.
CartesiaLow-latency voice AI and expressive voice controls.
RimeReal-time voice synthesis with provider voices.
Sarvam AIIndian language voice output.
Resemble AICustom and cloned voice workflows.
NeuphonicLow-latency conversational TTS.
MiniMaxVoice models from MiniMax.
GroqLow-latency TTS through Groq-supported models.
SpeechmaticsSpeechmatics voice output.
NVIDIANVIDIA-hosted voice models.
Custom TTSYour own WebSocket-compatible TTS backend.

Configuration fields

The exact fields vary by provider, but TTS configuration usually includes:
FieldWhat it controls
CredentialWhich stored provider credential Rapida uses.
ModelThe speech synthesis model.
VoiceThe voice ID or custom voice identifier.
LanguageThe output language or locale.
Speed or emotionProvider-specific controls for speaking style.

Advanced speech settings

Open Show advanced settings in Voice Output to tune delivery.
SettingWhat it controlsDefault
AmbientOptional background ambience mixed into output audio.none
Ambient VolumeVolume of the selected ambience.18
Pronunciation DictionariesBuilt-in pronunciation rules for currencies, dates, times, numbers, addresses, URLs, abbreviations, and symbols.none
Conjunction BoundariesWords where Rapida can add natural pause boundaries.none
Pause DurationPause length at configured conjunction boundaries.240 ms

Pronunciation dictionaries

Use pronunciation dictionaries when the assistant must say structured or domain-specific text clearly.
Dictionary typeHelps with
currencyPrices, amounts, and currency symbols.
date and timeDates, appointment times, and schedules.
numeralAccount numbers, quantities, and IDs.
addressStreet addresses and postal details.
urlWebsites and links.
tech-abbreviation, role-abbreviation, general-abbreviationAcronyms and abbreviations.
symbolSymbols that should be spoken naturally.
Add pronunciation dictionaries before user testing. Mispronounced product names, acronyms, prices, dates, and addresses are easy for users to notice.

Conjunction boundaries

Conjunction boundaries let Rapida add natural pauses around selected words such as and, but, or, because, and while. This can make long responses easier to listen to. Use them when:
  • The assistant often speaks multi-clause sentences.
  • Users need time to understand instructions.
  • TTS output feels rushed even when the voice is good.
Avoid overusing them when:
  • The assistant already speaks in very short sentences.
  • The added pauses make responses feel slow.

Choosing a provider

NeedRecommended direction
Lowest response latencyUse a streaming TTS provider and keep assistant responses short.
Natural voice qualityUse a neural or conversational voice model.
Brand voiceUse a provider that supports cloned or custom voice IDs.
Multilingual speechChoose a voice that supports the required language, not only a model that lists it.
Phone callsTest the voice over phone audio, not only in browser previews.
Private providerUse Custom TTS.

Prompt guidance for better TTS

TTS quality depends on the text the LLM produces. Tune the assistant prompt for spoken responses:
  • Keep responses under one or two short sentences.
  • Avoid markdown, bullet lists, long tables, and symbols.
  • Ask one question at a time.
  • Confirm critical values slowly, especially emails, phone numbers, dates, and addresses.
  • Write instructions in natural spoken language.

Troubleshooting

SymptomLikely causeWhat to adjust
First audio starts slowlyTTS model latency or long LLM responseUse lower-latency TTS and shorten assistant responses.
Voice mispronounces product namesMissing pronunciation handlingEnable pronunciation dictionaries or use a provider custom voice/pronunciation feature.
Voice sounds rushedLong clauses or no pausesAdd conjunction boundaries and lower prompt response length.
Voice sounds unnatural on phoneVoice tested only on browser audioTest through the target phone deployment and try a clearer voice.
Wrong language or accentVoice and language mismatchSelect a language-matched voice and provider model.

Voice Pipeline Overview

See where TTS fits into the full audio flow.

Speech-to-Text

Configure user speech transcription.

Custom TTS

Connect a custom WebSocket speech synthesis provider with DSL rules.

Phone Call Deployment

Configure required voice output for phone calls.

Web App / SDK Deployment

Configure optional voice output for custom apps.