Custom Text-to-Speech lets you connect Rapida to a WebSocket speech synthesis service that is not available as a built-in provider. You configure the provider URL, handshake headers, audio settings, and a small JSON DSL that maps assistant text packets to your provider’s WebSocket protocol. Use Custom TTS when your provider can receive text over WebSocket and return audio as binary frames or base64 audio in JSON frames. Provider identifier:Documentation Index
Fetch the complete documentation index at: https://doc.rapida.ai/llms.txt
Use this file to discover all available pages before exploring further.
custom-tts
Custom TTS is an end-user configuration feature. You do not need to write a new Rapida transformer when your provider can be described with the WebSocket DSL below.
Setup flow
Create the Custom TTS credential
Open Credentials or Integrations > Models, choose Custom TTS, and create a credential with your WebSocket connection details.
Select Custom TTS in Voice Output
Open your assistant deployment, go to Voice Output, and select Custom TTS as the text-to-speech provider.
Set voice and audio arguments
Configure voice ID, model, language, audio encoding, and sample rate so Rapida can interpret provider audio correctly.
Write WebSocket DSL rules
Define query parameters, request rules, and response rules so Rapida knows how to send text and receive audio.
Credential fields
| Field | Required | Description |
|---|---|---|
apiCompatibility | Yes | Must be websocket_v1. |
baseUrl | Yes | WebSocket URL for your TTS service, for example wss://tts.example.com/v1/speak. |
headers | No | Header map sent during the WebSocket handshake, for example {"Authorization":"Bearer sk_..."}. |
api_compatibility and base_url.
TTS arguments
| Option key | Required | Default | Description |
|---|---|---|---|
speak.voice.id | Yes | None | Provider voice identifier. Available as voice_id in query params and config.voice.id in request rules. |
speak.model | No | Empty | Provider model identifier. Available as model in query params and config.model in request rules. |
speak.language | No | Empty | Provider language code. Available as language in query params and config.language in request rules. |
speak.audio.encoding | Yes | LINEAR16 | Audio encoding expected back from the provider. Supported values: LINEAR16, MuLaw8. |
speak.audio.sample_rate | Yes | 16000 | Audio sample rate expected back from the provider. Common values include 8000, 16000, 24000, 44100, and 48000. |
speak.ws.query_params | No | {} | Flat JSON object appended to baseUrl as query parameters. |
speak.ws.request_rules | Yes | None | Ordered JSON array that maps Rapida TTS packets to outbound WebSocket frames. Must contain at least one text rule. |
speak.ws.response_rules | Yes | None | Ordered JSON array that maps provider frames to audio, done, or error events. |
DSL sections
Custom TTS has three DSL sections:| Section | Purpose |
|---|---|
| Query parameters | Add static or dynamic query params to the WebSocket URL. |
| Request rules | Convert Rapida text, done, and interrupt packets into provider WebSocket messages. |
| Response rules | Convert provider WebSocket frames into Rapida audio, done, or error events. |
Query parameters
Usespeak.ws.query_params when your provider expects configuration in the WebSocket URL.
Supported variables:
| Variable | Source |
|---|---|
message_id | Current synthesis message ID |
voice_id | speak.voice.id |
model | speak.model |
language | speak.language |
encoding | speak.audio.encoding |
sample_rate | speak.audio.sample_rate |
- Query params must be a flat JSON object.
- Values must resolve to a primitive: string, number, boolean, or null.
- Existing query params in
baseUrlare preserved unless the rendered DSL uses the same key. textis not a supported query parameter variable. Usepacket.textin request rules instead.
Request rules
Request rules are evaluated for normalized TTS packets produced by Rapida.| Packet | When it is sent | Available paths |
|---|---|---|
text | LLM text is ready for synthesis | packet.kind, packet.message_id, packet.text, config.voice.id, config.model, config.language, config.audio.encoding, config.audio.sample_rate |
done | The LLM response is complete | packet.kind, packet.message_id, packet.text, config.* |
interrupt | User interruption is detected | packet.kind, packet.message_id, packet.text, config.* |
| Frame | Body must resolve to |
|---|---|
binary | Bytes or string. |
json | Valid JSON value. |
text | Value convertible to string. |
One-shot synthesis
Use this when the provider synthesizes each text packet immediately.Text, done, and interrupt
Use this when the provider expects text payloads, an explicit final message, and an explicit cancel message.Response rules
Response rules parse provider WebSocket frames into Rapida audio packets. The first matching rule is evaluated and later rules are skipped for that frame. Supported inbound frames:| Frame | Use when |
|---|---|
binary | Provider streams raw audio frames. |
json | Provider returns JSON with base64 audio, done, or error fields. |
| Emit key | Type | Effect |
|---|---|---|
audio | bytes | Emits a TTS audio chunk. |
message_id | string | Associates audio, error, or done with a message. Falls back to the current context ID when omitted. |
done | boolean | Ends synthesis for the message, closes the connection, and emits a TTS end packet. |
error | string | Emits a TTS error. |
Binary audio response
Use this when the provider streams raw audio as binary WebSocket frames.JSON base64 audio response
Use$decode when the provider returns base64-encoded audio inside JSON.
Operators
Every operator object must contain only that operator and its required fields.| Operator | Where supported | Description |
|---|---|---|
$var | Query parameters | Reads message_id, voice_id, model, language, encoding, or sample_rate. |
$path | Request rules, response rules | Reads a dot path from request scope or a JSON response frame. |
$cast | Query parameters, request rules, response rules | Casts to string, number, or boolean. |
$frame | Response rules | Reads the full current binary response frame. |
$decode | Response rules | Decodes a base64 string into bytes. Only base64 is supported. |
- Text response frames
$frame: "text"$frame: "json"- Decode formats other than
base64
Cast behavior
| Cast | Behavior |
|---|---|
string | Converts strings, bytes, numbers, booleans, and null to string form. |
number | Converts JSON numbers, numeric values, or numeric strings to an integer or float. |
boolean | Converts booleans, boolean strings, and numeric values. JSON numbers are accepted as 0 or 1; typed numeric values use zero as false and non-zero as true. |
Path behavior
$path uses dot-separated paths.
- Keys containing a literal dot are not addressable.
- Request rules can only read from
configandpacket. - Response rules can use
$pathonly with JSON response frames. - A missing path in
when.pathmeans the rule does not match. - A missing path in
emitorsend.bodyis an error.
Runtime behavior
- The connection URL is built from
baseUrlandspeak.ws.query_params. - A connection is opened per active message or context. A new context closes the previous connection.
textpackets open the WebSocket connection if needed.doneandinterruptrules are optional. If no rule exists for that packet, nothing is sent.- On interruption, Rapida sends the optional
interruptrule first, then closes the connection. - Audio returned by the provider is interpreted as
speak.audio.encodingandspeak.audio.sample_rate, then resampled internally when needed. - If no response rule matches an inbound frame, the frame is ignored.
- If a response emits
error, Rapida emits a TTS error packet. - If a response emits
done, Rapida closes the connection and emits a TTS end packet.
Current limits
- No regex, contains, starts-with, greater-than, or compound match conditions.
- No string interpolation or concatenation.
- No fallback values inside expressions.
- No dynamic headers or dynamic WebSocket path segments.
- No text response handling for TTS.
- No
$frame: "json"selector in emit rules. $decodesupports only base64.
Troubleshooting
| Symptom | Likely cause | What to check |
|---|---|---|
| WebSocket does not connect | Bad baseUrl, headers, or compatibility value | Confirm apiCompatibility, baseUrl, and auth headers. |
| Provider receives no text | Missing text request rule | Add a when.packet = text rule. |
| Audio never plays | Response rule does not emit audio | Check binary frames or base64 $decode mapping. |
| Audio sounds distorted | Encoding or sample rate mismatch | Confirm speak.audio.encoding and speak.audio.sample_rate. |
| Audio keeps playing after interruption | Missing provider cancel message | Add an interrupt request rule. |
| Session never ends cleanly | Missing done handling | Emit done: true from the provider’s done frame. |
Related
Text-to-Speech
Configure standard TTS providers and speech delivery.
Custom STT
Configure a custom WebSocket STT provider.
Voice Pipeline Overview
See how TTS fits into the full pipeline.
Open-source runtime reference
Review the assistant-api implementation reference.