Custom Speech-to-Text - rapida.ai documentation

Custom Speech-to-Text lets you connect Rapida to a transcription service that is not available as a built-in provider. You configure the provider URL, headers, audio settings, and a small JSON DSL that maps Rapida audio packets and provider responses. Use Custom STT when your provider can accept streaming audio over WebSocket or one HTTP transcription request per speech segment, then return transcripts as JSON or text. Provider identifier: custom-stt API compatibility: websocket_v1 or http_v1

Custom STT is an end-user configuration feature. You do not need to write a new Rapida transformer when your provider can be described with the DSL below.

Setup flow

Create the Custom STT credential

Open Credentials or Integrations > Models, choose Custom STT, and create a credential with your provider connection details.

Select Custom STT in Voice Input

Open your assistant deployment, go to Voice Input, and select Custom STT as the speech-to-text provider.

Choose API compatibility

Select WebSocket v1 for streaming providers or HTTP v1 for providers that transcribe one completed speech segment per request.

Set audio arguments

Choose the audio encoding and sample rate your provider expects.

Write DSL rules

Define query parameters, request rules, and response rules so Rapida knows how to talk to your provider.

Test with conversation logs

Run a test call or web session. Check transcripts, latency, errors, and whether interim/final transcripts are emitted correctly.

Credential fields

Field	Required	Description
`apiCompatibility`	No	`websocket_v1` or `http_v1`. Defaults to `websocket_v1` when omitted.
`baseUrl`	Yes	WebSocket URL or HTTP endpoint URL for your STT service, for example `wss://stt.example.com/v1/listen` or `https://stt.example.com/v1/transcribe`.
`headers`	No	Header map sent during the WebSocket handshake or HTTP request, for example `{"Authorization":"Bearer sk_..."}`.

The runtime also accepts snake case keys: api_compatibility and base_url.

Headers are copied from the credential as static values. The DSL cannot template headers or change the URL path dynamically.

STT arguments

Option key	Required	Default	Description
`listen.model`	No	Empty	Optional provider model identifier. Available as `model` in query params and `config.model` in request rules when supplied by API or imported metadata.
`listen.language`	No	Empty	Optional provider language code. Available as `language` in query params and `config.language` in request rules when supplied by API or imported metadata.
`listen.audio.encoding`	Yes	`LINEAR16`	Audio encoding sent to the provider. Supported values: `LINEAR16`, `MuLaw8`.
`listen.audio.sample_rate`	Yes	`16000`	Audio sample rate sent to the provider. Supported UI values are `8000`, `16000`, `22050`, `24000`, `32000`, `44100`, and `48000`.
`listen.query_params`	No	`{}`	Flat JSON object appended to `baseUrl` as query parameters.
`listen.request_rules`	Yes	See example below	Ordered JSON array that maps Rapida packets to outbound provider messages. Must contain at least one `audio` rule.
`listen.response_rules`	Yes	None	Ordered JSON array that maps provider responses to transcripts or errors.

DSL sections

Custom STT has three DSL sections:

Section	Purpose
Query parameters	Add static or dynamic query params to the URL.
Request rules	Convert Rapida packets into provider messages.
Response rules	Convert provider WebSocket frames or HTTP responses into Rapida transcript events.

The DSL is intentionally small. It does not run JavaScript, call functions, read environment variables, use regex, concatenate strings, or perform compound conditions.

Query parameters

Use listen.query_params when your provider expects configuration in the URL. Supported variables:

Variable	Source
`model`	`listen.model`
`language`	`listen.language`
`encoding`	`listen.audio.encoding`
`sample_rate`	`listen.audio.sample_rate`

{
  "language": { "$var": "language" },
  "model": { "$var": "model" },
  "encoding": { "$var": "encoding" },
  "sample_rate": {
    "$cast": "number",
    "value": { "$var": "sample_rate" }
  }
}

Rules:

Query params must be a flat JSON object.
Values must resolve to a primitive: string, number, boolean, or null.
Existing query params in baseUrl are preserved unless the rendered DSL uses the same key.

Request rules

Request rules are evaluated for normalized packets produced by Rapida.

Packet	When it is sent	Available paths
`turn_change`	A new turn or context starts	`packet.kind`, `packet.context_id`, `config.model`, `config.language`, `config.audio.encoding`, `config.audio.sample_rate`
`audio`	Audio is ready to send	`packet.kind`, `packet.context_id`, `packet.audio.bytes`, `packet.audio.base64`, `packet.audio.pcm_base64`, `packet.audio.wav_base64`, `config.*`
`interrupt`	User interruption is detected	`packet.kind`, `packet.context_id`, `config.*`

Supported outbound frames:

Frame	Body must resolve to
`binary`	Bytes or string. Use this for raw audio streams.
`json`	Valid JSON value.
`text`	Value convertible to string.

Binary audio stream

Use this when the provider expects raw audio WebSocket frames.

[
  {
    "when": { "packet": "audio" },
    "send": {
      "frame": "binary",
      "body": { "$path": "packet.audio.bytes" }
    }
  }
]

JSON audio payload

Use this when the provider expects base64 audio inside JSON.

[
  {
    "when": { "packet": "audio" },
    "send": {
      "frame": "json",
      "body": {
        "audio": { "$path": "packet.audio.base64" },
        "encoding": { "$path": "config.audio.encoding" },
        "sample_rate": {
          "$cast": "number",
          "value": { "$path": "config.audio.sample_rate" }
        }
      }
    }
  }
]

HTTP transcription request

Use http_v1 when your provider accepts one JSON POST for a completed speech segment. Rapida buffers speech audio until the STT end packet, evaluates the first matching audio rule, and sends the rendered JSON body to baseUrl.

[
  {
    "when": { "packet": "audio" },
    "send": {
      "frame": "json",
      "body": {
        "audio": { "$path": "packet.audio.wav_base64" },
        "encoding": { "$path": "config.audio.encoding" },
        "sample_rate": {
          "$cast": "number",
          "value": { "$path": "config.audio.sample_rate" }
        }
      }
    }
  }
]

For http_v1, the audio request rule must render a json frame. Binary and text request frames are valid for WebSocket STT, but HTTP STT posts a JSON body.

Start, audio, and interrupt

Use this pattern when the provider expects a session-start message, binary audio frames, and a flush message on interruption.

[
  {
    "when": { "packet": "turn_change" },
    "send": {
      "frame": "json",
      "body": {
        "type": "start",
        "language": { "$path": "config.language" },
        "sample_rate": {
          "$cast": "number",
          "value": { "$path": "config.audio.sample_rate" }
        }
      }
    }
  },
  {
    "when": { "packet": "audio" },
    "send": {
      "frame": "binary",
      "body": { "$path": "packet.audio.bytes" }
    }
  },
  {
    "when": { "packet": "interrupt" },
    "send": {
      "frame": "json",
      "body": { "type": "flush" }
    }
  }
]

Response rules

Response rules parse provider WebSocket frames or HTTP response bodies into Rapida transcript packets. The first matching rule is evaluated and later rules are skipped for that response. Supported inbound frames:

Frame	Use when
`json`	Provider returns structured transcript events.
`text`	Provider returns plain transcript text.

Supported emit keys:

Emit key	Type	Effect
`script`	string	Transcript text. Empty transcripts are ignored.
`confidence`	number	Optional transcript confidence. Defaults to `0` when omitted.
`language`	string	Optional transcript language. Falls back to `listen.language` when omitted.
`interim`	boolean	`true` emits an interim transcript; `false` emits a completed transcript.
`error`	string	Emits an STT error instead of a transcript.

JSON partial and final transcripts

[
  {
    "when": { "frame": "json", "path": "type", "equals": "partial" },
    "emit": {
      "script": { "$path": "text" },
      "confidence": {
        "$cast": "number",
        "value": { "$path": "confidence" }
      },
      "language": { "$path": "language" },
      "interim": true
    }
  },
  {
    "when": { "frame": "json", "path": "type", "equals": "final" },
    "emit": {
      "script": { "$path": "text" },
      "confidence": {
        "$cast": "number",
        "value": { "$path": "confidence" }
      },
      "language": { "$path": "language" },
      "interim": false
    }
  },
  {
    "when": { "frame": "json", "path": "type", "equals": "error" },
    "emit": {
      "error": { "$path": "error.message" }
    }
  }
]

Plain text transcript response

[
  {
    "when": { "frame": "text" },
    "emit": {
      "script": { "$frame": "text" },
      "interim": false
    }
  }
]

Operators

Every operator object must contain only that operator and its required fields.

Operator	Where supported	Description
`$var`	Query parameters	Reads `model`, `language`, `encoding`, or `sample_rate`.
`$path`	Request rules, response rules	Reads a dot path from request scope or a JSON response frame.
`$cast`	Query parameters, request rules, response rules	Casts to `string`, `number`, or `boolean`.
`$frame`	Response rules	Reads the full current text response frame.

Unsupported for Custom STT:

$decode
Binary response handling
$frame: "binary"
$frame: "json"

Cast behavior

Cast	Behavior
`string`	Converts strings, bytes, numbers, booleans, and null to string form.
`number`	Converts JSON numbers, numeric values, or numeric strings to an integer or float.
`boolean`	Converts booleans, boolean strings, and numeric values. JSON numbers are accepted as `0` or `1`; typed numeric values use zero as `false` and non-zero as `true`.

Path behavior

$path uses dot-separated paths.

{ "$path": "packet.audio.base64" }

Objects are traversed by key. Arrays are traversed by numeric index.

{ "$path": "results.0.transcript" }

Limits:

Keys containing a literal dot are not addressable.
Request rules can only read from config and packet.
Response rules can use $path only with JSON response frames.
A missing path in when.path means the rule does not match.
A missing path in emit or send.body is an error.

Runtime behavior

The URL is built from baseUrl and listen.query_params.
Audio is resampled to listen.audio.encoding and listen.audio.sample_rate before request rules run.
For websocket_v1, turn_change and audio packets open the WebSocket connection if needed.
For http_v1, Rapida buffers audio for a speech segment and sends one HTTP POST when speech ends.
interrupt rules are sent only by the WebSocket transport when a connection is already active.
If no response rule matches an inbound frame, the frame is ignored.
If a response emits error, Rapida emits an STT error packet.
If a response emits non-empty script, Rapida emits a transcript packet and conversation event.

Current limits

No regex, contains, starts-with, greater-than, or compound match conditions.
No string interpolation or concatenation.
No fallback values inside expressions.
No dynamic headers or dynamic URL path segments.
No binary response handling for STT.
No $decode.

Troubleshooting

Symptom	Likely cause	What to check
WebSocket does not connect	Bad `baseUrl`, headers, or compatibility value	Confirm `apiCompatibility`, `baseUrl`, and auth headers.
HTTP request fails	Bad endpoint, headers, request body, or non-2xx status	Confirm `baseUrl`, auth headers, and that the `audio` rule emits `send.frame = json`.
Provider receives no audio	Missing `audio` request rule	Add a `when.packet = audio` rule.
Provider receives JSON but expected binary	Wrong `send.frame`	Use `binary` with `packet.audio.bytes`.
Transcript never appears	Response rules do not match provider frames	Check `when.frame`, `when.path`, and `when.equals`.
Partial transcripts show as final	Wrong `interim` value	Emit `interim: true` for partial responses.
Language or sample rate is wrong	Query params or request body not mapped	Use `$var` in query params or `$path` from `config.*` in request rules.

Speech-to-Text

Configure standard STT providers and transcription tuning.

Custom TTS

Configure a custom WebSocket TTS provider.

Listen

See how STT fits into speech input configuration.

Open-source runtime reference

Review the assistant-api implementation reference.

​Setup flow

​Credential fields

​STT arguments

​DSL sections

​Query parameters

​Request rules

​Binary audio stream

​JSON audio payload

​HTTP transcription request

​Start, audio, and interrupt

​Response rules

​JSON partial and final transcripts

​Plain text transcript response

​Operators

​Cast behavior

​Path behavior

​Runtime behavior

​Current limits

​Troubleshooting

​Related

Speech-to-Text

Custom TTS

Listen

Open-source runtime reference

Setup flow

Credential fields

STT arguments

DSL sections

Query parameters

Request rules

Binary audio stream

JSON audio payload

HTTP transcription request

Start, audio, and interrupt

Response rules

JSON partial and final transcripts

Plain text transcript response

Operators

Cast behavior

Path behavior

Runtime behavior

Current limits

Troubleshooting

Related