Skip to main content

Architecture

The assistant-api decouples audio I/O from provider-specific logic through a transformer layer. Each provider implements the same generic interface. The factory functions resolve the provider string at call time.
assistant-api
└── internal/transformer/
    ├── transformer.go          ← interface + factory functions
    ├── deepgram/               ← deepgram.go, stt.go, tts.go, normalizer.go
    ├── elevenlabs/
    ├── google/
    ├── azure/
    ├── cartesia/
    ├── assembly-ai/
    ├── revai/
    ├── sarvam/
    ├── openai/
    ├── resemble/
    ├── aws/
    └── speechmatics/

Transformer Interface

Every STT and TTS provider implements the same Transformers[IN] generic interface:
// api/assistant-api/internal/type/transformer.go
type Transformers[IN any] interface {
    // Initialize sets up the provider connection (WebSocket, HTTP client, etc.)
    // Called once per call session before audio begins.
    Initialize() error

    // Transform sends one audio packet (STT) or one LLM text packet (TTS)
    // to the provider. Results are delivered via the onPacket callback
    // registered at construction time.
    Transform(context.Context, IN) error

    // Close tears down the connection and releases resources.
    Close(context.Context) error
}
Type aliases used in practice:
type SpeechToTextTransformer = Transformers[UserAudioPacket]
type TextToSpeechTransformer  = Transformers[LLMPacket]

Provider Identifiers

Provider strings are defined as AudioTransformer constants in api/assistant-api/internal/transformer/transformer.go:
const (
    DEEPGRAM              AudioTransformer = "deepgram"
    GOOGLE_SPEECH_SERVICE AudioTransformer = "google-speech-service"
    AZURE_SPEECH_SERVICE  AudioTransformer = "azure-speech-service"
    CARTESIA              AudioTransformer = "cartesia"
    REVAI                 AudioTransformer = "revai"
    SARVAM                AudioTransformer = "sarvamai"
    ELEVENLABS            AudioTransformer = "elevenlabs"
    ASSEMBLYAI            AudioTransformer = "assemblyai"
)
These strings are stored in the assistant’s configuration and resolved at call time by the factory functions.

Factory Functions

The factory functions accept the provider string and return the correct implementation:
// api/assistant-api/internal/transformer/transformer.go

func GetSpeechToTextTransformer(
    ctx        context.Context,
    logger     commons.Logger,
    provider   string,
    credential *protos.VaultCredential,
    onPacket   func(STTPacket),
    opts       utils.Option,
) (SpeechToTextTransformer, error)

func GetTextToSpeechTransformer(
    ctx        context.Context,
    logger     commons.Logger,
    provider   string,
    credential *protos.VaultCredential,
    onPacket   func(AudioPacket),
    opts       utils.Option,
) (TextToSpeechTransformer, error)
Both functions use a switch on AudioTransformer(provider). Adding a new provider means adding a case to each switch.

Supported Providers

ProviderIdentifierStreamingNotes
DeepgramdeepgramNova-2 / Nova-3; WebSocket SDK
Google Cloud STTgoogle-speech-service100+ languages
Azure Cognitive Servicesazure-speech-serviceNeural Speech
CartesiacartesiaLow latency
AssemblyAIassemblyaiSpeaker diarization
Rev.airevaiReal-time
Sarvam AIsarvamaiIndian languages
AWS Transcribe(aws)Real-time streaming
OpenAI Whisper(openai)Batch only
Speechmatics(speechmatics)Real-time

Reference Implementation — Deepgram

The Deepgram transformer is the reference implementation. Its structure is the same for every provider.
// api/assistant-api/internal/transformer/deepgram/deepgram.go

type deepgramOption struct {
    key    string          // API key from vault credential
    logger commons.Logger
    mdlOpts utils.Option  // assistant-level options (language, model, etc.)
}

func NewDeepgramOption(
    logger          commons.Logger,
    vaultCredential *protos.VaultCredential,
    opts            utils.Option,
) *deepgramOption {
    credentials := vaultCredential.GetValue().AsMap()
    return &deepgramOption{
        key:     credentials["key"].(string),  // vault credential key
        logger:  logger,
        mdlOpts: opts,
    }
}
Provider-specific options are read from the vault credential map. The key name ("key") matches what is stored in the credential vault.
// api/assistant-api/internal/transformer/deepgram/stt.go

type deepgramSTT struct {
    opt      *deepgramOption
    onPacket func(STTPacket)        // callback when transcription arrives
    client   *client.WSUsingCallback
}

func (d *deepgramSTT) Initialize() error {
    // Builds LiveTranscriptionOptions from deepgramOption.SpeechToTextOptions()
    // Creates a WebSocket connection using the Deepgram SDK
    // Registers onPacket as the transcript callback
    d.client = client.NewWSUsingCallback(...)
    return d.client.Connect()
}

func (d *deepgramSTT) Transform(ctx context.Context, in UserAudioPacket) error {
    // Streams raw audio bytes to Deepgram over the WebSocket
    return d.client.Stream(bufio.NewReader(bytes.NewReader(in.Audio)))
}

func (d *deepgramSTT) Close(ctx context.Context) error {
    // Cancels context and stops the WebSocket client
    return d.client.Stop()
}
UserAudioPacket.Audio contains raw PCM 16-bit 16kHz audio bytes.
// api/assistant-api/internal/transformer/deepgram/tts.go

func (d *deepgramTTS) Initialize() error {
    // Connects to: wss://api.deepgram.com/v1/speak?encoding=linear16&sample_rate=16000&model=<voice>
    d.conn, _, err = websocket.DefaultDialer.Dial(d.opt.GetTextToSpeechConnectionString(), headers)
    // Starts goroutine to read audio chunks and call onPacket
    go d.readAudioChunks()
    return err
}

func (d *deepgramTTS) Transform(ctx context.Context, in LLMPacket) error {
    switch pkt := in.(type) {
    case InterruptionPacket:
        // Send {"type":"Clear"} to cancel queued audio
        return d.conn.WriteJSON(map[string]string{"type": "Clear"})
    case LLMResponseDeltaPacket:
        // Send {"type":"Speak","text":"..."} for each token
        return d.conn.WriteJSON(map[string]string{"type": "Speak", "text": pkt.Text})
    case LLMResponseDonePacket:
        // Send {"type":"Flush"} to trigger final synthesis
        return d.conn.WriteJSON(map[string]string{"type": "Flush"})
    }
    return nil
}
The onPacket callback is called with each synthesized audio chunk, which is sent directly to the client.

Adding a New Provider

Follow these steps to add a new STT or TTS provider.
1

Create the provider directory

mkdir api/assistant-api/internal/transformer/<provider>
Create these files inside the directory:
FilePurpose
<provider>.goOption struct, credential extraction, client initialization
stt.goSTT implementation (omit if TTS-only)
tts.goTTS implementation (omit if STT-only)
normalizer.goText normalizer for TTS (strip markdown, apply pronunciation dict)
2

Implement the Option struct

// <provider>.go
type myProviderOption struct {
    apiKey  string
    logger  commons.Logger
    opts    utils.Option
}

func NewMyProviderOption(
    logger          commons.Logger,
    vaultCredential *protos.VaultCredential,
    opts            utils.Option,
) *myProviderOption {
    credentials := vaultCredential.GetValue().AsMap()
    return &myProviderOption{
        apiKey: credentials["key"].(string),
        logger: logger,
        opts:   opts,
    }
}
3

Implement SpeechToTextTransformer (stt.go)

// stt.go
type myProviderSTT struct {
    opt      *myProviderOption
    onPacket func(STTPacket)
    // provider-specific client
}

func NewMyProviderSTT(opt *myProviderOption, onPacket func(STTPacket)) *myProviderSTT {
    return &myProviderSTT{opt: opt, onPacket: onPacket}
}

func (s *myProviderSTT) Initialize() error {
    // Connect to provider STT endpoint
    // Register s.onPacket to be called with transcript results
    return nil
}

func (s *myProviderSTT) Transform(ctx context.Context, in UserAudioPacket) error {
    // Send in.Audio (raw PCM 16-bit 16kHz bytes) to the provider
    return nil
}

func (s *myProviderSTT) Close(ctx context.Context) error {
    // Close connection, cancel context
    return nil
}
4

Register in the factory (transformer.go)

Add a case to both factory functions in api/assistant-api/internal/transformer/transformer.go:
// Add constant
const MY_PROVIDER AudioTransformer = "my-provider"

// In GetSpeechToTextTransformer:
case MY_PROVIDER:
    opt := myprovider.NewMyProviderOption(logger, credential, opts)
    return myprovider.NewMyProviderSTT(opt, onPacket), nil

// In GetTextToSpeechTransformer:
case MY_PROVIDER:
    opt := myprovider.NewMyProviderOption(logger, credential, opts)
    return myprovider.NewMyProviderTTS(opt, onPacket), nil
No changes to any other service are needed.
5

Add the credential to the vault

In the dashboard under Settings → Integrations, add the provider API key. The credential map key must match what your option struct reads from vaultCredential.GetValue().AsMap().

Next Steps