Skip to main content
The assistant-api decouples audio transcription from provider-specific logic through a transformer layer. Every STT provider implements the same generic interface. The factory resolves the provider string at call time.

Transformer Interface

Every STT provider implements Transformers[UserAudioPacket]:
// api/assistant-api/internal/type/transformer.go
type Transformers[IN any] interface {
    // Initialize sets up the provider connection (WebSocket, gRPC, HTTP client).
    // Called once per call session before audio begins.
    Initialize() error

    // Transform sends one audio packet to the provider.
    // Transcription results are delivered via the onPacket callback registered at construction.
    Transform(ctx context.Context, in IN) error

    // Close tears down the connection and releases resources.
    Close(ctx context.Context) error
}

// Type alias for STT
type SpeechToTextTransformer = Transformers[UserAudioPacket]
UserAudioPacket.Audio contains raw PCM 16-bit mono 16kHz bytes. All providers receive this same format — resampling from 8kHz telephony audio is handled upstream.

Factory Function

// api/assistant-api/internal/transformer/transformer.go
func GetSpeechToTextTransformer(
    ctx        context.Context,
    logger     commons.Logger,
    provider   string,                     // AudioTransformer constant string
    credential *protos.VaultCredential,    // decrypted vault credential
    onPacket   func(STTPacket),            // callback invoked with each transcript result
    opts       utils.Option,               // assistant-level options (language, model, etc.)
) (SpeechToTextTransformer, error)
The factory switches on the provider string. To add a new provider, add a case to this switch.

Provider Identifiers

// api/assistant-api/internal/transformer/transformer.go
const (
    DEEPGRAM              AudioTransformer = "deepgram"
    GOOGLE_SPEECH_SERVICE AudioTransformer = "google-speech-service"
    AZURE_SPEECH_SERVICE  AudioTransformer = "azure-speech-service"
    CARTESIA              AudioTransformer = "cartesia"
    REVAI                 AudioTransformer = "revai"
    SARVAM                AudioTransformer = "sarvamai"
    ELEVENLABS            AudioTransformer = "elevenlabs"
    ASSEMBLYAI            AudioTransformer = "assemblyai"
)

Supported STT Providers

ProviderIdentifierStreamingNotes
DeepgramdeepgramNova-2 / Nova-3, WebSocket SDK
Google Cloud STTgoogle-speech-service100+ languages
Azure Cognitiveazure-speech-serviceNeural Speech, 140+ languages
AssemblyAIassemblyaiSpeaker diarization
Rev.airevaiReal-time
Sarvam AIsarvamaiIndian languages
AWS TranscribeawsReal-time streaming
OpenAI WhisperopenaiBatch only
SpeechmaticsspeechmaticsReal-time

Provider Pages