assistant-api decouples audio transcription from provider-specific logic through a transformer layer. Every STT provider implements the same generic interface. The factory resolves the provider string at call time.
Transformer Interface
Every STT provider implementsTransformers[UserAudioPacket]:
UserAudioPacket.Audio contains raw PCM 16-bit mono 16kHz bytes. All providers receive this same format — resampling from 8kHz telephony audio is handled upstream.
Factory Function
provider string. To add a new provider, add a case to this switch.
Provider Identifiers
Supported STT Providers
| Provider | Identifier | Streaming | Notes |
|---|---|---|---|
| Deepgram | deepgram | ✅ | Nova-2 / Nova-3, WebSocket SDK |
| Google Cloud STT | google-speech-service | ✅ | 100+ languages |
| Azure Cognitive | azure-speech-service | ✅ | Neural Speech, 140+ languages |
| AssemblyAI | assemblyai | ✅ | Speaker diarization |
| Rev.ai | revai | ✅ | Real-time |
| Sarvam AI | sarvamai | ✅ | Indian languages |
| AWS Transcribe | aws | ✅ | Real-time streaming |
| OpenAI Whisper | openai | ✗ | Batch only |
| Speechmatics | speechmatics | ✅ | Real-time |