Audio
Browse models for audio and compare pricing across providers.
Whisper Large V3
OpenAIOpen-weight speech recognition supporting 50+ languages. Handles accents, noise, and technical language.
Sound Effect V2
ElevenLabsAI sound effect generation from text descriptions.
Speech to Text
ElevenLabsElevenLabs speech recognition and transcription service.
Flash V2.5
ElevenLabsUltra-low latency TTS at 75ms TTFA. Best for real-time conversational voice agents.
Speech 2.6
MiniMaxMiniMax text-to-speech with HD and turbo variants. Supports voice cloning.
Multilingual V2
ElevenLabsUltra-realistic narration in 70+ languages with thousands of voice presets.
GPT-4o Transcribe
OpenAILatest OpenAI transcription with lower error rates than Whisper. Recommended over Whisper for API use.
TTS-1
OpenAIOpenAI's fast text-to-speech model optimized for real-time use.
TTS-1 HD
OpenAIHigher quality TTS with improved naturalness and pronunciation accuracy.
Aura 2
DeepgramLow-latency TTS at 90ms optimized TTFB for voice agent production use.
Chirp 2
GoogleGoogle's latest speech recognition model with improved accuracy across 100+ languages.
Cloud TTS
GoogleGoogle's text-to-speech service supporting 75+ languages with WaveNet and Neural2 voices.
Fish Speech S2
Fish AudioMultilingual TTS supporting 80+ languages with voice cloning capabilities.
Nova 2
DeepgramReal-time STT specialist with sub-300ms latency, streaming WebSocket API, and domain-specific vocabulary.
Qwen 3 TTS
AlibabaQwen 3 text-to-speech model with voice cloning support.
Slam-1
AssemblyAISpeech-language model with multilingual streaming, safety guardrails, and LLM gateway integration.
Sonic
CartesiaFastest production TTS at ~40ms TTFA. 15 languages, ~130 voices. One fifth the cost of ElevenLabs.
Universal-2
AssemblyAIBenchmark-leading accuracy at ~8.4% WER with 30% fewer hallucinations than Whisper. Full audio intelligence suite.
Voxtral TTS
Mistral AIOpen-weight 4B TTS model. 9 languages, ~90ms TTFA, voice cloning from 3s reference. CC BY NC 4.0.