Discover

Audio

Browse models for audio and compare pricing across providers.

19 models

Whisper Large V3

OpenAI
STT

Open-weight speech recognition supporting 50+ languages. Handles accents, noise, and technical language.

1.5B Open
5 providers
$0.000278 /req
from fal.ai

Sound Effect V2

ElevenLabs
Audio Gen

AI sound effect generation from text descriptions.

3 providers
$0.0012 /req
from KIE AI

Speech to Text

ElevenLabs
STT

ElevenLabs speech recognition and transcription service.

3 providers
$0.018 /req
from KIE AI

Flash V2.5

ElevenLabs
TTS

Ultra-low latency TTS at 75ms TTFA. Best for real-time conversational voice agents.

4 providers
$0.030 /req
from KIE AI

Speech 2.6

MiniMax
TTS

MiniMax text-to-speech with HD and turbo variants. Supports voice cloning.

4 providers
$0.060 /req
from Replicate

Multilingual V2

ElevenLabs
TTS

Ultra-realistic narration in 70+ languages with thousands of voice presets.

2 providers
$0.100 /req
from fal.ai

GPT-4o Transcribe

OpenAI
STT

Latest OpenAI transcription with lower error rates than Whisper. Recommended over Whisper for API use.

1 provider
$6.00 /MTok input
from OpenAI

TTS-1

OpenAI
TTS

OpenAI's fast text-to-speech model optimized for real-time use.

1 provider
$15.00 /MTok input
from OpenAI

TTS-1 HD

OpenAI
TTS

Higher quality TTS with improved naturalness and pronunciation accuracy.

1 provider
$30.00 /MTok input
from OpenAI

Aura 2

Deepgram
TTS

Low-latency TTS at 90ms optimized TTFB for voice agent production use.

No providers yet

Chirp 2

Google
STT

Google's latest speech recognition model with improved accuracy across 100+ languages.

1 provider

Cloud TTS

Google
TTS

Google's text-to-speech service supporting 75+ languages with WaveNet and Neural2 voices.

1 provider

Fish Speech S2

Fish Audio
TTS

Multilingual TTS supporting 80+ languages with voice cloning capabilities.

No providers yet

Nova 2

Deepgram
STT

Real-time STT specialist with sub-300ms latency, streaming WebSocket API, and domain-specific vocabulary.

No providers yet

Qwen 3 TTS

Alibaba
TTS

Qwen 3 text-to-speech model with voice cloning support.

Open
2 providers

Slam-1

AssemblyAI
STT

Speech-language model with multilingual streaming, safety guardrails, and LLM gateway integration.

No providers yet

Sonic

Cartesia
TTS

Fastest production TTS at ~40ms TTFA. 15 languages, ~130 voices. One fifth the cost of ElevenLabs.

No providers yet

Universal-2

AssemblyAI
STT

Benchmark-leading accuracy at ~8.4% WER with 30% fewer hallucinations than Whisper. Full audio intelligence suite.

No providers yet

Voxtral TTS

Mistral AI
TTS

Open-weight 4B TTS model. 9 languages, ~90ms TTFA, voice cloning from 3s reference. CC BY NC 4.0.

4B Open
No providers yet