Audio

from Cloudflare Workers AI

Open-weight speech recognition supporting 50+ languages. Handles accents, noise, and technical language.

1.5B Open

6 providers

$0.000008 /sec

Speech to Text

ElevenLabs speech recognition and transcription service.

3 providers

$0.000061 /sec

from ElevenLabs

Sound Effect V2

Audio Gen

AI sound effect generation from text descriptions.

3 providers

$0.0012 /req

from KIE AI

Flash V2.5

Ultra-low latency TTS at 75ms TTFA. Best for real-time conversational voice agents.

4 providers

$0.030 /req

from KIE AI

Speech 2.6

MiniMax

MiniMax text-to-speech with HD and turbo variants. Supports voice cloning.

4 providers

$0.060 /req

from Replicate

Multilingual V2

Ultra-realistic narration in 70+ languages with thousands of voice presets.

2 providers

$0.100 /req

from fal.ai

GPT-4o Transcribe

Latest OpenAI transcription with lower error rates than Whisper. Recommended over Whisper for API use.

$6.00 /MTok input

from OpenAI

TTS-1

OpenAI's fast text-to-speech model optimized for real-time use.

$15.00 /MTok input

from OpenAI

TTS-1 HD

Higher quality TTS with improved naturalness and pronunciation accuracy.

$30.00 /MTok input

from OpenAI

Aura 2

Deepgram

Low-latency TTS at 90ms optimized TTFB for voice agent production use.

Chirp 2

Google

Google's latest speech recognition model with improved accuracy across 100+ languages.

Cloud TTS

Google

Google's text-to-speech service supporting 75+ languages with WaveNet and Neural2 voices.

Fish Speech S2

Fish Audio

Multilingual TTS supporting 80+ languages with voice cloning capabilities.

Nova 2

Deepgram

Real-time STT specialist with sub-300ms latency, streaming WebSocket API, and domain-specific vocabulary.

Qwen 3 TTS

Alibaba

Qwen 3 text-to-speech model with voice cloning support.

Open

2 providers

Slam-1

AssemblyAI

Speech-language model with multilingual streaming, safety guardrails, and LLM gateway integration.

Sonic

Cartesia

Fastest production TTS at ~40ms TTFA. 15 languages, ~130 voices. One fifth the cost of ElevenLabs.

Universal-2

AssemblyAI

Benchmark-leading accuracy at ~8.4% WER with 30% fewer hallucinations than Whisper. Full audio intelligence suite.

Voxtral TTS

Mistral AI

Open-weight 4B TTS model. 9 languages, ~90ms TTFA, voice cloning from 3s reference. CC BY NC 4.0.

4B Open