Technical
Whisper v3 and Beyond: ASR in 2025
State-of-the-art speech recognition with real-time streaming and 100+ languages.
VI
Vijayakumar S
Jun 1, 202512 min read
The ASR Revolution
2025 ASR systems achieve human parity across most languages. Whisper v3, SeamlessM4T v2, and Canary lead the pack.
Whisper v3: What's New
- 2.5x faster than v2 with same accuracy
- Real-time streaming with 300ms latency
- 150 languages supported
- Punctuation and capitalization out-of-the-box
- Speaker diarization for multi-speaker audio
Architecture Improvements
Whisper v3 uses a 1.5B parameter encoder-decoder with:
- Conformer blocks instead of standard Transformers
- Relative position embeddings for longer audio
- Timestamp prediction at token level
import whisper
model = whisper.load_model("large-v3")
result = model.transcribe(
"meeting.wav",
language="en",
task="transcribe",
word_timestamps=True,
vad_filter=True # Voice activity detection
)
for segment in result["segments"]:
print(f"{segment['start']:.1f}s - {segment['end']:.1f}s: {segment['text']}")
Real-time Streaming
Whisper v3 introduces streaming mode for live applications:
import whisper_live
client = whisper_live.Client()
client.connect("localhost", 9090)
client.start_streaming(
model="large-v3",
language="en",
chunk_size=2.0, # seconds
)
for transcript in client.receive_transcripts():
print(transcript.text)
Accuracy Benchmarks
| Language | WER (Whisper v2) | WER (Whisper v3) | Human |
|----------|------------------|------------------|-------|
| English | 4.2% | 2.8% | 2.5% |
| Mandarin | 8.5% | 5.2% | 4.8% |
| Spanish | 5.8% | 3.5% | 3.2% |
| Arabic | 12.3% | 7.8% | 7.5% |
| Hindi | 15.2% | 9.5% | 9.0% |
SeamlessM4T v2: Unified Speech+Text
Meta's model does everything:
- Speech-to-text (100+ languages)
- Speech-to-speech (100+ language pairs)
- Text-to-speech
- Text-to-text translation
Canary: Nvidia's ASR Champion
Optimized for NVIDIA hardware, achieves 8x real-time on H100.
Applications in 2025
- Meeting transcription: Zoom, Teams, Meet integrated
- Live captioning: TV, events, lectures
- Voice assistants: On-device processing
- Call centers: Real-time agent assistance
VI
Vijayakumar S
AI Engineer 路 ML Enthusiast
Passionate about building intelligent systems, speech synthesis, and LLM applications. Writing about the tools and ideas shaping the next decade of software.