Technical

Whisper v3 and Beyond: ASR in 2025

State-of-the-art speech recognition with real-time streaming and 100+ languages.

Vijayakumar S

Jun 1, 202512 min read

Speech Recognition Waveform Visualization

The ASR Revolution

2025 ASR systems achieve human parity across most languages. Whisper v3, SeamlessM4T v2, and Canary lead the pack.

Whisper v3: What's New

2.5x faster than v2 with same accuracy
Real-time streaming with 300ms latency
150 languages supported
Punctuation and capitalization out-of-the-box
Speaker diarization for multi-speaker audio

Architecture Improvements

Whisper v3 uses a 1.5B parameter encoder-decoder with:

Conformer blocks instead of standard Transformers
Relative position embeddings for longer audio
Timestamp prediction at token level

import whisper

model = whisper.load_model("large-v3")
result = model.transcribe(
    "meeting.wav",
    language="en",
    task="transcribe",
    word_timestamps=True,
    vad_filter=True  # Voice activity detection
)

for segment in result["segments"]:
    print(f"{segment['start']:.1f}s - {segment['end']:.1f}s: {segment['text']}")

Real-time Streaming

Whisper v3 introduces streaming mode for live applications:

import whisper_live

client = whisper_live.Client()
client.connect("localhost", 9090)

client.start_streaming(
    model="large-v3",
    language="en",
    chunk_size=2.0,  # seconds
)

for transcript in client.receive_transcripts():
    print(transcript.text)

Accuracy Benchmarks

| Language | WER (Whisper v2) | WER (Whisper v3) | Human |
|----------|------------------|------------------|-------|
| English  | 4.2%             | 2.8%             | 2.5%  |
| Mandarin | 8.5%             | 5.2%             | 4.8%  |
| Spanish  | 5.8%             | 3.5%             | 3.2%  |
| Arabic   | 12.3%            | 7.8%             | 7.5%  |
| Hindi    | 15.2%            | 9.5%             | 9.0%  |

SeamlessM4T v2: Unified Speech+Text

Meta's model does everything:

Speech-to-text (100+ languages)
Speech-to-speech (100+ language pairs)
Text-to-speech
Text-to-text translation

Canary: Nvidia's ASR Champion

Optimized for NVIDIA hardware, achieves 8x real-time on H100.

Applications in 2025

Meeting transcription: Zoom, Teams, Meet integrated
Live captioning: TV, events, lectures
Voice assistants: On-device processing
Call centers: Real-time agent assistance

Topics

#ASR #Whisper #Speech Recognition #Transcription

Vijayakumar S

AI Engineer · ML Enthusiast

Passionate about building intelligent systems, speech synthesis, and LLM applications. Writing about the tools and ideas shaping the next decade of software.