Technical

Whisper v3 and Beyond: ASR in 2025

State-of-the-art speech recognition with real-time streaming and 100+ languages.

VI
Vijayakumar S
Jun 1, 202512 min read
Speech Recognition Waveform Visualization

The ASR Revolution

2025 ASR systems achieve human parity across most languages. Whisper v3, SeamlessM4T v2, and Canary lead the pack.

Whisper v3: What's New

  • 2.5x faster than v2 with same accuracy
  • Real-time streaming with 300ms latency
  • 150 languages supported
  • Punctuation and capitalization out-of-the-box
  • Speaker diarization for multi-speaker audio

Architecture Improvements

Whisper v3 uses a 1.5B parameter encoder-decoder with:

  • Conformer blocks instead of standard Transformers
  • Relative position embeddings for longer audio
  • Timestamp prediction at token level
import whisper

model = whisper.load_model("large-v3")
result = model.transcribe(
    "meeting.wav",
    language="en",
    task="transcribe",
    word_timestamps=True,
    vad_filter=True  # Voice activity detection
)

for segment in result["segments"]:
    print(f"{segment['start']:.1f}s - {segment['end']:.1f}s: {segment['text']}")

Real-time Streaming

Whisper v3 introduces streaming mode for live applications:

import whisper_live

client = whisper_live.Client()
client.connect("localhost", 9090)

client.start_streaming(
    model="large-v3",
    language="en",
    chunk_size=2.0,  # seconds
)

for transcript in client.receive_transcripts():
    print(transcript.text)

Accuracy Benchmarks

| Language | WER (Whisper v2) | WER (Whisper v3) | Human |
|----------|------------------|------------------|-------|
| English  | 4.2%             | 2.8%             | 2.5%  |
| Mandarin | 8.5%             | 5.2%             | 4.8%  |
| Spanish  | 5.8%             | 3.5%             | 3.2%  |
| Arabic   | 12.3%            | 7.8%             | 7.5%  |
| Hindi    | 15.2%            | 9.5%             | 9.0%  |

SeamlessM4T v2: Unified Speech+Text

Meta's model does everything:

  • Speech-to-text (100+ languages)
  • Speech-to-speech (100+ language pairs)
  • Text-to-speech
  • Text-to-text translation

Canary: Nvidia's ASR Champion

Optimized for NVIDIA hardware, achieves 8x real-time on H100.

Applications in 2025

  • Meeting transcription: Zoom, Teams, Meet integrated
  • Live captioning: TV, events, lectures
  • Voice assistants: On-device processing
  • Call centers: Real-time agent assistance
VI
Vijayakumar S
AI Engineer 路 ML Enthusiast

Passionate about building intelligent systems, speech synthesis, and LLM applications. Writing about the tools and ideas shaping the next decade of software.