Speech-to-Text Accuracy Comparison (2026)
Speech-to-Text Accuracy Comparison (2026)
Speech-to-text (STT) accuracy determines whether captioning, transcription, and voice control tools are genuinely useful or merely decorative. For deaf users depending on captions as their primary communication channel, a 5% error rate means roughly one wrong word per sentence. For voice control users with motor impairments, recognition failures mean repeated attempts and mounting frustration. This comparison evaluates the leading STT systems against the benchmarks that matter for accessibility.
How Accuracy Is Measured
Word Error Rate (WER) is the standard metric: the number of insertions, deletions, and substitutions divided by the total number of words in the reference transcript. A WER of 5% means 95% accuracy. The accessibility standard for captioning requires WER below 2% (98%+ accuracy).
Important caveat: WER varies dramatically by test conditions. Clean studio audio produces much better results than real-world environments with background noise, overlapping speakers, accents, and technical vocabulary.
Model-by-Model Results
OpenAI Whisper Large-v3
Clean English WER: ~3-5% Noisy conditions: ~8-15% Accented speech: ~6-12%
Whisper v3 reduces errors 10-20% over v2 across multiple languages. It supports 99 languages but accuracy drops substantially for low-resource languages. A known limitation: hallucinations that insert fabricated content into transcripts, particularly during silence or music segments.
GPT-4o Transcription (OpenAI, March 2025)
Clean English WER: ~2-4% Noisy conditions: ~5-10% Accented speech: ~4-8%
The newer GPT-4o-based transcription models achieve lower error rates than any Whisper version. Available through API only.
Google Cloud Speech-to-Text V2
Clean English WER: ~3-5% Noisy conditions: ~7-12% Accented speech: ~5-10%
Strong multilingual support with automatic language detection. Enhanced model available for medical and phone call transcription.
Microsoft Azure Speech
Clean English WER: ~3-5% Noisy conditions: ~6-12% Accented speech: ~5-10%
Custom speech models trained on domain-specific data can reduce WER significantly for specialized vocabulary (medical, legal, technical). Deep integration with Microsoft 365.
Deepgram Nova-2
Clean English WER: ~3-5% Noisy conditions: ~6-10% Accented speech: ~5-9%
Optimized for real-time processing with low latency. Strong performance on multi-speaker audio and phone calls.
Human CART Captioner
Clean English WER: ~1-2% Noisy conditions: ~2-4% Accented speech: ~2-5%
Human captioners consistently outperform AI, particularly in challenging conditions. They handle context, speaker intent, and ambiguity that AI systems miss.
Comparison Table
| System | Clean English | Noisy | Accented | Languages | Real-time | Cost |
|---|---|---|---|---|---|---|
| Whisper v3 | ~95-97% | ~85-92% | ~88-94% | 99 | Via third-party | Free (open source) |
| GPT-4o STT | ~96-98% | ~90-95% | ~92-96% | 50+ | API | Pay-per-minute |
| Google V2 | ~95-97% | ~88-93% | ~90-95% | 125+ | Yes | Pay-per-minute |
| Azure Speech | ~95-97% | ~88-94% | ~90-95% | 100+ | Yes | Pay-per-minute |
| Deepgram Nova-2 | ~95-97% | ~90-94% | ~91-95% | 30+ | Yes | Pay-per-minute |
| Human CART | ~98-99% | ~96-98% | ~95-98% | Variable | Yes | $100-250/hour |
Accessibility-Specific Considerations
Atypical Speech
None of the AI systems above are benchmarked against speakers with dysarthria, stuttering, or other speech differences. Real-world accuracy for these users is substantially lower. Google’s Project Relate and similar initiatives are specifically training models on atypical speech, but mainstream STT systems lag behind.
Technical Vocabulary
Medical, legal, and scientific terminology pushes error rates higher across all systems. Custom model training (available on Azure, Google, and Deepgram) can address this but requires labeled training data.
Speaker Diarization
Identifying who said what matters for meeting transcription. Otter.ai, Azure, and Deepgram handle speaker separation; Whisper does not natively.
Latency
Real-time captioning requires low latency. Streaming APIs from Google, Azure, and Deepgram provide sub-second lag. Whisper is batch-oriented by default, though third-party streaming implementations exist.
For detailed service comparisons, see AI captioning and transcription services compared. For sign language alternatives, read AI sign language translation.
Key Takeaways
- GPT-4o transcription currently leads AI accuracy benchmarks for clean English, approaching but not consistently reaching the 98% threshold required for accessibility compliance.
- All AI systems degrade significantly with noise, accents, and atypical speech, the conditions most relevant to real-world accessibility use.
- Human CART captioners remain the accuracy standard, particularly for high-stakes contexts.
- Custom model training (Azure, Google, Deepgram) can meaningfully improve accuracy for specialized vocabulary.
- Mainstream STT systems are not benchmarked or optimized for speakers with speech disabilities, a critical gap for accessibility.
Sources
- OpenAI Whisper — open-source speech recognition: https://openai.com/index/whisper/
- Radford et al., “Robust Speech Recognition via Large-Scale Weak Supervision” — Whisper model paper: https://arxiv.org/abs/2212.04356
- Google Cloud Speech-to-Text — cloud speech recognition API: https://cloud.google.com/speech-to-text
- Microsoft Azure Speech Services — enterprise speech recognition: https://azure.microsoft.com/en-us/products/ai-services/speech-to-text
- Deepgram — AI-powered speech recognition platform: https://deepgram.com/