Speech-to-Text Accuracy Comparison (2026)

Speech-to-text (STT) accuracy determines whether captioning, transcription, and voice control tools are genuinely useful or merely decorative. For deaf users depending on captions as their primary communication channel, a 5% error rate means roughly one wrong word per sentence. For voice control users with motor impairments, recognition failures mean repeated attempts and mounting frustration. This comparison evaluates the leading STT systems against the benchmarks that matter for accessibility.

How Accuracy Is Measured

Word Error Rate (WER) is the standard metric: the number of insertions, deletions, and substitutions divided by the total number of words in the reference transcript. A WER of 5% means 95% accuracy. The accessibility standard for captioning requires WER below 2% (98%+ accuracy).

Important caveat: WER varies dramatically by test conditions. Clean studio audio produces much better results than real-world environments with background noise, overlapping speakers, accents, and technical vocabulary.

Model-by-Model Results

OpenAI Whisper Large-v3

Clean English WER: ~3-5% Noisy conditions: ~8-15% Accented speech: ~6-12%

Whisper v3 reduces errors 10-20% over v2 across multiple languages. It supports 99 languages but accuracy drops substantially for low-resource languages. A known limitation: hallucinations that insert fabricated content into transcripts, particularly during silence or music segments.

GPT-4o Transcription (OpenAI, March 2025)

Clean English WER: ~2-4% Noisy conditions: ~5-10% Accented speech: ~4-8%

The newer GPT-4o-based transcription models achieve lower error rates than any Whisper version. Available through API only.

Google Cloud Speech-to-Text V2

Clean English WER: ~3-5% Noisy conditions: ~7-12% Accented speech: ~5-10%

Strong multilingual support with automatic language detection. Enhanced model available for medical and phone call transcription.

Microsoft Azure Speech

Clean English WER: ~3-5% Noisy conditions: ~6-12% Accented speech: ~5-10%

Custom speech models trained on domain-specific data can reduce WER significantly for specialized vocabulary (medical, legal, technical). Deep integration with Microsoft 365.

Deepgram Nova-2

Clean English WER: ~3-5% Noisy conditions: ~6-10% Accented speech: ~5-9%

Optimized for real-time processing with low latency. Strong performance on multi-speaker audio and phone calls.

Human CART Captioner

Clean English WER: ~1-2% Noisy conditions: ~2-4% Accented speech: ~2-5%

Human captioners consistently outperform AI, particularly in challenging conditions. They handle context, speaker intent, and ambiguity that AI systems miss.

Comparison Table

System	Clean English	Noisy	Accented	Languages	Real-time	Cost
Whisper v3	~95-97%	~85-92%	~88-94%	99	Via third-party	Free (open source)
GPT-4o STT	~96-98%	~90-95%	~92-96%	50+	API	Pay-per-minute
Google V2	~95-97%	~88-93%	~90-95%	125+	Yes	Pay-per-minute
Azure Speech	~95-97%	~88-94%	~90-95%	100+	Yes	Pay-per-minute
Deepgram Nova-2	~95-97%	~90-94%	~91-95%	30+	Yes	Pay-per-minute
Human CART	~98-99%	~96-98%	~95-98%	Variable	Yes	$100-250/hour

Accessibility-Specific Considerations

Atypical Speech

None of the AI systems above are benchmarked against speakers with dysarthria, stuttering, or other speech differences. Real-world accuracy for these users is substantially lower. Google’s Project Relate and similar initiatives are specifically training models on atypical speech, but mainstream STT systems lag behind.

Technical Vocabulary

Medical, legal, and scientific terminology pushes error rates higher across all systems. Custom model training (available on Azure, Google, and Deepgram) can address this but requires labeled training data.

Speaker Diarization

Identifying who said what matters for meeting transcription. Otter.ai, Azure, and Deepgram handle speaker separation; Whisper does not natively.

Latency

Real-time captioning requires low latency. Streaming APIs from Google, Azure, and Deepgram provide sub-second lag. Whisper is batch-oriented by default, though third-party streaming implementations exist.

For detailed service comparisons, see AI captioning and transcription services compared. For sign language alternatives, read AI sign language translation.

Key Takeaways

GPT-4o transcription currently leads AI accuracy benchmarks for clean English, approaching but not consistently reaching the 98% threshold required for accessibility compliance.
All AI systems degrade significantly with noise, accents, and atypical speech, the conditions most relevant to real-world accessibility use.
Human CART captioners remain the accuracy standard, particularly for high-stakes contexts.
Custom model training (Azure, Google, Deepgram) can meaningfully improve accuracy for specialized vocabulary.
Mainstream STT systems are not benchmarked or optimized for speakers with speech disabilities, a critical gap for accessibility.

Sources

OpenAI Whisper — open-source speech recognition: https://openai.com/index/whisper/
Radford et al., “Robust Speech Recognition via Large-Scale Weak Supervision” — Whisper model paper: https://arxiv.org/abs/2212.04356
Google Cloud Speech-to-Text — cloud speech recognition API: https://cloud.google.com/speech-to-text
Microsoft Azure Speech Services — enterprise speech recognition: https://azure.microsoft.com/en-us/products/ai-services/speech-to-text
Deepgram — AI-powered speech recognition platform: https://deepgram.com/