Video Captioning and Audio Descriptions

Video content without captions excludes 466 million people worldwide with disabling hearing loss (WHO estimate). Video without audio descriptions excludes 2.2 billion people with vision impairments. Beyond disability, captions serve anyone in a noisy environment, a quiet office, or watching in a non-native language. Audio descriptions serve anyone who glances away from the screen or processes information better through narration.

WCAG Requirements for Multimedia

Criterion	Level	Requirement
1.2.1 Audio-only and Video-only (Prerecorded)	A	Provide a text transcript for audio-only. Provide either a text alternative or audio track for video-only.
1.2.2 Captions (Prerecorded)	A	Synchronized captions for all prerecorded audio content in video.
1.2.3 Audio Description or Media Alternative (Prerecorded)	A	Audio description for prerecorded video, or a full text alternative.
1.2.4 Captions (Live)	AA	Real-time captions for live audio content in video.
1.2.5 Audio Description (Prerecorded)	AA	Audio description for all prerecorded video content (no text-alternative fallback at this level).
1.2.6 Sign Language (Prerecorded)	AAA	Sign language interpretation for prerecorded audio in video.

Captioning Best Practices

Types of Captions

Closed captions (CC): User-toggleable text overlay. Can be turned on or off. This is the standard approach for web video.

Open captions: Burned into the video frame, always visible. Useful for social media feeds where players may not support CC tracks, but provide no user control.

Subtitles: Translations of dialogue into another language. Subtitles assume the viewer can hear; captions assume they cannot. Captions include non-speech audio cues.

What Captions Must Include

All spoken dialogue, attributed when the speaker changes or is off-screen.
Relevant sound effects: [door slams], [phone ringing], [applause].
Music descriptions: [upbeat jazz music], [ominous orchestral score].
Tone indicators when meaning depends on delivery: [sarcastically], [whispering].

Caption Quality Standards

Accuracy: 99%+ accuracy is the professional standard. Auto-generated captions typically achieve 80-90% — insufficient for accessibility compliance without human review.
Synchronization: Captions must appear within 1-2 frames of the spoken word and remain on screen long enough to read. The standard rate is 130-160 words per minute.
Readability: Maximum two lines per caption frame. Line breaks at natural phrase boundaries, not mid-word or mid-clause.
Positioning: Captions should not obscure critical visual content. They typically appear at the bottom center but may need repositioning.

Caption Formats

Format	Use Case
WebVTT (.vtt)	Web standard, supported by all modern browsers via `<track>` element
SRT (.srt)	Widely compatible, simple timestamp-and-text format
TTML (.ttml)	Broadcast and streaming, supports styling and positioning
SCC (.scc)	Broadcast television standard

For web delivery, WebVTT is the recommended format:

<video controls>
  <source src="demo.mp4" type="video/mp4">
  <track kind="captions" src="demo.en.vtt" srclang="en" label="English" default>
  <track kind="captions" src="demo.es.vtt" srclang="es" label="Español">
</video>

Audio Descriptions

Audio descriptions narrate visual information that is not conveyed through dialogue or existing audio. They describe actions, scene changes, on-screen text, and visual details essential to understanding the content.

When Audio Descriptions Are Required

At Level A, a text alternative (full transcript with visual descriptions) is acceptable. At Level AA (SC 1.2.5), actual audio descriptions integrated into the video are required.

Types of Audio Descriptions

Standard audio descriptions: Narration inserted during natural pauses in dialogue. Works when the video has sufficient gaps between speech.

Extended audio descriptions: The video pauses to allow longer narrated descriptions. Necessary when there are no natural pauses. This corresponds to WCAG SC 1.2.7 (Extended Audio Description, Level AAA).

Audio Description Content

Describe:

Actions and movements that are not explained by dialogue.
Scene and setting changes.
On-screen text (titles, signs, labels, credits).
Key visual information (facial expressions, gestures, clothing that identifies characters).
Charts, graphs, or diagrams shown on screen.

Do not describe:

Information already conveyed by dialogue or narration.
Obvious visual content that the audio makes clear.
Subjective interpretations (“She looks angry” vs. “She frowns and clenches her fists”).

Transcripts

Full transcripts serve as a universal fallback. They benefit:

Deaf-blind users who access content via Braille displays.
Users who prefer reading to watching.
Search engines (transcripts are indexable; audio and video content is not).

A transcript should include all dialogue, speaker identification, sound effects, and descriptions of visual content — essentially combining captions and audio descriptions into a single text document.

Tools and Workflows

Auto-captioning: YouTube, Vimeo, Rev, and Otter.ai provide automated captions. Always review and correct before publishing.
Professional services: Rev, 3Play Media, and Verbit offer human-reviewed captioning and audio description services.
DIY tools: Subtitle Edit (free, open source), Aegisub (free), and Descript (commercial) support manual caption creation and timing.

Key Takeaways

Captions (Level A) and audio descriptions (Level AA) are baseline WCAG requirements, not optional extras.
Auto-generated captions require human review — 80-90% accuracy is not accessible.
Captions must include non-speech audio cues, not just dialogue.
Audio descriptions narrate visual information during pauses in dialogue.
Full transcripts serve as a universal fallback for all multimedia content.

Next Steps

Apply captioning and description standards to designing for deaf and hard of hearing users.
Review the WCAG 2.2 multimedia requirements in context.
Ensure video players have accessible keyboard controls and meet contrast standards.

Sources

WCAG 2.2 SC 1.2.2 Captions (Prerecorded) — The Level A captioning requirement.
W3C WAI: Making Audio and Video Accessible — Comprehensive multimedia accessibility resource.
WebAIM: Captions, Transcripts, and Audio Descriptions — Practical captioning guidance.
MDN Web Docs: WebVTT — Technical reference for the WebVTT caption format.

Multimedia accessibility requirements referenced from WCAG 2.2 Success Criteria 1.2.1 through 1.2.9. Captioning quality standards adapted from the Described and Captioned Media Program (DCMP) guidelines.