UX Design

Video Captioning and Audio Descriptions

By EZUD Published · Updated

Video Captioning and Audio Descriptions

Video content without captions excludes 466 million people worldwide with disabling hearing loss (WHO estimate). Video without audio descriptions excludes 2.2 billion people with vision impairments. Beyond disability, captions serve anyone in a noisy environment, a quiet office, or watching in a non-native language. Audio descriptions serve anyone who glances away from the screen or processes information better through narration.

WCAG Requirements for Multimedia

CriterionLevelRequirement
1.2.1 Audio-only and Video-only (Prerecorded)AProvide a text transcript for audio-only. Provide either a text alternative or audio track for video-only.
1.2.2 Captions (Prerecorded)ASynchronized captions for all prerecorded audio content in video.
1.2.3 Audio Description or Media Alternative (Prerecorded)AAudio description for prerecorded video, or a full text alternative.
1.2.4 Captions (Live)AAReal-time captions for live audio content in video.
1.2.5 Audio Description (Prerecorded)AAAudio description for all prerecorded video content (no text-alternative fallback at this level).
1.2.6 Sign Language (Prerecorded)AAASign language interpretation for prerecorded audio in video.

Captioning Best Practices

Types of Captions

Closed captions (CC): User-toggleable text overlay. Can be turned on or off. This is the standard approach for web video.

Open captions: Burned into the video frame, always visible. Useful for social media feeds where players may not support CC tracks, but provide no user control.

Subtitles: Translations of dialogue into another language. Subtitles assume the viewer can hear; captions assume they cannot. Captions include non-speech audio cues.

What Captions Must Include

  • All spoken dialogue, attributed when the speaker changes or is off-screen.
  • Relevant sound effects: [door slams], [phone ringing], [applause].
  • Music descriptions: [upbeat jazz music], [ominous orchestral score].
  • Tone indicators when meaning depends on delivery: [sarcastically], [whispering].

Caption Quality Standards

  • Accuracy: 99%+ accuracy is the professional standard. Auto-generated captions typically achieve 80-90% — insufficient for accessibility compliance without human review.
  • Synchronization: Captions must appear within 1-2 frames of the spoken word and remain on screen long enough to read. The standard rate is 130-160 words per minute.
  • Readability: Maximum two lines per caption frame. Line breaks at natural phrase boundaries, not mid-word or mid-clause.
  • Positioning: Captions should not obscure critical visual content. They typically appear at the bottom center but may need repositioning.

Caption Formats

FormatUse Case
WebVTT (.vtt)Web standard, supported by all modern browsers via <track> element
SRT (.srt)Widely compatible, simple timestamp-and-text format
TTML (.ttml)Broadcast and streaming, supports styling and positioning
SCC (.scc)Broadcast television standard

For web delivery, WebVTT is the recommended format:

<video controls>
  <source src="demo.mp4" type="video/mp4">
  <track kind="captions" src="demo.en.vtt" srclang="en" label="English" default>
  <track kind="captions" src="demo.es.vtt" srclang="es" label="Español">
</video>

Audio Descriptions

Audio descriptions narrate visual information that is not conveyed through dialogue or existing audio. They describe actions, scene changes, on-screen text, and visual details essential to understanding the content.

When Audio Descriptions Are Required

At Level A, a text alternative (full transcript with visual descriptions) is acceptable. At Level AA (SC 1.2.5), actual audio descriptions integrated into the video are required.

Types of Audio Descriptions

Standard audio descriptions: Narration inserted during natural pauses in dialogue. Works when the video has sufficient gaps between speech.

Extended audio descriptions: The video pauses to allow longer narrated descriptions. Necessary when there are no natural pauses. This corresponds to WCAG SC 1.2.7 (Extended Audio Description, Level AAA).

Audio Description Content

Describe:

  • Actions and movements that are not explained by dialogue.
  • Scene and setting changes.
  • On-screen text (titles, signs, labels, credits).
  • Key visual information (facial expressions, gestures, clothing that identifies characters).
  • Charts, graphs, or diagrams shown on screen.

Do not describe:

  • Information already conveyed by dialogue or narration.
  • Obvious visual content that the audio makes clear.
  • Subjective interpretations (“She looks angry” vs. “She frowns and clenches her fists”).

Transcripts

Full transcripts serve as a universal fallback. They benefit:

  • Deaf-blind users who access content via Braille displays.
  • Users who prefer reading to watching.
  • Search engines (transcripts are indexable; audio and video content is not).

A transcript should include all dialogue, speaker identification, sound effects, and descriptions of visual content — essentially combining captions and audio descriptions into a single text document.

Tools and Workflows

  • Auto-captioning: YouTube, Vimeo, Rev, and Otter.ai provide automated captions. Always review and correct before publishing.
  • Professional services: Rev, 3Play Media, and Verbit offer human-reviewed captioning and audio description services.
  • DIY tools: Subtitle Edit (free, open source), Aegisub (free), and Descript (commercial) support manual caption creation and timing.

Key Takeaways

  • Captions (Level A) and audio descriptions (Level AA) are baseline WCAG requirements, not optional extras.
  • Auto-generated captions require human review — 80-90% accuracy is not accessible.
  • Captions must include non-speech audio cues, not just dialogue.
  • Audio descriptions narrate visual information during pauses in dialogue.
  • Full transcripts serve as a universal fallback for all multimedia content.

Next Steps

Sources

Multimedia accessibility requirements referenced from WCAG 2.2 Success Criteria 1.2.1 through 1.2.9. Captioning quality standards adapted from the Described and Captioned Media Program (DCMP) guidelines.