AccessibilityBeginner

Captions and Transcripts

Captions are synchronized text overlays that display spoken dialogue and relevant sounds in video content, while transcripts are full text documents of audio or video content, both essential for deaf and hard-of-hearing users.

In simple terms: Captions are the words you see at the bottom of a video that show what people are saying and what sounds are happening. Transcripts are like a written version of everything said in a video or podcast. Both help people who can't hear understand what's going on.

Why It Matters

Captions and transcripts matter because audio and video have become primary content formats on the web, and without text alternatives, this content is inaccessible to a significant population. Approximately 15% of the world's population experiences some degree of hearing loss, and 430 million people worldwide have disabling hearing loss according to the World Health Organization. Beyond the deaf and hard-of-hearing community, captions and transcripts benefit a much wider audience: - **Non-native speakers** use captions to better understand spoken content in a second language. - **Users in noisy or quiet environments** rely on captions when they cannot listen to audio—in a crowded train, in a library, or in an open office. - **Users with cognitive or learning disabilities** often comprehend content better when they can both hear and read it simultaneously. - **Search engines cannot index audio or video content** directly, but they can index caption files and transcripts, significantly improving content discoverability and SEO. From a legal perspective, captions are required under multiple laws. The ADA applies to video content published by covered entities. Section 508 requires captions for government multimedia. The FCC mandates captions for television content under the Twenty-First Century Communications and Video Accessibility Act (CVAA). WCAG includes multiple success criteria specifically addressing captions and transcripts. Studies have shown that the majority of video viewers use captions at least some of the time, even among hearing users. Adding captions is not just an accessibility requirement—it is a content strategy that increases engagement, comprehension, and reach.

How It Works

### Caption File Formats Captions are stored in text files that contain time codes and corresponding text. The most common formats include: - **WebVTT (.vtt)**: The standard format for HTML5 video. Supports basic styling and positioning. - **SRT (.srt)**: A simple, widely supported format used by many video platforms. - **TTML (.ttml/.xml)**: A more complex format that supports advanced styling, used by broadcast and streaming services. A WebVTT file looks like: ``` WEBVTT 00:00:01.000 --> 00:00:04.000 Welcome to our accessibility training session. 00:00:04.500 --> 00:00:08.000 Today we'll cover web accessibility fundamentals. 00:00:08.500 --> 00:00:12.000 [upbeat background music] ``` ### Caption Quality Standards High-quality captions must meet several criteria: - **Accuracy**: Captions must faithfully represent the spoken content with minimal errors. The industry standard target is 99% accuracy. - **Synchronization**: Captions must appear and disappear within 1-2 seconds of the corresponding audio. - **Completeness**: All meaningful audio must be captioned, including dialogue, sound effects, and music. - **Readability**: Captions should be displayed long enough to be read comfortably, typically at a rate of no more than 3 lines with 32 characters per line. - **Speaker identification**: When multiple speakers are present, captions should identify who is speaking using labels or positioning. ### Creating Captions Captions can be created through several methods: - **Professional captioning services**: Human captioners create accurate, time-coded captions. This is the gold standard. - **Auto-captioning with human review**: Services like YouTube, Rev, or Otter.ai generate auto-captions that are then reviewed and corrected by a human editor. - **DIY captioning tools**: Tools like Amara, Kapwing, and YouTube's caption editor allow content creators to add and edit captions manually. ### Live Captioning Live events, webinars, and streams require real-time captioning provided by: - **CART (Communication Access Realtime Translation)**: A trained stenographer produces real-time captions at speeds exceeding 200 words per minute with high accuracy. - **Automated speech recognition (ASR)**: AI-powered live captioning is improving rapidly but still produces errors, particularly with specialized terminology, accents, and multiple speakers. ### Transcript Best Practices Effective transcripts should include: - Speaker identification for multi-speaker content - Descriptions of relevant non-speech audio - Headings and structure for long content - Timestamps at regular intervals for reference - Links to the corresponding media Transcripts offer accessibility advantages that captions do not: they are searchable, can be consumed at the reader's own pace, work with braille displays, and can be translated more easily into other languages.

Examples

**Example 1: E-Learning Video** A training platform publishes instructional videos with closed captions in WebVTT format. Each caption line includes speaker identification ("[Instructor]"), descriptions of on-screen demonstrations ("[clicks Settings menu]"), and relevant sound cues. A full transcript is provided below the video player with headings corresponding to each lesson section, making it easy to find specific topics. **Example 2: Podcast Accessibility** A company publishes a weekly podcast on its website. Since this is audio-only content, WCAG 1.2.1 requires a text alternative. The company provides a full transcript for each episode, formatted with speaker labels, timestamps every few minutes, and paragraph breaks for readability. The transcript is published on the same page as the audio player. **Example 3: Live Webinar** A professional association hosts a live webinar for its members. A CART provider delivers real-time captions that appear in a caption window within the webinar platform. After the event, the CART output is cleaned up and converted into both a polished caption file for the recorded video and a downloadable transcript for attendees who want to review the content. **Example 4: Social Media Video** A nonprofit creates a 60-second advocacy video for social media. Because many social platforms autoplay video without sound and may not display caption tracks reliably, the team chooses open captions—text permanently burned into the video—ensuring that the message is accessible regardless of platform limitations or user settings.

## What Are Captions and Transcripts?

Captions and transcripts are two complementary methods of making audio and video content accessible to people who are deaf, hard of hearing, or who otherwise benefit from text alternatives to audio content.

**Captions** are synchronized text displays that appear on screen during video playback. They convey not only spoken dialogue but also relevant non-speech audio information, including sound effects (e.g., "[door slams]"), music descriptions (e.g., "[suspenseful music]"), and speaker identification when necessary. Captions are time-coded to appear and disappear in sync with the corresponding audio.

There are two types of captions: - **Closed captions (CC)**: Can be turned on or off by the viewer. These are the standard for web video and are delivered as separate text tracks (commonly in WebVTT, SRT, or TTML formats). - **Open captions**: Permanently embedded in the video image and cannot be turned off. These are sometimes used in social media videos where caption controls may not be available.

**Transcripts** are complete text documents that capture all the spoken and relevant non-spoken audio content in a recording. Unlike captions, transcripts are not time-coded and are consumed as a standalone document, typically presented as a text block below or alongside the media player. Transcripts can include speaker labels, descriptions of visual content, and organizational elements like headings and paragraphs.

## Why Do Captions and Transcripts Matter?

Captions and transcripts matter because audio and video have become primary content formats on the web, and without text alternatives, this content is inaccessible to a significant population. Approximately 15% of the world's population experiences some degree of hearing loss, and 430 million people worldwide have disabling hearing loss according to the World Health Organization.

Beyond the deaf and hard-of-hearing community, captions and transcripts benefit a much wider audience:

- **Non-native speakers** use captions to better understand spoken content in a second language. - **Users in noisy or quiet environments** rely on captions when they cannot listen to audio—in a crowded train, in a library, or in an open office. - **Users with cognitive or learning disabilities** often comprehend content better when they can both hear and read it simultaneously. - **Search engines cannot index audio or video content** directly, but they can index caption files and transcripts, significantly improving content discoverability and SEO.

From a legal perspective, captions are required under multiple laws. The ADA applies to video content published by covered entities. Section 508 requires captions for government multimedia. The FCC mandates captions for television content under the Twenty-First Century Communications and Video Accessibility Act (CVAA). WCAG includes multiple success criteria specifically addressing captions and transcripts.

Studies have shown that the majority of video viewers use captions at least some of the time, even among hearing users. Adding captions is not just an accessibility requirement—it is a content strategy that increases engagement, comprehension, and reach.

## How Captions and Transcripts Work

### Caption File Formats

Captions are stored in text files that contain time codes and corresponding text. The most common formats include:

- **WebVTT (.vtt)**: The standard format for HTML5 video. Supports basic styling and positioning. - **SRT (.srt)**: A simple, widely supported format used by many video platforms. - **TTML (.ttml/.xml)**: A more complex format that supports advanced styling, used by broadcast and streaming services.

A WebVTT file looks like:

``` WEBVTT

00:00:01.000 --> 00:00:04.000 Welcome to our accessibility training session.

00:00:04.500 --> 00:00:08.000 Today we'll cover web accessibility fundamentals.

00:00:08.500 --> 00:00:12.000 [upbeat background music] ```

### Caption Quality Standards

High-quality captions must meet several criteria:

- **Accuracy**: Captions must faithfully represent the spoken content with minimal errors. The industry standard target is 99% accuracy. - **Synchronization**: Captions must appear and disappear within 1-2 seconds of the corresponding audio. - **Completeness**: All meaningful audio must be captioned, including dialogue, sound effects, and music. - **Readability**: Captions should be displayed long enough to be read comfortably, typically at a rate of no more than 3 lines with 32 characters per line. - **Speaker identification**: When multiple speakers are present, captions should identify who is speaking using labels or positioning.

### Creating Captions

Captions can be created through several methods:

- **Professional captioning services**: Human captioners create accurate, time-coded captions. This is the gold standard. - **Auto-captioning with human review**: Services like YouTube, Rev, or Otter.ai generate auto-captions that are then reviewed and corrected by a human editor. - **DIY captioning tools**: Tools like Amara, Kapwing, and YouTube's caption editor allow content creators to add and edit captions manually.

### Live Captioning

Live events, webinars, and streams require real-time captioning provided by:

- **CART (Communication Access Realtime Translation)**: A trained stenographer produces real-time captions at speeds exceeding 200 words per minute with high accuracy. - **Automated speech recognition (ASR)**: AI-powered live captioning is improving rapidly but still produces errors, particularly with specialized terminology, accents, and multiple speakers.

### Transcript Best Practices

Effective transcripts should include:

- Speaker identification for multi-speaker content - Descriptions of relevant non-speech audio - Headings and structure for long content - Timestamps at regular intervals for reference - Links to the corresponding media

Transcripts offer accessibility advantages that captions do not: they are searchable, can be consumed at the reader's own pace, work with braille displays, and can be translated more easily into other languages.

## Examples

**Example 1: E-Learning Video** A training platform publishes instructional videos with closed captions in WebVTT format. Each caption line includes speaker identification ("[Instructor]"), descriptions of on-screen demonstrations ("[clicks Settings menu]"), and relevant sound cues. A full transcript is provided below the video player with headings corresponding to each lesson section, making it easy to find specific topics.

**Example 2: Podcast Accessibility** A company publishes a weekly podcast on its website. Since this is audio-only content, WCAG 1.2.1 requires a text alternative. The company provides a full transcript for each episode, formatted with speaker labels, timestamps every few minutes, and paragraph breaks for readability. The transcript is published on the same page as the audio player.

**Example 3: Live Webinar** A professional association hosts a live webinar for its members. A CART provider delivers real-time captions that appear in a caption window within the webinar platform. After the event, the CART output is cleaned up and converted into both a polished caption file for the recorded video and a downloadable transcript for attendees who want to review the content.

**Example 4: Social Media Video** A nonprofit creates a 60-second advocacy video for social media. Because many social platforms autoplay video without sound and may not display caption tracks reliably, the team chooses open captions—text permanently burned into the video—ensuring that the message is accessible regardless of platform limitations or user settings.

## Key Takeaways

- **Captions are synchronized text** that convey dialogue, sound effects, and speaker identification in video content; **transcripts are standalone text documents** of the full audio content. - **Both are legally required** under the ADA, Section 508, WCAG, and the CVAA, with specific requirements varying by content type and context. - **Auto-generated captions are not sufficient** for compliance—they must be reviewed and corrected to achieve the 99% accuracy standard. - **Captions benefit far more than deaf users**—non-native speakers, users in sound-sensitive environments, and people with cognitive disabilities all benefit. - **Transcripts provide unique advantages** including searchability, compatibility with braille displays, and the ability to be consumed at the reader's own pace. - **Live captioning requires CART or high-quality ASR** to provide real-time access for live events and streams. - **Providing both captions and transcripts** is the most inclusive approach and maximizes accessibility, SEO, and content reach.

Frequently Asked Questions

What is the difference between captions and subtitles?
Captions include all audio information—dialogue, sound effects, music descriptions, and speaker identification—and are designed for deaf and hard-of-hearing viewers. Subtitles typically include only dialogue and are designed for viewers who can hear the audio but need a text translation. Captions are an accessibility feature; subtitles are a translation feature.
Are auto-generated captions sufficient for accessibility compliance?
Generally no. Auto-generated captions from services like YouTube's automatic captioning typically have error rates of 10-30%, which is too high for accessibility compliance. WCAG requires captions to be accurate, and the Department of Justice has stated that auto-captions alone may not satisfy ADA requirements. Auto-captions can serve as a starting point but must be reviewed and corrected by humans.
When are transcripts required vs. captions?
For prerecorded video with audio, WCAG Level A requires either captions or a transcript. For prerecorded audio-only content (like podcasts), a transcript is required at Level A. For live video with audio, real-time captions are required at Level AA. Providing both captions and transcripts is the most inclusive approach.

Need help making your website ADA compliant?

Our team specializes in ADA-compliant web design and remediation. Get a free accessibility audit today.

Last updated: 2026-03-15