Live captioning

Accessibility requirements for live captioning

Guideline 1.2 of WCAG 2.0 relates to the provision of alternatives for all time-based media. The guidelines that related to live content are:

  • Guideline 1.2.4: Captions are provided for all live audio content in synchronized media (Level AA)
  • Guideline 1.2.9: An alternative for time-based media that presents equivalent information for live audio-only content is provided (Level AAA)

To deliver real-time captions through the internet:

  1. Captions must be created for all audio information in real-time (live captioning); and
  2. Captions must be delivered to the end user so that they are synchronised with the audio (a 3 – 5 second delay is acceptable when live captions are provided).

Creating live captions

There are currently two methods used for the creation of live captions – stenocaptioning, and captioning by an operator using voice recognition software (often called voice captioning). Both methods are used extensively on Australian television, with voice captioning considered to be more appropriate for news and sports programs.

With both methods, there is an inevitable delay between the dialogue being spoken and the captions appearing on screen – 3-5 seconds is usually considered acceptable. Captions also appear one word at a time, unlike pre-prepared captions which appear one or two lines at a time. Text files of the captions can be edited after the event.

Live captioning of speeches, conferences, school lessons or other events is known as CART (Communication Access Real-time Translation). It is often performed remotely with the captioner connected via phone or the Internet. 

A certain error rate is considered inevitable with all live captioning, but a figure of 95%-98% accuracy is commonly used to as a benchmark of acceptable quality (which is what is usually expected of court reporters).


Stenocaptioners are highly trained individuals who use a stenotype machine with a phonetic keyboard (as used by court reporters) to create captions. The quality of these captions depends on the skill level of the stenocaptioner and the time they have been given to prepare for their task. Part of this preparation involves entering into their software’s ‘dictionary’ any unusual names or words which are likely to appear in the dialogue they will be captioning, so that these will be spelled correctly in the captions as they appear. They can also create ‘shortforms’ of certain phrases which are likely to be used, so several words can be created with a couple of keystrokes.

Stenocaptioners create captions by pressing combinations of keys on their stenotype machine to create words and phrases. ‘Mistranslates’, where the software does not recognise a particular word, typically appear in the caption as a string of random letters and other characters. Too many of these will render the captions incomprehensible to the viewer.

Voice captioning

Voice captions are created by a captioner using commercially available speech recognition programs such as Dragon Naturally Speaking. While these programs are not yet capable of taking a direct audio feed of speech and turning it into captions of acceptable accuracy, they can be used with a captioner re-speaking the dialogue into a microphone in real time, in a clear and steady manner.

Voice captioning has been used on Australian television since 2005. It is considered particularly appropriate for programs where there is generally one person speaking at a time. Stenocaptioning is considered more appropriate for programs likely to have people speaking over the top of each other.

Voice captioner training involves a captioner using the software for 3-4 weeks, reading into it from books or magazine articles and then correcting the text produced. The software becomes more attuned to their voices the longer they practise with it.

Like stenocaptioners, voice captioners build up ‘dictionaries’ and create shortforms of commonly used phrases. Unlike stenocaptioning, where a ‘mistranslate’ of a word will appear as random letters, voice recognition software always tries to represent a sound as a real word, even if it’s an incorrect one. (The viewer can often guess from this what the correct word is.)

If a voice captioner is properly trained, has had the necessary time to prepare, and the software is working correctly, the captions they produce are comparable to high-quality stenocaptioning.

Delivery of live captions online

Once the captions have been created, they must be delivered to the end user so that they are as closely synchronised with the audio content as possible. Unfortunately, there are very few real-time multimedia technologies that have native support for captioning. Thus, to synchronise captions with online content would require a custom-build solution that uses technology or hardware than runs in parallel with multimedia software (e.g. live streaming through Adobe Flash Player).

For streaming media, captions must be created and then converted into a format for online delivery (e.g. Timed Text DFXP, SubViewer, SubRip or SAMI). The formatted captions must then be streamed with the audio content, ensuring that the delay between the audio content and the captions is as short as possible.

It is also important that the embedded media player is accessible. In addition to standard captioning support, the media player must also be accessible for people with vision and mobility impairments (e.g. keyboard navigation to play and control the video).

Australian access companies that provide live captioning

The following Australian companies provide live captioning services:

Top of page