Live Captioning IP Workflows

IP is an enabling technology that allows broadcasters to integrate software solutions into their workflows. However, the plethora of SDI and IP workflows that are now coexisting in broadcast infrastructures means that technologists and engineers must look for new methods when designing their facilities, especially when considering the real-time nature of live captioning.

Subtitles and captioning not only provide a valuable service for the hard of hearing but also improve the viewing experience in loud environments where the audio may not be easily heard. Consequently, the quality of the audio to text conversion must be of a very high-quality with the lowest translation to display-time possible.

Although captioning may not take up much bandwidth in relation to video, its impact on the viewer can be huge and so must be considered at the very beginning of a broadcast infrastructure design or modification. Furthermore, the large number of variations of languages, fonts, and dialects requires technologists and engineers to think deeply about how they integrate live captioning into SDI and IP workflows.

Legacy Broadcasting

For live captioning to be of any value to the viewer it must be an accurate representation of the audio text and be aligned to the people speaking. For pre-recorded material this can be relatively easy to achieve as captioning text can be edited prior to transmission, but for live broadcasts it proves more interesting.

Traditionally, a stenograph operator has manually typed the captioning text as the presenters and guests are speaking. Although this can provide accurate representation of the spoken word, it requires multilingual and highly skilled stenograph operators. And the stenograph operator may be fast, but there is often a noticeable delay between the spoken word and the captioning text appearing on the screen.

SDI workflows have generally provided the captioning information by encoding it in the ancillary data space. This maintained frame accurate timing as well as making sure the correct captioning text was provided for the program being broadcast. However, as we move to an OTT world, this method of operation proves inefficient for transmitting multiple languages across multiple platforms.

Translating Audio to Text

The stenograph method has served broadcasters well for many years, but new methods of operation are now available using artificial intelligence to provide automated speech recognition. The high-quality speech to text processing not only provides accurate representations of speech for the viewer, but it also creates the captioning with incredibly low latency.

Google Assistant, Apple’s Siri and Amazon’s Alexa all demonstrate how speech-to-text is highly effective and fast. Methods of time series analysis are applied using neural networks to convert the speech into text so that multiple models can be trained using different languages. This means the speech-to-text converter can both understand many languages as well as provide multi-language text. The accuracy of the conversion is enhanced as the neural network models can learn the context of the words and sentences which allows them to accurately translate words when the audio quality may not be particularly reliable.

Diagram 1 – One example of an audio to text converter using neural network machine learning. The Convolution Neural Network (CNN) extracts the features from the raw audio, the Transformer determines the context of the audio and converts it to tokens where the Connectionist Temporal Classification (CTC) provides the text translation.

Diagram 1 – One example of an audio to text converter using neural network machine learning. The Convolution Neural Network (CNN) extracts the features from the raw audio, the Transformer determines the context of the audio and converts it to tokens where the Connectionist Temporal Classification (CTC) provides the text translation.

For AI and ML to operate efficiently, the captioning text and video and audio media must be separated from one another. This is not anything new as SMPTEs ST2110 series of standards have abstracted the video, audio, and metadata away from the underlying transport stream, thus keeping the timing plane separate, and this has opened the door for generating captioning text as a separate file.

OTT Workflows

OTT and internet streaming has many different streaming formats that are required by a whole host of platforms and web browsers. Consequently, one workflow cannot be assumed to operate across every streaming platform.

One of the challenges for traditional broadcast and OTT workflows is how the captions are communicated to the viewer. DVB sends bitmaps as part of the transport stream and not the actual text. This has several advantages such as there are no restrictions on the supported languages and special characters, text positioning is specified by the broadcaster so the decoder doesn’t have to make any decisions about their placement, and broadcasters can create their own customized fonts.

However, DVB bitmaps have a confined operation as devices that require HTTP adaptive streaming protocols, e.g., DASH and HLS, have limited support for the bitmaps. Also, some countries, such as the US, mandate that users should be allowed to customize the visual appearance of the subtitles, and the DVB bitmaps do not always meet this requirement.

The multiple distribution platforms for both linear and OTT has led to many file formats being developed and used. Consequently, when creating live captioning, the software should be able to create the different formats needed to satisfy the platform dependent systems.

Delivery of live captions within SDI is well established using caption-insertion into the ancillary data but still requires specialist caption embedders and de-embedders. However, for IP, the distribution systems are technically easier as files can be transferred over IP networks, but broadcasters must keep track of the files so that the correct subtitles are streamed alongside their associated media streams.


To deliver the caliber of live captioning that viewers demand requires broadcasters to design workflows that fit work across multiple platforms for both SDI and IP. The new breed of AI audio to text conversion software greatly simplifies the operation as the software can be more easily integrated into the workflows than the traditional methods, especially when multi-language applications are considered. Furthermore, IP networks allow for easier integration of live captioning files as they do not rely on specialized embedders and de-embedders as found in the SDI domain. And working with a live captioning specialist vendor will help smooth the integration into both existing and new workflows.

Supported by

You might also like...

The Meaning Of Metadata

Metadata is increasingly used to automate media management, from creation and acquisition to increasingly granular delivery channels and everything in-between. There’s nothing much new about metadata—it predated digital media by decades—but it is poised to become pivotal in …

The Future Of Sports Media & Fan Engagement

This is a story about the COO of a media business, that shines a light on the thinking underway at the leading edge of the media industry, where the balance shift from Linear Broadcasting to D2C Streaming is firmly…

The Streaming Tsunami: The Necessary Evolution Of The TV (From A Broadcaster Perspective) - Part 3

Here we conclude our discussion of the immense challenges presented by an ecosystem of end user devices that is already at 13K + devices and growing continually.

FEMA Experimenting At IPAWS TSSF

When government agencies get involved, prepare for new acronyms.

(Re)bundling Is Coming – OTT Is Retransforming The Media Industry

OTT is driving the next great rebundle. After years of D2C streaming, unbundling and fragmentation, we are now reaching a stage where we have so many D2C Apps that consumers are looking for simplicity and convenience again.