Live Captioning IP Workflows

IP is an enabling technology that allows broadcasters to integrate software solutions into their workflows. However, the plethora of SDI and IP workflows that are now coexisting in broadcast infrastructures means that technologists and engineers must look for new methods when designing their facilities, especially when considering the real-time nature of live captioning.

Subtitles and captioning not only provide a valuable service for the hard of hearing but also improve the viewing experience in loud environments where the audio may not be easily heard. Consequently, the quality of the audio to text conversion must be of a very high-quality with the lowest translation to display-time possible.

Although captioning may not take up much bandwidth in relation to video, its impact on the viewer can be huge and so must be considered at the very beginning of a broadcast infrastructure design or modification. Furthermore, the large number of variations of languages, fonts, and dialects requires technologists and engineers to think deeply about how they integrate live captioning into SDI and IP workflows.

Legacy Broadcasting

For live captioning to be of any value to the viewer it must be an accurate representation of the audio text and be aligned to the people speaking. For pre-recorded material this can be relatively easy to achieve as captioning text can be edited prior to transmission, but for live broadcasts it proves more interesting.

Traditionally, a stenograph operator has manually typed the captioning text as the presenters and guests are speaking. Although this can provide accurate representation of the spoken word, it requires multilingual and highly skilled stenograph operators. And the stenograph operator may be fast, but there is often a noticeable delay between the spoken word and the captioning text appearing on the screen.

SDI workflows have generally provided the captioning information by encoding it in the ancillary data space. This maintained frame accurate timing as well as making sure the correct captioning text was provided for the program being broadcast. However, as we move to an OTT world, this method of operation proves inefficient for transmitting multiple languages across multiple platforms.

Translating Audio to Text

The stenograph method has served broadcasters well for many years, but new methods of operation are now available using artificial intelligence to provide automated speech recognition. The high-quality speech to text processing not only provides accurate representations of speech for the viewer, but it also creates the captioning with incredibly low latency.

Google Assistant, Apple’s Siri and Amazon’s Alexa all demonstrate how speech-to-text is highly effective and fast. Methods of time series analysis are applied using neural networks to convert the speech into text so that multiple models can be trained using different languages. This means the speech-to-text converter can both understand many languages as well as provide multi-language text. The accuracy of the conversion is enhanced as the neural network models can learn the context of the words and sentences which allows them to accurately translate words when the audio quality may not be particularly reliable.

Diagram 1 – One example of an audio to text converter using neural network machine learning. The Convolution Neural Network (CNN) extracts the features from the raw audio, the Transformer determines the context of the audio and converts it to tokens where the Connectionist Temporal Classification (CTC) provides the text translation.

Diagram 1 – One example of an audio to text converter using neural network machine learning. The Convolution Neural Network (CNN) extracts the features from the raw audio, the Transformer determines the context of the audio and converts it to tokens where the Connectionist Temporal Classification (CTC) provides the text translation.

For AI and ML to operate efficiently, the captioning text and video and audio media must be separated from one another. This is not anything new as SMPTEs ST2110 series of standards have abstracted the video, audio, and metadata away from the underlying transport stream, thus keeping the timing plane separate, and this has opened the door for generating captioning text as a separate file.

OTT Workflows

OTT and internet streaming has many different streaming formats that are required by a whole host of platforms and web browsers. Consequently, one workflow cannot be assumed to operate across every streaming platform.

One of the challenges for traditional broadcast and OTT workflows is how the captions are communicated to the viewer. DVB sends bitmaps as part of the transport stream and not the actual text. This has several advantages such as there are no restrictions on the supported languages and special characters, text positioning is specified by the broadcaster so the decoder doesn’t have to make any decisions about their placement, and broadcasters can create their own customized fonts.

However, DVB bitmaps have a confined operation as devices that require HTTP adaptive streaming protocols, e.g., DASH and HLS, have limited support for the bitmaps. Also, some countries, such as the US, mandate that users should be allowed to customize the visual appearance of the subtitles, and the DVB bitmaps do not always meet this requirement.

The multiple distribution platforms for both linear and OTT has led to many file formats being developed and used. Consequently, when creating live captioning, the software should be able to create the different formats needed to satisfy the platform dependent systems.

Delivery of live captions within SDI is well established using caption-insertion into the ancillary data but still requires specialist caption embedders and de-embedders. However, for IP, the distribution systems are technically easier as files can be transferred over IP networks, but broadcasters must keep track of the files so that the correct subtitles are streamed alongside their associated media streams.


To deliver the caliber of live captioning that viewers demand requires broadcasters to design workflows that fit work across multiple platforms for both SDI and IP. The new breed of AI audio to text conversion software greatly simplifies the operation as the software can be more easily integrated into the workflows than the traditional methods, especially when multi-language applications are considered. Furthermore, IP networks allow for easier integration of live captioning files as they do not rely on specialized embedders and de-embedders as found in the SDI domain. And working with a live captioning specialist vendor will help smooth the integration into both existing and new workflows.

Supported by

You might also like...

Playout & Transmission Technology At NAB 2024

As we approach the 2024 NAB Show we take a look at some of the discussion points and new playout & transmission technologies that will be available for investigation on the show floor.

NAB Show 2024 BEIT Sessions Part 2: New Broadcast Technologies

The most tightly focused and fresh technical information for TV engineers at the NAB Show will be analyzed, discussed, and explained during the four days of BEIT sessions. It’s the best opportunity on Earth to learn from and question i…

Standards: Part 6 - About The ISO 14496 – MPEG-4 Standard

This article describes the various parts of the MPEG-4 standard and discusses how it is much more than a video codec. MPEG-4 describes a sophisticated interactive multimedia platform for deployment on digital TV and the Internet.

Chris Brown Discusses The Themes Of The 2024 NAB Show

The Broadcast Bridge sat down with Chris Brown, executive vice president and managing director, NAB Global Connections and Events to discuss this year’s gathering April 13-17 (show floor open April 14-17) and how the industry looks to the show e…

5G Broadcast: Part 6 - Technical Dive Into 5G Broadcast & New 3GPP Standards

Standards bodies and mobile technology developers are putting the finishing touches to 5G Multicast and Broadcast. These include enabling seamless switching between unicast and multicast, and equally transparent roaming for users as they move between mobile cells. There is also…