Live Captioning IP Workflows

IP is an enabling technology that allows broadcasters to integrate software solutions into their workflows. However, the plethora of SDI and IP workflows that are now coexisting in broadcast infrastructures means that technologists and engineers must look for new methods when designing their facilities, especially when considering the real-time nature of live captioning.

Subtitles and captioning not only provide a valuable service for the hard of hearing but also improve the viewing experience in loud environments where the audio may not be easily heard. Consequently, the quality of the audio to text conversion must be of a very high-quality with the lowest translation to display-time possible.

Although captioning may not take up much bandwidth in relation to video, its impact on the viewer can be huge and so must be considered at the very beginning of a broadcast infrastructure design or modification. Furthermore, the large number of variations of languages, fonts, and dialects requires technologists and engineers to think deeply about how they integrate live captioning into SDI and IP workflows.

Legacy Broadcasting

For live captioning to be of any value to the viewer it must be an accurate representation of the audio text and be aligned to the people speaking. For pre-recorded material this can be relatively easy to achieve as captioning text can be edited prior to transmission, but for live broadcasts it proves more interesting.

Traditionally, a stenograph operator has manually typed the captioning text as the presenters and guests are speaking. Although this can provide accurate representation of the spoken word, it requires multilingual and highly skilled stenograph operators. And the stenograph operator may be fast, but there is often a noticeable delay between the spoken word and the captioning text appearing on the screen.

SDI workflows have generally provided the captioning information by encoding it in the ancillary data space. This maintained frame accurate timing as well as making sure the correct captioning text was provided for the program being broadcast. However, as we move to an OTT world, this method of operation proves inefficient for transmitting multiple languages across multiple platforms.

Translating Audio to Text

The stenograph method has served broadcasters well for many years, but new methods of operation are now available using artificial intelligence to provide automated speech recognition. The high-quality speech to text processing not only provides accurate representations of speech for the viewer, but it also creates the captioning with incredibly low latency.

Google Assistant, Apple’s Siri and Amazon’s Alexa all demonstrate how speech-to-text is highly effective and fast. Methods of time series analysis are applied using neural networks to convert the speech into text so that multiple models can be trained using different languages. This means the speech-to-text converter can both understand many languages as well as provide multi-language text. The accuracy of the conversion is enhanced as the neural network models can learn the context of the words and sentences which allows them to accurately translate words when the audio quality may not be particularly reliable.

Diagram 1 – One example of an audio to text converter using neural network machine learning. The Convolution Neural Network (CNN) extracts the features from the raw audio, the Transformer determines the context of the audio and converts it to tokens where the Connectionist Temporal Classification (CTC) provides the text translation.

Diagram 1 – One example of an audio to text converter using neural network machine learning. The Convolution Neural Network (CNN) extracts the features from the raw audio, the Transformer determines the context of the audio and converts it to tokens where the Connectionist Temporal Classification (CTC) provides the text translation.

For AI and ML to operate efficiently, the captioning text and video and audio media must be separated from one another. This is not anything new as SMPTEs ST2110 series of standards have abstracted the video, audio, and metadata away from the underlying transport stream, thus keeping the timing plane separate, and this has opened the door for generating captioning text as a separate file.

OTT Workflows

OTT and internet streaming has many different streaming formats that are required by a whole host of platforms and web browsers. Consequently, one workflow cannot be assumed to operate across every streaming platform.

One of the challenges for traditional broadcast and OTT workflows is how the captions are communicated to the viewer. DVB sends bitmaps as part of the transport stream and not the actual text. This has several advantages such as there are no restrictions on the supported languages and special characters, text positioning is specified by the broadcaster so the decoder doesn’t have to make any decisions about their placement, and broadcasters can create their own customized fonts.

However, DVB bitmaps have a confined operation as devices that require HTTP adaptive streaming protocols, e.g., DASH and HLS, have limited support for the bitmaps. Also, some countries, such as the US, mandate that users should be allowed to customize the visual appearance of the subtitles, and the DVB bitmaps do not always meet this requirement.

The multiple distribution platforms for both linear and OTT has led to many file formats being developed and used. Consequently, when creating live captioning, the software should be able to create the different formats needed to satisfy the platform dependent systems.

Delivery of live captions within SDI is well established using caption-insertion into the ancillary data but still requires specialist caption embedders and de-embedders. However, for IP, the distribution systems are technically easier as files can be transferred over IP networks, but broadcasters must keep track of the files so that the correct subtitles are streamed alongside their associated media streams.

Conclusion

To deliver the caliber of live captioning that viewers demand requires broadcasters to design workflows that fit work across multiple platforms for both SDI and IP. The new breed of AI audio to text conversion software greatly simplifies the operation as the software can be more easily integrated into the workflows than the traditional methods, especially when multi-language applications are considered. Furthermore, IP networks allow for easier integration of live captioning files as they do not rely on specialized embedders and de-embedders as found in the SDI domain. And working with a live captioning specialist vendor will help smooth the integration into both existing and new workflows.

Supported by

You might also like...

Designing IP Broadcast Systems: Integrating Cloud Infrastructure

Connecting on-prem broadcast infrastructures to the public cloud leads to a hybrid system which requires reliable secure high value media exchange and delivery.

Video Quality: Part 1 - Video Quality Faces New Challenges In Generative AI Era

In this first in a new series about Video Quality, we look at how the continuing proliferation of User Generated Content has brought new challenges for video quality assurance, with AI in turn helping address some of them. But new…

Minimizing OTT Churn Rates Through Viewer Engagement

A D2C streaming service requires an understanding of satisfaction with the service – the quality of it, the ease of use, the style of use – which requires the right technology and a focused information-gathering approach.

Designing IP Broadcast Systems: Where Broadcast Meets IT

Broadcast and IT engineers have historically approached their professions from two different places, but as technology is more reliable, they are moving closer.

Encoding & Transport For Remote Contribution At NAB 2024

As broadcasters embrace remote production workflows the technology required to compress, encode and reliably transport streams from the venue to the network operation center or the cloud become key, and there will be plenty of new developments and sources of…