IP is an enabling technology that allows broadcasters to integrate software solutions into their workflows. However, the plethora of SDI and IP workflows that are now coexisting in broadcast infrastructures means that technologists and engineers must look for new methods when designing their facilities, especially when considering the real-time nature of live captioning.
Subtitles and captioning not only provide a valuable service for the hard of hearing but also improve the viewing experience in loud environments where the audio may not be easily heard. Consequently, the quality of the audio to text conversion must be of a very high-quality with the lowest translation to display-time possible.
Although captioning may not take up much bandwidth in relation to video, its impact on the viewer can be huge and so must be considered at the very beginning of a broadcast infrastructure design or modification. Furthermore, the large number of variations of languages, fonts, and dialects requires technologists and engineers to think deeply about how they integrate live captioning into SDI and IP workflows.
For live captioning to be of any value to the viewer it must be an accurate representation of the audio text and be aligned to the people speaking. For pre-recorded material this can be relatively easy to achieve as captioning text can be edited prior to transmission, but for live broadcasts it proves more interesting.
Traditionally, a stenograph operator has manually typed the captioning text as the presenters and guests are speaking. Although this can provide accurate representation of the spoken word, it requires multilingual and highly skilled stenograph operators. And the stenograph operator may be fast, but there is often a noticeable delay between the spoken word and the captioning text appearing on the screen.
SDI workflows have generally provided the captioning information by encoding it in the ancillary data space. This maintained frame accurate timing as well as making sure the correct captioning text was provided for the program being broadcast. However, as we move to an OTT world, this method of operation proves inefficient for transmitting multiple languages across multiple platforms.
Translating Audio to Text
The stenograph method has served broadcasters well for many years, but new methods of operation are now available using artificial intelligence to provide automated speech recognition. The high-quality speech to text processing not only provides accurate representations of speech for the viewer, but it also creates the captioning with incredibly low latency.
Google Assistant, Apple’s Siri and Amazon’s Alexa all demonstrate how speech-to-text is highly effective and fast. Methods of time series analysis are applied using neural networks to convert the speech into text so that multiple models can be trained using different languages. This means the speech-to-text converter can both understand many languages as well as provide multi-language text. The accuracy of the conversion is enhanced as the neural network models can learn the context of the words and sentences which allows them to accurately translate words when the audio quality may not be particularly reliable.
Diagram 1 – One example of an audio to text converter using neural network machine learning. The Convolution Neural Network (CNN) extracts the features from the raw audio, the Transformer determines the context of the audio and converts it to tokens where the Connectionist Temporal Classification (CTC) provides the text translation.
For AI and ML to operate efficiently, the captioning text and video and audio media must be separated from one another. This is not anything new as SMPTEs ST2110 series of standards have abstracted the video, audio, and metadata away from the underlying transport stream, thus keeping the timing plane separate, and this has opened the door for generating captioning text as a separate file.
OTT and internet streaming has many different streaming formats that are required by a whole host of platforms and web browsers. Consequently, one workflow cannot be assumed to operate across every streaming platform.
One of the challenges for traditional broadcast and OTT workflows is how the captions are communicated to the viewer. DVB sends bitmaps as part of the transport stream and not the actual text. This has several advantages such as there are no restrictions on the supported languages and special characters, text positioning is specified by the broadcaster so the decoder doesn’t have to make any decisions about their placement, and broadcasters can create their own customized fonts.
However, DVB bitmaps have a confined operation as devices that require HTTP adaptive streaming protocols, e.g., DASH and HLS, have limited support for the bitmaps. Also, some countries, such as the US, mandate that users should be allowed to customize the visual appearance of the subtitles, and the DVB bitmaps do not always meet this requirement.
The multiple distribution platforms for both linear and OTT has led to many file formats being developed and used. Consequently, when creating live captioning, the software should be able to create the different formats needed to satisfy the platform dependent systems.
Delivery of live captions within SDI is well established using caption-insertion into the ancillary data but still requires specialist caption embedders and de-embedders. However, for IP, the distribution systems are technically easier as files can be transferred over IP networks, but broadcasters must keep track of the files so that the correct subtitles are streamed alongside their associated media streams.
To deliver the caliber of live captioning that viewers demand requires broadcasters to design workflows that fit work across multiple platforms for both SDI and IP. The new breed of AI audio to text conversion software greatly simplifies the operation as the software can be more easily integrated into the workflows than the traditional methods, especially when multi-language applications are considered. Furthermore, IP networks allow for easier integration of live captioning files as they do not rely on specialized embedders and de-embedders as found in the SDI domain. And working with a live captioning specialist vendor will help smooth the integration into both existing and new workflows.
You might also like...
This article describes the various codecs in common use and their symbiotic relationship to the media container files which are essential when it comes to packaging the resulting content for storage or delivery.
This list of file container formats and their extensions is not exhaustive but it does describe the important ones whose standards are in everyday use in a broadcasting environment.
This article gives an overview of the various codec specifications currently in use. ISO and non-ISO standards will be covered alongside SMPTE 2110 elements to contextualize all the different video coding standard alternatives and their comparative efficiency - all of which…
Fast growing traction for 5G Broadcast and Multicast has the potential to disrupt over the air broadcasting by presenting an alternative to the established digital terrestrial networks just as they progress to the next generation. Yet the two may end…
This article gives an overview of the standards relating to production and transmission or playout. It prepares the ground for subsequent more detailed articles which will explore the following subject areas: ST 2110, higher bit rate codecs and profiles that are…