IP is an enabling technology that allows broadcasters to integrate software solutions into their workflows. However, the plethora of SDI and IP workflows that are now coexisting in broadcast infrastructures means that technologists and engineers must look for new methods when designing their facilities, especially when considering the real-time nature of live captioning.
Subtitles and captioning not only provide a valuable service for the hard of hearing but also improve the viewing experience in loud environments where the audio may not be easily heard. Consequently, the quality of the audio to text conversion must be of a very high-quality with the lowest translation to display-time possible.
Although captioning may not take up much bandwidth in relation to video, its impact on the viewer can be huge and so must be considered at the very beginning of a broadcast infrastructure design or modification. Furthermore, the large number of variations of languages, fonts, and dialects requires technologists and engineers to think deeply about how they integrate live captioning into SDI and IP workflows.
For live captioning to be of any value to the viewer it must be an accurate representation of the audio text and be aligned to the people speaking. For pre-recorded material this can be relatively easy to achieve as captioning text can be edited prior to transmission, but for live broadcasts it proves more interesting.
Traditionally, a stenograph operator has manually typed the captioning text as the presenters and guests are speaking. Although this can provide accurate representation of the spoken word, it requires multilingual and highly skilled stenograph operators. And the stenograph operator may be fast, but there is often a noticeable delay between the spoken word and the captioning text appearing on the screen.
SDI workflows have generally provided the captioning information by encoding it in the ancillary data space. This maintained frame accurate timing as well as making sure the correct captioning text was provided for the program being broadcast. However, as we move to an OTT world, this method of operation proves inefficient for transmitting multiple languages across multiple platforms.
Translating Audio to Text
The stenograph method has served broadcasters well for many years, but new methods of operation are now available using artificial intelligence to provide automated speech recognition. The high-quality speech to text processing not only provides accurate representations of speech for the viewer, but it also creates the captioning with incredibly low latency.
Google Assistant, Apple’s Siri and Amazon’s Alexa all demonstrate how speech-to-text is highly effective and fast. Methods of time series analysis are applied using neural networks to convert the speech into text so that multiple models can be trained using different languages. This means the speech-to-text converter can both understand many languages as well as provide multi-language text. The accuracy of the conversion is enhanced as the neural network models can learn the context of the words and sentences which allows them to accurately translate words when the audio quality may not be particularly reliable.
Diagram 1 – One example of an audio to text converter using neural network machine learning. The Convolution Neural Network (CNN) extracts the features from the raw audio, the Transformer determines the context of the audio and converts it to tokens where the Connectionist Temporal Classification (CTC) provides the text translation.
For AI and ML to operate efficiently, the captioning text and video and audio media must be separated from one another. This is not anything new as SMPTEs ST2110 series of standards have abstracted the video, audio, and metadata away from the underlying transport stream, thus keeping the timing plane separate, and this has opened the door for generating captioning text as a separate file.
OTT and internet streaming has many different streaming formats that are required by a whole host of platforms and web browsers. Consequently, one workflow cannot be assumed to operate across every streaming platform.
One of the challenges for traditional broadcast and OTT workflows is how the captions are communicated to the viewer. DVB sends bitmaps as part of the transport stream and not the actual text. This has several advantages such as there are no restrictions on the supported languages and special characters, text positioning is specified by the broadcaster so the decoder doesn’t have to make any decisions about their placement, and broadcasters can create their own customized fonts.
However, DVB bitmaps have a confined operation as devices that require HTTP adaptive streaming protocols, e.g., DASH and HLS, have limited support for the bitmaps. Also, some countries, such as the US, mandate that users should be allowed to customize the visual appearance of the subtitles, and the DVB bitmaps do not always meet this requirement.
The multiple distribution platforms for both linear and OTT has led to many file formats being developed and used. Consequently, when creating live captioning, the software should be able to create the different formats needed to satisfy the platform dependent systems.
Delivery of live captions within SDI is well established using caption-insertion into the ancillary data but still requires specialist caption embedders and de-embedders. However, for IP, the distribution systems are technically easier as files can be transferred over IP networks, but broadcasters must keep track of the files so that the correct subtitles are streamed alongside their associated media streams.
To deliver the caliber of live captioning that viewers demand requires broadcasters to design workflows that fit work across multiple platforms for both SDI and IP. The new breed of AI audio to text conversion software greatly simplifies the operation as the software can be more easily integrated into the workflows than the traditional methods, especially when multi-language applications are considered. Furthermore, IP networks allow for easier integration of live captioning files as they do not rely on specialized embedders and de-embedders as found in the SDI domain. And working with a live captioning specialist vendor will help smooth the integration into both existing and new workflows.
You might also like...
At the moment it is far from clear exactly how the OTA TV landscape will evolve in the US over the next few years… the only sure thing is that we are in a period of rapid change.
Why keeping control of wi-fi and other devices within a broadcast facility to ensure there is no interference with critical devices is essential.
Here we look at codecs and encoding for digital RF modulation such as ATSC 3.0, DVB and other digital OTA standards, and network requirements for encoding and delivering streaming internet video.
This is the fourth of a multi-part series exploring the science and practical applications of RF technology in broadcast. Here we discuss codecs & encoding, the need to carefully manage the proliferation of RF devices within facilities and the future…
Premium live sports streaming, global distribution, and mobile streaming are key OTT themes for IBC 2023. This puts the spotlight on performance and ultra-low and latency streaming, as well as security of mobile apps and unauthorized devices. Other themes include energy…