At Facebook, AI Tackles Automated Captioning For Online Video

Online video captioning is critical for the deaf community at any time, but during a public health emergency like COVID-19, it has taken on a new significance, particularly as people stay at home.

On Facebook alone, 100 million hours of video are watched on every day and it sees more than 8 billion average daily video views. More telling for the hearing impaired, and others, Facebook says that 85 percent of its users watch videos with the sound off.

With so much video to process and a U.S. government mandate — the 21st Century Communications and Video Accessibility Act, which requires closed captioning for online video content that was originally broadcast on TV with captions — the job of syncing live or post-produced captioning across so much content is massive.

Using AI To Keep Up

Because Facebook videos auto-play on mute, closed captions are instrumental in capturing viewers’ attention while scrolling down their news feed. Most will agree that captioning Facebook videos makes them more watchable. The company now provides automatic closed captioning for on-demand videos in 16 languages, however access to live, real-time news and information is still a challenge in many parts of the world.

More recently, the company has invested considerable resources in artificial intelligence (AI) algorithms that make live video content more accessible, with automatic closed captions now applied to Facebook Live and Workplace Live platforms. Already, six languages are supported: English, Spanish, Portuguese, Italian, German and French.

Automated captioning for live video is still a challenge for most online video platforms. See the difference between online caption and actual words spoken (on bottom).

Automated captioning for live video is still a challenge for most online video platforms. See the difference between online caption and actual words spoken (on bottom).

Facebook Live automatic captions are helping governments disseminate crucial public health information and ensuring that millions of viewers across the world — whether they have hearing loss or are just watching where audio is not available — get the message. And, as workplace policies evolve, automatic captioning has become essential for employers to keep their staff and customers informed through safety updates.

Automated Captioning Is Not New, Just Better

Although automated caption technology, which predicts a sequence of words from a raw audio signal, has been around since the late 2000s, it is still hard to accomplish. In the type of conversational speech that is present in live streams, people don’t always naturally speak clearly or wait their turn to speak. Unpredictable background noise, the large variety of accents and dialects, and the wide range of tones that influence human speech, make ASR even harder. The system also needs to learn to recognize hundreds of millions of different words across many languages, including uncommon names and jargon.

Conventional ASR systems are made up of three components: an acoustic model that predicts phonemes from short segments of audio; a pronunciation lexicon, which describes how the phonemes are combined to form the words of a given language; and a language model that captures the relationships among those words (e.g., which words are the most common and which words are likely to appear together).

Acoustic Models That Predict The Characters Of A Word

One key discovery made by the Facebook AI team was that the phonetic pronunciation lexicon could be eliminated, and acoustic models could be trained to directly predict the graphemes (or characters) of a word with better accuracy for end-to-end systems at first, and later also confirmed for hybrid systems. This greatly simplified training and deployment of these ASR models across different languages.

The rapid spread of the COVID-19 pandemic caused a spike in both the supply and demand of public health information. Several local and state governments — that were accustomed to holding live press conferences but didn’t have the resources, staff or technology to record, stream and caption their live events — turned to Facebook Live. Several governments also discovered that video captioning was not just a luxury but vitally important, especially in the absence of available sign language interpreters. Many of them needed captions to comply with their own disability access rules for public broadcasts.

To handle the inevitable elevated spikes in traffic, Facebook’s ASR models had to get a lot faster in production to avoid falling behind.

To handle the inevitable elevated spikes in traffic, Facebook’s ASR models had to get a lot faster in production to avoid falling behind.

People around the world were also tuning into newscasts and conferences streaming on Facebook Live, and watching for much longer periods of time than usual. In fact, the number of Facebook Live broadcasts from Pages doubled in June 2020 compared to the same time last year.

To handle the inevitable elevated spikes in traffic, Facebook’s ASR models had to get a lot faster in production to avoid falling behind. Recent research in the captioning space has shown that “convolutional” encoders — that is a neural network applied to analyzing visual imagery — yielded the best accuracy. In non-streaming use cases, (i.e. when the entire video is available to the model for decoding) Facebook engineers found that “Transformer” encoders produce ASR models that are both very fast and the most accurate.

Training The System With PyTorch

Facebook engineers were able to deploy these model variations with a number of infrastructure optimizations, which enabled the company to serve all the additional video traffic and resulted in machine resource savings despite the increased load. Models were trained using a technology called “PyTorch”, which enables quick iterations on ideas and deployments to production. PyTorch provides a research framework that includes common building blocks for carrying out deep learning research. It also allows chaining of high-level neural network modules that can greatly increase captioning processing speeds.

Julian Chan, a Facebook AI software engineer said that that the system is also capable of adapting to new words such as “COVID,” which has been essential for captioning public health information-based broadcasts during the pandemic.

“It can easily learn a new word and predict where it will occur,” he said. “This was largely made possible using text data from public Facebook posts to train the system.”

Clearly more work is needed to keep up with demand, but in the meantime, broadcasters can look to automatic closed captions from consumer-facing companies like Facebook, Google, Instagram and the like to support their efforts to help get important messages out in easily read on-screen text and also comply with online video laws. AI is up to the challenge and Facebook is making that happen.

You might also like...

IT-Centric Technologies Will Continue To Alter Traditional Video Production

After a year like 2020, predicting the future is scary business. However there are several leading-edge technologies—many borrowed from the IT and consumer-facing industries—that certainly look to make a significant impact on video production and broadcasting in 2021. Here are som…

Mobile Edge Computing Gaining Ground For 5G-Enabled Media Services

As 5G mobile networks become more ubiquitous, the Broadcast and OTT industries are looking to leverage the technology’s speed and low-latency advantages for the management and distribution of live content. Tech companies too are now leveraging cloud-based, edge computing s…

The Sponsors Perspective: Manage Standardized Connections For Essences/Network Flows - ST2110/AES67

With the release of the core parts of the SMPTE ST 2110 suite, the confusion around different transport standards for audio and video over IP is settled. However, the adoption of these solutions for day-to-day use is still far from easy.…

Broadcasters Hit By Sharp Rise In Credentials Stuffing Attacks

Broadcasters and video service providers have become ever more attractive targets of credential stuffing attacks for cyber criminals seeking both content and subscribers’ personal details.

The New Precision Time Protocol Explained

The IEEE has just published the latest version of its Precision Time Protocol (PTP) standard that provides precise synchronization of clocks in packet-based networked systems. This article explains the significance of IEEE 1588-2019, otherwise known as PTPv2.1, and how it…