In part one of this series, we looked at why machine learning, with particular emphasis on neural networks, is different to traditional methods of statistical based classification and prediction. In this article, we investigate some of the applications specific for broadcasters that ML excels at.
Other articles in this series:
For ML to be successful it needs a large amount of labelled training data. The good news is that broadcasting is abundant with data that is easily classified by domain experts. This leaves the data scientist to get on with designing the ML model to optimize the task in question.
Detection Of Objects For Media Ingest
Broadcasters are fighting to digitize their historical media archives before tape players become completely obsolete. But this also provides a great opportunity for broadcasters to classify and tag their media assets as they are digitized. Data retrieval is only as effective as data storage, so tagging scenes with metadata is more important now than ever.
Convolution Neural Networks (CNN) are an adaption of the neural network (NN) that facilitates object and scene detection in images. For example, a chair may be classified, or a sunset over New York. During ingest, the ML engine would detect thousands or even tens of thousands of different objects and scenes to provide rich metadata tags that accurately and densely define images during ingest. But this tagging isn’t just limited to video, audio can also be tagged using the process of Natural Language Processing (NLP).
Searching For Content
During media ingest into the library, numerous CNN, NLP, and NN engines will be parsing the images, sound, and metadata with an efficiency and accuracy that is far in excess of anything a human could achieve. Mainly, this is because library ingest and metadata tagging soon becomes a very boring and repetitive task, something humans are poor at, but computers excel at.
The rich and detailed metadata provided by the ML engines provides search engines with a massive opportunity to deliver contextual searches, not just on one file, but on the whole media asset library. For example, a search string such as “find me a scene over the Sydney Opera House in winter with sailing boats in the harbor”.
Part of the role of an ML vendor is to continually retrain their engines so they become more accurate in providing tagged metadata. Consequently, the ML engine can be continually run over the existing broadcaster's media asset library to improve the metadata accuracy to further advance searching for producers and editors.
Video And Audio Compression
If we can predict the next samples in a media sequence, then we have a much greater opportunity to provide video and audio compression. Successive frames of video are often repetitive leading to obvious compression opportunities. However, if we can improve this by detecting and defining objects, as well as their trajectory, then we can represent them as vectors thus greatly reducing the amount of data needed to represent a scene (compared to pixel representation).
Video and audio compression is a massive area of research as it has so many applications outside of broadcasting. As well as CNNs, GANs (Generative Adversarial Networks) are seeing much attention for video compression. Fundamentally, two NNs contest with each other to find the optimal compression with an arbiter deciding which provides the better solution.
Quality Control is well known for improving automation, especially for checking parameters such as gamut and level errors. But ML can review the image and sound so that an assessment of the media content can be determined. This is especially useful where cultural restrictions apply to media such as the consumption of alcohol, or smoking. As CNNs, NNs, and GANs develop in complexity and the types of images and sound they can reliably determine, QC-ML will add another dimension to subjective checking of media, especially for contribution and ingest.
For much of the editing used in broadcasting, a well-defined rule structure based on established working practices is used for providing the edit points. However, the subjective nature of editing has meant that automation using computers has been limited using traditional statistical based approaches. As ML can understand a certain amount of the image through its training, it is possible to provide subjective editing, especially for highlights packages.
It’s unlikely that, any time soon, ML will overtake the creation of highlights packages for high value games such as the Super Bowl. However, for smaller sports events, the automatic creation of highlight packages is already a reality. An adequately trained ML engine can determine when a goal has been scored, or missed, and the preamble leading up to the event. From here, the necessary edit points can be established, and the edited package provided.
Close Captioning And Subtitling
Subtitling has traditionally relied on manual typing of the spoken words into text which is then transmitted along with the media. This suffers two challenges, firstly, people are needed to enter the subtitles, and secondly, this is inherently slow due to the time taken to decode and type the text. ML not only speeds up this process, but also reduces the need for human intervention.
NLP is the ML process of providing time series analysis and translation of the audio track of the production. This converts the spoken word on the audio to text which can then be broadcast to provide the subtitles. NLP works in a variety of languages so that interpreters and language specific operators are no longer needed, thus speeding up and reducing the complexity of the subtitling process.
Remote Camera Control
Facial recognition allows ML engines to detect players at a sports event. When combined with a remote camera that can provide axis, zoom, and focus control, a method of tracking players around the sports arena becomes available. This not only reduces the need for an operator at each camera, but also provides the opportunity for many more cameras to be available for the production.
It’s entirely possible to have one camera assigned to each player, or multiple cameras giving lots of different angles assigned to specific players of interest. Furthermore, this could be achieved on the fly with the director selecting a specific sports person to follow. Combining remote camera control through facial recognition with automated editing would provide an incredibly flexible and efficient method of generating highlights for a sports production.
IP Network Optimization
ML is finding all types of applications in networks, from security to flow optimization. IP flows have a complex cyclical element about them that can be learned through time series analysis ML such as LSTMs (Long Short-Term Memory) and Transformers. This is particularly useful when SDNs (Software Defined Networks) are employed as the data plane can be switched under the control of the management system, which in turn can determine predictive information about the network from the ML model.
As more broadcasters migrate to IP, the network operation will need to be understood more so that latency and packet loss can be kept low. ML provides an opportunity to better predict network behavior and determine anomalies before they happen.
A futuristic application of NLP is to have voice activated devices such as production switchers in the studio. The director calls the shots, the NLP decodes the voice into commands, and the control system switches the inputs on the production switcher. All without human intervention.
Machine Learning is finding all kinds of applications in broadcasting and the massive amount of research being conducted in seemingly unrelated fields is working to our advantage. However, as we will see in the next few articles in this series, training data is critical for accurate ML prediction and classification, and its importance cannot be taken for granted.
You might also like...
One cannot get very far with electricity without the topic of batteries arising. Broadcasters in particular have become heavily dependent on batteries to power portable equipment such as cameras and lights.
The venerable field of audio/visual (AV) packaging is undergoing a renaissance in the streaming age, driven by convergence between broadcast and broadband, demand for greater flexibility, and delivery in multiple versions over wider geographical areas requiring different languages and…
Multi-CDN is a standard model for today’s D2C Streamers, which automatically requires a CDN selection solution. As streaming aims to be broadcast-grade and cost-effective, how are CDN Selection solutions evolving to support these objectives?
Information theory can also be applied to loudspeakers, which are among the most difficult of transducers to design. Measuring the information capacity of loudspeakers is a useful tool.
Here we look at some practical results of transform theory that show up in a large number of audio and visual applications.