Video Quality: Part 4 - Video Quality Focus On Generative AI

We continue our mini-series about Video Quality, with a discussion of how Generative AI is making a growing impact on all aspects of video quality, from basic upscaling and image enhancement, to improving the value of search, as well as generating clips and original content in full ultra HD.

Concerns over threats posed by Generative AI can obscure some of the great benefits for video quality control and enhancement, increasing scope through automation as well as image generation. It also builds new bridges between basic aspects of quality enhancement such as upscaling or sharpening blurry images, and capabilities that have a more profound impact on the viewer experience. The latter includes rapid creation of clips from longer form videos in news and sports for example, as well as generation of metadata that makes new content almost instantly searchable with greater ease than before.

Then while there have been concerns and conflicts over the ability of Gen AI to usurp human creative functions, at the same time the potential for automation of routine tasks previously performed manually can free human operatives to focus on higher level tasks. There is even the potential for raising the bar for human creativity, at the same time recognizing that many tasks previously considered to lie within the domain of higher human intelligence will be automated or taken over by machines.

A good starting point is to consider the impact Gen AI is having on more routine aspects of quality control, especially upscaling and improving existing footage, or restoring archived material. In essence it makes these more accurate and realistic, so that an image upscaled from full HD to 4K is matched to background encountered in similar real images, rather than just stuffed out by copying existing pixels, which inevitably looks slightly artificial.

Gen AI is the latest development of neural network based machine learning designed loosely to resemble the human brain but now with many more potential connections. Modern AI has been driven by the unexpectedly sustained advances in computational power that have continued to obey Moore’s Law for over 40 years. This equates roughly to a doubling in transistor density and therefore processing power every two years, or about a million fold over 40 years.

AI has become almost a synonym for machine learning in its broadest form, applying data sets to train software how to identify features or correlations contained within them. Most machine learning models are based on neural networks comprising nodes whose interconnections are assigned weights that are adjusted during learning, analogous to the synapses joining neurons in the human brain.

These weights are the parameters sometimes cited to describe the complexity of a given model. During training, the difference between the actual output of the model and the desired target output is continually reduced until it either disappears or reaches a constant low level, at which point convergence is said to have occurred.

Large language models are then neural network based machine learning models designed to process and generate human language from training on massive volumes of textual. They use deep learning algorithms in neural networks to learn the patterns and nuances of language and then elicit human-like responses, producing capabilities under the banner of Gen AI.

At the application level, Gen AI goes further than preceding neural network based machine learning models by being able to create novel content on its own, through mining existing patterns within datasets upon which they have been trained. This generated content can be text, images and software code, as well as video and audio.

The generative element then comes at a low level for upscaling and converting blurred to sharp images. It can also, along the same lines, restore damaged old, archived footage, and upscale that, as well convert from black and white to color. Needless to say, this is all work in progress, and has not yet reached perfection, but nonetheless represents great progress on capabilities of earlier AI systems.

Adobe has just shown how much more mileage there is in traditional upscaling in a paper just published in April 2024, describing its VideoGigaGAN as a “generative video super-resolution model that can upsample videos with high-frequency details while maintaining temporal consistency.”

The model achieves what it says on the tin, although as always claims that this is far better than competitors have achieved should be taken as unproven marketeering in the absence of rigorous testing. The main point is sound and indicative of how challenging it has been to fill in the pixel gaps to achieve convincing higher resolution than the original production.

The paper is worth mentioning because Adobe was correct in its starting point that until now video upscaling has failed to strike the best balance between the temporal and spatial dimensions. It has either yielded realistic individual frames, but at the expense of perceptible motion jitter, flickering, or distortion from aliasing artifacts, between frames, or total temporal consistency but with imperfections such as blurred images within frames.

Adobe noted that existing Video Super Resolution (VSR) models employing traditional machine learning have largely solved the temporal part of the problem, achieving consistent sequences of frames. At the same time, application of Generative Adversarial Networks (GANs) has demonstrated impressive generation of higher resolution individual images. The latest so called GigaGANs have taken this almost to perfection when trained on billions of images. However, they fail to match successive frames and so fail the temporal consistency test.

GANs emerged a decade ago around 2014, inspired by mimicry in evolutionary biology where organisms imitate one another over some detail, for various selective advantages such as camouflage. The idea is that one network takes the role of a human trainer, issuing data to another, which then judges how “realistic“ it is, that is how closely it matches a target.

But then, instead of attempting to reduce the gap, the second network delivers the results of its assessment to the first, while also updating itself to make clearer distinctions next time. The first network in turn updates its weights to generate a closer match to the target next time. In this way, with one model trying to converge and the other trying to dodge the convergence, it is possible for the whole system to learn without supervision from the initial data alone, which works well for some problems.

In upscaling it works well in the spatial but not temporal dimension, because the latter is too open a problem if confined to matching. Adobe’s idea was to combine GANs with proven temporal VSR upscaling in its packages called VideoGigaGAN. Results suggest at least this approach is worth pursuing.

AI based upscaling is of particular interest to makers of 4K and 8K TVs, which already incorporate AI based techniques for this. For broadcasters, there is great interest in upgrading archived content, which in the past has tended to be a labor intensive process. The role of Gen AI here lies more in automating processes than improving outcomes, making it possible to revise much more material.

While broadcasters have been applying some form of machine learning in various ways for some time, Gen AI opens new horizons that has led them to reappraise their ambitions and strategies. The BBC outlined its GenAI strategy in February 2024, starting with 12 internal pilots to assess the technology under three categories: increasing value of existing content; creating new audience experiences; and accelerating or simplifying workflow. All three of these areas have quality implications in the broader sense as it is now understood.

An obvious one is machine translation, which was already being pursued with traditional AI, but BBC News is now considering how Gen AI can accelerate the process of converting breaking items into multiple languages accurately and quickly, almost on the fly.

In a similar vein, another of these projects is looking at reformatting content almost immediately to extend its appeal. One example is converting live sports radio commentaries to text for almost immediate posting on BBC Sport’s live pages.

Another project is looking at automated headline creation, helping beleaguered journalists cope with multiple breaking items by eliciting more amusing or creative titles for them. The idea is to present menus of headlines, and also standfasts below them, as well as summaries of articles from the video and audio.

The BBC is also looking at near live metadata creation with Gen AI, aiming to avoid emerging items being submerged amid the morass of content largely un-accessed in future because of inability to find it. This also feeds into clip generation, enabling snapshots more conducive for social media to be created close to the time of first transmission, accompanied by labelling that makes them easier to find. Although there has been progress with object-based video search, that is locating items on the basis of people in the video, or keywords in audio for example, scope is still limited by the computational complexity and time involved. If sufficiently comprehensive descriptive textual metadata has already been created, subsequent search is expedited greatly.

The BBC may well be informed by work done on video summation by Google, among others. The latter’s subsidiary Google DeepMind developed Flamingo as a visual language model specifically to improve SEO (Search Engine Optimization) for its YouTube shorts by generating descriptive text from the initial frames. Google has claimed big improvements in both searchability and user engagement as a result.

But the main advance embodied in Flamingo is faster learning, so that it can readily be applied to different scenarios quickly without extensive training on large numbers of instances. Apart from text creation from first frames, potential applications include captioning and answering questions about the video.

A few major broadcasters or video service providers are attempting to accelerate innovation around Gen AI by stimulating and funding startups in the field directly. Comcast NBCUniversal has taken this approach with its Generative AI Accelerator secured pilots or proofs of concepts managed by its LIFT Labs subsidiary.

The latter’s first accelerator focus for 2024 is Vertical AI, encouraging startups that integrate AI into typical employee workflows at all levels, including quality control.

Comcast insists that start-ups it helps will be free to sell their products and services to other video service providers that lack its resources to invest directly at large scale.

Comcast’s main motivation is to become agile at the R&D level, achieving the innovative pace of a start-up focused on a given task, essentially obtaining early access to various AI based technologies with the hope that at least some will be valuable and enable some early mover advantage.

You might also like...

The Resolution Revolution

We can now capture video in much higher resolutions than we can transmit, distribute and display. But should we?

Microphones: Part 3 - Human Auditory System

To get the best out of a microphone it is important to understand how it differs from the human ear.

HDR Picture Fundamentals: Camera Technology

Understanding the terminology and technical theory of camera sensors & lenses is a key element of specifying systems to meet the consumer desire for High Dynamic Range.

IP Security For Broadcasters: Part 2 - The Problem To Be Solved

By assuming that IP must be made secure, we run the risk of missing a more fundamental question that is often overlooked: why is IP so insecure?

Standards: Part 22 - Inside AIFF Files

Compared with other popular standards in use, AIFF is ancient. The core functionality was stabilized over 30 years ago and remains unchanged.