Video Quality: Part 1 - Video Quality Faces New Challenges In Generative AI Era
In this first in a new series about Video Quality, we look at how the continuing proliferation of User Generated Content has brought new challenges for video quality assurance, with AI in turn helping address some of them. But new issues are raised as the lines between content generation, quality measurement, and restoration become blurred.
Other articles in this series:
Video quality has been a recurring topic throughout TV history, as technologies evolve across the content lifecycle, with new delivery mechanisms and rising consumer expectations. With the streaming era now long underway, the latest disruption has been coming from AI and especially the Generative variety. Gen AI is not just revolutionizing content generation but blurring the lines over quality assurance as enhancement becomes entwined with creation.
At the same time though streaming continues to pose challenges for quality assurance after distribution beyond the creation process and here too AI is making a growing impact. It is becoming possible to apply consistent video quality measurement after delivery, assessing the holistic whole rather than just the aggregation of individual metrics. Even without AI it has been possible to detect and quantify artefacts such as buffering and stuttering, as well as more specific parameters such as resolution, bit rate, frame rate, and the ability to display colors accurately after transmission.
Machine learning algorithms can home in on the results of human assessment as used to be de rigor in the heyday of Mean Opinion Scores (MOS). Indeed, crowd sourcing can be applied to conduct surveys of video quality perception valuable for training AI models to provide meaningful assessments.
But before delving further in this latest The Broadcast Bridge series on video quality, it is worth establishing clear definitions. There is a fundamental distinction between source and destination, with the former further being divided into archive and newly created material. The latter can now in turn be sub-divided into content captured by camera and created automatically, whether by legacy CGI (Computer Generated Imagery), or contemporary Gen AI techniques.
The intrusion of AI into CGI has led to confusion over terminology in particular. This is partly because CGI predated AI by 66 years if we consider Alfred Hitchcock’s Vertigo as the first example of CGI in 1958 with its use of an in-camera mechanism to distort the image. Much later CGI more as we know it now, was employed in various domains, including flight simulators and then CT scanners to generate 3D models of internal organs or tissues from multiple single slice 2D x-rays.
CGI was then employed with growing sophistication in movie making, still without AI assistance. CGI still required considerable human assistance and did not on its own yield to improvements in video quality. The advent of Gen AI has changed that and become conflated with CGI in the context of video generation and enhancement, while also entering the domain of video quality. The role of Gen AI in video quality assurance will be analyzed further in a later article focusing exclusively on that.
Whatever technologies are employed, the objective on the distribution side is to transmit the best quality video possible under given cost and logistical constraints, usually subject to encoding and also adaptive bit rate streaming (ABRS) in the OTT case. With the advent of ABRS aiming to deliver at the best quality possible under prevailing network conditions by encoding multiple versions as chunk streams at different bit rates, it became even more critical to measure the resulting experience as accurately as possible.
Measurements had to cater for varying levels of information about both the originating and received content, leading to several options around three broad categories. These are Full Reference Methods (FR), Reduced Reference Methods (RR), and No Reference Methods (NR).
Video quality after transmission has traditionally been assessed by one of three methods, full reference, reduced reference, and no reference.
FR models calculate the difference between the original video signal and received signal exhaustively pixel by pixel, taking no account of encoding or transmission impact. This yields the most accurate assessment of quality but at great computational cost that is unfeasible for many streaming services, given the amount of content they now serve. In any case the original reference signal is not always available.
RR models compromise by extracting just a subset of the source and target videos to reduce computational load but at lower accuracy. Then NR models assess quality of a received video after impacts of encoding and transmission without access to the source at all. Some of these NR methods look for specific types of degradation such as blurring and attempt to equate these with overall perceived quality.
The rise of User Generated Content (UGC) has increased demand for NR methods since by definition there is no information about source quality. Google as the pioneer and highest volume distributor of UGC with YouTube has been working longest on the conundrum of video quality assessment in the absence of conventional reference metrics. Such legacy metrics include Peak Signal-to-Noise Ratio (PSNR) and the Video Multi-Method Assessment Fusion (VMAF) developed by Netflix with help from the University of Southern California and made generally available open source.
PSNR is the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation, which has been commonly applied to quantify reconstruction quality for video after lossy compression where essential information has been permanently lost.
VMAF predicts subjective video quality by taking a reference signal and then accounting for distortion resulting from specific video codecs, encoding settings, or transmission variants. Netflix was then able to optimize its encodes for quality of its on demand content arsenal, including both its own creations and third party assets.
But this was no good for YouTube, which developed a model called Universal Video Quality (UVQ) to evaluate UGC video quality and provide a foundation for improving that through optimal transmission and potentially AI enhancement. In essence You Tube developed a machine learning algorithm to map features of a video to MOS by analyzing millions of raw videos in its collection.
This involved first employing an internal crowd-sourcing platform to collect mean opinion scores (MOS) of quality on an ascending scale of 1–5, where 5 is the top, for its no-reference (NR) case. It then collected ground-truth labels from the YouTube-UGC dataset and categorized factors that affect quality perception in three high-level categories. Ground truth in this context is information about quality known to correctly reflect human perception, obtained through direct questioning or measurement. It is empirical or experimental evidence, rather than inferred indirectly.
The three categories were content, distortions and compression, which all affect visual perception. If there is no meaningful content the calculated MOS score will be low. Similarly both distortions introduced during video production and artefacts resulting from subsequent compression, will also reduce MOS scores.
Machine learning models, with their ability to segregate data sets to match target attributes, can be trained to match MOS scores to various features extracted from a given video. Google applied self-supervised learning (SSL) here, which was necessary given the absence of sufficiently numerous ready categorized UGC data sets based on ground truth.
SSL models are trained on tasks using the data itself to generate supervisory signals designed to lead towards eventual convergence, rather than relying on external labels provided by humans. In the video context SSL aims to converge around relevant features or relationships in the data. It is supposed to emulate the way humans learn to classify objects in the absence of direct supervision.
There is a general trend in AI away from supervised learning towards unsupervised, because it is more flexible, better scalable to complex tasks and large data sets, enabling model training to become more automated. But it is work in progress and so there is plenty of scope for improvement in measuring quality of user generated content.
This is relevant not just for YouTube and the social media giants such as Facebook whose platforms handle a lot of user generated content, but also for broadcasters and video service providers. An increasing amount of content is produced remotely at varying levels of quality, with growing demand for fast and accurate measurement of quality so that delivery can be optimized as far as possible.
Effective optimization in turn depends on being able to identify where in the delivery chain quality issues arise. In many cases problems cannot be fixed instantly in real time, but can be resolved relatively quickly, although in some cases the information will inform ongoing network enhancement.
This comes under the heading of root cause analysis, which does inevitably require monitoring at multiple points across a delivery chain, in order to assess how quality changes during distribution and home in on problems. This can be at the headend side, during encoding, somewhere within a CDN, or one of multiple CDNs, or it could be during the final delivery over an ISP network, even in the user’s home over say WiFi, or yet the client device.
This a challenge for tool makers and monitoring aggregators or integrators, since service providers want to be able to manage and troubleshoot from a single dashboard fed by multiple points. We will drill into these issues in further articles of this series.
You might also like...
Brazil Adopts ATSC 3.0 For NextGen TV Physical Layer
The decision by Brazil’s SBTVD Forum to recommend ATSC 3.0 as the physical layer of its TV 3.0 standard after field testing is a particular blow to Japan’s ISDB-T, because that was the incumbent digital terrestrial platform in the country. C…
Broadcasting Innovations At Paris 2024 Olympic Games
France Télévisions was the standout video service performer at the 2024 Paris Summer Olympics, with a collection of technical deployments that secured the EBU’s Excellence in Media Award for innovations enabled by application of cloud-based IP production.
Standards: Part 18 - High Efficiency And Other Advanced Audio Codecs
Our series on Standards moves on to discussion of advancements in AAC coding, alternative coders for special case scenarios, and their management within a consistent framework.
HDR & WCG For Broadcast - Expanding Acquisition Capabilities With HDR & WCG
HDR & WCG do present new requirements for vision engineers, but the fundamental principles described here remain familiar and easily manageable.
What Does Hybrid Really Mean?
In this article we discuss the philosophy of hybrid systems, where assets, software and compute resource are located across on-prem, cloud and hybrid infrastructure.