Up conversion must retain integrity of original video source and not just look good.
The ability of upscaling to give new life to old content, as well demonstrate the full value of the latest 8K TVs, has already been well proven through traditional techniques.
Meanwhile, reasonable results have been obtained through more advanced automated approaches, both within TV sets themselves and in processing of content prior to distribution, but these have involved compromise and failed to get close to what would be native quality at the target resolution, whether this is 4K or 8K.
The ideal approach would combine the quality obtainable through intensive manual efforts with an automated approach capable of operating in as near real time as possible for live or linear content in the TV. At the very least it should be capable of up converting on demand and archive content cost effectively for subsequent distribution. This has become a focus for a growing number of CE and infrastructure vendors.
MediaKind, the video technology business recently divested by Ericsson to private equity firm One Equity Partners, has just staged a webinar outlining its developments in the field, aiming to refine existing machine learning approaches to the task. The bottom line is that such methods should yield substantially better results than proven non-AI based up conversion technologies already widely deployed both by broadcasters and TV makers. The essential idea is to employ an iterative approach incorporating advanced versions of statistical regression to converge the algorithms towards the quality obtained natively at the desired resolution. In other words, content upconverted from HD to 4K for example should look as close as possible to the same material captured natively in the latter format.
As MediaKind’s Principal Technologist Tony Jones noted in that webinar, up conversion using traditional methods offline is generally not much better than can be done on the fly by leading 4K TV sets. These yield pictures noticeably inferior to native 4K capture. Machine learning has then been applied, with the basic approach being to compare the output of the up conversion with the images obtained by native capture. The difference is computed and fed back into the model so that after repeating the process the error is reduced.
In theory, successive attempts in turn yield output nearer to native capture, then presenting the algorithm with input to work on that is closer to the desired result each time. But in practice full convergence will never occur and artefacts are created because ultimately the algorithms have to estimate how to fill in the gaps, given that a 4K image comprises 4 times as many pixels as a full HD one at 1080p and in turn an 8K image four times more again.
Machine learning has also faced the common problem of scalability shared by conventional video processing techniques. This threatened to make the computational task too intensive even for contemporary GPUs (Graphical Processing Units) and other hardware designs when scaling up to 4K and 8K. This was the case whether the application was say object recognition, or translation tasks such as up conversion.
However, advances in application of convolutional neural networks (CNNs) offered a solution, by breaking the task into manageable components that can then be stitched together. CNNs are modelled on the mammalian visual cortex which processes information locally across the field of vision to yield complete images. Not surprisingly therefore CNNs have been widely applied in image processing, especially for facial recognition and object categorization, for example different types of animals.
However, up conversion is a rather different challenge requiring fidelity to the original presumed image and estimation of added detail where relevant, rather than recognition. CNNs ran into problems with training for up conversion when the iterative improvements would sometimes fizzle out well before convergence had occurred, leaving considerable errors. This was partially resolved by using a slightly different machine learning model based on residual neural networks (resnets), where layers of the network hierarchy could be bypassed. Then data could be carried straight from the training source where relevant and fed directly to the predicted version element by element.
This in turn ran into a problem of bias whereby the up conversion prediction was too dependent on the sets of videos used for training. A fundamental problem outlined by Jones in that webinar is that most video frames are lacking in detail across much of their surface, which gives little for the predictive model to chew on. This can lead to up conversions that may look convincing to viewers but do not accurately reflect the actual source footage that would have been captured natively.
The final solution given current state of the science is to run two separate neural networks side by side, one for discriminating to produce the final output and one dedicated just to training the overall predictive model. The training model itself is trained on lots of real images, say at 4K resolution if that is the target. The training neural network then learns with increasing sensitivity how to detect real from simulated video. The discriminator model in turn learns how to fool the training model by making its predictive output resemble the real video as closely as possible. The idea is that this combative approach pitting one neural network against another reduces the gap between predicted output and source at the given resolution.
Tony Jones is MediaKind’s evangelist for machine learning based video up conversion.
MediaKind’s demonstration showed that there is still room for further improvement but that this double network approach gets much closer not just than traditional non-AI upscaling, but also earlier machine learning approaches. “So while we can’t fully replace native UHD sources, because you always lose something, it certainly gives a UHD-like impression,” said Jones.
At least when up converting on-demand or archived content there’s the luxury of time to perfect the process. This is not the case for upscaling in the TV itself, which has to be performed in real time and inevitably requires some short cuts. That is why TV upscaling has generally not yielded the same quality as is possible with the latest AI based approaches, but Samsung has come closest by most reckoning with its machine learning approach driven by its 2020 business strategy for TV sales of 8K or bust. Faced with being undercut by Chinese and other makers of 4K TVs, Samsung has sought the high ground of top end QLED (Quantum LED) 8K sets, exploiting up conversion to make up for the lack of content actually shot in that format. Samsung up converts not just 4K content into 8K, but even HD, although the results are not then as good. The basic principle is the same as that employed by MediaKind, that is to take high-resolution images, then downgrade them to lower resolutions tracking what visual data is lost. The process is then reversed to train the model to take those low resolution images and fill in the missing data to convert them back to high resolution.
But to shorten the process of filling in the data, Samsung has created aids under two related headings. Firstly there is a set of image databases, comprising both low resolution and high resolution versions of given objects such as apples or individual letters. It uses these to repair both edges and textures of images during up conversion. Secondly, Samsung employs filters that modify this process of object restoration to the particular content genre, such as a category of movie or specific sport. This will change how much detail is added to particular objects. Of course there is a lot more to it, but Samsung has demonstrated that upscaling in the TV itself can achieve considerable quality improvement and in some cases avoid the need for up conversion in the field.
There is also temporal upscaling, which is becoming increasingly important given the rise of HFR (High Frame Rate) content. This is particularly required for sports and fast moving content at high resolutions, especially 8K, where legacy frame rates can inhibit perceptual quality improvements for viewers. In this case it is detail residing in intermediate frames that is missing, rather than pixels within a frame, raising different challenges.
Grass Valley’s Alchemist is one of the leading products for motion-compensated frame rate conversion, which operates without changing the resolution, increasingly at 4K. One challenge is to handle stationary objects overlaid by moving parts, with the risk of falsely making the former appear to shift relative position within a frame. Another is how to handle transient content such as graphics or captions, or rapid shifts in perspective through zooming, which need to be represented smoothly when effectively injecting new frames into the content. Increasingly there will be demand to up convert both resolution and frame rate at the same time, perhaps adding High Dynamic Range to boot, since high quality perception will call for simultaneous gains across all these parameters. Again, with archived and on-demand content this can be achieved more readily and can even be done using publicly available tools. For example, YouTube content creator Denis Shiryaev recently upscaled a well-known 1911 film of New York City to 4K, while raising frame rate to 60fps and introducing color. In this case, Google’s DAIN was used to create and insert frames to increase fps (frames per second) to 60, while Topaz Labs’ Gigapixel AI did the up conversion to 4K.
Broadcast Bridge Survey
You might also like...
There are two approaches to digital filtering. One is to implement the impulse response directly. The other is to use recursion. Here we look at the direct implementation.
In the last article in this series, we looked at how PTP V2.1 has improved security. In this part, we investigate how robustness and monitoring is further improved to provide resilient and accurate network timing.
NAB have announced the show scheduled for October 2021 has been cancelled.
Timing accuracy has been a fundamental component of broadcast infrastructures for as long as we’ve transmitted television pictures and sound. The time invariant nature of frame sampling still requires us to provide timing references with sub microsecond accuracy.
For the past year an international group of technology companies, funded by the European Union (EU), has been looking into the use of 5G technology to streamline live and studio production in the hopes of distributing more content to (and…