Future Technologies: New Hardware Paradigms

As we continue our series of articles considering technologies of the near future and how they might transform how we think about broadcast, we consider the potential processing paradigm shift offered by GPU based processing.

GPUs are appearing more and more in broadcast COTS server infrastructures allowing broadcasters to achieve massive parallel processing, especially for video. But we’re not necessarily using the video output of the GPU as there’s much more to a GPU than meets the eye.

Video processing makes unprecedented demands on servers and their processing infrastructures due to the relentless nature of data generated by uncompressed video acquisition devices. FPGAs deal with this as the device can be thought of as one huge assignable parallel logic block that can be configured in a whole host of different ways to deliver any type of processing needed. The latency in FPGAs is generally in the order of tens of serial clock cycles making their processing appear near instantaneous. However, they still rely on custom hardware.

Serial Processing Bottlenecks

One of the motivations for transitioning to COTS-type datacenters is the amount of flexibility they provide, but they do suffer from one fundamental issue, that is, they process data serially. High-end servers certainly have multiple processors on single CPU cores with multiple threads running on them, but they do rely on software scheduling which in turn provides massive variance in the asynchronous aspect of their operation. And the processors on the CPUs are limited in their data throughput.

Virtually all COTS-type servers are based on the Von Neuman architecture. The architecture stores instructions and data in memory so that processing occurs one instruction at a time. This works well for sequential operations but even when parallel processing is adopted using multicore and multithread operations, they turn out to be just an adaptation of the sequential processing architecture.

Weaknesses soon become apparent when we look at an operation such as an SD-to-HD upconverter. To achieve the creation of new data to fill the additional HD lines interpolation filters are often employed. Assuming a ten-tap filter is employed, then ten samples must be continuously available to a multiplier so that the product of each of the samples can be found before being summed and then normalized using a divide circuit. This is relatively straightforward in an FPGA as all the data can be moved in parallel and then presented to an optimized hardware multiplier. However, the same cannot be said of the Von Neuman processing architecture.

Constrained Data Throughput

A multiprocessor CPU core can certainly process each sample in parallel to provide the multiplication function. But the fundamental challenge is that the video frame of data resides in one area of memory and so each adjacent sample (or pixel) must be moved into the CPUs processor core. There may well be one-hundred processor cores on a single CPU, so this sounds relatively trivial, but the challenge occurs when it becomes clear that the data needs to be streamed across the same address and data bus to reach all the processor cores. And this is further compounded when all the samples must be added together after their multiplication and normalized, thus adding many more address and data bus operations. If we take this to the logical conclusion where many pixels need to be processed simultaneously, then we soon get to the point where we experience address and data bus bottlenecks. In this case, the limitations of the server are not the number of processes that can simultaneously occur, but instead, the maximum data throughput of the server's address and data busses.

Hardware Tweaking

In on-prem solutions the software designer has more control over the hardware, and they must employ this control to maintain the necessary data throughput and processing. But we have less (or no) control over public cloud hardware so we can see why systems don’t always scale.

The good news is that we know real-time uncompressed video processing works because there are several vendors demonstrating the technology. But the Von Neuman architecture alone is a massive bottleneck for video processing. The even better news is that there is a solution on the horizon through the adoption of GPUs.

GPUs started their lives as a simple memory buffer that the CPU could write to, and the GPU would take care of converting the image memory matrix into a raster scan for the monitor. Over the years, more and more video-centric functions have been off-loaded to the GPU hardware. In the extreme, the CPU treats the video as objects rather than pixel maps and then sends instructions to the GPU on how to display the image. Anybody who has programmed video graphics using drivers such as OpenGL will understand the power of this design philosophy. In essence, we’re restricting the amount of video data that is relentlessly shifted around the server's architecture.

Hardware Acceleration

As GPU technology has developed more software functions have been offloaded to custom hardware within the device. Image processing relies heavily on a mathematics branch called linear algebra, which provides highly optimized hardware multiplication, division, addition, and subtraction. Also, linear algebra employs a method of matrix processing where the individual mathematical functions act on each element in parallel. If we think of a matrix with each of its elements as a pixel, which in turn has a mathematical hardware accelerated block associated with it, then we can see how GPU technology has the potential to massively accelerate video, audio, and metadata processing. There is a dedicated mathematical processing engine associated with every pixel in the image which leads to GPUs having thousands of data processing units, all working in parallel with little chance of memory address and data bus bottlenecks occurring.

It’s fair to say that we’re still limited by the speed with which we can shift the video data to and from the GPU. In traditional server architectures, the media stream would be transferred from the NIC to the system memory by the CPU, and then transferred to the GPU. This method still presents the bottlenecks already highlighted; however, the new ultra-high-speed PCI buses now available provide a solution to this as newer server technologies use the PCI bus to directly transfer media data from the NIC to the GPU and back again. Another vendor specific solution is to provide a dedicated inter-PCB connection between the NIC and the GPU, thus bypassing the server's buses altogether.

Server System Memory Bypass

By bypassing the server’s system memory and transferring video data directly between the NIC and GPU, we have the ultimate in video processing architectures. This method does rely on server architectures that are not necessarily mainstream, but the servers, NICs, and GPUs are all COTs components so could be, and are being built with ease. This does provide some challenges for public cloud infrastructures as hardware connectivity within the server needs to be specified, but many public cloud service providers are seeing this as a potential for new business and will hopefully make such hardware available in their infrastructures.

Another interesting idea is that we’ve almost come full circle. If we take the next logical step, which is to add a high-speed NIC to the GPU circuit board, thus providing direct connectivity between the network and the GPU with very low latency and a significantly reduced risk of transfer bottlenecks, then what is the point of using a traditional COTS server? It’s only acting as a mechanical hosting, configuration, and monitoring device for the NIC-GPU circuit board. Maybe this is the future of video processing in COTS datacenters, whether on- or off-prem?

You might also like...

Designing IP Broadcast Systems

Designing IP Broadcast Systems is another massive body of research driven work - with over 27,000 words in 18 articles, in a free 84 page eBook. It provides extensive insight into the technology and engineering methodology required to create practical IP based broadcast…

NDI For Broadcast: Part 3 – Bridging The Gap

This third and for now, final part of our mini-series exploring NDI and its place in broadcast infrastructure moves on to a trio of tools released with NDI 5.0 which are all aimed at facilitating remote and collaborative workflows; NDI Audio,…

Microphones: Part 2 - Design Principles

Successful microphones have been built working on a number of different principles. Those ideas will be looked at here.

Expanding Display Capabilities And The Quest For HDR & WCG

Broadcast image production is intrinsically linked to consumer displays and their capacity to reproduce High Dynamic Range and a Wide Color Gamut.

Standards: Part 20 - ST 2110-4x Metadata Standards

Our series continues with Metadata. It is the glue that connects all your media assets to each other and steers your workflow. You cannot find content in the library or manage your creative processes without it. Metadata can also control…