Future Technologies: New Hardware Paradigms

As we continue our series of articles considering technologies of the near future and how they might transform how we think about broadcast, we consider the potential processing paradigm shift offered by GPU based processing.

Other articles in this series:

GPUs are appearing more and more in broadcast COTS server infrastructures allowing broadcasters to achieve massive parallel processing, especially for video. But we’re not necessarily using the video output of the GPU as there’s much more to a GPU than meets the eye.

Video processing makes unprecedented demands on servers and their processing infrastructures due to the relentless nature of data generated by uncompressed video acquisition devices. FPGAs deal with this as the device can be thought of as one huge assignable parallel logic block that can be configured in a whole host of different ways to deliver any type of processing needed. The latency in FPGAs is generally in the order of tens of serial clock cycles making their processing appear near instantaneous. However, they still rely on custom hardware.

Serial Processing Bottlenecks

One of the motivations for transitioning to COTS-type datacenters is the amount of flexibility they provide, but they do suffer from one fundamental issue, that is, they process data serially. High-end servers certainly have multiple processors on single CPU cores with multiple threads running on them, but they do rely on software scheduling which in turn provides massive variance in the asynchronous aspect of their operation. And the processors on the CPUs are limited in their data throughput.

Virtually all COTS-type servers are based on the Von Neuman architecture. The architecture stores instructions and data in memory so that processing occurs one instruction at a time. This works well for sequential operations but even when parallel processing is adopted using multicore and multithread operations, they turn out to be just an adaptation of the sequential processing architecture.

Weaknesses soon become apparent when we look at an operation such as an SD-to-HD upconverter. To achieve the creation of new data to fill the additional HD lines interpolation filters are often employed. Assuming a ten-tap filter is employed, then ten samples must be continuously available to a multiplier so that the product of each of the samples can be found before being summed and then normalized using a divide circuit. This is relatively straightforward in an FPGA as all the data can be moved in parallel and then presented to an optimized hardware multiplier. However, the same cannot be said of the Von Neuman processing architecture.

Constrained Data Throughput

A multiprocessor CPU core can certainly process each sample in parallel to provide the multiplication function. But the fundamental challenge is that the video frame of data resides in one area of memory and so each adjacent sample (or pixel) must be moved into the CPUs processor core. There may well be one-hundred processor cores on a single CPU, so this sounds relatively trivial, but the challenge occurs when it becomes clear that the data needs to be streamed across the same address and data bus to reach all the processor cores. And this is further compounded when all the samples must be added together after their multiplication and normalized, thus adding many more address and data bus operations. If we take this to the logical conclusion where many pixels need to be processed simultaneously, then we soon get to the point where we experience address and data bus bottlenecks. In this case, the limitations of the server are not the number of processes that can simultaneously occur, but instead, the maximum data throughput of the server's address and data busses.

Hardware Tweaking

In on-prem solutions the software designer has more control over the hardware, and they must employ this control to maintain the necessary data throughput and processing. But we have less (or no) control over public cloud hardware so we can see why systems don’t always scale.

The good news is that we know real-time uncompressed video processing works because there are several vendors demonstrating the technology. But the Von Neuman architecture alone is a massive bottleneck for video processing. The even better news is that there is a solution on the horizon through the adoption of GPUs.

GPUs started their lives as a simple memory buffer that the CPU could write to, and the GPU would take care of converting the image memory matrix into a raster scan for the monitor. Over the years, more and more video-centric functions have been off-loaded to the GPU hardware. In the extreme, the CPU treats the video as objects rather than pixel maps and then sends instructions to the GPU on how to display the image. Anybody who has programmed video graphics using drivers such as OpenGL will understand the power of this design philosophy. In essence, we’re restricting the amount of video data that is relentlessly shifted around the server's architecture.

Hardware Acceleration

As GPU technology has developed more software functions have been offloaded to custom hardware within the device. Image processing relies heavily on a mathematics branch called linear algebra, which provides highly optimized hardware multiplication, division, addition, and subtraction. Also, linear algebra employs a method of matrix processing where the individual mathematical functions act on each element in parallel. If we think of a matrix with each of its elements as a pixel, which in turn has a mathematical hardware accelerated block associated with it, then we can see how GPU technology has the potential to massively accelerate video, audio, and metadata processing. There is a dedicated mathematical processing engine associated with every pixel in the image which leads to GPUs having thousands of data processing units, all working in parallel with little chance of memory address and data bus bottlenecks occurring.

It’s fair to say that we’re still limited by the speed with which we can shift the video data to and from the GPU. In traditional server architectures, the media stream would be transferred from the NIC to the system memory by the CPU, and then transferred to the GPU. This method still presents the bottlenecks already highlighted; however, the new ultra-high-speed PCI buses now available provide a solution to this as newer server technologies use the PCI bus to directly transfer media data from the NIC to the GPU and back again. Another vendor specific solution is to provide a dedicated inter-PCB connection between the NIC and the GPU, thus bypassing the server's buses altogether.

Server System Memory Bypass

By bypassing the server’s system memory and transferring video data directly between the NIC and GPU, we have the ultimate in video processing architectures. This method does rely on server architectures that are not necessarily mainstream, but the servers, NICs, and GPUs are all COTs components so could be, and are being built with ease. This does provide some challenges for public cloud infrastructures as hardware connectivity within the server needs to be specified, but many public cloud service providers are seeing this as a potential for new business and will hopefully make such hardware available in their infrastructures.

Another interesting idea is that we’ve almost come full circle. If we take the next logical step, which is to add a high-speed NIC to the GPU circuit board, thus providing direct connectivity between the network and the GPU with very low latency and a significantly reduced risk of transfer bottlenecks, then what is the point of using a traditional COTS server? It’s only acting as a mechanical hosting, configuration, and monitoring device for the NIC-GPU circuit board. Maybe this is the future of video processing in COTS datacenters, whether on- or off-prem?

You might also like...

IP Monitoring & Diagnostics With Command Line Tools: Part 7 - Remote Agents

How to run diagnostic processes in each machine and call them remotely from a centralised system that can marshal the results from many other networked systems. Remote agents act on behalf of that central system and pass results back to…

Growing Momentum For 5G In Remote Production

A combination of factors that includes new 3GPP 5G standards & optimizations that have reduced latencies & jitter, new network slicing capabilities and the availability of new LEO satellite services are bringing increasing momentum to the use of 5G for…

Building Software Defined Infrastructure: Part 4 - Integration

Welcome to Part 4 of Building Software Defined Infrastructure. This multi-part content series from Tony Orme explores the microservices based IT technologies that are driving the next phase of transition from hardware to software based broadcast systems. This series is essential…

Monitoring & Compliance In Broadcast: Accessibility & The Impact Of AI

The proliferation of delivery devices and formats increases the challenges presented by accessibility compliance, but it is an area of rapid AI powered innovation.

IP Monitoring & Diagnostics With Command Line Tools: Part 6 - Advanced Command Line Tools

We continue our series with some small code examples that will make your monitoring and diagnostic scripts more robust and reliable