Future Technologies: New Hardware Paradigms

As we continue our series of articles considering technologies of the near future and how they might transform how we think about broadcast, we consider the potential processing paradigm shift offered by GPU based processing.

GPUs are appearing more and more in broadcast COTS server infrastructures allowing broadcasters to achieve massive parallel processing, especially for video. But we’re not necessarily using the video output of the GPU as there’s much more to a GPU than meets the eye.

Video processing makes unprecedented demands on servers and their processing infrastructures due to the relentless nature of data generated by uncompressed video acquisition devices. FPGAs deal with this as the device can be thought of as one huge assignable parallel logic block that can be configured in a whole host of different ways to deliver any type of processing needed. The latency in FPGAs is generally in the order of tens of serial clock cycles making their processing appear near instantaneous. However, they still rely on custom hardware.

Serial Processing Bottlenecks

One of the motivations for transitioning to COTS-type datacenters is the amount of flexibility they provide, but they do suffer from one fundamental issue, that is, they process data serially. High-end servers certainly have multiple processors on single CPU cores with multiple threads running on them, but they do rely on software scheduling which in turn provides massive variance in the asynchronous aspect of their operation. And the processors on the CPUs are limited in their data throughput.

Virtually all COTS-type servers are based on the Von Neuman architecture. The architecture stores instructions and data in memory so that processing occurs one instruction at a time. This works well for sequential operations but even when parallel processing is adopted using multicore and multithread operations, they turn out to be just an adaptation of the sequential processing architecture.

Weaknesses soon become apparent when we look at an operation such as an SD-to-HD upconverter. To achieve the creation of new data to fill the additional HD lines interpolation filters are often employed. Assuming a ten-tap filter is employed, then ten samples must be continuously available to a multiplier so that the product of each of the samples can be found before being summed and then normalized using a divide circuit. This is relatively straightforward in an FPGA as all the data can be moved in parallel and then presented to an optimized hardware multiplier. However, the same cannot be said of the Von Neuman processing architecture.

Constrained Data Throughput

A multiprocessor CPU core can certainly process each sample in parallel to provide the multiplication function. But the fundamental challenge is that the video frame of data resides in one area of memory and so each adjacent sample (or pixel) must be moved into the CPUs processor core. There may well be one-hundred processor cores on a single CPU, so this sounds relatively trivial, but the challenge occurs when it becomes clear that the data needs to be streamed across the same address and data bus to reach all the processor cores. And this is further compounded when all the samples must be added together after their multiplication and normalized, thus adding many more address and data bus operations. If we take this to the logical conclusion where many pixels need to be processed simultaneously, then we soon get to the point where we experience address and data bus bottlenecks. In this case, the limitations of the server are not the number of processes that can simultaneously occur, but instead, the maximum data throughput of the server's address and data busses.

Hardware Tweaking

In on-prem solutions the software designer has more control over the hardware, and they must employ this control to maintain the necessary data throughput and processing. But we have less (or no) control over public cloud hardware so we can see why systems don’t always scale.

The good news is that we know real-time uncompressed video processing works because there are several vendors demonstrating the technology. But the Von Neuman architecture alone is a massive bottleneck for video processing. The even better news is that there is a solution on the horizon through the adoption of GPUs.

GPUs started their lives as a simple memory buffer that the CPU could write to, and the GPU would take care of converting the image memory matrix into a raster scan for the monitor. Over the years, more and more video-centric functions have been off-loaded to the GPU hardware. In the extreme, the CPU treats the video as objects rather than pixel maps and then sends instructions to the GPU on how to display the image. Anybody who has programmed video graphics using drivers such as OpenGL will understand the power of this design philosophy. In essence, we’re restricting the amount of video data that is relentlessly shifted around the server's architecture.

Hardware Acceleration

As GPU technology has developed more software functions have been offloaded to custom hardware within the device. Image processing relies heavily on a mathematics branch called linear algebra, which provides highly optimized hardware multiplication, division, addition, and subtraction. Also, linear algebra employs a method of matrix processing where the individual mathematical functions act on each element in parallel. If we think of a matrix with each of its elements as a pixel, which in turn has a mathematical hardware accelerated block associated with it, then we can see how GPU technology has the potential to massively accelerate video, audio, and metadata processing. There is a dedicated mathematical processing engine associated with every pixel in the image which leads to GPUs having thousands of data processing units, all working in parallel with little chance of memory address and data bus bottlenecks occurring.

It’s fair to say that we’re still limited by the speed with which we can shift the video data to and from the GPU. In traditional server architectures, the media stream would be transferred from the NIC to the system memory by the CPU, and then transferred to the GPU. This method still presents the bottlenecks already highlighted; however, the new ultra-high-speed PCI buses now available provide a solution to this as newer server technologies use the PCI bus to directly transfer media data from the NIC to the GPU and back again. Another vendor specific solution is to provide a dedicated inter-PCB connection between the NIC and the GPU, thus bypassing the server's buses altogether.

Server System Memory Bypass

By bypassing the server’s system memory and transferring video data directly between the NIC and GPU, we have the ultimate in video processing architectures. This method does rely on server architectures that are not necessarily mainstream, but the servers, NICs, and GPUs are all COTs components so could be, and are being built with ease. This does provide some challenges for public cloud infrastructures as hardware connectivity within the server needs to be specified, but many public cloud service providers are seeing this as a potential for new business and will hopefully make such hardware available in their infrastructures.

Another interesting idea is that we’ve almost come full circle. If we take the next logical step, which is to add a high-speed NIC to the GPU circuit board, thus providing direct connectivity between the network and the GPU with very low latency and a significantly reduced risk of transfer bottlenecks, then what is the point of using a traditional COTS server? It’s only acting as a mechanical hosting, configuration, and monitoring device for the NIC-GPU circuit board. Maybe this is the future of video processing in COTS datacenters, whether on- or off-prem?

You might also like...

Standards: Part 10 - Embedding And Multiplexing Streams

Audio visual content is constructed with several different media types. Simplest of all would be a single video and audio stream synchronized together. Additional complexity is commonplace. This requires careful synchronization with accurate timing control.

Designing IP Broadcast Systems: Why Can’t We Just Plug And Play?

Plug and play would be an ideal solution for IP broadcast workflows, however, this concept is not as straightforward as it may first seem.

Future Technologies: Private 5G Vs Managed RF

We continue our series considering technologies of the near future and how they might transform how we think about broadcast, with whether building your own private 5G network could be an excellent replacement for managed RF.

Standards: Part 9 - Standards For On-air Broadcasting & Streaming Services

Traditional on-air broadcasters and streaming service providers use many of the same standards to define how content is received from external providers and how it is subsequently delivered to the consumer. They may apply those standards in slightly different ways.

An Introduction To Network Observability

The more complex and intricate IP networks and cloud infrastructures become, the greater the potential for unwelcome dynamics in the system, and the greater the need for rich, reliable, real-time data about performance and error rates.