AI Based & GPU Accelerated Compression & Image Processing Are Igniting Video’s Afterburner

The impact of AI on videoconferencing bandwidth reduction couldn’t be accelerating at a more opportune time.

Before DTV, digital cameras or smartphones, video was a one-volt, baseband composite TV picture signal connected from source to destination by coaxial cable. If the video wasn’t intended for broadcast it was called closed circuit TV (CCTV). Color cameras were expensive and generally used only for broadcast TV. TV gear was only practical in a few niche markets until the introduction of consumer VCRs and camcorders in the late 1970s.

As US TV stations began the transition to DTV and HDTV broadcasting in the early 2000s, streaming video began to emerge from science fiction. About the same time, telcos began to rollout DSL service. Until then, the fastest home internet available was 56kBps. H.264/AVC aka MPEG-4 AVC coding Version 1 debuted in 2003. Streaming video didn’t go mainstream until Apple released HTTP Live Streaming (HLS) in 2009, which also happens to be the year the last analog SD TV stations in the US permanently signed off.

Over the 11 years between then to now, streaming video has been steadily gaining consumer popularity. The 2020 pandemic ignited its afterburner. Suddenly, nearly everyone is relying on streaming video and videoconferencing for meetings, school, socialization and entertainment, including TV station newscasts, all which are increasing internet bandwidth requirements. In fact, NVIDIA estimates more than 30 million web meetings take place each day, and that video conferencing has increased by 10x since the beginning of the year.

Earlier this month, NVIDIA announced its Maxine platform. It provides developers a cloud-based suite of GPU-accelerated AI video conferencing software to enhance the internet’s biggest source of data traffic, streaming video. Video conference service providers running the platform on NVIDIA GPUs in the cloud can offer users all kinds of amazing AI magic. Because the data is processed in the cloud rather than on local devices, people can use and access the new features and effects without any special hardware.

AI can perceive the important features of a face, send only the changes of the features and reanimate the face at the receiver, reducing bandwidth by a factor of 10. Courtesy NVIDIA.

Breakthrough Features

The NVIDIA Maxine platform dramatically decreases the bandwidth needed for video calls and typical studio-style, stationary background, TV reporting. Instead of streaming full frames of mainly repeating pixels, the AI software analyzes the key facial points of each person on a call and then intelligently re-animates the face in the video on the other side. This makes it possible to stream video with far less data flowing back and forth across the internet.

This new AI-based video compression technology running on NVIDIA GPUs allows developers to reduce video bandwidth consumption down to approximately 10% of similar H.264 streaming video. In one test, the required data rate fell from 97.28 KB/frame to 0.1165 KB/frame – a reduction to 0.1% of required bandwidth. NVIDIA’s research in generative adversarial networks (GANs) also provides a variety of new features and effects never available before for live video.

Software Writing Software

Earlier this month, NVIDIA CEO Jensen Huang said in his GPU Technology Conference (GTC) Keynote address, “AI is writing software that achieves results no human-written software can. The scalability of AI started a race to create ever-larger, more complex and more capable AI.” He continued “Computation to create state-of-the-art models has increased 30,000x in the last five years and is doubling every 2 months.”

“Enormous models and AI training data sets have pushed the limits of every aspect of the computer, storage and networking,” Huang said. He also explained that NVIDIA AI consists of three pillars. First is data processing, feature engineering and training at all scales. Second is inference computing to optimize competing restraints. Inference computing is in every cloud and growing by 10x every couple of years. The third pillar is AI application frameworks such as pre-trained models. Some of the most challenging applications include self-driving cars, robotics, scientific discovery, and conversational AI.

Other parts of the Huang’s NVIDIA GTC Keynote Address were titled ‘Everything that Moves will be Autonomous’, ‘Data Center Infrastructure-on-a-Chip’, ‘Trillions of Intelligent Things’ and ‘AI for Every Company’, all clues to where AI is headed.

New Audio and Video Effects

Maxine includes the latest sound and visual effects and correction innovations from NVIDIA research such as face alignment, gaze correction, face re-lighting and real-time translation in addition to capabilities such as super-resolution, noise removal, closed captioning and virtual assistants. These capabilities are fully accelerated on NVIDIA GPUs to run in real-time video streaming applications in the cloud.

The Free View feature can animate a face to artificially make eye contact with each person. Courtesy NVIDIA.

Face alignment, for example, enables faces to be automatically adjusted so that people appear to be facing each other during a call even when they are not. Gaze correction simulates eye contact, even if the camera isn’t aligned with the user’s screen. These new features help people stay focused on the conversation rather than their cameras.

AI-based super-resolution and artifact reduction can convert lower resolutions to higher resolution videos in real time. Example super-resolution videos on YouTube show amazing visual results. It helps lower the bandwidth requirements for video conference providers, and it improves the call experience for users with lower bandwidth. Developers can add features to automatically filter out background noise and frame the camera on a user’s face for a more personal and engaging conversation.

Conversational AI features powered by the NVIDIA Jarvis Software Developer Kit (SDK) allow developers to integrate virtual assistants that use state-of-the-art AI language models for speech recognition, language understanding and speech generation. The virtual assistants can take notes, set action items and answer questions in human-like voices. Additional conversational AI services such as translations, closed captioning and transcriptions help ensure that everyone can understand what is being discussed on the call.

For TV broadcasters, the new AI-based video compression technology running on NVIDIA GPUs can eliminate slow connection issues for news people reporting from home and live interviews over the internet from people’s homes or offices.

Cloud-Native

Demand for video conferencing at any given time can be hard to predict, with hundreds or even thousands of people potentially trying to join the same call. NVIDIA Maxine takes advantage of AI microservices running in Kubernetes container clusters on NVIDIA GPUs to help developers scale their services according to real-time demands. Kubernetes are APIs that provide access to special hardware resources for automating and managing various AI components including versioning and upgrades. Users can run multiple AI features simultaneously while remaining well within application latency requirements.

Video conference service providers can use Maxine to deliver leading AI capabilities to hundreds of thousands of users by running AI inference workloads on NVIDIA GPUs in the cloud. The modular design of the Maxine platform enables developers to easily select AI capabilities to integrate into their video conferencing solutions.

The Maxine platform integrates technology from several NVIDIA AI SDKs and APIs. In addition to NVIDIA Jarvis, the Maxine platform leverages the NVIDIA DeepStream high-throughput audio and video streaming SDK and the NVIDIA TensorRT SDK for high-performance deep learning inference.

The AI audio, video and natural language capabilities provided in the NVIDIA SDKs used in the Maxine platform were developed through hundreds of thousands of training hours on NVIDIA DGX systems, a leading platform for training, inference and data science workloads.

Crawling Around the Room

AI is in its infancy, much like personal computers 30 years ago, HDTV 20 years ago and streaming video 10 years ago. The difference between AI and previous technologies is that AI learns and gains power as it is trained.

AI benefits were extended to Adobe Photoshop users with the introduction of GPU-accelerated neural filters at this month's Adobe MAX virtual conference. It leverages NVIDIA RTX GPUs with Adobe creative applications, including the next-level Adobe Photoshop 'Smart Portrait' neural filter.

Adobe Premiere Pro users also benefit from NVIDIA RTX GPUs with virtually all GPU-accelerated decoding offloaded to dedicated VRAM, resulting in smoother video playback and sharper responsiveness when scrubbing through footage, especially with ultra-high resolution and multi-stream content.

Human-like AI is like a baby learning to crawl around a room. Imagine what it will be doing 10 years from now.

Other related articles posted on The Broadcast Bridge.

Protecting Newsrooms And Viewers From Deepfake Videos

You might also like...