Understanding Compression Technology: Motion Estimation Part 2

Part 1 of this article covered multiple aspects of compression technology: macroblocks, DCT, quantization, and lossless compression. Part 2 will focus on motion estimation for P- and B-frames—after a review of lossless data compacting. Part 3 (next month) will detail the critical role of Predicted frames and Difference frames in maintaining image quality.

Compression, defined as the removal of information deemed by an encoder’s designer as not essential to a successful transfer of visual information can be performed in multiple ways: reduction of chroma information (i.e., 4:2:2 to 4:2:0 colorspace); noise reduction; quantization, and lossless data compacting. (The definition of “successful” is based upon an encoder’s design specification.)

The quantization process is the primary locus of compression. During quantization the 64-cell coefficient matrix from a DCT is multiplied by a pre-defined quantization matrix. (Pre-defined here means based upon an encoder’s design.) The definition is, however, relative to whether Variable Bit Rate (VBR) or Constant Bit Rate (CBR) encoding is applied. When the compression factor—represented by a quantization matrix—is kept constant under varying image complexity, the result is Variable Bit Rate (VBR) encoding. Alternatively, by monitoring the output bit-rate and dynamically altering the compression factor, data output is smoothed thereby yielding Constant Bit Rate (CBR) encoding.

Lossless data reduction follows quantization and employs Variable Length Coding (VLC) and then Run Length Coding (RLC). Both processes reduce information by compacting the data that results from quantization. Like ZIP file compression, no information is lost. Once the luminance (Y) blocks have been compressed, the Cb and Cr blocks are compressed. All compressed blocks are then stored.

This compression process is the same for intra-frame encoding, and for generating an I-frame as part of inter-frame encoding in which a series of video frames are compressed into data contained within in a sequence of several types of data frames. One frame type, of course, is the “I” frame. Two other types of frames that can be found in a GOP (Group Of Picture) data stream include “P” (predictive) frames and “B” (bi-predictive) frames.

A P-frame contains the information needed to recreate a video frame in conjunction with the information from a previous closest I-frame or a previous closest P-frame. A B-frame contains the information needed to reconstitute a video frame when combined with information from: a previous closest I-frame; a previous closest P-frame; a future closest I-frame (open GOPs only); and a future closest P-frame. (A closed GOP is never dependent on information from another GOP.) Frame dependencies for 15-frame open and closed GOPs are shown below.

Compression relies on three types of frames. The first is the “I” or inter-frame. Two other types of frames that can be found in a GOP (Group Of Picture) data stream include “P” (predictive) frames and “B” (bi-predictive) frames.

(H.264 introduced the concept of slices—segments of a picture bigger than a macroblock but smaller than a frame. A P-slice depends only on a single link (motion vector) to another slice. A B-slice depends on two links. Unlike MPEG-2, “bi” here means two links—not two dependency directions.)

Motion estimation may be one of the most interesting digital technologies developed because in a way it “sees” the movement of objects over time in a sequence of video frames. One application of this technology is the creation of intermediate frames when, to avoid LCD display motion blur, 60fps video is converted to 120fps or 240fps video. Motion estimation is also employed when generating P-frames and B-frames. It is the first stage of creating P-frames and B-frames.

Video frames that will become P- and B-frames are partitioned into macroblocks in the same way as done for an I-frame. Starting with the first (uppermost, leftmost) macroblock, which is contained in what is called the Present Image, a search is made to determine where it’s content can be found in the next video frame, which is called the Adjacent Image.

The first comparison is made in the Adjacent Image at X=0 and Y=0 coordinates to determine if the Present Image’s first macroblock remains at its initial location. To determine whether or not a macroblock has moved, a content match is made between the Present Image and the Adjacent Image. (To measure the strength of a match, a correlation technique is used. The correlation must be above a defined threshold to be a match.) When the contents of a macroblock have not moved, the macroblock’s Motion Vector is set to zero.

When a match is not found at X=0 and Y=0, in a methodical pattern, the Present Image’s comparison macroblock is moved at an increasing, but limited, distance from its origin until there is a match— or no ultimately no match. Movement size is typically one PEL (Picture Element, e.g., a pixel), although a step-size of ½ PEL can be employed. The maximum number of X and Y steps allowed defines the search window shown by a red square, below.

Motion estimation can be considered as “seeing” the movement of objects over time in a sequence of video frames. This is often used to avoid LCD display motion blur; 60fps video is converted to 120fps or 240fps video. It is the first stage of creating P-frames and B-frames.

The displacement (direction and distance) moved until a match is made, determines a macroblock’s motion vector. A motion vector (the small arrow) is shown below.

Video frames that will become P- and B-frames are partitioned into macroblocks. Beginning with the first (uppermost, leftmost) macroblock, which is contained in what is called the Present Image, a search is made to determine where it’s content can be found in the next video frame, which is called the Adjacent Image.

Once a search is made for the first macroblock, additional searches are made for every macroblock within the Present Image. In this manner every macroblock within a Present Image is assigned a motion vector. A Present Image’s motion vectors are stored in a Motion Estimation block. Although an estimation block will ultimately be stored, it is first used to generate a Predicted Frame. This process will be detailed in Part 3.

The success of inter-frame compression depends on the contents of most macroblocks not moving from frame to frame. Therefore, the motion vector for most macroblocks is zero. When this assumption is violated, for example when an explosion fills the screen, you are likely to see a screen filled with ugly “macroblocking.”

You might also like...

Why AI Won’t Roll Out In Broadcasting As Quickly As You’d Think

We’ve all witnessed its phenomenal growth recently. The question is: how do we manage the process of adopting and adjusting to AI in the broadcasting industry? This article is more about our approach than specific examples of AI integration;…

Designing IP Broadcast Systems: Integrating Cloud Infrastructure

Connecting on-prem broadcast infrastructures to the public cloud leads to a hybrid system which requires reliable secure high value media exchange and delivery.

Video Quality: Part 1 - Video Quality Faces New Challenges In Generative AI Era

In this first in a new series about Video Quality, we look at how the continuing proliferation of User Generated Content has brought new challenges for video quality assurance, with AI in turn helping address some of them. But new…

Minimizing OTT Churn Rates Through Viewer Engagement

A D2C streaming service requires an understanding of satisfaction with the service – the quality of it, the ease of use, the style of use – which requires the right technology and a focused information-gathering approach.

Production Control Room Tools At NAB 2024

As we approach the 2024 NAB Show we discuss the increasing demands placed on production control rooms and their crew, and the technologies coming to market in this key area of live broadcast production.