Compression: Part 5 - Motion Estimation
Motion estimation is one of the most important enabling technologies of video compression because it allows redundancy between pictures to be identified even in the presence of motion.
Related articles:
Motion estimation is difficult, because in real pictures the motion can be complicated. Objects can move within the field of view of the camera, yet the field of view can also change if panning or zooming is used. An edit usually destroys redundancy between pictures before and after the edit and motion information is unlikely to be meaningful.
The motion estimator is presented with a set of pixels representing one picture and another set representing a second picture. It is expected to figure out what happened from that data. It is unlikely that a single approach would work in all cases and some kind of adaptive approach based upon what is found will be more efficient.
An attempt to correlate one picture with the other will give the motion estimator some idea of how to proceed. Strong correlation over the whole picture means there was no motion and there is nothing more to do. Strong correlation over most of the picture with failures in small areas suggests a fixed camera and scene with some moving objects. Further motion estimation would concentrate on those objects.
Correlation failure over the whole picture would suggest a camera movement or an edit. The motion estimator would then have to establish if there had been movement. This requires a search process in which blocks of picture data are shifted relative to one another by different distances and directions looking for a shift that allows correlation.
From a cold start, block matching is a computationally intensive process and a lot of effort has been expended in making such searches as efficient as possible. If a match was found in a given block, the motion estimator might try to obtain a match with the same amount of motion in an adjacent block. In the case of a pan, or of a large moving object, this would allow the estimator to short cut the search.
A real motion estimator would have access to any motion estimated between earlier pictures and could keep a history of optic flow. Such optic flow axes could be extrapolated into the latest estimate on the basis that the motion between earlier pictures is likely to have continued in a similar direction. Beginning a search around an extrapolated optic flow axis could be more efficient than starting from cold.
On the other hand all of the searches could fail and the motion estimator would have to assume that it had come across an edit, where a comparison of pictures before and after is not useful for coding purposes. The motion estimator would alert the coder to the existence of the cut, reset any optic flow axes it had been following and start cold on the new clip.
Finding motion by testing every possibility is inefficient and ideas turned towards ways of computing the amount of motion. These ideas rely on Fourier optics and optic flow. Real objects in video consist of areas that differ in brightness from their surroundings. They create a step, a kind of square wave, in the video data. That square wave can be broken down by Fourier analysis into its component frequencies.
If in a subsequent picture the same step exists, but in a different place, the coefficients coming from a Fourier transform will be different. The amplitudes may well be the same, but the phases will be different. Fig.1 shows that the phase difference is the product of the displacement and the frequency. It should therefore be possible to measure the displacement by looking at the phase differences.
That is the basis of phase correlation. Fig.2 shows how it is done in practice. Blocks of video data are fed into a Fourier transform, which outputs amplitude and phase information. The amplitude information is normalized, meaning it is set to a predetermined value. The phase information is subject to a delay of one picture and the input and the output of the delay are subtracted to obtain the phase differences.
The phase differences and the normalized amplitudes are then subject to an inverse Fourier transform. This results in a two-dimensional pixel array, which is not an image. Instead it is a correlation surface: a carpet plot of all of the motions between the two input blocks. Wherever there is displacement between the pictures, there will be a corresponding peak on the correlation surface.
Fig.2 - A typical phase correlator. Phase differences are used to determine how picture areas moved.
Fig.3 shows a simplified version of what happens. If there is no motion, the correlation peak will be central. If the camera pans, it will move to one side. If the camera tilts, it will move up or down. However, the movement of the correlation peak will be proportional to the actual movement between the pictures. In practice with real video the correlation surface will be populated by a number of peaks, one for each movement.
It is known from transform duality that the more accurately something is known on one side of a transform, the less accurately it is known on the other. The Heisenberg inequality follows from that. The correlation surface contains precise information about the distance and direction of motions, but says absolutely nothing about where in the picture they took place.
This is resolved simply, because if a conventional block matcher is directed to use the distance and direction information from one of the peaks in the correlation surface, it will immediately find the picture area that moved in that way. This process can be repeated for every peak found. Phase correlation can be viewed as a method of directing or guiding block matching that makes it much more efficient by removing the trial-and-error stage.
In most codecs, the unit of motion compensation is a pixel block of a certain size. For example in MPEG-2 it was 16 pixels square and was called a macroblock. In order to create a predicted picture, the decoder would take a block of that size from a picture it had already decoded. However, the block would not necessarily come from the same place in the source picture. In order to cancel out motion, the place in the source picture where the pixels were taken could be shifted in two dimensions by a vector sent by the encoder.
All being well, taking an area of pixels shifted in that way would produce a predicted picture that was closer to the real one and would thus require fewer residual data to turn the prediction into the real picture. In practice it doesn’t quite work like that. There is no ideal size for a macroblock. If every pixel was individually motion compensated, the prediction would be near perfect, but the bit rate required to send all of the vectors would cancel out any coding gain.
Fig.3 - The results of a) no motion, b) a pan and c) a tilt on the correlation surface. The peak position reveals the motion.
In practice ideal motion compensation only occurs when a macroblock is entirely within a moving object so that all the pixels move in the same way. Near the edge of a moving object, some of a macroblock could be moving in a first way and the rest of it could be moving in a second way or not at all. This means that there is no perfect solution. If the vector transmitted corresponded to motion in the first way, it would be inappropriate for the rest of the block and result in a mis-compensation requiring residual data to correct it.
On the other hand if the vector corresponded to motion in the second way, the first motion would be mis-compensated. The coder might check the amount of residual data needed by each vector and compare it with not sending a vector at all. It could then choose the method that resulted in the fewest residual data.
Whilst MPEG-2 worked quite well with one size of motion compensation block, later codecs allowed the size to vary. This has to be done carefully as the decoder has to be told how the picture has been blocked and it has to come up with a vector for each block.
The precision of the vector also has a bearing on the coding efficiency. In real pictures, the motion is infinitely variable, whereas vectors have to have finite word length and might be restricted to shifting pixels in half-pixel steps. Later codecs tend to have more precise vectors that can shift pixel blocks to finer sub-pixel accuracy.
Motion vectors also have redundancy, because every block within a moving object will have very nearly the same vector. The decoder contains a vector predictor and the bitstream contains residuals that cancel vector prediction errors. MPEG-2 had a structure known as a slice, which was a horizontal row of macroblocks in which the first had an absolute vector and subsequent blocks used differential coding.
Codecs tend to use all the tricks that have been found to work over the years but tend to refine their performance with resultant increase in complexity. The complexity increase is made possible within economic bounds by Moore’s Law.
You might also like...
The Resolution Revolution
We can now capture video in much higher resolutions than we can transmit, distribute and display. But should we?
Microphones: Part 3 - Human Auditory System
To get the best out of a microphone it is important to understand how it differs from the human ear.
HDR Picture Fundamentals: Camera Technology
Understanding the terminology and technical theory of camera sensors & lenses is a key element of specifying systems to meet the consumer desire for High Dynamic Range.
IP Security For Broadcasters: Part 2 - The Problem To Be Solved
By assuming that IP must be made secure, we run the risk of missing a more fundamental question that is often overlooked: why is IP so insecure?
Standards: Part 22 - Inside AIFF Files
Compared with other popular standards in use, AIFF is ancient. The core functionality was stabilized over 30 years ago and remains unchanged.