Image compression depends heavily on transforms and there are a number of these. The criteria for selecting a transform include appropriateness for the type of images to be coded and computational complexity.
Other articles in this series and other series by the same author:
The most common transform used in image compression is the DCT, which is an adaptation of the Fourier transform that results in only one coefficient per spatial frequency. Typically, a two-dimensional version of the DCT is used, in which eight horizontal frequencies and eight vertical frequencies are analyzed, allowing 64 combinations. As a result, after DCT the 64-pixel input block becomes a 64 coefficient block.
Fig.1 shows the DCT coefficient block. The top left coefficient represents the average brightness of the entire block, which is the same thing as representing energy at zero frequency in both directions. It is often erroneously called the DC coefficient because it causes no change across the block, neglecting the fact that there are no currents in an image.
Fig.1 - The 8 x 8 coefficient block of a DCT. The top left coefficient is the average brightness. Horizontal spatial frequency increases along rows and vertical spatial frequency increases down columns.
The top row represents increasing horizontal frequency and the left column represents increasing vertical frequency with the rest of the array representing combinations of horizontal and vertical frequency. The inverse DCT is essentially a synthesizer that creates cosine waves whose amplitude corresponds to the value of the coefficient. When these are added together, the original pixel data emerge. The DCT itself is lossless and loss only occurs because some coefficients are truncated and some are discarded altogether.
In real video signals, the energy at different spatial frequencies is affected by three dominant factors. The first is that real-life scenes contain energy that typically falls with spatial frequency, not least because they contain objects many pixels across. The second is that some areas of the picture may be out of focus because the camera has been focused on an object of interest. The third is that motion results in smear across the camera sensor, which filters out high frequencies.
The sum of these effects is that the magnitude of the DCT coefficients falls strongly as a function of the distance from the top left corner of the coefficient array. Most of the time high frequency coefficients to the bottom right of the array have zero value or such small values they can be set to zero without doing any visible harm. Zero-value coefficients need not be sent, of course.
In some cases simply failing to send all of the zero-valued coefficients provides enough compression, in other cases the surviving coefficients are quantized such that fewer bits are needed to represent them. The higher the spatial frequency concerned, the more severe can be the quantizing because the HVS is less sensitive to errors at high spatial frequency.
On the other hand, an error in the zero-frequency coefficient would affect every pixel in the block. Typically, that coefficient is sent with full accuracy.
After quantizing, even more coefficients that have zero value. The probability of finding a non-zero coefficient falls with increasing distance from the top left. This suggests a sequence of transmission reflecting that distribution. With an optimum sequence, the important non-zero coefficients would be transmitted at the beginning and the zero coefficients would find themselves at the end. Transmission could end at the last of the non-zero coefficients.
Fig.2 - The zig-zag scan approximates the order in which the magnitude of coefficients typically falls.
That is the reasoning behind the use of the zig-zag scan shown in Fig.2. Although the probability falls with radius from the zero-frequency coefficient, the scan sloping at 45 degrees is the best that can be done and although slightly sub-optimal it is a lot better than not doing it.
Variable length coding is used, in which the more probable coefficient values are represented by shorter codes. Where one or more zero-value coefficients appear in the scan prior to a non-zero coefficient, unique codes tell the receiver to skip to the correct place in the scan. Variable length coding relies on the codes themselves to declare their own wordlength, so a serial message can correctly be parsed. A single bit error will cause this process to crash, but error propagation cannot go beyond one DCT block and the system soon recovers.
The DCT works well with video signals because they seldom contain high spatial frequencies, which means that relatively few coefficients need to be transmitted. In contrast, still pictures, digital photographs, will contain high spatial frequencies. This is because photographers do not need to avoid background strobing and can use short shutter speeds. Photographers often shoot with very high-resolution equipment because the photograph may well be cropped in production. Starting with a high pixel count and sharp optics and technique allows the cropped picture to maintain adequate resolution.
Spatial transform theory suggests that the more accurately the spatial frequency is known the less accurately its location is known. This is not ideal for sharp images where the edges need both to be defined well as well as to be located well.
Efforts have been made to get around the Heisenberg uncertainty of transforms by making the window, the number of pixels processed, smaller as spatial frequency rises. The Haar transform and more recently the wavelet transform are examples of that. The Haar transform dates from 1909 and is rightly considered to be the first example of the use of wavelets.
Fig.3 shows a one-dimensional Haar Transform that works with blocks of eight pixels. The size of the block must be a power of two. The transform outputs eight coefficients because they are real rather than complex. The coefficients are 0, 1 and -1, which does wonders for the computational complexity. The top row of the matrix is all ones, meaning that the eight input pixels are added together to obtain the average brightness. The second row of coefficients is looking for a half cycle of modulation across the pixel block.
The third row of coefficients is interesting, because the second half are all zero. Those pixels are ignored and instead the first four are analyzed, looking for a half-cycle across four pixels, twice the frequency the second row was looking for, but with twice the spatial resolution because the window is half as big. The fourth row does the same, but with the second half of the pixels.
The next four coefficient rows have only two non-zero coefficients because they have closed the window down to two pixels and are looking for a half cycle in that window. Clearly four such windows are needed to work on the eight-pixel block, two pixels at a time. The frequency of interest is four times that of the second row of coefficients, and the window is four times smaller.
The characteristics of Haar and other wavelet transforms are shown in the frequency/time plot of Fig.4. The FT and the DCT have constant time resolution as frequency rises, whereas the wavelet transform (WT) shown has time resolution that increases with frequency.
The whole goal of transform coding is to obtain a set of coefficients that are sparse, in other words many of the values are zero or near zero and can be discarded to obtain compression. The Haar and wavelet transforms help, because edges in images are localized and lead to sparse transforms.
Fig.4 - The wavelet transform window gets smaller as frequency rises, so the temporal resolution increases.
Another way of looking at the difference between FT/DCT and the WT is that the basis functions that are used to multiply the input pixels have different characteristics. In the FT/DCT, the basis function has the length of the pixel block but a different number of cycles according to the frequency to be analyzed. In the WT, the basis function always has the same number of cycles and gets shorter as frequency rises.
That shortening is possible because the WT is based on cascaded filters each of which cut the spectrum of the input signal in half. Each one can be decimated by a factor of two so it contains half as many samples and so is shorter. After four such filters one sixteenth of the data are in each sub spectrum so the fact that the wavelet is getting shorter is cancelled out.
There is no one ideal transform that can deal with the range of possibilities in a video signal. One idea is for a coder to have more than one transform and to apply these to the input. The coder would then choose the transform that produced the most sparseness in the coefficients.
Whatever the type of transform employed, the required output bit rate must be met somehow. A certain amount of buffer memory is provided to allow for pictures that display more than average detail, but the buffers must not be allowed to overflow. The buffer is read out at the desired bit rate and it must be kept half full so that the instantaneous bit rate from the coder can rise and fall. The average bit rate from the coder must be reduced to equal the output bit rate and the truncation of coefficients has to be increased until that is achieved.
You might also like...
Here we begin a new five-part series looking at the current security landscape for OTT and streaming services.
Having considered all of the vital elements of moving image coding this final part looks at how these elements were combined throughout coding history.
The Edge network scales with the audience. The more people that stream concurrently, or the higher the average bitrate requested by a consistently sized audience, the more capacity the Edge network needs. Achieving best possible efficiency at the Edge requires…
The criticality of service assurance in OTT services is evolving quickly as audiences grow and large broadcasters double-down on their streaming strategies.
Having looked at the traditional approach to moving pictures and found that the portrayal of motion was irremediably poor, thoughts turn to how moving pictures might be portrayed properly.