Transforms: Part 6 - The Discrete Cosine Transform (DCT)

The Fourier Transform is complex in the mathematical sense, which means that each coefficient is represented by complex number.

A complex number is a form of vector containing two parameters. Fig.1 shows that these parameters define the length of two vectors that are always at right angles to one another.

Readers who can recall the days of NTSC and PAL color television will recognize the similarity with the I/Q and U/V signals used in composite video, where the relationship between the magnitude of the two parameters determined the phase of the chroma signal.

The same idea is used in the inverse Fourier Transform. The Fourier Transform is reversible, in that, with suitable care, the transform and the inverse transform in series give back the original waveform, with the various frequency components having the correct amplitude and phase.

In the inverse Fourier Transform, the complex coefficients are used to get the phase of the time domain signal correct. One of the parameters determines the amplitude of a sine wave; the other parameter determines the amplitude of a cosine wave. As Fig.1 shows, the vector sum of those two components determines the amplitude and phase of the resultant signal.

Fig.1 - The coefficients of a Fourier transform are complex, so the two parameters between them define the phase.

Fig.1 - The coefficients of a Fourier transform are complex, so the two parameters between them define the phase.

This process is repeated at every frequency for which coefficients exist and when all of the components are added together the reconstructed waveform emerges.

This is good and bad, depending on the application. Where the application requires accuracy, it's good, because the use of a DFT in which the signal paths and the coefficients have adequate word length means that the signals can be as accurate as required. However, that is not the only application of transforms.

In compression, the goal is to obtain a reasonable impression of the original signal but using significantly fewer bits to transmit it. Most compression systems use a combination of lossless and lossy techniques to cut down the bit rate. Lossless techniques are to be preferred, because no quality impairment results. However, lossless compression can only achieve so much compression and if the bit rate needs to be reduced further, a lossy technique will have to be invoked.

Lossless compression retains the information in the original signal by seeking to eliminate redundancy. Lossy compressors must work by losing some information so that what is sent must be in error. The secret is to do it in such a way that the effects of the errors are minimized. This means studying the sensitivity of the destination to errors in the signal.

In the case of video compression, the signal destination is the human visual system (HVS). There is no other arbiter, because if no human viewer can see an error, it may as well not be there. A successful lossy compressor must be based upon what the HVS can and cannot see, so that any errors due to compression are rendered as invisible as possible.

As the bit rate goes down, compressed video will be found in one of three conditions. If the compression is lossless, the decoder will recreate the original signal. If the compression is lossy, the decoded signal would be found to be different from the original using suitable test equipment, but if the HVS cannot see the differences, we could say the signal is subjectively lossless. Further reduction in bit rate must result in the errors becoming visible at some point and then everyone knows the system is lossy.

One of the things that was learned about the HVS is that the sensitivity to noise in an image is not constant, but instead varies with spatial frequency as shown in Fig.2. The sensitivity to noise is greatest at low spatial frequencies, which correspond to larger areas being in error. The well-known phenomenon of wide-area flicker follows from that. Visibility of noise falls at higher spatial frequencies.

This contrasts with the characteristics of a conventional quantizer, in which the noise level is essentially constant. This means that if the noise floor is set low enough to be invisible where the HVS is the most sensitive, then the noise performance at other spatial frequencies will be over-specified.

Conventionally sampled digital video carries the entire visible range of spatial frequencies in the same way and so no advantage can be taken of Fig.2. This can only be done of the image information is converted into the spatial frequency domain, so that different frequencies can be treated in different ways. That immediately suggests the use of transforms.

One of the most common lossy compression techniques used in audio and video is noise shaping. A conventionally sampled digital representation of the signal is fed through a transform or filter bank such that it is divided into a number of different frequency bands. Within each band the noise level is increased according to knowledge of the human visual or auditory systems.

Fig.2 - The sensitivity of the HVS to noise falls with increasing spatial frequency.

Fig.2 - The sensitivity of the HVS to noise falls with increasing spatial frequency.

The goal is to reduce the amount of data needed to represent the signal. If one bit is removed from the least significant end of a digital value, the noise rises by 6dB because the size of the quantizing steps is effectively doubled. Reducing the number of bits is what we want, the increase in noise follows on from that.

As the original signal has been split into different frequency bands, it follows that noise can be raised by a different amount at each frequency so that full advantage can be taken of a characteristic such as that of Fig.2.

Transforms can be extremely useful for splitting the signal up into frequency bands, but they may be less useful when it comes to raising the noise floor. The Fourier Transform is at a disadvantage because the coefficients are complex. If the coefficients are preserved to full accuracy, there is no change to the signal phase. However, if complex coefficients are shortened in word length, the resultant rounding off does not just raise the noise level, but it can also change the phase of the signal.

If a complex coefficient is rounded, one parameter may be rounded up whereas the other is rounded down and phase rotates. In video a phase change is the equivalent of shifting picture information across the screen and is an unacceptable distortion.

One solution is to employ a transform that does not produce complex coefficients. The discrete cosine transform (DCT) is such a transform. When the input block of samples is being assembled from the source data, the sample are mirrored, or reversed in time and appended to the front of the block so that the new block is symmetrical about the center.

Fig.3 shows the result, which is that any sinusoidal component of the input undergoes a phase reversal at the mirror, whereas a cosinusoidal component carries on unchanged. The result of mirroring is that any sinusoidal component on one side of the mirror is cancelled out by the component on the other side. This means that sinusoidal coefficients would always be zero and so need not be calculated. Only the cosine coefficients are computed, hence the name of the transform.

The computation works on the same principle as for a Fourier Transform. Basis functions for each frequency are multiplied by the mirrored input waveform. There is only one basis function for each frequency, which is cosinusoidal. The coefficients are no longer complex and so shortening their word length does not cause phase shifts, only an increase in the noise level.

Fig.3 - In the DCT, the mirroring of the data block, shown at a) results in sinusoidal components disappearing at b), leaving only the cosine components.

Fig.3 - In the DCT, the mirroring of the data block, shown at a) results in sinusoidal components disappearing at b), leaving only the cosine components.

The discrete cosine transform became very popular in both audio and image compression and found application in JPEG, MPEG-1 and MPEG-2. For image compression the DCT would be used in a two-dimensional version where it computed both horizontal and vertical spatial frequencies.

The most common input format for the DCT is the 8 x 8 pixel array. Where there are color difference signals, these will be subsampled. In a 16 x 16 pixel block using 4:2:0 coding, there will be four 8 x 8 luma blocks and two 8 x 8 color difference blocks.

The result of transforming a pixel block is an array of 64 coefficients. Although each coefficient is a single binary number, it represents two spatial frequencies, one horizontal and one vertical. In the coefficient array, a particular row of coefficients all result in the same vertical frequency, but combined with eight different horizontal frequencies.

The top left coefficient represents zero frequency in both dimensions. In other words it represents the average brightness of the entire pixel block and is often, if erroneously, referred to as the DC component of the block. Any error in that coefficient affects the greatest screen area and so will be highly visible. Typically the average brightness coefficient is treated differently than the others in that it will be transmitted with greater accuracy where possible.

If that accuracy is not possible, adjacent DCT blocks may not blend together properly and the regular block structure may become visible. As the position of the block boundaries is closely defined, later decoders were able to detect blocking and filter it out to some extent.

You might also like...

Machine Learning For Broadcasters: Part 2 - Applications

In part one of this series, we looked at why machine learning, with particular emphasis on neural networks, is different to traditional methods of statistical based classification and prediction. In this article, we investigate some of the applications specific for…

How To Achieve Broadcast-Grade Latency For Live Video Streaming - Part 1

For a serious discussion about “making streaming broadcast-grade” we must address latency. Our benchmark is 5-seconds from encoder input to device, but we cannot compromise on quality. It’s not easy, and leaders in the field grapple with the trade-offs to en…

Information: Part 5 - Moving Images

Signal transducers such as cameras, displays, microphones and loudspeakers handle information, ideally converting it from one form to another, but practically losing some. Information theory can be used to analyze such devices.

Building An IP Studio - Connecting Cameras - Part 1

Connecting a camera in an SDI infrastructure is easy. Just connect the camera output to the monitor input, and all being well, a picture will appear. The story is very different in the IP domain.

Machine Learning For Broadcasters: Part 1 - Overview

Machine Learning is generating a great deal of interest in the broadcast industry, and in this short series we cut through the marketing hype and discover what ML is, and what it isn’t, with particular emphasis on neural networks (N…