Transforms: Part 6 - The Discrete Cosine Transform (DCT)

The Fourier Transform is complex in the mathematical sense, which means that each coefficient is represented by complex number.

A complex number is a form of vector containing two parameters. Fig.1 shows that these parameters define the length of two vectors that are always at right angles to one another.

Readers who can recall the days of NTSC and PAL color television will recognize the similarity with the I/Q and U/V signals used in composite video, where the relationship between the magnitude of the two parameters determined the phase of the chroma signal.

The same idea is used in the inverse Fourier Transform. The Fourier Transform is reversible, in that, with suitable care, the transform and the inverse transform in series give back the original waveform, with the various frequency components having the correct amplitude and phase.

In the inverse Fourier Transform, the complex coefficients are used to get the phase of the time domain signal correct. One of the parameters determines the amplitude of a sine wave; the other parameter determines the amplitude of a cosine wave. As Fig.1 shows, the vector sum of those two components determines the amplitude and phase of the resultant signal.

Fig.1 - The coefficients of a Fourier transform are complex, so the two parameters between them define the phase.

This process is repeated at every frequency for which coefficients exist and when all of the components are added together the reconstructed waveform emerges.

This is good and bad, depending on the application. Where the application requires accuracy, it's good, because the use of a DFT in which the signal paths and the coefficients have adequate word length means that the signals can be as accurate as required. However, that is not the only application of transforms.

In compression, the goal is to obtain a reasonable impression of the original signal but using significantly fewer bits to transmit it. Most compression systems use a combination of lossless and lossy techniques to cut down the bit rate. Lossless techniques are to be preferred, because no quality impairment results. However, lossless compression can only achieve so much compression and if the bit rate needs to be reduced further, a lossy technique will have to be invoked.

Lossless compression retains the information in the original signal by seeking to eliminate redundancy. Lossy compressors must work by losing some information so that what is sent must be in error. The secret is to do it in such a way that the effects of the errors are minimized. This means studying the sensitivity of the destination to errors in the signal.

In the case of video compression, the signal destination is the human visual system (HVS). There is no other arbiter, because if no human viewer can see an error, it may as well not be there. A successful lossy compressor must be based upon what the HVS can and cannot see, so that any errors due to compression are rendered as invisible as possible.

As the bit rate goes down, compressed video will be found in one of three conditions. If the compression is lossless, the decoder will recreate the original signal. If the compression is lossy, the decoded signal would be found to be different from the original using suitable test equipment, but if the HVS cannot see the differences, we could say the signal is subjectively lossless. Further reduction in bit rate must result in the errors becoming visible at some point and then everyone knows the system is lossy.

One of the things that was learned about the HVS is that the sensitivity to noise in an image is not constant, but instead varies with spatial frequency as shown in Fig.2. The sensitivity to noise is greatest at low spatial frequencies, which correspond to larger areas being in error. The well-known phenomenon of wide-area flicker follows from that. Visibility of noise falls at higher spatial frequencies.

This contrasts with the characteristics of a conventional quantizer, in which the noise level is essentially constant. This means that if the noise floor is set low enough to be invisible where the HVS is the most sensitive, then the noise performance at other spatial frequencies will be over-specified.

Conventionally sampled digital video carries the entire visible range of spatial frequencies in the same way and so no advantage can be taken of Fig.2. This can only be done of the image information is converted into the spatial frequency domain, so that different frequencies can be treated in different ways. That immediately suggests the use of transforms.

One of the most common lossy compression techniques used in audio and video is noise shaping. A conventionally sampled digital representation of the signal is fed through a transform or filter bank such that it is divided into a number of different frequency bands. Within each band the noise level is increased according to knowledge of the human visual or auditory systems.

Fig.2 - The sensitivity of the HVS to noise falls with increasing spatial frequency.

The goal is to reduce the amount of data needed to represent the signal. If one bit is removed from the least significant end of a digital value, the noise rises by 6dB because the size of the quantizing steps is effectively doubled. Reducing the number of bits is what we want, the increase in noise follows on from that.

As the original signal has been split into different frequency bands, it follows that noise can be raised by a different amount at each frequency so that full advantage can be taken of a characteristic such as that of Fig.2.

Transforms can be extremely useful for splitting the signal up into frequency bands, but they may be less useful when it comes to raising the noise floor. The Fourier Transform is at a disadvantage because the coefficients are complex. If the coefficients are preserved to full accuracy, there is no change to the signal phase. However, if complex coefficients are shortened in word length, the resultant rounding off does not just raise the noise level, but it can also change the phase of the signal.

If a complex coefficient is rounded, one parameter may be rounded up whereas the other is rounded down and phase rotates. In video a phase change is the equivalent of shifting picture information across the screen and is an unacceptable distortion.

One solution is to employ a transform that does not produce complex coefficients. The discrete cosine transform (DCT) is such a transform. When the input block of samples is being assembled from the source data, the sample are mirrored, or reversed in time and appended to the front of the block so that the new block is symmetrical about the center.

Fig.3 shows the result, which is that any sinusoidal component of the input undergoes a phase reversal at the mirror, whereas a cosinusoidal component carries on unchanged. The result of mirroring is that any sinusoidal component on one side of the mirror is cancelled out by the component on the other side. This means that sinusoidal coefficients would always be zero and so need not be calculated. Only the cosine coefficients are computed, hence the name of the transform.

The computation works on the same principle as for a Fourier Transform. Basis functions for each frequency are multiplied by the mirrored input waveform. There is only one basis function for each frequency, which is cosinusoidal. The coefficients are no longer complex and so shortening their word length does not cause phase shifts, only an increase in the noise level.

Fig.3 - In the DCT, the mirroring of the data block, shown at a) results in sinusoidal components disappearing at b), leaving only the cosine components.

The discrete cosine transform became very popular in both audio and image compression and found application in JPEG, MPEG-1 and MPEG-2. For image compression the DCT would be used in a two-dimensional version where it computed both horizontal and vertical spatial frequencies.

The most common input format for the DCT is the 8 x 8 pixel array. Where there are color difference signals, these will be subsampled. In a 16 x 16 pixel block using 4:2:0 coding, there will be four 8 x 8 luma blocks and two 8 x 8 color difference blocks.

The result of transforming a pixel block is an array of 64 coefficients. Although each coefficient is a single binary number, it represents two spatial frequencies, one horizontal and one vertical. In the coefficient array, a particular row of coefficients all result in the same vertical frequency, but combined with eight different horizontal frequencies.

The top left coefficient represents zero frequency in both dimensions. In other words it represents the average brightness of the entire pixel block and is often, if erroneously, referred to as the DC component of the block. Any error in that coefficient affects the greatest screen area and so will be highly visible. Typically the average brightness coefficient is treated differently than the others in that it will be transmitted with greater accuracy where possible.

If that accuracy is not possible, adjacent DCT blocks may not blend together properly and the regular block structure may become visible. As the position of the block boundaries is closely defined, later decoders were able to detect blocking and filter it out to some extent.

Other related articles posted on The Broadcast Bridge.

Transforms: Part 7 - Standards Conversion

You might also like...

Navigating Streaming Networks For Live Sports: Broadcaster OTT & Streaming Delivery Networks

With the ongoing growth of OTT content consumption, and the drive from live sports broadcasters to provide high-scale and high-quality Direct to Consumer OTT services, Streamers and their customers now demand streaming services that operate at the scale and quality…

Designing IP Broadcast Systems - The Book

Designing IP Broadcast Systems is another massive body of research driven work - with over 27,000 words in 18 articles, in a free 84 page eBook. It provides extensive insight into the technology and engineering methodology required to create practical IP based broadcast…

The Big Guide To OTT - The Book

The Big Guide To OTT ‘The Book’ provides deep insights into the technology that is enabling a new media industry. The Book is a huge collection of technical reference content. It contains 31 articles (216 pages… 64,000 words!) that exhaustively explore the technology and…

An Introduction To Network Observability

The more complex and intricate IP networks and cloud infrastructures become, the greater the potential for unwelcome dynamics in the system, and the greater the need for rich, reliable, real-time data about performance and error rates.

Next-Gen 5G Contribution: Part 2 - MEC & The Disruptive Potential Of 5G

The migration of the core network functionality of 5G to virtualized or cloud-native infrastructure opens up new capabilities like MEC which have the potential to disrupt current approaches to remote production contribution networks.