Compression: Part 3 - Reference Codecs
Compression relies heavily on information theory, but also on a good deal of common sense. Here we pick our way between the two.
Related articles:
Image or video compression is in widespread use, so there is no point in arguing that it is possible because that much is obvious. On the other hand to be able to say why it is possible is a different matter. The human visual system (HVS) has evolved to help us survive and it works by creating a more or less accurate model of our surroundings. Our own bodies are included in the model. Whenever we use the HVS to reach out towards and then touch some object, we confirm that it is actually there and that our visual system is conveying something real.
What do we mean by some object? A solid object occupies some three dimensional space, and it does it exclusively. Trying to make two such objects occupy the same space is the definition of a collision. When the HVS looks at an object, each eye sees a two-dimensional image and the third dimension has to be implied by a combination of clues. Common sense tells us that if it is behind other objects we won’t be able to see it. There must be a direct optical path to or from the object. Along that path the HVS sees that the object differs from the background in some way, perhaps in color, brightness or texture. As the HVS is stereoscopic, there are also depth clues from interocular differences.
Most imaging systems, including those used in most television, are not stereoscopic. Instead, the cameras have single lenses and the systems are cyclopean. The Cyclops was a one-eyed monster from Greek mythology. Cyclopean systems omit the interocular differences, but leave all of the other depth clues intact and that seems to be adequate for most purposes, including television and cinema.
Fundamentally, the extent of the object is determined by its edges, where something changes from having the attributes of the object to the attributes of the background. Edges are very important in human and in machine vision, because they might help reveal the extent of an object.
On the other hand they might not. The whole art of camouflage is based on the concept of preventing the visual recognition of objects by disrupting the detection of edges. One approach is by concealing existing edges under some sort of covering, or by introducing new and stronger edges that are completely unrelated to the object.
Given the importance of edges in vision, we need to look at what an edge actually is. As we have seen, it exists when there is a contrast between an object and its background. Contrast has to be loosely interpreted here, because a real edge may be revealed by a change of hue as well as by a change of brightness. By differentiating the signal from a camera, we pull out all of the edges. Further processing allows the boundary of an object to be established.
In so-called moving pictures, we have a further advantage that motion can help to reveal edges. When an object moves against a background considerable changes occur at edges. An object can then be defined as some area of the image in which motion is much the same and which contrasts with motion outside that area. It is immediately evident that a recognizable object in an image must be of finite size. To be recognizable, it must contain a number of pixels that are the same or similar in value.
Any image that contains a series of similar pixel values contains redundancy and can be compressed. This takes us back to Murray Gell-Mann’s concept of complexity. According to that, an image in which the pixel values are totally random cannot contain any edges or objects and is incompressible. At the other extreme, an image in which every pixel is the same cannot have any edges or objects, and so it is easily compressible but not useful. Complexity theory suggest that information resides somewhere between the two extremes, where edges and objects of finite size can exist, where the picture carries information which can be compressed.
It is known from information theory that truly random data cannot be compressed and the most efficient way of sending it is in its original form. It also follows that a good way of testing a proposed security password is to attempt to compress it. If it can be compressed easily it is not secure. The fact that random pixels cannot be compressed is of more than academic interest, because all video signals have a noise floor, which in many cases will be random. In a compression environment, noisy video signals are bad news, because the noise creates false spatial frequencies and false picture differences that must be coded. Noisy signals may need noise reduction before coding is attempted.
Fig.1 - The encoder contains a decoder so that it knows what can remotely be predicted. Whatever cannot be predicted is transmitted to the decoder as a residual.
Having established that video signals are compressible, it is possible to move on to look at redundancy. There are two ways of looking at the topic. One of them is from the standpoint of information theory, where redundancy is ideally anything that need not be sent, irrespective of the complication or processing power needed to find it. The other, perhaps more practical way of looking at redundancy is to accept that all real codecs have to be affordable, to work within a limited time scale and consume reasonable amounts of power. This means that the ideal will never be met and it can be thought of as a bound or limit. Bounds can be useful, because any proposal that appears to go beyond a bound may have been exaggerated.
In the real world we are faced with making something that works and the most critical part is the decoder, since every viewer will need one. That decoder need not be perfect or reach any bounds, provided redundancy is re-defined as anything the decoder can figure out for itself. In Fig.1, which shows at a high level practically every codec ever made, it will be seen that the encoder contains a decoder.
The decoder has a certain amount of memory and so can remember some known amount of earlier images. It can use these to predict what the next image might look like. Such a prediction cannot be perfect and the encoder will subtract the local prediction from the actual image to produce what is effectively the prediction error, also known as the residual. The residual is transmitted to the remote decoder. The sum of the remote prediction, which is identical to the local prediction, and the residual reveals the decoded image.
If the residual is sent in its entirety, the compression has been lossless. In most broadcast applications, this is not done, and the complete residual instead is not sent. Instead, the residual is impaired in ways the HVS finds difficult to detect, so the result can be indistinguishable from lossless. If, of course the compression factor is raised further, the impairments become visible and it is then obvious that the system is lossy.
As Fig.1 is nearly universal, a lot can be learned from it. Firstly, the entire codec is defined when the decoder is defined. With the decoder defined, the instructions the decoder needs follow immediately. Thus a codec is defined by the vocabulary and syntax of the data sent from the encoder. The entire ability of the decoder is set by that. Anything outside the vocabulary of the decoder will not be understood. On the other hand the encoder is not forced to explore the entire vocabulary and a low cost encoder that failed to use some of the encoding tools would still be understood.
As the decoder is defined, the only definition of the encoder is that it should create the correct syntax for the specified decoder. How it works is down to the ingenuity and resources of the designer. A simple encoder might have a constant approach and treat every image in the same way. A more sophisticated encoder might try out several coding tools in parallel and pick the one that requires the least bandwidth. In all cases, the operation of the encoder is not revealed by the compressed data stream. This means that the manufacturer’s intellectual property is secure. It then follows that there can be competition between manufacturers to develop improved encoders that work with the same decoders.
As the codec effectively is the decoder, it follows that all advances in coding, all improvements in compression factor must largely be achieved by advancing the design of the decoder and possibly extending the syntax and the vocabulary of the decoder. As those are set at the design stage, it follows that existing decoders will not be able to understand new syntax and will not be compatible with new codecs. On the other hand as new codecs are generally refinements of existing codecs, it is relatively easy to make a new codec backwards compatible so that it continues to work with earlier encoders.
You might also like...
Microphones: Part 2 - Design Principles
Successful microphones have been built working on a number of different principles. Those ideas will be looked at here.
Expanding Display Capabilities And The Quest For HDR & WCG
Broadcast image production is intrinsically linked to consumer displays and their capacity to reproduce High Dynamic Range and a Wide Color Gamut.
Standards: Part 20 - ST 2110-4x Metadata Standards
Our series continues with Metadata. It is the glue that connects all your media assets to each other and steers your workflow. You cannot find content in the library or manage your creative processes without it. Metadata can also control…
Delivering Intelligent Multicast Networks - Part 2
The second half of our exploration of how bandwidth aware infrastructure can improve data throughput, reduce latency and reduce the risk of congestion in IP networks.
If It Ain’t Broke Still Fix It: Part 1 - Reliability
IP is an enabling technology which provides access to the massive compute and GPU resource available both on- and off-prem. However, the old broadcasting adage: if it ain’t broke don’t fix it, is no longer relevant, and potentially hig…