The transform is a useful device that has some interesting characteristics. On one side of a transform we might have the spatial domain, for example data describing an image in terms of brightness as a function of position. On the other side we might have coefficients describing the spatial frequencies and phases of that image data.
Other articles in this series:
We are typically dealing with moving objects before our cameras that obviously take up different positions in each of a series of frames. Whilst this is fundamental to the process, it is also a bit of a nuisance. For example if we have a video signal that is noisy, the noise level could be reduced averaging two or more frames together. In the case where nothing moves, successive video frames are the same, but the noise, being statistical, isn't. Adding two frames and dividing by two leaves the frames the same but reduces the noise.
This falls down when anything moves, because the moving object will be in two different places in the two frames and will produce a double image when they are added.
In compression systems, a great deal of data can be saved exploiting the similarities between one frame and the next. Instead of coding each frame individually, we send instead the difference between the present frame and the previous one. Again this falls down if anything moves because it will create differences between the frames.
In a standards converter the output frame rate will not necessarily be the same as the input frame rate. This means that output frames are not in the same place on the time axis as input frames. If there is no motion, we can move the frames along the time axis as necessary, but if something moves its trajectory will be distorted if the frame appears at the wrong time. Fig.1 shows the problem that is visible as judder.
Hand held cameras produce video signals in which inadvertent camera motion is added to the actual motion to be portrayed and it would be an advantage if that inadvertent motion could be removed.
Any steps that are taken to overcome the difficulties set out above fall into the category of motion compensation. It's a large and important subject but basically whatever the application, it comes down to identifying moving objects and measuring how they moved.
Transforms can be used to measure motion in images.
Fig.1 - The origin of judder. If a frame is shifted in time, a moving object will no longer lie on its correct trajectory.
If one imagines a car driving along and we want to know where it is at a particular instant, we could arrange for a photograph to be taken at that instant.
If the exposure were long enough, we would see some motion blur. However, if the exposure was instead made extremely short, or a flash was used, any motion blur would be smaller than the resolution of the system. The position of the car would be known as accurately as possible, but the photograph would tell nothing about the speed, because the motion has been frozen. A perfect instantaneous Shannon sample has been taken, without any aperture effect, if you will.
If, on the other hand, it was required to know the speed of the car, the most accurate way would be to measure the time it took to travel between two points a known distance apart. Clearly in taking a speed measurement over a significant distance, we will lose any knowledge of where the car is.
That, in short, is the basis of uncertainty theory, which is yet another aspect of transform duality. Stated another way, the more that is known about something on one side of a transform, the less will be known on the other. Werner Heisenberg established this in connection with quantum mechanics, but it applies to transforms in general.
Heisenberg discovered that if the position of a photon is calculated, it is not possible to know its energy or velocity. If its energy is known, it is not possible to know its position. This has often been reported as the photon having some mysterious split personality, whereas it is more to do with measuring techniques. Television broadcasters, of course, depend heavily on the behavior of photons.
To keep things simple, let us suppose that a TV camera is panning slowly across a still scene. In successive frames the scene appears in a slightly different place on the sensor. The waveform along a TV line, or row of pixels will be substantially the same in two successive frames, but shifted along the line.
If the two waveforms are the same, then their spectra will also be the same. In that case how can a transform be used? The answer is that if something like a Fourier transform is used on the same lines from two successive frames, the amplitudes of the coefficients may well be the same, but the phases will differ.
Obviously a DCT would be no good as it is deliberately designed not to register phase.
The Fourier phases differ in a precise and predictable way. For any given feature that can be identified in the video waveform such as a step or a pulse, which moves from one frame to the next, the phase differences between the two frames will be proportional to the frequency of the basis function of the transform.
In other words if analysis at a given frequency displays a certain phase shift between the frames, analysis at twice that frequency will have double the phase shift.
Fig.2 shows what can be done. Windows are chosen in the two selected frames, and Fourier transforms are performed on both. At each frequency, the phases of the two transforms are subtracted. As the amplitudes are meaningless, they are normalized to a convenient value. Then an inverse transform is performed on the phase differences.
Fig.2 - To measure motion, transforms are performed on two successive frames, and the phases are subtracted. An inverse transform of the phase shifts is then carried.
Fig.3 shows the result of the inverse transform for a uniform pan. There is a single spike whose position is proportional to the motion between frames. The amount of motion has actually been measured very accurately. At this point, transform duality in the shape of Heisenberg uncertainty hits us between the eyes: although we know precisely how much motion there was, we have no idea where it took place. We don't know how big the moving area was, where it began or where it stopped.
The solution is very simple: we revert to the original frame data, the pixels in the spatial domain. One of the two frames is simply shifted by the distance calculated, and there will be correlation between the frames for the part of the frame that moved. In the case of the pan the correlation will be everywhere. In the case of a moving object, the outline of the object will be identified.
Fig.3 - The result of an inverse transform on phase differences with normalized amplitude is a pulse or spike whose position is proportional to the motion.
In fact the system is much smarter than that, because if there are a number of moving objects moving with different speeds, each will produce its own spike in the phase differences. The two frames can then be shifted a number of times, once for each phase difference pike, and each time the moving area will correlate and can be identified.
The explanation given is for a single direction of motion. In practice there are vertical processes as well so that motion in any direction can be measured.
To compensate for camera shake, the phase differences in the various image windows are compared. Genuine motion of objects will be restricted to certain windows, whereas camera shake will show the same motion in all windows. It is then possible to extract the motion due to shake and to oppose it in a simple DVE that shifts the entire frame by the same amount.
In practice the frame will need to be cropped by a small amount, so that the edge of the picture can move around without being visible.
In a standards convertor, once a moving object has been identified, it can be shifted to where it would have been if a frame had originally been captured where an output frame is required. That's relatively straightforward and it eliminates judder from the output. It is a little less straightforward to deal with gaps left in the background of the frame where an object is shifted and no longer conceals what was behind it. This requires pixel data to be taken from a later frame where more of the background would be revealed.
In moving image compression, knowledge of how an object moved between two frames is extremely useful, because if the motion is cancelled, the versions of the object in the two frames become very similar and only a small amount of difference data are needed to make one from the other.
In MPEG-2, for example, the motion information is sent as two-dimensional vectors, one for each macroblock, along with the image difference data. In some versions of MPEG-4, the outline of the moving object is sent with one set of vectors to describe the motion so that the moving object itself is coded as an entity.
You might also like...
As the wider broadcast industry picks up the pace with virtualized, cloud-native production systems we take a look at what audio vendors currently have available and what may be on the horizon.
FOR-A was founded in Tokyo in October 1971, to develop video processing devices. The name FOR-A is a deliberate echo of the Japanese expression Han’ei, which can be roughly translated as “prosperity with partners/customers”.
Capturing the essence of a location in a single shot or series of shots can present a range of challenges for the itinerant DOP.
It was late in 2018 when a major public broadcaster in the UK came to London-based 7FiveFive, a technology solutions provider, with a growth challenge. Their postproduction department had about 75 edit positions throughout the building working off a shared storage SAN…
Here we look at some practical results of transform theory that show up in a large number of audio and visual applications.