# Is Gamma Still Needed?: Part 8 - Floating Point

The requirement for a wide dynamic range to be provided economically is usually met by what is loosely called gamma, which is a non-linear transfer function having numerous drawbacks. However, there is an alternative form of what might loosely be called gamma, which is floating point.

Forgetting historical reasons such as CRTs, the reason gamma is used in video today is because it is one way of giving the eye a better solution for a given cost. In other words gamma produces a signal that is psycho-optically better than a linear signal having the same signal-to-noise ratio.

The reason is that the human eye works on what might be called constant accuracy. Over a huge range, whatever the absolute brightness might be, the eye can detect changes that are a constant percentage of that brightness. Both a digital system having uniform steps, and a linear analog system having a stable noise floor, become too accurate at high brightness, but not accurate enough at low brightness.

This phenomenon is not restricted to human vision. In fact it is everywhere because it is a practical way of going about things. If we consider the US customary units still used in recipes, there are units whose magnitude is appropriate to the problem. The spoonful, the cupful, the ounce, the pound, the pint and the quart all relate simply to the amount of an ingredient that might be needed.

The inch, foot and yard and so on also relate to the quantities that are met in everyday life. If we are told that something is ten inches long, that could mean it is between 9.5 and 10.5 inches long, so the figure is five percent accurate. But the same would be true if we were told a size of ten feet, or ten yards. The percentage accuracy, which is what matters, remains the same, but the absolute accuracy goes down.

Closer to home, the IRE used in video measurement is a unit of voltage and by having one hundred steps on the scale was intended to allow the accuracy of a luma signal to be measured in percent.

Possibly a better example is the deciBel, which is in universal use for audio measurement. Two signals whose levels differ by 1dB are about 12 percent apart in amplitude, no matter how big or small the signals may be.

Fig.1 - The radix point separates the integers from the fractions. In binary bits to the right of the radix point represent one half, one quarter and so on.

Resistors in the days before surface mount used color-coding to specify the value. As the range of possible values is huge, the coding scheme was carefully designed. Typically two colors would specify the first two digits of the number. For example yellow and purple would mean 47. The net color would specify how many zeros followed the first two digits; so yellow, purple, yellow would be 470000 or 470K Ohms.

Given the universality of the requirement, we should not be surprised that computers have adopted the constant accuracy principle in attempting to describe the huge range of the real world by a series of numbers. The computer solution is known as floating point notation.

As floating-point notation is just a different way of expressing numbers as a series of bits, it can be used for any type of numbers, including those representing audio samples and image pixels.

In everyday decimal numbers, the decimal point separates the units and the tenths. In binary it's the same, except that the radix point separates the ones from the halves. Fig.1 shows the principle. Starting from the radix point and going left, we have the familiar one, two, four, eight and so on. Going right we have one half, one quarter, one eighth, one sixteenth and so on.

The floating-point number comes in two parts, both of which are binary numbers. One of them, known as the mantissa, determines the accuracy in the same way that word length determines accuracy in PCM audio and video. A sixteen-bit mantissa is accurate to one part in 65,000. The other number, the exponent, effectively specifies where the radix point is, like the third color on the resistor.

In binary, moving the radix point is the same as multiplying or dividing by two for each place moved. Fig.2 shows a simple example of a four-bit number 101.1, which is the equivalent of 5.5 in decimal. We could also describe it as 1011 divided by 2, as 10.11 multiplied by two or 1.011 multiplied by four.

It turns out that one of these is the best. If the original number is always shifted until there is a single 1 before the radix point, then it is not necessary to record or transmit that one, because its value is implicit. So we win one bit of accuracy with that approach. The mantissa is encoded in that way and the exponent tells us how to shift the mantissa, after the implicit 1 has been put back, to return to the original number.

The most common standard used for floating point notation in computers is IEEE 754. In the single precision version the resulting word is 32 bits long. It has a sign bit indicating if the original number is positive or negative and an eight-bit exponent, leaving space for a 23-bit mantissa, which is really a 24-bit value because of the implicit leading one. Any mantissa longer than 24 bits would need to be rounded. That process is where the precision is determined.

Fig.3 - 16-bit floating point has a sign bit, a five-bit exponent and a ten-bit mantissa. However, there is an implicit one that is not recorded o the mantissa has 11-bit accuracy.

Although the number itself has a sign bit, the exponent doesn't. Instead it uses an offset, rather like color difference signals in 601. As the eight-bit exponent has 256 combinations, an offset is used such that a value of 127 represents an exponent of zero, meaning the mantissa doesn't need to be shifted at all.

There is also a double precision version whose performance is far beyond any audiovisual requirements. In fact the IEEE 754 single precision standard is too accurate for video use and wastes storage capacity. As a result a half-precision floating-point notation was developed by Industrial Light and Magic that uses a 16 bit word. As shown in Fig.3, the half-precision notation has one sign bit, a five-bit exponent and a 10-bit mantissa.

When used to express linear light signals such as true luminance (not gamma compensated), every multiplication of the pixel value by two is the equivalent of a stop in imaging. Thus the range of the exponent determines the number of stops over which the mantissa can be shifted. Half precision has a 32-stop range, which leaves gamma-based video in the dust.

Fig.4 - At a) the floating-point conversion process causes the gain to halve every time full-scale is reached. At b) the effective gain needed in FP coding is a piecewise-linear approximation to a gamma curve, which is why FP coding is applicable to imaging.

So much for the numerical side of things; but what does floating point actually do? Essentially it's a gain ranging process in which the gain applied to the input signal is halved every time a certain output level is exceeded.

Fig.4a) shows the idea. The input is on the horizontal axis and starting from zero, the output increases along the first slope until the maximum output is reached at point A. At that point the exponent is increased by one and the data are shifted one bit right. This halves the data values, which firstly brings us to point B and subsequently halves the gain so the range of inputs before the maximum output is reached again at C is also doubled.

A conventional uniformly quantizing ADC preceded by a variable gain amplifier could implement this process. Fig.4a) showed what happens at the output, whereas Fig.4b) shows the characteristic the variable gain amplifier must have. It should now immediately be clear why floating point notation is an alternative to gamma correction, because the graph of Fig.4b) is essentially a piecewise-linear version of a gamma curve.

On converting the floating-point data back to linear PCM, the effect of the gain ranging is to put steps in the noise floor so that in critical dark areas of the picture the quantizing noise is lowest whereas in bright areas where it doesn't matter, the quantizing noise is greatest.

Floating point coding and gamma are both techniques that respect the characteristics of human vision, but there is a major difference between them, which is that floating point coding remains linear. At all times a floating point number is proportional to the luminance of the original linear light signal. When returned to a fixed-point value, the number will display a quantizing error with respect to the original, but with eleven bits of accuracy (ten-bit mantissa plus the implied one bit) that error will be smaller than the error in any gamma based video signal.

## You might also like...

# Is Gamma Still Needed?: Part 10 - Summary

In this final part of the series, an attempt will be made to summarize all that has gone before and to see what it means.

# HDR: Part 22 - Creative Technology - Non Standard

Read too much film and TV industry technical literature, and it’s easy to get the impression that everything about the technology is built to carefully considered specifications. As Philo Farnsworth’s wife was probably aware, though, as he tinkered wit…

# Creative Analysis: Part 20 - Cinematographer Pete Romano On Underwater Filming

There aren’t many positions in the film industry which have the prerequisite of spending an hour sunken in the waters off San Diego in a classic diving suit with a blacked-out helmet. To be fair, it wasn’t so muc…

# Creative Analysis: Part 19 - Cinematographer Angus Hudson BSC, On The Life Ahead

Spending a few weeks in southern Italy is a popular idea. For cinematographer Angus Hudson, BSC, an opportunity to soak up the Puglian sun would come in the form of The Life Ahead (in Italian, La vita davanti a sé), …

# Is Gamma Still Needed?: Part 9 - Processing In Floating Point

Floating-point notation and gamma are both techniques that trade precision for dynamic range. However they differ fundamentally. Gamma is a non-linear function whereas floating point remains linear. Any mathematical manipulations carried out on floating-point encoded data will be correct whereas…