Data Recording and Transmission: Part 14 - Error Handling

In the data recording or transmission fields, any time a recovered bit is not the same as what was supplied to the channel, there has been an error. Different types of data have different tolerances to error. Any time the error rate is in excess of the tolerance, we need to do something. Error handling is a broad term that includes everything to do with errors. It includes studying the error sensitivity of different types of data, measures to prevent or reduce errors, measures to actually correct them and measures to conceal the error if correction is not available or not possible.

Why do errors happen? Why don't we just make our equipment better? One answer to that is that if we are not in control of the channel, we can't control errors either. This is particularly true in transmission, via radio or cable, where the signal can be altered by interference from outside our system. Devices such as mobile phones need to radiate energy in order to work, and they may be used anywhere at any time. Powerful electrical installations can create electric and magnetic fields that induce unwanted voltages, as do the ignition systems of gasoline engines. Optical fibers are immune to this sort of interference.

In recording devices, dirt between the pickup and the medium may interfere with reading and writing. The information layer in the media may contain imperfections. Tape and disk media have a magnetic coating on a substrate of some kind, and the coating may not be perfectly uniform. Optical disks may have defects in the information layer, and the transparent disk material may not be uniform, which has the effect of throwing the light beam off track or out of focus.

Another source of errors is noise. Any electronic component having resistance is a source of thermal noise and that gets added to any replay signal from a pickup and is amplified with it. The noise is the sum of countless individual processes at atomic level and the macroscopic view can only be described statistically. The individual processes add together to give the actual noise waveform added to a wanted signal. The noise power may be constant, but the instantaneous voltage is neither constant nor predictable.

Fig.1 Statistical nature of noise – see text for description.

Fig.1 shows where the statistical nature of noise comes from. At a) is a simple system that has hard limits and the probability of it being anywhere between the limits is uniform, so the probability function is a rectangle. Suppose then we have two such systems and we add them. As they are independent it is possible for them to augment or cancel one another. The overall probability is obtained by convolving the individual probability functions. Convolution is where one function is simply slid across the other and the area of overlap is the combined function. Fig.1b) shows that convolving two rectangles yields a triangular function having twice the width. At c) we see the result of doing this an infinite number of times, which is the Gaussian probability curve.

The Gaussian curve has no lateral limits, and that tells us that in a large system, extreme events can happen with low probability. For example, waves in water can interfere constructively or destructively. In large oceans, such as the Pacific, very large waves can occur when there is chance coherence between lots of smaller waves. Where these come ashore, they are known as sneaker waves and they can be deadly. The signs erected on US beaches warning visitors not to turn their back on the ocean are not kidding.

The same thing happens with noise in electronic circuits. With low probability, noise waveforms can add coherently and produce a noise spike that cancels the voltage of a data bit, causing an error. Just as we can't stop sneaker waves, we can never make electronic systems that are free from bit error. Over-engineering the system with larger signals just makes the errors less frequent and adds cost. Once it is accepted that errors in raw data are inevitable, and that some form of error handling is required, a new way of thinking emerges.

Just as a motor race is won by the best combination of car and driver, in data storage it is the combination of the medium and the error handling strategy that matters. Provided the original data are recovered correctly, no one cares how it was done. Whether a high-grade medium needed the occasional correction, or whether an average medium needed more frequent correction, doesn't make any difference to the overall performance. However, it might make an economic difference.

If error handling is incorporated into a system, it makes sense to use it. Instead of designing the medium to make infrequent errors, the medium can instead be designed to make errors within the capability of the error handling system. For example, a magnetic track on a disk or tape having a certain width will replay with a certain signal to noise ratio. If the width of the track is halved, the noise power is halved, so the noise voltage falls by 3dB. The signal voltage is also halved and falls by 6dB. As a result, the signal to noise ratio has fallen by 3dB and the probability of error has risen slightly. On the other hand, we just doubled the recording density, so a given device would store twice as much data.

In electronic memory, such as flash, each bit is individually fabricated photolithographically, and the sheer number of bits means that the chances of making a memory chip without defective bits are slim. Instead of complaining, we add an error handling system and then proceed to put more bits on the chip, driving up the number of errors and driving down the cost per bit. The new way of thinking not only accepts that errors are normal but goes further and argues that if bit errors are not occurring, the medium is being under-utilized.

As will be seen, error correction requires additional stuff to be recorded along with the actual data. Space has to be found on the medium for this extra stuff and it could be argued that it reduces storage capacity. Nothing could be further from the truth. The presence of that error correcting stuff may have allowed the storage density of the medium to be increased many times over, so the loss of a small amount to store check bits is eclipsed by the overall gain. Far from being a nuisance, error handling is an enabling technology that allows storage and transmission channels to perform significantly better and at lower cost than they would without it. In all modern channels the actual channel and the error handling strategy are designed together as a system.

In order to design an error strategy, we can look at the distribution of errors in any channel. Typically, there are random bit errors, caused by noise, and burst errors, where a number of bits together are incorrect. This may be due to interference in a channel, or a small defect on a medium. Finally, there are catastrophic errors, occurring infrequently, that are beyond the correcting power of any code. There are many things we can do. For example, in a transmission channel with interference, the chances of the same interference occurring twice are not high, so if we have an error in a data block, we can simply send it again. A speck of dust might corrupt the reading of a disk block, but on a retry it might have blown away. In disks and flash memory, which are both addressable media, we can deal with defective blocks by creating a bad block file. This appears in the directory and it just happens to use all of the bad blocks, so no real data will ever be written there.

Creating a bad block file requires the medium to be formatted, meaning that every bit is written and compared to see if it is working. In tapes this is not practical, so instead every block is read back as it is written using an additional head. If the read-back isn't identical, the block is skipped and another one is written further down the tape. This means that tape defects such as dropouts are skipped, and the data are written on good parts of the tape.

It doesn't matter what we do to deal with an error, we can't do it if the error has not been detected, so it is axiomatic that reliable error detection is the most important part of our strategy. On replay, the original data will no longer be available, so error-detecting codes have to work with whatever is on the medium. As will be seen, error-detecting codes are looking for certain patterns in the data. They are not infallible, and with low probability, large errors can simulate those patterns and make it look as if the data are good. Formatting disks and flash cards and read-after-write on tapes relies on bit-for-bit comparison with the original data, which is infallible. These measures protect the later error-detecting codes.

You might also like...