The first burst error correcting code was the Fire Code, which was once widely used on hard disk drives. Here we look at how it works and how it was used.
Data recording and transmission need different types of error correction because the channels act in different ways. For example a deep-space probe returning signals to Earth has limited power and the received signals will be very noisy and will contain many random errors. On the other hand storage media tend to suffer from burst errors due to defects in the medium or from particles of contaminant between the head and the medium.
Here we look at the correction of burst errors. Using the Hamming code described in an earlier piece, single bit errors could be corrected. Clearly using bit interleaving, a large number of Hamming codes could be created across a data block such that a burst error would cause a single bit error in many different codes, but the amount of redundancy needed would be very great. An error-correcting code incorporating interleaving principles could correct a burst error with a single code, which could be more efficient.
Fig.1 - A burst error is defined as a series of bits where those at the ends must be in error (1) whereas those between (X) may or may not be.
It is important to consider what a burst is in the context of error correction. If a pinhole in a magnetic coating on a hard drive intersects a data track, or if an arcing switch interferes with a transmission, the signal is impaired for a short time and for the rest of the time it is fine. The effect of impairment is interesting, because it does not always cause errors. For example a noise spike whose polarity opposes the polarity of a data bit may cancel it out and cause an error, but a noise spike having the same polarity as that of a data bit could reinforce it and no error occurs.
As a result a burst error has a peculiar definition. The first and last bits of the burst are always in error, but the bits in between may or may not be in error. Fig.1 shows the idea. It is important because it defines what the error detector needs to look for.
Fig.2a) shows a very simple parity generating circuit that could be designed to run at very high bit rate. Essentially the register forms a delay of nine bits, so as bit 0 appears on the output of the delay at one input to the XOR gate, bit 9 appears on the other input, so the XOR of bits 0 and 9 enters the delay. By the time that emerges from the delay, bit 18 will be on the input, and the gate calculates the XOR of bits 0,9 and 18 and so on.
Fig.2b) shows that the effect of the circuit of Fig.2a) is that of a ring having a circumference of nine bits rolling along a line of bits. Every time the ring contacts a new bit its value is XORed with the value in the ring.
A simple delay line and XOR gate at a) creates in this example nine check bits that form parity on bits 0,9,18,27….. 1,10,19,28…… 2,11,20,29…. And so on. The effect is analogous to a ring b) rolling down the serial data and making a new XOR at each bit.
The hard-working XOR gate is calculating parity on every ninth bit in the message. In other words, one parity check is on bits 0,9,18,27…., the next is on bits 1,10,19,28… and leaves nine check bits in the register.
As this parity checking is convolutional, there is no limit to the length of the data block that can be used.
A possible burst error of three bits is shown in Fig.3a) and this reflects in the pattern of parity errors, in which the syndrome literally draws a map of the bits in error in the burst.
Fig.3b) shows that a different configuration of errors could produce the same syndrome. This is serious, because if it is believed, a miscorrection will result and a mechanism has to be found that prevents such an occurrence. The syndrome of Fig.3b) could result from the ends of a long burst. It follows that prevention of miscorrection can be achieved by a system that can detect bursts that are longer than those that can be corrected.
Fig.3 - A genuine single burst of three bits at a) is mapped in the ring on replay. A different failure shown in b) produces the same syndrome. Steps must be taken to detect that to prevent miscorrection.
If the system is designed to correct a burst of length up to b bits, then in the worst case where only the first and last bits are in error, there will be a maximum of b-2 error free bits inside the burst. If the parity generator uses a delay of 2b - 1 bits, the ring of Fig.2b) will have a circumference of 2b – 1 bits. Outside any single correctable burst of b bits there must be b-1 bits that are not in error, whose parity bits will be zero.
That is the principle of the Fire code that was devised by Philip Fire in 1959. It was used extensively in hard disk drives to overcome errors due to media defects and contamination.
The nine-bit example of Fig.2 must have a maximum burst size b of 5 bits. On reading the data, the parity check of Fig.2a) is repeated to create a ring of parity bits. The decoder rotates the ring, looking for a run of zeros. If it can’t find a contiguous run of four (b – 1) zeros in the syndrome, the error is un-correctable. However, if it finds a run of b – 1 zeros or more, the end of the burst has been located with respect to the repeating sequence of nine bits.
Obviously the same pattern of parity failures would be found if the that burst error had occurred 9, 18, 27….. bits away and an additional mechanism is needed to determine where the burst was, so the pattern in the ring can correct it.
The Fire code combines the parity generator of Fig.2 with a cyclic code whose purpose is to find where the burst is so that the parity check can correct it. The polynomial also gives some protection against a random error occurring elsewhere in the block that would prevent a burst being corrected.
Fig.4 shows a simple Fire code in which a polynomial generator having a sequence length of 31 is multiplied by one having a sequence length of 9 to create 279-bit code words.
On writing data, the serial data are also clocked into the encoder. As the last data bit is written, the redundancy calculation has completed and the feedback is turned off so the check bits can be written immediately after the data, to form a code word.
Fig.4 - A Fire code having a sequence length of 279 bits of which 14 are check bits. As 2b -1 is 9 here, then b must be 5, the biggest burst that can be corrected.
If the data are subsequently read and shifted into the decoder, the in the case there is no error the syndrome will be all zeros and no action needs to be taken. However, if there is an error there will be a non-zero syndrome. In order to attempt correction, the input to the twisted ring counter is held at zero and the counter is clocked.
Fig.5 - In a punctured Fire code, the search for the burst results in a count with respect to the start of the code, not the start of the data block.
The ring counter begins to step through sequential powers of a Galois Field. If the problem is a single correctable burst, one of those powers is also the burst error. It is a characteristic of Galois Fields that sequential powers are highly non-sequential in pure binary. As the ring counter is stepped, if the burst error is within the allowable burst size, at some point a contiguous row of at least b -1 zeros will be found in the ring counter. This will only take place at one count. The idea is shown in Fig.5.
In actual use, Fire codes designed to correct large burst errors will have long sequence lengths, much longer than the size of a typical data block. The codewords will be punctured, meaning that only the end of the codeword is occupied by data, followed by the check bits. The part of the codeword before the data is considered to be all zeros. In writing and reading this is not a problem because zeros can be fed into a twisted ring counter indefinitely and nothing happens.
However, when an error is encountered, the number of clocks needed to locate b – 1 or more zeros is counted from the start of the codeword, so the number of leading zeros needs to be subtracted to locate the error with respect to the data.
You might also like...
The CRC (cyclic redundancy check) was primarily an error detector, but it did allow some early error correction systems to be implemented. There are many different CRCs but they all work in much the same way, which is that the…
The mathematics of finite fields and sequences seems to be a long way from everyday life, but it happens in the background every time we use a computer and without it, an explanation of modern error correction cannot be given.
Computer marketing departments typically do not promote all company products. Rather they focus on high margin products.
Here we look at one of the first practical error-correcting codes to find wide usage. Richard Hamming worked with early computers and became frustrated when errors made them crash. The rest is history.
Error correction is fascinating not least because it involves concepts that are not much used elsewhere, along with some idiomatic terminology that needs careful definition.