Data Recording and Transmission: Part 17 - Error Correction II

Here we look at one of the first practical error-correcting codes to find wide usage. Richard Hamming worked with early computers and became frustrated when errors made them crash. The rest is history.

The principle of parity generation and checking is very simple to implement, as the only calculation required is to determine if the number of ones in a given binary symbol is odd or even.

However, if there is a parity failure in a received symbol, all that can be said is that there has been an error that could, with decreasing probability, have been in one, three or five bits. Even if the statistics of the system are such that only single bit errors are probable, a simple parity check still cannot determine the position of the failing bit.

It follows from information theory that as parity has only two states, it can only indicate the definite presence or probable absence of an error, but no more. It should come as no surprise that to locate an error in a symbol requires more than one bit of redundancy.

One noteworthy approach actually allowing failing bits to be located and making actual correction possible, was published by Richard Hamming in 1950. The Hamming code works by making several parity checks on the same data symbol. The parity checking is selective. What that means is that in any given data symbol, not all of the bits are included in a given parity check.

The reasoning is simple, which is that if a given selective parity check fails, the error must have been in one of the selected bits (or in the parity bit). Clearly the error could not have been in the bits that were not included in that parity check. With appropriate design, when there are several different selective parity checks, a given failing bit will cause some checks to fail and other checks not to fail. The pattern of parity failures is called a syndrome and in a successful code the syndrome must be different for each possible failing bit.

Fig. 1. The truth table for a 7,4 Hamming code. There is one column in this table for each bit in the code word. There are three rows, one for each parity check. The crosses show the bits that the parity checks include.

If there are, for example, three different selective parity checks, then there will be a three-bit syndrome, which could have eight possible combinations. One of these is the error-free condition, so the remaining seven combinations can point to an error. It follows immediately that the code word cannot be longer than seven bits and that three of them must be parity bits. That is the reasoning behind the 7,4 Hamming code which protects four data bits with three check bits.

An elegant feature of the Hamming code is that that the syndrome is the address of the failing bit. Although larger code words are possible, the 7,4 Hamming code is small enough to allow various different worked examples to be shown from different viewpoints.

Fig.1 shows a table having seven columns, one for each bit in the code word, and having three rows, one for each selective parity check. The check bits are P1, P2 and P3 and the four data bits are in columns 3, 5, 6 and 7. In each row of Fig.1, four crosses will be found. These are the bits that are selected in each parity check. In the finished seven-bit code word, whatever the four data bit combination might be, those three parity checks will always be even. In the absence of an error, the parity checks will all succeed and the generated syndrome will be 000.

It is useful to consider how the pattern of crosses in Fig.1 was selected. If each cross were to be replaced by a one and the absence of a cross were to be replaced by a zero, the pattern shown would represent a pure binary count starting at one on the left and finishing with seven on the right, where the lowest row represents the least significant bit.

It is possible to work through Fig.1 and to select any failing bit. For example if bit 5 were to fail, the parity check P2 would be unaffected since it does not consider bit 5. However, parity checks P1 and P3 would fail so the syndrome would be 101, which is or course 5, the address of the failing bit.

If the failure can be assumed to be due to a single failing bit, then reversing the state of bit 5 as received will achieve a correction. This process can be repeated for any one failing bit and the syndrome will always point to that bit. However, if any two bits are allowed to fail, the syndrome will be non-zero, but clearly one syndrome cannot point to two failing bits.

For example, if bits 3 and 6 fail, parity check P2 will not fail because there have been two errors and the parity remains even. Parity checks P1 and P3 will fail giving a syndrome of 5. If that is believed, and bit 5 is reversed, there is now a three-bit error, illustrating that error correction makes the data worse when its power is exceeded.

Fig.2. The 7,4 Hamming code represented as a Venn diagram, where the bits included in each parity check form three sets. The bit numbers correspond to the columns of Fig. 1.

If three bits are in error, for example 3, 5 and 6, the syndrome will be zero because every parity check will suffer a double bit error, which leaves the parity even. A zero syndrome can only result from a code word (even if it is the wrong one) so the 7,4 Hamming code must have a distance of three, because changing three bits turns one code word into another.

As there are four data bits, they can have 16 combinations and so there can only be 16 different code words. In seven bits there can be 128 combinations and that is the number of possible received symbols. After subtracting the 16 code words, 112 combinations are left which cannot be code words and which have a variety of distances from code words.

Another way of considering the 7,4 Hamming code is to use a Venn diagram, as has been done in Fig. 2. It will be seen that the three circles overlap to produce seven areas, one for each bit in the code word. Each circle contains a different set of four bits, one being a parity bit, based on selected combinations of the data bits.

If parity checks P1 and P2 fail in Fig.2, the only single bit error that can cause that is D1. However, it can be see that a failure of D2 and D3 would also cause checks P1 and P2 to fail, whereas P3 would not fail because it contains a double error. It can also be seen that if bits D1, D2 and D3 were to fail the result would be that a syndrome of zero would generated, because each parity check would contain a double error. That is consistent with the code having a distance of three, which distance converts one code word into another.

It is also possible to see how the Hamming code works mathematically. Fig. 3a) shows that two matrices are created, one to generate the code word and one to check it. Fig.3b) shows what happens. The input data, in this example 1001, selects the top and bottom rows of the generator matrix. Those selected rows are added together, modulo-2, to create the code word.

Fig.3b) also shows the example of a bit error which changes the state of bit three. On receipt, the ones in the received symbol select columns in the check matrix H, which are added together horizontally modulo-two to create the syndrome, which in this case has a value of three, the address of the failing bit.

Fig.3. The generator and check matrices of the 7,4 Hamming code are shown at a). The generation and checking of an example code word is shown at b).

Hamming code has been used extensively in error-correcting computer memories, where the failure mechanism is due to high-energy radiation changing the charge on a bit in much the same way as light creates a signal in a camera. Such high-energy particles are relatively uncommon, and each one can only hit a single bit, so being able to correct single bits is adequate. For added security, most such systems added a further parity bit to the Hamming code to make it into a distance four code that was single bit error correcting and double-bit error detecting. This is universally abbreviated to SECDED.

If the additional parity check failed, but the Hamming 7,4 code gave a zero syndrome, the error was in the extra parity bit. If the additional parity check failed but a non-zero syndrome was obtained, that pointed to the failing bit. A non-zero syndrome without the failure of the additional parity check pointed to a double-bit error, as it would leave the additional parity check with even parity. Correction is not possible but error propagation would be avoided, and some other strategy could be invoked.

I can recall the introduction of SECDED memories in the 1970s when there was considerable controversy over how they should be introduced to users. Prior computer memory technology only made errors when it was faulty and there was concern that users would be unhappy if they knew their hardware was making errors. Fast forward ten years and I heard a record producer say at a conference that what error correction did for him was that he could chuck up on the medium. How times change.

You might also like...