Data Recording and Transmission: Part 16 - Error Correction

Error correction is fascinating not least because it involves concepts that are not much used elsewhere, along with some idiomatic terminology that needs careful definition.

The foundations of error correction go back a long way, to John Venn, whose diagrams helped visualization, to Charles Dodgson, the Oxford mathematician perhaps better known as Lewis Carroll, and to Evariste Galois.

One place to start is to stress that a single bit carries an absolutely fixed amount of information that determines which of only two states is the case. Those two states may represent truth or falsehood, presence or absence, or they may represent a one or a zero. Whichever it is, there is no gray scale: no value judgement; no intermediate condition. On account of that it is possible to say that a single bit in isolation can be either absolutely correct or irrefutably wrong.

The direct consequence is that if we are able to say that a bit is incorrect, then we also know what the correct value is, because that is simply the opposite state to the incorrect state. The actual correction process is trivial, because it consists of altering the state of the incorrect bit. The correction is perfect, because the bit goes to the absolutely correct state, as if it had never been wrong.

Fig.1 shows the correction process requires nothing more than an exclusive-OR gate, which acts as a selective inverter, flipping the bits that are wrong.

Figure 1a - An exclusive-OR gate acts as a selective invertor, changing the state of bits known to be wrong so they become correct again. The XOR gate is a parity generator as the output and the inputs together always have even parity. It also correctly adds two-bit together, except that it fails to create any carry out.

If the correction itself is trivial and perfect, the process of determining which bit or bits are incorrect certainly is neither. That is where the art lies. The determination of bits in error is statistical, not absolute. If the determination is incorrect then bits that are actually correct are flipped to increase the number of incorrect bits, instead of reducing the number.

The exclusive-OR gate has some interesting properties. The truth table is that the output is true only when the inputs are different. Fig.1 also shows some more of the attributes of this peculiar logic element. As the output will be true only for inputs of 10 and 01, then it acts as an even parity generator, where even parity is defined as the number of ones in a word being even.

If we consider adding two bits, then 0 + 0 = 0, 0 +1 or 1 +0 = 1 and 1 + 1 requires a carry to become 10. In fact the exclusive-OR, or XOR gate calculates the sum correctly, except that it fails to generate the carry, as a full adder would do. The type of arithmetic that describes the action of the XOR gate is that it works in modulo-2.

In modulo arithmetic, the result can never exceed one less than the modulus. For example, a modulo-ten counter would count 0123456789012345678901234 because there is no carry function. A modulo-2 counter counts 0101010101....

Although it is quite alien to the arithmetic of our school days with decimal numbers and carries and borrows, modulo arithmetic has some advantages, not least that the absence of carries speeds up the logic implementing a calculation. Take the case of a binary count that has to increment from 01111111111. Adding one to that sets off a series of carries that ripple through the word to change the MSB to a 1. The bits effectively have to be processed one at a time.

In comparison a modulo-2 circuit can process its bits in parallel because there are no carries or borrows. So we adopt the arithmetic that will work with fast logic, rather than slowing the logic down to what we are used to. Not only does error correction have to be smart, it has to work at the blistering bit rates of real media.

Error correction also has its own terminology and one of the first concepts to be considered has to be that of Hamming distance, after Richard Hamming. This is not a distance in the spatial sense, although simple spatial analogies are possible. Instead the distance is a measure of the number of bits that are different between two binary symbols of the same wordlength. It does not matter which bits differ, only the number of them.

Fig.2a) shows Hamming distance graphically in a four-bit system having sixteen combinations shown inside the disks. Every path between two disks represents a Hamming distance of one. Fig.2b) shows a Venn diagram for four bits, where some of the Hamming distance-one paths are depicted by worm holes.

Figure 2a).

Figure 2b).<br />At a) a four-bit system is shown where the sixteen combinations are separated by paths corresponding to a Hamming distance of one. The same is shown in b) which is a Venn diagram to which some distance-1 worm holes have been added.

Figure 2b).
At a) a four-bit system is shown where the sixteen combinations are separated by paths corresponding to a Hamming distance of one. The same is shown in b) which is a Venn diagram to which some distance-1 worm holes have been added.

The concept of Hamming distance and some other key buzzwords can be introduced by considering a simple parity check. Fig. 3 shows a symbol of four data bits connected to a parity generator that produces a fifth bit such that whatever combination the four data bits have, the number of ones in the message is always even.

The parity bit is created from the data and therefore adds nothing to it and is described as redundant. All error correction operates by adding redundancy to the data. The parity bit gives the message a constant testable characteristic. The test can be made without knowing anything else about the data. A message having a constant testable characteristic is called a codeword. Fig.3 shows that the codeword consists of the data plus the redundancy.

In the example of Fig.3, there are 32 possible combinations of 5 bits. Of these, 16 are codewords and 16 are not codewords. Fig.4a) shows that they have a Hamming distance of one from a codeword. One thing we can say with absolute certainty is that the receipt of a symbol that is not a codeword indicates that something is wrong, although in this case we cannot correct it.

Figure 3 - A simple parity system, in which a redundant bit is added to ensure that there is always an even number of ones in the output, whatever the state of the original data.

Unfortunately, the converse is not true. Receipt of a symbol that is a codeword does not guarantee nothing is wrong, because there are many possible codewords and we might have received the wrong one.

Between every pair of codewords exists a word having a Hamming distance of one that is not a codeword. A Hamming distance of two separates codewords from one another. Thus, parity is described as a distance 2 code. What we can say is that all single bit errors must result in a symbol at a Hamming distance of one from a codeword, which therefore cannot be a codeword and which must be detectible.

We can also say that a double bit error converts one codeword into another code word at a Hamming distance of two and so is not detectable.

If we extend that idea a little, we can say that the number of bits in error that can reliably be detected cannot exceed the number of bits of redundancy.

Fig.4b) shows that if we add two redundant bits to each symbol, then there are four times as many possible symbols as data combinations. One quarter of those are code words and three quarters are not. Thus between every pair of codewords exists three non-code words. To get from one codeword to another we have to travel a Hamming distance of four, so two bits of redundancy give us a distance-4 code.

Figure 4.
At a) a distance-2 code such as simple parity cannot perform correction but can detect single bit errors because they have a Hamming distance of 1 from codewords. At b) a distance-4 code has three steps of distance 1 between code words. A single bit error can be detected and attributed to the only code word from which it has a Hamming distance of 1. Two-bit errors can be detected, but not corrected, because there are many code words having a Hamming distance of two from a double-bit error.

A single bit error produces a symbol that is not a code word, but which has a Hamming distance of one from the correct code word. This has a Hamming distance of at least three from any other code word. In other words, the received symbol could have resulted from a single bit error suffered by only one codeword, or by a three-bit error suffered by a number of others.

If we know that three-bit errors are either not possible or are highly unlikely, we can deduce that the original code word was most likely the one which has a Hamming distance of one from the received symbol. That is the error-correcting equivalent of Occam's razor: the codeword having the shortest Hamming distance from the received symbol is probably the correct one. It follows from Fig.4b) that error correction is theoretically possible, even if it doesn't tell us how to do it.

Fig.4b) also shows that if two bits are in error, the received symbol will have a Hamming distance of two from not one, but many code words and the problem is that we can't tell which one. All we can say is that there has been an error, but it can't be corrected.

Using a distance-4 code, we can theoretically build an error checking system. If we know that only single bit errors can occur, we have a reliable single bit correction system. If mostly single bit errors with the occasional double-bit error occur, we can build a single-error-correcting-double-error-detecting (SECDED) system. The double errors cannot be corrected, but their detection allows perhaps a re-transmission request.

If three-bit errors can occur, then the codeword having the shortest Hamming distance from a three-bit error is the wrong one and a miscorrection will have taken place. Thus the probability of miscorrection becomes the probability of three bit errors. Three bit errors could be prevented or made less probable using, for example, interleaving.

You might also like...

Understanding IP Production Networks: Part 14 - Delay Monitoring

We use buffers to reassemble asynchronous streams so we must measure how long individual packets take to reliably get to the receiver, and the maximum and minimum delay of all packets at the receiver.

Understanding IP Production Networks: Part 13 - Quality Of Service

How QoS introduces a degree of control over packet prioritization to improve streaming over asynchronous networks.

Understanding IP Production Networks: Part 12 - Measuring Line Speeds

Broadcast and IT engineers take very different approaches to network speed and capacity; it is essential to reach a shared understanding.

Understanding IP Production Networks: Part 11 - Network Analyzers

Wireshark is an invaluable tool that enables engineers to examine network traffic in detail. Commercial monitoring platforms provide even deeper observation.

Understanding IP Production Networks: Part 10 - Security

The flexibility of IP and COTS brings with it all of the security dangers of the internet and the need for robust processes. It means new questions need to be asked of broadcast equipment manufacturers.