We are not done with statistics yet. In a sense we will never be done with it and it is better to know how to deal with it than to ignore it. It is better still to know how others commonly fail to deal with it and reach conclusions that cannot be justified. It is typically very hard to explain to people who do not understand.
Kelly Johnson, the designer of the U-2, had a problem with the US military. They were insisting the plane should have twin engines. Johnson explained that the only defense the plane had was extreme altitude, and although it might still fly on one engine it would have to fly lower and would then be shot down. He added that with twin engines the chance of an engine failure was doubled, so the single engine plane was more likely to complete the mission. He was right.
Much of statistics comes down to sampling; not just sampling in the closely defined techniques used in, for example digital audio and video, but in the larger sense. Sampling may or may not lose information, depending on how it is done. What Shannon did was to establish how to do it. Sampling is at least two-dimensional: samples are taken in time (or space) and then the samples themselves are sampled.
For example cameras, whether using film or electronics, whether for moving or still pictures, have limited dynamic range that is frequently exceeded by the scene they are expected to capture. Essentially the camera samples the dynamic range and only reproduces some of it. Areas that are too bright will white clip and areas that are too dark will come out black. Information in such areas must be lost. The best we can do is to balance the losses by selecting an exposure that is a reasonable compromise.
The same thing happens when making an audio recording. The dynamic range of the recording system may be less than that of the original sounds. Using Shannon's concepts, we can work out how many bits we need to represent a given dynamic range, although that doesn't tell us how hard it might be.
Sampling also takes place in time, space and populations. An opinion poll, for example is based on the views of a sub set of the population. In electronic parlance, we would call it undersampling and that opens a whole can of Gaussian worms that we will need to look at. Undersampling implies loss of information, from which it follows that there is loss of certainty.
Shannon also looked at the problem and was able to determine how to sample without loss of information. Fig.1 shows the concept. The bound explained by Shannon is where information loss is limited. Below the bound we have undersampling, or polling. Above the bound we have oversampling, where the sampling rate is higher than necessary.
The probability of a fair coin landing heads up is 0.5. But if we toss two such coins, or the same one twice, we may get TT, TH, HT or HH, two of which outcomes violate the suggested probability. The reason is that the sample is too small. Statistics cannot work with small samples and they do not result in Gaussian curves.
Fig.1. - The bound determined by Shannon, often called the Nyquist frequency, explains how to sample without losing information. Below the bound we are polling or undersampling. Above it we are oversampling.
In any polling situation, where less than the entire population is sampled, our conclusions must be qualified. Whatever feature we are looking for is subject to a distribution and that distribution can never be known accurately using subsampling. In other words our conclusions may not be true for the population as a whole, because by chance we could have taken unrepresentative samples.
The chi squared test much loved by statisticians is often used to see if sampled data could have come about by chance, but it has to be interpreted carefully because the outcome of the chi squared test is itself probabilistic. It does not guarantee that your findings could not have come about by chance, but simply says how likely it was.
In any process that is truly random, all kinds of things happen that do not seem intuitive. For example random distribution of some characteristic in a population does not require the distribution to be uniform. It actually requires the distribution to be non-uniform and the result is areas where the characteristic is rare and other areas, known as clusters, where it is more common than average.
The popular assumption when a cluster is found is that something is causing it. Like sneaker waves, clusters are expected in random systems, it is fallacious to think there is a cause.
Interestingly, almost the same statistical concepts apply to insurance and error correction. In any endeavor, experience shows how often something goes wrong, how often vehicles crash, or ships sink and so on. Once that is known, the cost of putting things right will also be known. Generally it is not known by whom these losses will be sustained. Insurance works by spreading the cost of damage over the whole field. Everyone involved in the endeavor pays a premium, which allows the insurers to have sufficient funds to pay when an insured loss takes place.
In error correction, the equivalent of paying the insurance premium is the need to transmit extra data, known as redundancy, which allows most of the errors introduced in real channels to be corrected.
Whilst the way in which an electrical signal is interpreted might be binary, the signal itself has to be considered as if it is an analog signal. The ideal signal will be corrupted by noise. The amplitude of the noise is important. If it is small, it has no effect on a digital signal, whereas if it is large, it might change the interpretation of the signal from a one to a zero or vice versa.
Knowing the distribution of noise in the signal path allows the probability distribution of errors to be calculated.
As is well known, error correction works by adding check bits to the data. The check bits are calculated from the data and so do not increase the information sent. For that reason they are referred to as redundancy. The redundancy and the data from which it was calculated form an entity called a code word.
On receipt of the data, we can repeat the same calculation. If the result of the calculation does not agree with the check bits, what we have received is not a code word and there has definitely been an error. However, the converse does not follow. If the check bits do agree with the data, all we can say is that we have received a code word. We cannot say that it was the code word that was sent.
As a result we cannot assume the data are error free. The reason is easy to see, because in all real systems, the check bits are far fewer than the data bits. It follows that there must be a far greater number of combinations of data than there are combinations of code words, so a given code word could result from many different data combinations.
We have the same problem as we have with undersampling or polling, where results could be due to chance. In a noisy channel it is perfectly possible for a valid code word to be generated by chance. If we are foolish enough to think that the code word represents an error free message, we shall accept what are essentially random numbers as if they were valid data.
The characteristic of Fig.2 follows. Received errors below a certain bound are corrected, whereas above that bound they are mis-corrected and the data get worse. In all practical error correction systems we have to do whatever it takes to avoid going beyond that bound.
Fig.2 - The trade off of error correction is that reduction in the effect of errors at low error rates is balanced by an increase in errors when the ability of the system is exceeded.
Modern error correcting systems use techniques such as interleaving and product codes in which the ability to detect errors far exceeds the correction power so that mis-correction is all but prevented. All system design must begin with a study of the error statistics of the channel.
Without statistics we might not even be here. If the replication of DNA that forms the basis of life were totally error free, then offspring would be clones and no evolution could take place. Fortunately, the replication of DNA is not error free, but results in mutations. Some are so radical that the offspring cannot live. Some are detrimental, but some are beneficial and natural selection will automatically ensure the prevalence of beneficial genes.
The random content of many processes essentially guarantees that the outcome cannot be predicted. The old idea that the future could be predicted from the present using the laws of physics was soon abandoned when the statistical nature of quantum mechanics was found. It is simply not possible to predict the future with 100% certainty.
Evolutionary theory provides a good argument. If any species developed true clairvoyance, it would give such a tremendous evolutionary advantage to that species it would become dominant. That has not, of course, happened.
You might also like...
In part one of this series, we looked at why machine learning, with particular emphasis on neural networks, is different to traditional methods of statistical based classification and prediction. In this article, we investigate some of the applications specific for…
For a serious discussion about “making streaming broadcast-grade” we must address latency. Our benchmark is 5-seconds from encoder input to device, but we cannot compromise on quality. It’s not easy, and leaders in the field grapple with the trade-offs to en…
Signal transducers such as cameras, displays, microphones and loudspeakers handle information, ideally converting it from one form to another, but practically losing some. Information theory can be used to analyze such devices.
Connecting a camera in an SDI infrastructure is easy. Just connect the camera output to the monitor input, and all being well, a picture will appear. The story is very different in the IP domain.
Machine Learning is generating a great deal of interest in the broadcast industry, and in this short series we cut through the marketing hype and discover what ML is, and what it isn’t, with particular emphasis on neural networks (N…