Cameras, Pixels and Resolution

There is a lot of confusion about what a pixel is. John Watkinson argues that it doesn’t matter because the resolution of a camera has little to do with the pixel count.

I think it may be useful to start by considering what sampling means in technology. Unlike many terms, it means exactly what it means in everyday speech. When you sample some wine, you don’t drink the whole bottle, but there’s an underlying assumption that wine is uniform so one sip is representative of the whole. Clearly that’s not true of an opinion poll. Asking just one member of the public how he intends to vote is a pretty shaky way of predicting election results. The underlying assumption is different. The more people you ask, the more statistically reliable the prediction is going to be.

If someone claims something is obvious, the only reasonable reaction is “to whom is it obvious”? Equally whenever the term sampling is used, the only reasonable consideration has to be “What is the underlying assumption”?

If the number of samples of any given source is steadily increased, there must come a point where nothing further can be known and the statistical gives way to the certain. That is what was behind the work of Edmund Whittaker, Vladimir Kotelnikov and Claude Shannon, whose thoughts today are known as the WKS Sampling Theorem. WKS sampling is the point below which we have statistical representation and above which we have oversampling. It is sometimes called the Nyquist point or limit.

Before mathematics and technology had advanced much, people were building ships. As they got bigger, the designer would make a scale model of the hull and the shipyard could measure anywhere on the model and scale it up. But it was hard to describe the shape on paper, so lines were invented. Imagine putting the model through a bread slicer so it comes out as a series of slices. Trace round each slice and we have a set of lines.

What are the underlying assumptions here? Firstly the model will be smooth, so there are no sudden changes that take place between the slices. Today we would say it was pre-filtered. Secondly the spacing of the slices meets the WKS criterion. Thirdly the shipyard smoothes out the resultant shape so that the lines are joined up cleanly. Today we would say that it is post-filtered.

Digital audio is identical to shipbuilding. We have a pre-filter that prevents any out-of band signals passing, we have WKS sampling that gives us complete certainty of what is happening within the chosen bandwidth and we have a reconstruction filter that smoothes out the reproduced waveform so it is identical to the input waveform. Thus the underlying assumption in digital audio is that the sampling process is preceded by and followed by a pair of identical filters having a known relationship to the sampling rate.

The journalistic explanations of digital audio involving ugly jagged staircases are completely flawed and serve only to help the misguided audiophile believe that vinyl discs must be somehow better. Yet none of them has ever seen a ship’s hull with jaggies. They have it completely backwards: a properly reconstructed digital audio waveform is a smooth as whipped cream and it’s the groove of the vinyl disc that is like a cat’s tongue.

We can also sample images, still and moving. I read somewhere that there was opposition to the use of the word “sample” to describe a picture element because of the statistical implication. Clearly that argument was flawed, because it is, only necessary to state that the samples are taken according to WKS Theory and the statistical element disappears. Nevertheless the term “pixel” came into use. Abbreviate “picture” to “pic” and the plural becomes “pics”, which can be spelt “pix”, hence pixel: an element of pictures. But it’s still a sample by another name. Recall what Richard Feynman said: “You can know what something is called without knowing anything about it”.

In digital audio we can fully describe a sample with a single two’s complement number. It’s a scalar quantity having only magnitude. In images we can only do that in monochrome, where each sample describes the brightness. If we want colour, then each sample becomes a vector, or multi-dimensional number, that not only describes the brightness but also the hue and saturation. So here’s the definition of an image sample or pixel:

“A vector representing the brightness, hue and saturation of an infinitessimally small point in spatio-temporal image space.”

What that means in the digital domain is that a set of three binary numbers, for example R, G and B, stored, for example, in a RAM, together completely describe the visual attributes of a point. The point is located by two spatial dimensions, how far down and how far across the screen, and by a time dimension. Because it is a point, the three components R, G and B are, by definition, co-sited. A 2-D array of such points, sampled according to WKS, completely define an image as what we would call a frame.

So what we can say is that a pixel is a concept that exists only in the digital domain, in RAM or on a mass storage device as an array of samples each of which is described by three binary numbers. In television parlance we would call it 4:4:4.

Now the bad news: no camera or display uses pixels because they are conceptual and nobody has ever seen one. Cameras and displays use photo-sites which are real and are clearly not the same as pixels and are frequently not co-sited. Consideration of the colour triads of the traditional CRT makes that obvious. One obvious difference is that a pixel is required by sampling theory to have come from a vanishingly small point, whereas photo-sites practically touch one another in order to have as much light-gathering area as possible. Thus the infinitely small delta function of ideal WKS sampling is replaced by a zero-order hold system having a 100 percent aperture ratio. Even a monochrome sensor having one photo-site per pixel suffers from this aperture effect. As a result no camera with n-photo-sites can ever have n-pixel resolution.

To make matters worse, the zero temporal aperture required by WKS is also violated. There is a finite exposure time, typically a significant chunk of the frame period, so if anything moves it will be blurred.

Displays also have large photo-sites, thus all displayed images ever seen are subject to two aperture effects in series: one in the sensor and one in the display. The camera lens causes further loss. Pixel count teaches nothing about these inescapable losses.

Things are complicated in colour cameras because of the need to capture three different primaries. In a three-chip camera there is a beam splitter that filters light into three different bands, each of which has its own sensor. One photo-site at the same place in each of the three sensors contributes to the output vector. The pixel count is one third the photo-site count.

Most single-chip cameras use a sensor having a Bayer pattern after Bryce Bayer of Kodak who invented it. A monochrome sensor is converted to work in colour by placing over it three types of filter that are the same size as a photo-site. Recognising the response of human vision, half of the photo-sites are green, and red and blue have a quarter each. As each photo-site can only respond to one colour, clearly there can be no co-siting.

In order to create pixels, which are co-sited, from Bayer pattern sensors which are not, it is necessary to interpolate the photo-site information. A point at which the pixel is considered to be located is fixed with respect to the Bayer pattern, and for each colour in turn the data from surrounding photo-sites are interpolated to compute what the value would have been had it been sampled at that point. The same process, but with different coefficients, is carried out for each colour to obtain the co-sited output vector. A set of four photo-sites, two green, one red and one blue is the smallest meaningful image unit and typically after interpolation a pixel will correspond to each one. Thus the pixel count is one quarter the photo-site count.

The Foveon sensor uses a different approach in which the photo-sites are stacked vertically in the same place. The colour filtering is based on the principle that short wavelengths are filtered more strongly than long wavelengths. By having photo-sites at three different depths in a filtering medium, three different spectral responses are obtained. The sensor near the surface sees the entire visible spectrum, the next sensor down sees a spectrum from which blue is absent and the bottom sensor sees a spectrum from which blue and green are absent. By matrixing together the signals from the three photo-sites it is possible to obtain an RGB pixel. The pixel count is one third the photo-site count.

It is possible to make all kinds of comparisons between three chip, Bayer and Foveon technologies, but I don’t believe it is particularly useful to do so because there are more important factors. In particular the theoretical resolution advantage of a given sensor becomes meaningless when the lens performance is the dominant factor, which is how it should be in a good camera. The resolution is limited by the point-spread function of the lens and it doesn’t matter whether there are ten photo-sites or a million inside the point spread function, the resolution will be the same.


Moore’s Law does not apply to lenses, so there is such a thing as an ideal photo-site size, above which resolution suffers and below which noise increases. The solution somewhat depends on what the camera is for. In professional still photography where it may be necessary to enlarge only a small part of the image, static resolution is uppermost. That and depth of field control suggest a physically large sensor, typically 60 x 45mm, for which a single Bayer sensor is ideal, and longer focal length lenses. Cinematography also relies on depth of field control, again suggesting large sensors and long lenses. In television, focus pullers are not employed and depth of field control is seldom used, suggesting physically small sensors and short focal length lenses, which is where Foveon sensors may have an advantage.

You might also like...

NDI For Broadcast: Part 1 – What Is NDI?

This is the first of a series of three articles which examine and discuss NDI and its place in broadcast infrastructure.

Brazil Adopts ATSC 3.0 For NextGen TV Physical Layer

The decision by Brazil’s SBTVD Forum to recommend ATSC 3.0 as the physical layer of its TV 3.0 standard after field testing is a particular blow to Japan’s ISDB-T, because that was the incumbent digital terrestrial platform in the country. C…

Broadcasting Innovations At Paris 2024 Olympic Games

France Télévisions was the standout video service performer at the 2024 Paris Summer Olympics, with a collection of technical deployments that secured the EBU’s Excellence in Media Award for innovations enabled by application of cloud-based IP production.

HDR & WCG For Broadcast - Expanding Acquisition Capabilities With HDR & WCG

HDR & WCG do present new requirements for vision engineers, but the fundamental principles described here remain familiar and easily manageable.

What Does Hybrid Really Mean?

In this article we discuss the philosophy of hybrid systems, where assets, software and compute resource are located across on-prem, cloud and hybrid infrastructure.