Noted audio engineer for The Doors, Bruce Botnick, relies on three JBL M2 Master Reference Monitors in his studio.
In this second part of his loudspeaker series, John Watkinson considers the importance of the time domain to human hearing.
In Part 1, I mentioned that the recording and distribution of audio had been transformed by the application of IT. This means that audio must contain information and it follows immediately that anything audio passes through can be considered as an information channel – including hard drives, networks, loudspeakers and human hearing, all of which will be seen to have actual or effective information capacity or bit rates.
But what form does audio information take? To answer that, we have to go back to what hearing is for. In evolutionary terms, electronic entertainment and IT has happened in the last few milliseconds. Long ago, hearing was a means to survive and evolution rewarded species that evolved better means to avoid being eaten, to find food and a mate. A sense of hearing would benefit its owner in that respect.
What would be the most important information that a hearing mechanism could tell an early living being? Pretty obviously the location of a source of sound must be at the top of the list, closely followed by the size of the sound source. Is this sound a threat or does it reveal our next meal? In the absence of speech or music, the concept of establishing pitch was of limited importance; indeed the frequency domain was of little importance and means to deal with it evolved later.
Those prehistoric means to establish direction and size are still with us, hard wired into the Human Audio System (HAS) and functional at birth. The failure to consider the aspects of audio reproduction that these mechanisms interpret, results in loss of realism and listening fatigue.
Let us consider how direction is established, first in principle and then in the presence of reflections. Figure 1 shows that in the case of an off-centre sound source, the distance from the source to each ear is different. The finite speed of sound means that whatever waveform arrives at the nearer ear will arrive at the more distant ear with a predictable delay. This introduces the first hard-wired mechanism in the HAS, which is a variable delay and a correlator. Any new waveform, the onset of a sound, recognised by either ear will result in the other ear trying to find that waveform at a later time. The time difference tells us the direction.
Figure 1. When a sound source is off to one side of the listener, the sound arrives at the two ears at different times. The HAS can identify the same transient at both ears and measure the time difference.
It ought to be clear that unambiguous measurement of the time shift between two waveforms is only possible if the waveform is transient. Trying to do it with a pure tone or sine wave suffers two difficulties. Firstly all cycles look the same so correlation can be found at a number of time shifts. Secondly in the real world pure tones jump to the nearest standing wave or eigentone in the room so the location of the source is concealed. This is hardly an issue in the real world where the majority of sounds such as footfalls, doors closing, objects falling, are transient.
This explains why wailing sirens on emergency vehicles are not very smart and why the vehicle often can’t be located until the flashing lights are visible. The use of blue lights is equally dumb as human vision is least sensitive to blue. Given that sub-optimal applications are the rule rather than the exception in acoustics, which has always been a Cinderella subject, we should not be surprised to find mediocrity in a lot of legacy loudspeakers. Equally it’s not appropriate for me to complain if I can’t advance solutions.
It is interesting to consider the problem from a communications theory standpoint. Sine waves are pure tones and so have no bandwidth. Thus their information capacity is zero. The bandwidth of a transient is large. It follows that most all of the information in audio is carried by transients and that failing to consider the time domain accuracy of a loudspeaker may seriously compromise its information capacity. This is one of the reasons we hear speakers that all have the same frequency response yet all sound different. We find, for example, speakers that are good on violins but lousy on percussion.
Getting back to the plot, not only can the HAS measure the delay between the versions of a sound at the two ears, but it can also insert that delay prior to adding the sound from the two ears, so that sound from that direction is emphasised and sounds from other directions are diminished. This is known as attentional selectivity, or in familiar terms, the cocktail party effect, which allows the listener to pay attention to one source in preference to others. The two ears have been made into a simple phased array.
Figure 2. Reflections don’t reduce the ability to determine direction, because they arrive too late. Instead the reflections and the direct sound are time shifted to make the source more audible.
The HAS deals with reflections using an extension of the same mechanism. Figure 2 shows that reflections must have travelled by a longer path and must have suffered a delay that is greater than that due to ear separation. Thus after the onset of a sound and after the inter-aural delay has been detected, the correlator continues to run and looks for further versions of the sound, which will indicate the presence of reflections.
The reflections are recognised, and the delay is used to give us a sense of the distance from the reflecting surface, or in an enclosed space, an idea of the size of that space. The fact that reflections are recognised for what they are means that they do not diminish the accuracy of the initial location of the source via the direct sound. Now here comes the clever part. Provided the reflections are not too late, the HAS time-aligns all the reflections with the original sound and adds them, so that the original sound can be heard better in a reverberant environment. This is known as the Haas effect. Compare that with a microphone which has no such mechanism and where reflections make things worse.
This is why amateur sound recordings are invariably terrible because it is not understood that the microphone doesn’t hear as living things do. The poor microphone doesn’t have a brain and, instead of thinking for it, the amateur recordist seeks to emulate it.
Here we find one of the great contradictions of audio. Ask an acoustician whether it is the reverberant sound or the direct sound that conveys the most power to the listener in an auditorium and he will correctly say it’s the reverberant sound. Take the reverberation out of a concert hall and the audience will condemn it.
For critical listening, why are legacy loudspeakers traditionally played in acoustically treated, practically dead spaces? The brief answer is that legacy speakers cannot reproduce the time domain correctly and cannot excite reverberation correctly, so the Haas effect cannot work and the reflected sounds become a distraction that has to be absorbed. It doesn’t have to be like that. A fuller answer will emerge as this series progresses.
Real sources of sound are frequently physical objects that have been set into motion. If this is someone stepping on a twig which then breaks, the vibration and the sound will be transient. Imagine some surface suddenly moving forward in a step-like manner. A sound transient having increased pressure will be radiated. However, the atmosphere cannot sustain local pressure differences, so the over-pressure leaks away. The speed with which it leaks away is the time constant of the transient sound. The waveform of a hand gun firing has the same shape as the waveform of a Howitzer firing, except that the latter has a much longer time constant. So if you need a big gun sound effect, just record a pistol shot and slow it down.
The time constant is a function of the size of the radiating object. Larger objects block the path by which pressure equalises to a greater extent and cause longer time constants. In real life, the HAS can measure the time constant of a transient and estimate the size of the source from it. Most legacy loudspeaker designs do not allow this mechanism to operate because they superimpose fixed time constants of their own which come down on the sound like a Pythonesque foot.
What we are concluding is that for realism, audio waveforms need to have their phase linearity preserved. Readers familiar with television technology know that this is paramount for video waveforms and won’t be surprised at all.
Figure 3. The square-wave response of a loudspeaker designed to meet the time-accuracy criteria of human hearing.
Audio amplifier designers test their products using square waves, or strictly speaking, band limited square waves and often publish the results. According to Fourier, a square wave consists of a series of harmonics which are closely defined in amplitude and phase. A system that can reproduce a square wave at its output not only has a flat frequency response, but is also phase linear and capable of carrying the time domain information the HAS requires.
Very few loudspeaker manufacturers publish the square wave response of their products, usually because the output waveform is unrecognisable. However, just to illustrate that it is possible, Figure 3 shows the square wave response of a speaker I designed about 15 years ago.
In Part 3 of this series, we will begin by looking at how the frequency domain comes into play in human hearing.
Editor Note: Readers may wish to read other Watkinson articles on The Broadcast Bridge. Two are listed below:
John Watkinson Consultant, publisher, London (UK)
You might also like...
Immersive audio transforms the listening environment to deliver a mesmerizing and captivating experience for a wide range of audiences and expansive group of genres.
Part one of this four-part series introduces immersive audio, the terminology used, the standards adopted, and the key principles that make it work.
Every Super Bowl is a showcase of the latest broadcast technology, whether video or audio. For the 53rd Super Bowl broadcast, CBS Sports will use almost exclusively IP and network-based audio.
Richard Devine creates sound assets for companies such as Apple, Microsoft, Google and other Silicon Valley giants, as well as content companies like Sony Media and the video game, Doom. These range from individual sounds to complete music tracks.
High fidelity speakers for the home environment differ from professional audio monitors due to their sonic accuracy. In the studio, we want to hear mistakes in the audio and not have the speakers cover them up. At home, we want…