Spatial Audio: Part 1 - Current Formats & The Rise Of HRTF

Spatial audio has become mainstream in gaming because it takes the principles and technology of immersive audio effectively and places the player inside the on-screen action. This series of articles defines the current formats and technologies and asks the obvious question; with such widespread adoption by next generation consumers, what is its potential in broadcast?

Finding Space In The Everyday – The Rise Of Spatial Audio And Where To Find It

If you are wondering how commonplace spatial audio is today, look no further than the 22 million people a month who play Riot Game’s epic first person shooter Valorant.

During 2022, an average of two million people a day strapped on a pair of headphones to play Valorant. In March 2021, its v2.06 software update made headlines when the upgrade incorporated Head Related Transfer Function (HRTF) capabilities – more on that later – and it is still the best-known game with that feature.

It’s not often one can say this and genuinely mean it, but the upgrade was a game changer; having the spatial awareness to hear where opponents are hiding gives players the edge.

Making Space

Spatial audio is anything which puts the listener in a 3D soundscape and places objects somewhere within that space. It is a fully immersive concept which includes audio objects on the vertical plane as well as the horizontal plane. It is designed for multi-speaker immersive systems and soundbars, and it is commonly virtualized for headphones.

Spatial audio is such a big deal that there are a number of well-established cross-industry organizations developing products to generate appropriate content, such as the EBU’s EAR Production Suite, while technologies like Fraunhofer’s Cingo and THX Spatial Audio can upscale content using binaural and HRTF techniques to virtualize the effect for use in headphones (you can read more about binaural content here).

Where You At?

Valorant’s spatial awareness is created by THX Spatial Audio, an object-based renderer which provides HTRF functionality.

There are three sound localization cues which enable people to perceive space through localization; The first is interaural level difference, which is how loud a sound source is – if it’s louder in one ear, the brain perceives it as being located on the side nearest that ear.

The second is the interaural timing difference, which is the difference in time between a sound source reaching both ears – in our earlier example, the sound will reach the left ear before it reaches the right.

Both these effects can be replicated using delay and panning techniques.

HRTF is a third localization cue and provides additional information for sound source localization based on a person’s ears and head shape. These physical attributes are a big part of how we localize sounds when they are attenuated and absorbed by our ears. They provide height information as well as position and distance, and binaural recordings are often done at source with a dummy head to replicate an average person’s head. They can also be created in post-production with binaural renderers, which are also often based on an average person’s head.

While for gaming, this is more than enough to put you in that space to make it an active part of gameplay, it’s not always ideal for prolonged TV and radio content as we all have different ears. In fact, unless you have a perfectly average head and a perfectly average pair of ears, your experience will differ from what is intended.

Encoded And Decoded

There are two main formats for spatial audio; encoded and decoded. Decoded audio arrives in that binaural format from source – either recorded as a stereo feed using binaural techniques or created in post-production using binaural renderers.

Encoded audio formats are quite the opposite. They arrive at the consumer with embedded metadata describing which sounds are replicated in which speakers (although because headphone users are limited to two-channels, these consumers will still receive a decoded binaural signal).

The best-known encoded format is Dolby Atmos, a long-established immersive format first developed in 2012. The other best known encoded format for consumers is Sony 360 Reality Audio. Both are object-based, and both contain metadata to decode signals.

The Big Apple

Many consumers will have experienced spatial audio through Apple’s specialist service (helpfully branded as “Apple Spatial Audio”). Apple worked on its service in collaboration with Dolby Atmos resulting in a huge amount of content on its Apple Music service, but it also supplies something else which adds to the immersive experience – head tracking.

This is the other big value-add with spatial audio. Head tracking enables the soundscape to move where you move. Apple is the only provider which provides this ability; its spatial audio service not only decodes Atmos content, but it also tracks the user’s head movements using accelerometers and gyroscopes built into compatible devices like Apple’s AirPods and Beats headphones.

These allow the listener to listen to the audio relative to where the screen is, so if one turns one’s head, the audio will still be anchored to the same on-screen content to further increase the illusion of immersion.

But while Apple’s technology is well-established there are others snapping at its heels. Qualcomm’s Snapdragon 8 Gen 2 mobile platform, launched in November 2022, also supports spatial audio with dynamic head tracking, and is already being adopted by several Android OEMs including Honor, Motorola and Sony.

What About The Others?

Apple’s main competition is the aforementioned Sony 360 Reality Audio format, which delivers 3D audio on its Playstation games consoles and has been building a hardware ecosystem with home theater suppliers like McIntosh and Sennheiser, as well as using its own hardware to support spatial audio.

Like Dolby Atmos, Sony 360 Reality Audio is object based, so audio streams are encoded with metadata to describe the placement of the microphone in the sound field. Sony 360 Reality Audio is available across a range of music services including Amazon Music HD, Nugs.net and Tidal.

And while Sony has the PlayStation ecosystem all tied up for gaming, Dolby Atmos is supported across Xbox consoles as well as a range of PCs. Blizzard's Overwatch was an early adopter of Dolby Atmos over headphones, while THX Spatial Audio – an object based spatial renderer - covers many more games, including Riot Games’ Valorant.

In an attempt to create more realistic HTRF results, Sony also provides a custom service which analyzes a users’ ear shape for Sony’s advanced AI to match it to an ear model in its database – Apple has its own version of this which it calls “Personalized Spatial Audio”.

Appetite

Appetite is huge. Although now getting into the spatial audio market on Android with its Snapdragon platform, Qualcomm’s 2022 State of Sound report reported that spatial audio is right up there.

“Spatial Audio is the next “must-have” feature,” it said. “More than half of respondents claimed spatial audio will have an influence on their decision to buy their next pair of true wireless earbuds, and 41% said they would be willing to spend more for the feature.”

It also claims that a massive 73% of people were aware of what spatial audio is (and not all of them were gamers!)

We’re Just Getting Started

The world of spatial audio is still developing and already there are a number of competing and proprietary formats and delivery mechanisms vying for attention.

The Audio Engineering Society already has a standard in place for spatial audio. AES69-2022, also referred to as SOFA 2.1, was created to provide a standardized file format to exchange space-related acoustic data.

Like Qualcomm, Apple, Dolby, THX, Sony and other stakeholders, the AES also recognize the explosion in demand due to the growth in compatible devices.

It says: “Binaural listening is growing fast, because of growing sales in smartphones, tablets and other individual entertainment systems. The lack of a standard for the exchange of head-related transfer functions (HRTF) means each company keeps its binaural capture and rendering algorithms private. 3D audio is arising, and binaural listening could be the very first 3D audio vector with sufficient fidelity of HRTF.”

So, where to next?

You might also like...

Designing An LED Wall Display For Virtual Production - Part 2

We conclude our discussion of how the LED wall is far more than just a backdrop for the actors on a virtual production stage - it must be calibrated to work in harmony with camera, tracking and lighting systems in…

Microphones: Part 2 - Design Principles

Successful microphones have been built working on a number of different principles. Those ideas will be looked at here.

Expanding Display Capabilities And The Quest For HDR & WCG

Broadcast image production is intrinsically linked to consumer displays and their capacity to reproduce High Dynamic Range and a Wide Color Gamut.

NDI For Broadcast: Part 2 – The NDI Tool Kit

This second part of our mini-series exploring NDI and its place in broadcast infrastructure moves on to exploring the NDI Tools and what they now offer broadcasters.

HDR & WCG For Broadcast: Part 2 - The Production Challenges Of HDR & WCG

Welcome to Part 2 of ‘HDR & WCG For Broadcast’ - a major 10 article exploration of the science and practical applications of all aspects of High Dynamic Range and Wide Color Gamut for broadcast production. Part 2 discusses expanding display capabilities and…