Spatial Audio: Part 2 - Approaches, Creation And Delivery

In part one we looked at some of the reasons for the growth in adoption by next generation consumers, and how hardware and content production are combining to give more people better access to spatial content. Here we will look at some of the basics; how spatial audio is achieved, captured, can sound and can be spatialized post-capture, as well as what formats are required to manipulate it all.

More articles about Immersive Audio:

Spatial audio is as mainstream as it gets. Consumers understand exactly what it is and how to get it while at the production end, spatial renderers are becoming even more accessible and delivery methods are becoming even more streamlined. But how does it all work?

Changing the channel

The first principals of immersive and spatial sound can be boiled down to three fundamental approaches, so let’s look at the difference between channel-based, scene-based and object-based audio.

Channel-based audio (CBA) is what we have all grown up with, where each audio channel is sent directly to a loudspeaker in a specific location; one channel for mono, two channels for stereo, six channels for 5.1 surround, 12 channels for 7.1.4 immersive, and so on. Channel-based audio signals end up at their intended loudspeaker without any modification.

It's a simple system but it’s also the reason why channel-based formats are limited; they assume that the receiving hardware at the consumer end is the same as what it is intended in the mix. With a vast range of listening formats, including mono, stereo, surround and immersive as well as headphones, this is no longer a safe assumption.

Setting the scene

Scene-based and object-based audio both have some provision to adapt the audio format to the listening environment.

Scene-based audio (SBA) is a term that includes Ambisonics and Higher Order Ambisonics (HOA). Ambisonics dates back to the early 1970s when Michael Gerzon, who pioneered what we now know as immersive audio, created a framework for 3D immersive sound which was modelled on sound fields on a spherical surface and was entirely independent of speaker placement.

Gerzon’s concept is now known as first order Ambisonics and is made up of four component channels; an omni-directional signal plus components representing the X, Y and Z dimensions of the sound.

This concept has since been developed intro Higher Order Ambisonics (HOA), or scene-based audio. Second order HOA adds five additional components on top of the first order ones to make nine component channels, and third order HOA adds another further seven to total 16 component channels. HOA adds more value because the more channels used, the greater the spatial resolution.

How to record

As with all audio recording, the best way to record scene-based spatial audio depends on the environment and what the recording engineer is trying to achieve.

There are a variety of immersive microphone techniques and approaches to capture 360° recordings at source. Standard microphone arrays can be used, but more conveniently there are also a variety of Ambisonic microphones available. Like the original Calrec Soundfield microphone launched in 1978, these all use multiple capsules in a single array, which can capture up to third order Ambisonic content. These mics capture the raw “A-Format” audio from each capsule, but an editor must be used to manipulate the content, which needs to be decoded into an Ambisonics B-Format format; again, there are multiple options to do this, including Ambisonic mics which include decoding and conversion software.

In addition, as many people will experience spatial audio as a decoded binaural stereo format, there are various ways to record binaural content at source, from inexpensive in-ear mics to more expensive dummy heads, although all have the same potential issues which affect all binaural recordings.

Being objectified

While scene-based formats provide more spatial separation, most commonly available immersive and spatial audio content is derived from object-based audio (OBA); Dolby Atmos, THX Spatial Audio, DTS:X, MPEG-H and Sony’s 360 Reality Audio all support object-based formats.

OBA treats an audio component – such as a microphone, an instrument or a commentary feed - as an independent object with associated metadata to describe it, such as relative levels and its position in space. In this way an immersive scene can be built up using a number of separate audio objects which each has a place in that scene. A consumer’s receiver reproduces those encoded objects in accordance with the metadata and recreates the soundscape as intended.

For example, Apple Spatial consumers will receive an encoded Dolby Atmos format which replicates the location of each channel according to whatever audio hardware is being used – whether that’s a 3D soundbar, a pair of Apple Airpods or a full immersive speaker system.

As objects are independent of each other, OBA provides mix engineers with enormous creativity, as well as providing the possibility of end-user personalization further down the line, a concept which is rapidly gaining more recognition in broadcast. If enabled in the metadata, consumers can be given the ability to modify the contribution of individual objects, such as commentary levels or languages.

What language is this?

This is all made easier by the adoption of the Audio Definition Model (ADM).

Originally published in 2014, ADM was developed by the European Broadcast Union (EBU). It is codec-agnostic so it doesn’t tie content producers into any codec-specific ecosystem during production, and it also supports existing CBA delivery which means that as well as enabling object-based and scene-based assets, producers don’t need to create separate mixes for every output format they need to support.

It can be interpreted by a wide range of encoders such as Dolby AC-4 and MPEG-H, and most streaming services request ADM BWF (which stands for the Audio Definition Model Broadcast Wave Format) as a delivery format.

In fact, the extension to the BWF format (another EBU- specified format which dates back to 1997), was one of ADM’s first applications. BWF includes an 'AXML' chunk to allow XML metadata to be carried and it is the standard file format for Dolby Atmos and Apple Spatial.

ADM has widespread adoption to provide the metadata to author and render spatial content from individual audio assets, and in establishing itself as the open production standard for NGA metadata, it enables producers to maintain their creative vision.

Tools

When it comes to designing the final mix and generating that metadata there are a number of renderers available, but there are fundamental differences between them. There are various renderers available, but the most well-known are from Dolby and Apple.

The Dolby Atmos Production Suite is fundamental to its Atmos system. It integrates with a number of DAWs and in addition to creating immersive mixes it renders down to standard channel-based formats like stereo, 5.1 surround and binaural.

The Dolby Atmos plugin was added to the Apple Logic Pro renderer in 2021, and within a year its 10.7.3 update introduced two key features. One was support for Apple’s head tracking, which shifts the soundscape as the user moves around it. As head-tracking becomes more commonplace, and with more devices looking to support it in the future, renderers which are compatible with this functionality will become increasingly practical.

The second feature was the ability to preview Atmos tracks as they would sound on Apple Music, which is significant as it uses a different codec to encode the ADM BWF file. While Dolby Atmos uses the AC4-IMS codec for binaural headphone playback, Apple uses its own renderer to interpret a Dolby Atmos mix, which means an Apple Music spatial mix will sound different to one rendered by Dolby on another streaming service.

More of everything please

The tools are in place, the techniques are well-tested and bedded in, and demand is there.

From a broadcast perspective, live content technologies are also gaining momentum. Recent updates to ITU-R BS.2125, which describes a serial representation of the Audio Definition Model (S-ADM) and makes it suitable for use in linear workflows like live broadcasting and streaming, takes us all a step closer to experiencing live spatial content. There have already been examples of S-ADM use in Europe, with France TV’s coverage of the French Open as far back as 2020 utilizing a serialized ADM workflow over Dolby AC-4.

Dynamic head tracking has also been trialed in a live broadcast environment in 2022 when the 5G Edge-XR project won the Content Everywhere Award at this year’s IBC exhibition.

Recording techniques like these add significant value. As adoption rates continue to develop and processes become more streamlined, spatial audio continues to amplify levels of immersion in all forms of entertainment, from both pre-produced content like music, film and television to more dynamic pursuits like gaming and live broadcast.

And everyone is on board.

You might also like...

Microphones: Part 10 - Mid-Side (M-S) Recording And Processing

M-S techniques provide useful sound-field positioning and a convenient way to check mono compatibility. We explain the hard science behind this often misunderstood technique.

Microphones: Part 9 - The Science Of Stereo Capture & Reproduction

Here we look at the science of using a matched pair of microphones positioned as a coincident pair to capture stereo sound images.

Microphones: Part 8 - Audio Vectorscopes

The audio vectorscope is an excellent tool for assuring quality in stereo sound production, because it makes the virtual sound image visible in the same way that a television vectorscope allows the color signals to be seen.

Microphones: Part 7 - Microphones For Stereophony

Once the basic requirements for reproducing sound were in place, the most significant next step was to reproduce to some extent the spatial attributes of sound. Stereophony, using two channels, was the first successful system.

Microphones: Part 6 - Omnidirectional Response In Practice

Having looked at how microphones are supposed to work, here we see that what happens in practice isn’t quite the same because the ideal and the actual are somewhat different.