Object-Based Audio Mixing: A New Way To Personalized Listening

Object-based audio mixing creates a multi-dimensional experience for the viewer that moves around you like sound would in real life.

With the advent of immersive audio mixing using codecs like Dolby Atmos and DTS:X (the successor to DTS HD) professionals now have the ability to create interactive, personalized, scalable and immersive content by representing it as a set of individual assets together with metadata describing their relationships and associations.

More articles about Immersive Audio:

This is called object-based audio mixing and it is adding a new dimension to multi-channel mixes for television and film. Some say it helps create a multi-dimensional sound experience for the viewer that moves around you like sound would in real life.

A Unique Experience For Each Audience Member

Object-based media allows the content of programs to change according to the requirements of each individual audience member. The ‘objects’ refer to the different assets that are used to make up a piece of content. These could be large objects—the audio and video used for a scene in a drama—or small objects, like an individual frame of video, a caption, or a sound effect. By breaking down a piece of media into separate objects, attaching meaning to them, and describing how they can be rearranged, a program can be changed to reflect the context of an individual consumer.

However, object-based audio is not just about Dolby Atmos and DTS:X. It is possible to use object audio to deliver content to the end user where they can adjust the balance between content elements. Because MPEG-H audio also offers interactive and immersive sound, employing the audio objects, height channels, and Higher-Order Ambisonics for other types of distribution—including OTT services, digital radio, music streaming, VR, AR, and web content. Dolby and others are now offering personalized audio delivery systems based around the MPEG-H audio standard enabling the end user to choose what they want to hear or not hear. For example in tennis, maybe you don’t want to hear the shrieks from a player? You will have the option to turn that down.

Object-based audio demands higher-performance audio processors to handle complex computing during the mix.

Audiences want to watch (and listen to) content everywhere, and with mobile devices, they might start watching or listening to a program at home and then finish the rest on the bus. Object-based media allows the mixer to specify different audio mixes for different environments. If people are listening on the move, with object-based audio the mixer can make sure that the sound is just right for them, no matter where they are.

This new workflow requires audio professionals to rethink how they approach the mix and requires extra processing power to use this technique successfully.

What Is An “Object?”

Audio becomes an object when it is accompanied by metadata that describes its existence, position and function. An audio object can, therefore, be the sound of a bee flying over your head, the crowd noise, commentary to a sporting event in any language. All this remains fully adjustable on the consumer’s end to their specific listening environment, needs and liking, regardless of the device.

In the UK the BBC has been experimenting with object-based audio, which has led to a new ITU recommendation (ITU-R BS.2125 “A serial representation of the Audio Definition Model”), which was published in February 2019. It outlines a specification for metadata that can be used to describe object-based audio, scene-based audio and channel-based audio.

“People’s interest in object-based broadcasting varies enormously depending on their level of understanding of it,” Andrew Mason, BBC R&D senior research engineer, said in 2019. “In some areas, for example BBC Radio Engineering, it is the focus of a significant amount of effort, designing the next generation of radio broadcasting infrastructure. The impact on production areas—both TV and radio—is still modest, being limited at the moment to an underpinning technology for binaural productions, many of which have now been aired or published on the BBC website. [Meanwhile] the interest of program commissioners and program makers in the possibilities of personalization is still being developed.”

MPEG-H Audio In The Mix

Another important element in delivering object-based audio to the consumer has been the development of the MPEG-H Audio standard. MPEG-H Audio is already on-air in Korea and the US (ATSC 3.0), Europe (DVB UHD), and China.

MPEG-H was developed by Germany’s Fraunhofer IIS research institute and is an audio system devised for delivering format-agnostic object-based audio.

Fraunhofer IIS has demonstrated an end-to-end production to consumer system that includes MPEG-H monitoring units for real-time monitoring and content authoring, post-production tools, MPEG-H Audio real-time broadcast encoders, and decoders in professional and consumer receivers.

Adrian Murtaza, senior Manager at Fraunhofer IIS’ technical standards and business development unit, has said that with MPEG-H it is possible to offer immersive sound that increases the realism and immersion in the scene, [as well as] the use of audio objects to enable interactivity.

“This means viewers can personalize a program’s audio mix, for instance by switching between different languages, enhancing hard-to-understand dialogue, or adjusting the volume of the commentator in sports broadcasts,” he said, adding that along with Dolby’s new AC-4 format, which natively supports the Dolby Atmos immersive audio technology, MPEG-H is expected to have a significant impact on broadcast delivery services over the next two years.

Object Mixing In Live Sports

Several production companies—like Salsa Sound, an offshoot of research initiatives completed at Salford University in the UK—have developed tools for automatic mixing that are both channel and object-based. These are focused on live sports, where a machine learning engine can automatically create a mix of the on-pitch sounds without any additional equipment, services or human input. This frees up the sound supervisors to be able to create better mixes.

Applying a machine learning engine can automatically create a mix of the on-pitch sounds without any additional equipment, services or human input.

“Our solutions not only create a mix for a channel-based world, “ said Rob Oldfield, co-founder at Salsa Sound, “but also allow for the individual objects to be broadcast separately with accompanying metadata from our optimized triangulation procedure which places all of the sounds in 3D space—even in a high noise environment—which helps facilitate immersive and interactive applications.”

Based on machine learning, Salsa Sound have been able to identify where the ball is on the pitch and to automate the mixing of all the field mics. In addition, the machine learning technology has been taught to not only identify the ball but how hard it is being kicked and to do automated ball kick foley on the fly, at last giving us the impact that we have been struggling to achieve.

Audio equipment vendors have begun to develop compatible products and are beginning to see interest from their customers.

“Over the last couple of years, our users have started migrating to next-generation audio and producing Dolby Atmos—among others—by adding channels to each path to add height legs, as well as adding objects to their mix,” said Pete Walker, Senior Product Manager at audio mixing console maker Calrec Audio, adding that they have added height legs and height panning to provide native immersive input channels, buses, monitoring and metering to its Impulse audio processing and routing engine. “That’s quite a lot of extra DSP being used and we need to make sure that we provide enough so there’s no compromise.”

At the end of the day, object-based audio offers the consumer a lot more control while also providing content providers with the technology to deliver one stream of object-based content and then use the metadata to render the most appropriate version for the hardware the consumer is using to playback the content. There are still many issues to work out—like the challenge of deciding what are objects and what remain beds in a Dolby Atmos or DTS:X mix—but with time and experimentation, the promise of true personalization for the consumer, using object-based mixing, will be welcomed by all.

You might also like...

Microphones: Part 11 - The State Of The Art… And The Potential Of MEMS Microphone Arrays

Here we look from the state of the art in microphones, to what the future may bring with the enticing theoretical potential of microphone arrays built using MEMS technology.

IP Monitoring & Diagnostics With Command Line Tools: Part 2 - Testing Remote Connections

In the previous article, we set the scene for working with the Command Line Interface (CLI) on a UNIX system. Now we will explore some techniques for performing basic tests on our network infrastructure to check for potential problems.

Microphones: Part 10 - Mid-Side (M-S) Recording And Processing

M-S techniques provide useful sound-field positioning and a convenient way to check mono compatibility. We explain the hard science behind this often misunderstood technique.

Microphones: Part 9 - The Science Of Stereo Capture & Reproduction

Here we look at the science of using a matched pair of microphones positioned as a coincident pair to capture stereo sound images.

Microphones: Part 8 - Audio Vectorscopes

The audio vectorscope is an excellent tool for assuring quality in stereo sound production, because it makes the virtual sound image visible in the same way that a television vectorscope allows the color signals to be seen.