Object-Based Audio Mixing: A New Way To Personalized Listening

With the advent of immersive audio mixing using codecs like Dolby Atmos and DTS:X (the successor to DTS HD) professionals now have the ability to create interactive, personalized, scalable and immersive content by representing it as a set of individual assets together with metadata describing their relationships and associations.


More articles about Immersive Audio:


This is called object-based audio mixing and it is adding a new dimension to multi-channel mixes for television and film. Some say it helps create a multi-dimensional sound experience for the viewer that moves around you like sound would in real life.

A Unique Experience For Each Audience Member

Object-based media allows the content of programs to change according to the requirements of each individual audience member. The ‘objects’ refer to the different assets that are used to make up a piece of content. These could be large objects—the audio and video used for a scene in a drama—or small objects, like an individual frame of video, a caption, or a sound effect. By breaking down a piece of media into separate objects, attaching meaning to them, and describing how they can be rearranged, a program can be changed to reflect the context of an individual consumer.

However, object-based audio is not just about Dolby Atmos and DTS:X. It is possible to use object audio to deliver content to the end user where they can adjust the balance between content elements. Because MPEG-H audio also offers interactive and immersive sound, employing the audio objects, height channels, and Higher-Order Ambisonics for other types of distribution—including OTT services, digital radio, music streaming, VR, AR, and web content. Dolby and others are now offering personalized audio delivery systems based around the MPEG-H audio standard enabling the end user to choose what they want to hear or not hear. For example in tennis, maybe you don’t want to hear the shrieks from a player? You will have the option to turn that down.

Object-based audio demands higher-performance audio processors to handle complex computing during the mix.

Object-based audio demands higher-performance audio processors to handle complex computing during the mix.

Audiences want to watch (and listen to) content everywhere, and with mobile devices, they might start watching or listening to a program at home and then finish the rest on the bus. Object-based media allows the mixer to specify different audio mixes for different environments. If people are listening on the move, with object-based audio the mixer can make sure that the sound is just right for them, no matter where they are.

This new workflow requires audio professionals to rethink how they approach the mix and requires extra processing power to use this technique successfully.

What Is An “Object?”

Audio becomes an object when it is accompanied by metadata that describes its existence, position and function. An audio object can, therefore, be the sound of a bee flying over your head, the crowd noise, commentary to a sporting event in any language. All this remains fully adjustable on the consumer’s end to their specific listening environment, needs and liking, regardless of the device.

In the UK the BBC has been experimenting with object-based audio, which has led to a new ITU recommendation (ITU-R BS.2125 “A serial representation of the Audio Definition Model”), which was published in February 2019. It outlines a specification for metadata that can be used to describe object-based audio, scene-based audio and channel-based audio.

“People’s interest in object-based broadcasting varies enormously depending on their level of understanding of it,” Andrew Mason, BBC R&D senior research engineer, said in 2019. “In some areas, for example BBC Radio Engineering, it is the focus of a significant amount of effort, designing the next generation of radio broadcasting infrastructure. The impact on production areas—both TV and radio—is still modest, being limited at the moment to an underpinning technology for binaural productions, many of which have now been aired or published on the BBC website. [Meanwhile] the interest of program commissioners and program makers in the possibilities of personalization is still being developed.”

MPEG-H Audio In The Mix

Another important element in delivering object-based audio to the consumer has been the development of the MPEG-H Audio standard. MPEG-H Audio is already on-air in Korea and the US (ATSC 3.0), Europe (DVB UHD), and China.

MPEG-H was developed by Germany’s Fraunhofer IIS research institute and is an audio system devised for delivering format-agnostic object-based audio.

Fraunhofer IIS has demonstrated an end-to-end production to consumer system that includes MPEG-H monitoring units for real-time monitoring and content authoring, post-production tools, MPEG-H Audio real-time broadcast encoders, and decoders in professional and consumer receivers.

Adrian Murtaza, senior Manager at Fraunhofer IIS’ technical standards and business development unit, has said that with MPEG-H it is possible to offer immersive sound that increases the realism and immersion in the scene, [as well as] the use of audio objects to enable interactivity.

“This means viewers can personalize a program’s audio mix, for instance by switching between different languages, enhancing hard-to-understand dialogue, or adjusting the volume of the commentator in sports broadcasts,” he said, adding that along with Dolby’s new AC-4 format, which natively supports the Dolby Atmos immersive audio technology, MPEG-H is expected to have a significant impact on broadcast delivery services over the next two years.

Object Mixing In Live Sports

Several production companies—like Salsa Sound, an offshoot of research initiatives completed at Salford University in the UK—have developed tools for automatic mixing that are both channel and object-based. These are focused on live sports, where a machine learning engine can automatically create a mix of the on-pitch sounds without any additional equipment, services or human input. This frees up the sound supervisors to be able to create better mixes.

Applying a machine learning engine can automatically create a mix of the on-pitch sounds without any additional equipment, services or human input.

Applying a machine learning engine can automatically create a mix of the on-pitch sounds without any additional equipment, services or human input.

“Our solutions not only create a mix for a channel-based world, “ said Rob Oldfield, co-founder at Salsa Sound, “but also allow for the individual objects to be broadcast separately with accompanying metadata from our optimized triangulation procedure which places all of the sounds in 3D space—even in a high noise environment—which helps facilitate immersive and interactive applications.”

Based on machine learning, Salsa Sound have been able to identify where the ball is on the pitch and to automate the mixing of all the field mics. In addition, the machine learning technology has been taught to not only identify the ball but how hard it is being kicked and to do automated ball kick foley on the fly, at last giving us the impact that we have been struggling to achieve.

Audio equipment vendors have begun to develop compatible products and are beginning to see interest from their customers.

“Over the last couple of years, our users have started migrating to next-generation audio and producing Dolby Atmos—among others—by adding channels to each path to add height legs, as well as adding objects to their mix,” said Pete Walker, Senior Product Manager at audio mixing console maker Calrec Audio, adding that they have added height legs and height panning to provide native immersive input channels, buses, monitoring and metering to its Impulse audio processing and routing engine. “That’s quite a lot of extra DSP being used and we need to make sure that we provide enough so there’s no compromise.”

At the end of the day, object-based audio offers the consumer a lot more control while also providing content providers with the technology to deliver one stream of object-based content and then use the metadata to render the most appropriate version for the hardware the consumer is using to playback the content. There are still many issues to work out—like the challenge of deciding what are objects and what remain beds in a Dolby Atmos or DTS:X mix—but with time and experimentation, the promise of true personalization for the consumer, using object-based mixing, will be welcomed by all.

Broadcast Bridge Survey

You might also like...

Designing IP Broadcast Systems: Where Broadcast Meets IT

Broadcast and IT engineers have historically approached their professions from two different places, but as technology is more reliable, they are moving closer.

Comms In Hybrid SDI - IP - Cloud Systems - Part 2

We continue our examination of the demands placed on hybrid, distributed comms systems and the practical requirements for connectivity, transport and functionality.

NAB Show 2024 BEIT Sessions Part 2: New Broadcast Technologies

The most tightly focused and fresh technical information for TV engineers at the NAB Show will be analyzed, discussed, and explained during the four days of BEIT sessions. It’s the best opportunity on Earth to learn from and question i…

Standards: Part 6 - About The ISO 14496 – MPEG-4 Standard

This article describes the various parts of the MPEG-4 standard and discusses how it is much more than a video codec. MPEG-4 describes a sophisticated interactive multimedia platform for deployment on digital TV and the Internet.

Chris Brown Discusses The Themes Of The 2024 NAB Show

The Broadcast Bridge sat down with Chris Brown, executive vice president and managing director, NAB Global Connections and Events to discuss this year’s gathering April 13-17 (show floor open April 14-17) and how the industry looks to the show e…