Object-Based Audio Mixing: A New Way To Personalized Listening

With the advent of immersive audio mixing using codecs like Dolby Atmos and DTS:X (the successor to DTS HD) professionals now have the ability to create interactive, personalized, scalable and immersive content by representing it as a set of individual assets together with metadata describing their relationships and associations.

This is called object-based audio mixing and it is adding a new dimension to multi-channel mixes for television and film. Some say it helps create a multi-dimensional sound experience for the viewer that moves around you like sound would in real life.

A Unique Experience For Each Audience Member

Object-based media allows the content of programs to change according to the requirements of each individual audience member. The ‘objects’ refer to the different assets that are used to make up a piece of content. These could be large objects—the audio and video used for a scene in a drama—or small objects, like an individual frame of video, a caption, or a sound effect. By breaking down a piece of media into separate objects, attaching meaning to them, and describing how they can be rearranged, a program can be changed to reflect the context of an individual consumer.

However, object-based audio is not just about Dolby Atmos and DTS:X. It is possible to use object audio to deliver content to the end user where they can adjust the balance between content elements. Because MPEG-H audio also offers interactive and immersive sound, employing the audio objects, height channels, and Higher-Order Ambisonics for other types of distribution—including OTT services, digital radio, music streaming, VR, AR, and web content. Dolby and others are now offering personalized audio delivery systems based around the MPEG-H audio standard enabling the end user to choose what they want to hear or not hear. For example in tennis, maybe you don’t want to hear the shrieks from a player? You will have the option to turn that down.

Object-based audio demands higher-performance audio processors to handle complex computing during the mix.

Object-based audio demands higher-performance audio processors to handle complex computing during the mix.

Audiences want to watch (and listen to) content everywhere, and with mobile devices, they might start watching or listening to a program at home and then finish the rest on the bus. Object-based media allows the mixer to specify different audio mixes for different environments. If people are listening on the move, with object-based audio the mixer can make sure that the sound is just right for them, no matter where they are.

This new workflow requires audio professionals to rethink how they approach the mix and requires extra processing power to use this technique successfully.

What Is An “Object?”

Audio becomes an object when it is accompanied by metadata that describes its existence, position and function. An audio object can, therefore, be the sound of a bee flying over your head, the crowd noise, commentary to a sporting event in any language. All this remains fully adjustable on the consumer’s end to their specific listening environment, needs and liking, regardless of the device.

In the UK the BBC has been experimenting with object-based audio, which has led to a new ITU recommendation (ITU-R BS.2125 “A serial representation of the Audio Definition Model”), which was published in February 2019. It outlines a specification for metadata that can be used to describe object-based audio, scene-based audio and channel-based audio.

“People’s interest in object-based broadcasting varies enormously depending on their level of understanding of it,” Andrew Mason, BBC R&D senior research engineer, said in 2019. “In some areas, for example BBC Radio Engineering, it is the focus of a significant amount of effort, designing the next generation of radio broadcasting infrastructure. The impact on production areas—both TV and radio—is still modest, being limited at the moment to an underpinning technology for binaural productions, many of which have now been aired or published on the BBC website. [Meanwhile] the interest of program commissioners and program makers in the possibilities of personalization is still being developed.”

MPEG-H Audio In The Mix

Another important element in delivering object-based audio to the consumer has been the development of the MPEG-H Audio standard. MPEG-H Audio is already on-air in Korea and the US (ATSC 3.0), Europe (DVB UHD), and China.

MPEG-H was developed by Germany’s Fraunhofer IIS research institute and is an audio system devised for delivering format-agnostic object-based audio.

Fraunhofer IIS has demonstrated an end-to-end production to consumer system that includes MPEG-H monitoring units for real-time monitoring and content authoring, post-production tools, MPEG-H Audio real-time broadcast encoders, and decoders in professional and consumer receivers.

Adrian Murtaza, senior Manager at Fraunhofer IIS’ technical standards and business development unit, has said that with MPEG-H it is possible to offer immersive sound that increases the realism and immersion in the scene, [as well as] the use of audio objects to enable interactivity.

“This means viewers can personalize a program’s audio mix, for instance by switching between different languages, enhancing hard-to-understand dialogue, or adjusting the volume of the commentator in sports broadcasts,” he said, adding that along with Dolby’s new AC-4 format, which natively supports the Dolby Atmos immersive audio technology, MPEG-H is expected to have a significant impact on broadcast delivery services over the next two years.

Object Mixing In Live Sports

Several production companies—like Salsa Sound, an offshoot of research initiatives completed at Salford University in the UK—have developed tools for automatic mixing that are both channel and object-based. These are focused on live sports, where a machine learning engine can automatically create a mix of the on-pitch sounds without any additional equipment, services or human input. This frees up the sound supervisors to be able to create better mixes.

Applying a machine learning engine can automatically create a mix of the on-pitch sounds without any additional equipment, services or human input.

Applying a machine learning engine can automatically create a mix of the on-pitch sounds without any additional equipment, services or human input.

“Our solutions not only create a mix for a channel-based world, “ said Rob Oldfield, co-founder at Salsa Sound, “but also allow for the individual objects to be broadcast separately with accompanying metadata from our optimized triangulation procedure which places all of the sounds in 3D space—even in a high noise environment—which helps facilitate immersive and interactive applications.”

Based on machine learning, Salsa Sound have been able to identify where the ball is on the pitch and to automate the mixing of all the field mics. In addition, the machine learning technology has been taught to not only identify the ball but how hard it is being kicked and to do automated ball kick foley on the fly, at last giving us the impact that we have been struggling to achieve.

Audio equipment vendors have begun to develop compatible products and are beginning to see interest from their customers.

“Over the last couple of years, our users have started migrating to next-generation audio and producing Dolby Atmos—among others—by adding channels to each path to add height legs, as well as adding objects to their mix,” said Pete Walker, Senior Product Manager at audio mixing console maker Calrec Audio, adding that they have added height legs and height panning to provide native immersive input channels, buses, monitoring and metering to its Impulse audio processing and routing engine. “That’s quite a lot of extra DSP being used and we need to make sure that we provide enough so there’s no compromise.”

At the end of the day, object-based audio offers the consumer a lot more control while also providing content providers with the technology to deliver one stream of object-based content and then use the metadata to render the most appropriate version for the hardware the consumer is using to playback the content. There are still many issues to work out—like the challenge of deciding what are objects and what remain beds in a Dolby Atmos or DTS:X mix—but with time and experimentation, the promise of true personalization for the consumer, using object-based mixing, will be welcomed by all.

Why Did You Read This?

You might also like...

Microphones: Part 1 - Introduction

In this new series John Watkinson looks at all aspects of microphones, including how they work and how they don’t work.

Digital Audio: Part 10 - Adjusting Levels

Gain control in digital audio is essentially a numerical model of the same process in the analog domain.

Digital Audio: Part 9 - Representing Data

The advantages of digital audio for recording purposes are clear, but once in the digital domain, productions steps also need to be carried out. Recorders don’t care about the encoding method, which is instead optimized for production purposes.

Digital Audio: Part 8 - Sampling Rates

The best sampling rate for digital audio is easily established by considering the requirements of the human auditory system (HAS), which is the only meaningful arbiter. Provided that the bandwidth of a digital audio system somewhat exceeds the bandwidth of…

Digital Audio: Part 7 - Debunking The Myths Around Hi-Fi Audio

It’s interesting to compare the quality that can be obtained using digital audio with legacy media such as the vinyl disk and magnetic tape.