Object-Based Audio Mixing: A New Way To Personalized Listening

With the advent of immersive audio mixing using codecs like Dolby Atmos and DTS:X (the successor to DTS HD) professionals now have the ability to create interactive, personalized, scalable and immersive content by representing it as a set of individual assets together with metadata describing their relationships and associations.

More articles about Immersive Audio:

This is called object-based audio mixing and it is adding a new dimension to multi-channel mixes for television and film. Some say it helps create a multi-dimensional sound experience for the viewer that moves around you like sound would in real life.

A Unique Experience For Each Audience Member

Object-based media allows the content of programs to change according to the requirements of each individual audience member. The ‘objects’ refer to the different assets that are used to make up a piece of content. These could be large objects—the audio and video used for a scene in a drama—or small objects, like an individual frame of video, a caption, or a sound effect. By breaking down a piece of media into separate objects, attaching meaning to them, and describing how they can be rearranged, a program can be changed to reflect the context of an individual consumer.

However, object-based audio is not just about Dolby Atmos and DTS:X. It is possible to use object audio to deliver content to the end user where they can adjust the balance between content elements. Because MPEG-H audio also offers interactive and immersive sound, employing the audio objects, height channels, and Higher-Order Ambisonics for other types of distribution—including OTT services, digital radio, music streaming, VR, AR, and web content. Dolby and others are now offering personalized audio delivery systems based around the MPEG-H audio standard enabling the end user to choose what they want to hear or not hear. For example in tennis, maybe you don’t want to hear the shrieks from a player? You will have the option to turn that down.

Object-based audio demands higher-performance audio processors to handle complex computing during the mix.

Object-based audio demands higher-performance audio processors to handle complex computing during the mix.

Audiences want to watch (and listen to) content everywhere, and with mobile devices, they might start watching or listening to a program at home and then finish the rest on the bus. Object-based media allows the mixer to specify different audio mixes for different environments. If people are listening on the move, with object-based audio the mixer can make sure that the sound is just right for them, no matter where they are.

This new workflow requires audio professionals to rethink how they approach the mix and requires extra processing power to use this technique successfully.

What Is An “Object?”

Audio becomes an object when it is accompanied by metadata that describes its existence, position and function. An audio object can, therefore, be the sound of a bee flying over your head, the crowd noise, commentary to a sporting event in any language. All this remains fully adjustable on the consumer’s end to their specific listening environment, needs and liking, regardless of the device.

In the UK the BBC has been experimenting with object-based audio, which has led to a new ITU recommendation (ITU-R BS.2125 “A serial representation of the Audio Definition Model”), which was published in February 2019. It outlines a specification for metadata that can be used to describe object-based audio, scene-based audio and channel-based audio.

“People’s interest in object-based broadcasting varies enormously depending on their level of understanding of it,” Andrew Mason, BBC R&D senior research engineer, said in 2019. “In some areas, for example BBC Radio Engineering, it is the focus of a significant amount of effort, designing the next generation of radio broadcasting infrastructure. The impact on production areas—both TV and radio—is still modest, being limited at the moment to an underpinning technology for binaural productions, many of which have now been aired or published on the BBC website. [Meanwhile] the interest of program commissioners and program makers in the possibilities of personalization is still being developed.”

MPEG-H Audio In The Mix

Another important element in delivering object-based audio to the consumer has been the development of the MPEG-H Audio standard. MPEG-H Audio is already on-air in Korea and the US (ATSC 3.0), Europe (DVB UHD), and China.

MPEG-H was developed by Germany’s Fraunhofer IIS research institute and is an audio system devised for delivering format-agnostic object-based audio.

Fraunhofer IIS has demonstrated an end-to-end production to consumer system that includes MPEG-H monitoring units for real-time monitoring and content authoring, post-production tools, MPEG-H Audio real-time broadcast encoders, and decoders in professional and consumer receivers.

Adrian Murtaza, senior Manager at Fraunhofer IIS’ technical standards and business development unit, has said that with MPEG-H it is possible to offer immersive sound that increases the realism and immersion in the scene, [as well as] the use of audio objects to enable interactivity.

“This means viewers can personalize a program’s audio mix, for instance by switching between different languages, enhancing hard-to-understand dialogue, or adjusting the volume of the commentator in sports broadcasts,” he said, adding that along with Dolby’s new AC-4 format, which natively supports the Dolby Atmos immersive audio technology, MPEG-H is expected to have a significant impact on broadcast delivery services over the next two years.

Object Mixing In Live Sports

Several production companies—like Salsa Sound, an offshoot of research initiatives completed at Salford University in the UK—have developed tools for automatic mixing that are both channel and object-based. These are focused on live sports, where a machine learning engine can automatically create a mix of the on-pitch sounds without any additional equipment, services or human input. This frees up the sound supervisors to be able to create better mixes.

Applying a machine learning engine can automatically create a mix of the on-pitch sounds without any additional equipment, services or human input.

Applying a machine learning engine can automatically create a mix of the on-pitch sounds without any additional equipment, services or human input.

“Our solutions not only create a mix for a channel-based world, “ said Rob Oldfield, co-founder at Salsa Sound, “but also allow for the individual objects to be broadcast separately with accompanying metadata from our optimized triangulation procedure which places all of the sounds in 3D space—even in a high noise environment—which helps facilitate immersive and interactive applications.”

Based on machine learning, Salsa Sound have been able to identify where the ball is on the pitch and to automate the mixing of all the field mics. In addition, the machine learning technology has been taught to not only identify the ball but how hard it is being kicked and to do automated ball kick foley on the fly, at last giving us the impact that we have been struggling to achieve.

Audio equipment vendors have begun to develop compatible products and are beginning to see interest from their customers.

“Over the last couple of years, our users have started migrating to next-generation audio and producing Dolby Atmos—among others—by adding channels to each path to add height legs, as well as adding objects to their mix,” said Pete Walker, Senior Product Manager at audio mixing console maker Calrec Audio, adding that they have added height legs and height panning to provide native immersive input channels, buses, monitoring and metering to its Impulse audio processing and routing engine. “That’s quite a lot of extra DSP being used and we need to make sure that we provide enough so there’s no compromise.”

At the end of the day, object-based audio offers the consumer a lot more control while also providing content providers with the technology to deliver one stream of object-based content and then use the metadata to render the most appropriate version for the hardware the consumer is using to playback the content. There are still many issues to work out—like the challenge of deciding what are objects and what remain beds in a Dolby Atmos or DTS:X mix—but with time and experimentation, the promise of true personalization for the consumer, using object-based mixing, will be welcomed by all.

Broadcast Bridge Survey

You might also like...

Next-Gen 5G Contribution: Part 2 - MEC & The Disruptive Potential Of 5G

The migration of the core network functionality of 5G to virtualized or cloud-native infrastructure opens up new capabilities like MEC which have the potential to disrupt current approaches to remote production contribution networks.

Designing IP Broadcast Systems: Addressing & Packet Delivery

How layer-3 and layer-2 addresses work together to deliver data link layer packets and frames across networks to improve efficiency and reduce congestion.

Next-Gen 5G Contribution: Part 1 - The Technology Of 5G

5G is a collection of standards that encompass a wide array of different use cases, across the entire spectrum of consumer and commercial users. Here we discuss the aspects of it that apply to live video contribution in broadcast production.

Virtual Production At America’s Premier Film And TV Production School

The School of Cinematic Arts at the University of Southern California (USC) is renowned for its wide range of courses and degrees focused on TV and movie production and all of the sub-categories that relate to both disciplines. Following real-world…

Why AI Won’t Roll Out In Broadcasting As Quickly As You’d Think

We’ve all witnessed its phenomenal growth recently. The question is: how do we manage the process of adopting and adjusting to AI in the broadcasting industry? This article is more about our approach than specific examples of AI integration;…