Implementing MPEG-H Audio in Television Broadcast Operations

A new world of immersive audio awaits both broadcasters and viewers with MPEG-H audio.

The coding efficiency of audio compression is about to improve markedly because of a new MPEG audio standard—MPEG-H. This standard will permit many new and useful capabilities for those involved in television audio.

The world of television audio is advancing.Interactive and Immersive sound is about to displace surround sound, just as surround sound began to displace stereo about 20 years ago and stereo began to displace mono about 30 years ago.The improvements in television audio have been and are being made possible by advancements in the technology for delivery of audio content –recent improvements, in particular, by significant increases in the efficiency of audio compression.

A major increase in the coding efficiency of audio compression is embodied in the new MPEG International Standard on MPEG-H Audio.MPEG-H Audio will permit many new and useful capabilities in the delivery of television audio.This article will examine some of those capabilities and look at the requirements for implementing them in practical situations in television broadcast operations.

MPEG-H Audio yields roughly double the coding efficiency of the methods in use for U.S. television broadcasting today.That doubling means that twice as many channels can be carried in the same bit bandwidth.For example, in the space required today for a 5.1-channel surround sound service (typically 384 kb/s), it becomes possible to carry 7.1 surround channels plus 4 height channels (12 channels vs. 6).In MPEG-H Audio, the signals representing audio essence can be carried in any of three forms – channels, objects, and coefficients (in a method called Scene-Based Audio), and those forms can be combined as needed to best deliver content through the broadcast chain to consumers.

Interactive sound is all about giving listeners the ability to control audio to their preferences as it is presented to them.With MPEG-H Audio, for example, broadcasters can give listeners choices between home-team- or away-team-oriented commentary for sporting events, between languages for dialog and commentary, and between sound effects of different sorts.Once the various selections are offered, control over the relative levels at which the different sounds are presented, within adjustment ranges set by broadcasters, also can be given to listeners.

In general, Interactive sound is made possible through the transmission to receivers of Audio Objects, which essentially are single channels or groups of channels delivered in such a way that they can be mixed with the remainder of program audio content in the listener’s receiver, be that a television set, a set top box, or a high end digital processor for a home theater.Audio Objects can be Static (in fixed locations) or Dynamic (able to be moved around in their locations in the sound field).MPEG-H Audio provides a wealth of options that can be enabled by broadcasters – for example, to make listener selections among Static Audio Objects mutually exclusive between choices of languages or between choices of commentary, or to control the ranges over which relative presentation levels can be adjusted by listeners.Broadcasters also can establish “presets” or “presentations” among which listeners can choose to indirectly select sets of parameters that tailor the listening experience to their particular circumstances.

Immersive sound is to surround sound as surround sound was to stereo.Immersive sound in MPEG-H Audio provides an expanded sense of realism through improvements in the perception of directionality of sound sources.With MPEG-H Audio, not only can directionality be perceived in the plane of the listener’s ears but also in planes above and below the listener, depending on the number of channels that broadcasters choose to transmit.The increased sensation of directionality can be presented to listeners through conventional methods such as placing additional loudspeakers in the listening environment and through alternative methods such as use of “3D soundbars” surrounding displays to create three-dimensional sound images.

With MPEG-H Audio, it also is possible to correct for loudspeakers that are not positioned in “standard” locations and to render content for listening on headphones or earbuds while retaining the sense of directionality that can be achieved in larger spaces using arrays of loudspeakers.Another aspect of MPEG-H Audio is the ability for broadcasters to create “profiles” that address listening on particular types of devices or in particular environments (think listening on a portable device in a noisy subway station environment) and to deliver separate control information appropriate for simultaneous use of the same essence in each of the situations to which the profiles are targeted.

Changes from Current Equipment and Practice

Making advances of the sort described, naturally, will require some changes in the equipment currently in use and in some operating practices.With MPEG-H Audio, however, the changes required will be surprisingly small, especially when compared to the changes needed to move from stereo to surround sound, for example.To obtain the full benefits of MPEG-H Audio, two primary changes will be necessary: Changing from the audio encoders and decoders currently in use to new MPEG-H Audio encoders and decoders, of course, and adding one piece of equipment for monitoring, rendering, and data authoring purposes at the input of each MPEG-H Audio encoder in the broadcast chain.As will be discussed below with respect to potential stages of implementation, it is possible to get started with just the new encoders and decoders and to have different levels of implementation at different locations along the chain (network operations center versus affiliate master control, for instance).An example of a recent demonstration system is shown in simplified form in Figure 1, which shows replacement encoders and decoders highlighted in yellow and added equipment in red.

MPEG-H signal flow block diagram

Of great significance to the progression from today’s surround sound to future interactive and immersive sound is the fact that MPEG-H Audio technology can be used wherever audio compression is needed in the broadcast chain.There no longer will be a need for one form of compression within the broadcast chain and a separate compression structure used to transmit content to consumers.This ubiquity of technology means that the same type of equipment can be used throughout the broadcast chain and that moving from encoded to PCM to encoded forms of representation of audio content can be accomplished readily.The result of such repeated concatenation of methods means that it is possible to carry metadata, as defined in the MPEG-H Audio standard, in the compressed audio domain and to carry the information needed to populate the metadata in a Control Track on an audio channel in the uncompressed domain within a broadcast plant.

The Control Track developed to facilitate use of MPEG-H Audio in broadcast plants provides many simplifications of operations relative to some of the workarounds that have been required for previous technologies.Data in the Control Track is synchronized with the video frame rate in use in such a way that switching between audio sources can be done on video frame boundaries without disruption to the audio.At the same time, audio frames remain at their optimum size so that coding efficiency is maintained.The ability to switch audio and the Control Track together in the PCM domain means that it becomes easy to transition between audio formats (e.g., from stereo to 7.1+4 channels to 5.1-channel surround + 4 objects).Such transitions can occur at such places as on the output of a master control switcher changing sources and in a standard video editor intercutting clips from different sources.

Stages of Implementation

Implementation of MPEG-H Audio can begin very simply, essentially duplicating the services provided today and using the same workflows.It can progress over time to offer more features to consumers as broadcasters make incremental additions of equipment and incremental changes to workflows.Even if a program source – for instance, a network – chooses to move to full implementation of all of the capabilities of MPEG-H Audio, a program distributor downstream from that program source – for instance, an affiliate – can choose to retain its operation with just the services and workflows that it offers currently.

One of the characteristics of the MPEG-H Audio standard is that it includes the technology for a renderer – a subsystem that can generate output in any form needed – as part of the standardized decoder.The renderer in a decoder can be set to produce output content that is downmixed from the source content.Thus, if a network chooses to produce a program in 7.1-channel surround plus 4 height channels plus 3 objects (7.1+4+3) and send that to its affiliates, some affiliates might choose to put the program on the air in that form.Other affiliates might choose to set the decoders on the inputs to their operations to always output in 5.1-channel surround only, and so the 7.1+4+3 program will be converted automatically to 5.1-channel surround when it arrives.Similar fixed settings can be used for parameters such as dialnorm, in the same way as many networks and stations operate currently.

To take advantage of the sort of transition described for MPEG-H Audio, making it possible for different participants in the broadcast distribution chain to progress at their own paces, a four-stage process has been identified.

Stage 1 begins with replacement of just the encoders and decoders that exist today. Fixed settings are used for all parameters, as today, and the services offered are the same as today.The immediate benefit is a lower audio bitrate and better loudness control.
Stage 2 adds some static objects to the stereo or 5-1-channel surround bed that is delivered to consumers, potentially allowing them to select between objects representing dialog in different languages, different types of commentary, and the like and also potentially allowing them to adjust the relative audio levels of the objects.The extent to which consumers can make such choices and such adjustments is up to the broadcasters to control.Benefits of this stage are viewer interactivity, personalized choice of content, and personalized mix of audio elements.
Stage 3 involves adding height or overhead loudspeaker channels delivered to consumers, enabling true immersiveness of content.Benefits of Stage 3 are improved quality of the listening experience and parity with other presentation modalities such as cinema and Blu-Ray discs.
Stage 4 adds Dynamic (moving) objects, providing the ability to track screen action with mono channels.

While the progression of MPEG-H Audio implementation has been described assuming moving successively through the stages, there is nothing to preclude broadcasters from choosing to start at Stage 1, then moving to Stage 3, Stage 2, and Stage 4.Other orders are possible.The important point is that a gradual transition is possible, with minimal investment but significant benefits from the start, followed by increasing attractiveness to consumers concomitant with gradual additions of equipment, changes in workflows, and the like.

About the author

S. Merrill Weiss is a consultant in electronic media technology and technology management with over 48 years of experience in broadcast operations and in designing facilities for radio and television stations, television networks, major studios, and similar organizations.He has spent over 38 years in developing technology for digital audio and video and in helping to write standards for its implementation, having produced the tests that led to the first digital television standard, upon which practically all subsequent digital video standards are based.He is a SMPTE Fellow, received the SMPTE David Sarnoff Gold Medal Award and the SMPTE Progress Medal, received the NAB Television Engineering Achievement Award and the ATSC Bernard J. Lechner Outstanding Contributor Award, received the IEEE Broadcast Technology Society Matti S. Siukola Award twice, and is recognized by the Society of Broadcast Engineers as a Certified Professional Broadcast Engineer.Weiss holds six patents for broadcast transmission technology and is a graduate of the Wharton School of the University of Pennsylvania.

You might also like...

Microphones: Part 10 - Mid-Side (M-S) Recording And Processing

M-S techniques provide useful sound-field positioning and a convenient way to check mono compatibility. We explain the hard science behind this often misunderstood technique.

Microphones: Part 9 - The Science Of Stereo Capture & Reproduction

Here we look at the science of using a matched pair of microphones positioned as a coincident pair to capture stereo sound images.

Microphones: Part 8 - Audio Vectorscopes

The audio vectorscope is an excellent tool for assuring quality in stereo sound production, because it makes the virtual sound image visible in the same way that a television vectorscope allows the color signals to be seen.

Microphones: Part 7 - Microphones For Stereophony

Once the basic requirements for reproducing sound were in place, the most significant next step was to reproduce to some extent the spatial attributes of sound. Stereophony, using two channels, was the first successful system.

Microphones: Part 5 - The Variable Directivity Microphone

The variable directivity microphone is very popular for studio work. What goes on inside is very clever and not widely appreciated.