Audio Over IP - Why It Is Essential

Audio is becoming an IP-centric technology that offers many benefits. Image: Rob Wolifson, TCN audio supervisor at Nine Network in Sydney.

The impact of IP on the design of broadcast equipment and infrastructures is profound. Many broadcasters are replacing existing analog, AES3, MADI and SDI ports with a new class of interface for connecting to standard IT switch infrastructure, together with new control mechanisms for connection management and device discovery. In the process, they’re embracing an emerging set of open standards for interoperable, vendor-neutral signal transport.

Internet Protocol (IP) technology is changing the broadcast industry in a revolutionary way, sweeping aside traditional approaches that relied on a mass of point-to-point cable connections, each restricted to transporting a given type of media/data between two locations, and in audio terms, limited to a relatively low channel count (typically a maximum of 64). These are being replaced with network connections that allow for higher audio channel counts and the ability to pass different types of media and data over the same connection.

In an IP-based operation, a device only needs one connection to a network to be able to send and receive audio and data to and from any other device on the network, rather than needing a direct connection with every other device. Not only is the number of cables massively reduced, but the cables themselves are shorter, since they only need to connect with the nearest switch, rather than running all the way to the piece of kit they connect with. Network cables are also cheap and readily available.

In this simple star network multiple consoles are networked. Each console has its own dedicated I/O interface each handling different audio output formats.

The benefits of IP topology are many. As broadcasters replace their existing analog, AES3, MADI and SDI solutions with this new IT technology, they leverage standard IT switch infrastructure, together with new control mechanisms for connection management and device discovery. In the process, engineers begin to embrace an emerging set of open standards for interoperable, vendor-neutral signal transport. As this happens, the traditional distinctions between audio, video and data transports are disappearing and being replaced with a single agnostic, scalable network. Gradually, old video and audio cabling is being replaced by networks carrying IP traffic, and standard IT switches are taking over the role of conventional audio and video routing equipment.

Next, broadcasters’ ambitions will extend beyond merely replicating existing working practices as confidence and competence grows, and workflows will evolve to take advantage of the greater flexibility and geographical freedom now available to them. And longer-term, the broadcast industry will further borrow from the IT industry by shifting away from bespoke hardware towards software-based processing running on commodity computing platforms. Broadcast processes that fit this model will benefit profoundly in terms of both scalability and economy.

In this tutorial, we’ll take a look at the evolution of IP standards and the key considerations for audio engineers as they approach the brave new world of audio over IP.

IP Basics for Audio

So why is IP asserting such influence in the broadcast and media industries? Perhaps the biggest reason is that it’s proven technology — used for many years by computers in the home, the office and around the world to connect over the internet via their Ethernet ports. Because IP is so widespread, it’s supported by commodity, low-cost hardware that’s easy to acquire and set up. Almost every device has an IP connection these days, giving it the ability to connect to other devices around the world. There are also many, many firmware and software libraries available that make it easy to develop on the IP platform.

Sitting on top of the IP layers, transmission control protocol (TCP) is the means by which devices exchange packets of data in a reliable and organized fashion with no errors. TCP works well for control data, but it’s not appropriate for live streaming of media because it introduces delays for the error-checking and resending of packets. Therefore, live streaming uses an alternative transport mechanism called User Datagram Protocol (UDP), by which the sender streams packets one after another without waiting for confirmation from the receiver.

UDP boosts efficiency and prevents large buffers or delays before media is played out at the receiving end, and its partner, Real Time Streaming Protocol (RTSP), tags each packet so the receiver knows in what order they are to be played out. A further safeguard against dropped or corrupt packets is the SMPTE ST 2202-7 standard, which allows for the same media to be passed over two connections. A receiver takes in both, so if there is a missing or corrupt packet on one, it can take the same media data from the other.

Audio signals have come a long way since the days when it was only possible to transport a single channel over a single cable. The AES3 standard changed that back in the 1980s, enabling the transport of a stereo pair with digital encoding to prevent degradation of the audio signal due to noise picked up on the cable. Next came Multichannel Audio Digital Interface (MADI), with the ability to transport 64 channels of uncompressed audio, in sync, over a single wire. MADI was a big step forward because it meant operations could get rid of large bundles of heavy multicore cabling, but it’s still a point-to-point connection, meaning that each piece of equipment needs a separate connection to each other device for sharing audio.

In the meantime, various audio networking technologies have emerged to increase channel counts and improve the flexibility of media-sharing between devices, while also providing breakthroughs in workflows and reducing setup and cabling requirements. Even better, growing numbers of manufacturers are adopting open formats based on existing IP protocols for media transport over networks using generic off-the-shelf switches, allowing media to coexist on the same networks as other types of data. In other words, these protocols enable devices from different manufacturers to be able to connect together via networks based on open standards. Ravenna, Livewire, Q-LAN, WheatNet and Dante are all popular audio networking technologies developed independently by different manufacturers, but they share an underlying commonality with the key IP protocols they employ. In fact, these IP protocols are the common denominator for audio sharing that has led to the development of AES67.

AES67 allows individual items to be connected to a network that primarily uses one protocol and could even allow a system to be made up of items all using different protocols. Image: Telos Alliance.

Enter AES67

Essentially, AES67 defines a common standard for all devices to stream media over a shared network, and for media from one manufacturer’s device to be received by another. Rather than defining new standards, AES67 describes how to use proven and reliable protocols and technologies from the IEEE, IETF and other standards bodies in an interoperable manner.

AES67 defines a minimum feature set. If two devices are AES67 compliant, then they can exchange audio streams with each other. Some devices will go beyond the minimum capabilities; for example, AES67 only mandates the transport of up to eight channels of audio within each stream. Since we audio professionals are used to SDI, we like the ability to bundle 16 audio channels together, and 64 is a good number too. Therefore, it’s common to see AES67-compliant devices that support higher channel counts per stream.

The AES67 standard is sometimes misunderstood as the specifications on how all professional digital audio gear is supposed to work and interconnect. Not exactly. In fact, AES67 simply defines the requirements for high-performance AoIP (Audio-over-IP) interoperability. A manufacturer can implement AES67 anyway it wants, and there’s the rub.

Of course, low latency is another important feature for broadcast audio. Latency depends on a number of factors — the time it takes to transmit any data packet between two points, the number of switches the packet has to pass through on the way and the length of the audio (packet time) within each packet. The lower the packet time, the quicker it can be sent out. With a 1ms packet time, the transmitter has wait until it has received 1ms worth of audio before it can send it out, which adds an extra 1ms of latency to the audio signal due to the packet time. With smaller packet times, there is less holding back of audio before the packet can be sent out. AES67 defines several mandatory packet times, with 125 µs being preferred on a LAN to minimize the latency of the audio signal.

Professional-quality audio at 16 bits and 48kHz or higher requires a high-performance media network that can ensure the low latencies compatible with live sound. This level of performance is achievable on local-area networks but isn’t generally available on wide-area networks or the public Internet. With AES67, however, all compatible protocols are easily carried over Ethernet and routable over IP networks of all sizes.

Multicast vs. Unicast

Another important concept of IP audio is multicast, the standard means of transporting audio signals over a LAN. Unlike a unicast stream, which is a direct connection between two devices, multicast enables a single switch to forward the data on to multiple receivers. In other words, instead of sending the same audio to multiple places (as with unicast), a multicast transmitter only sends the data once and the IP network takes care of forwarding it on to multiple IP addresses.

It’s easy to see why multicast is easier and more efficient when you consider the role of a typical A1 audio mixer. In a traditional live broadcast, the A1’s job is just as much about routing, and making sure the right feeds are going to the right places, as it is about making a sweet mix for a program. Using IP, it’s as simple as making all of your mixes available on multicast addresses, with labels to clearly identify each, and other users on the same network can see and choose to receive them. Likewise, all the mics, playout, and other sources can come in by IP, advertising themselves and their content for you to choose to receive. With IP simplifying the routing task, the production crew has more time to focus on making things sound great.

Timing and Quality of Service

By enabling synchronization of both audio and video over IP, Precision Timing Protocol (PTP) is fundamental to the reliability of AES67. PTP provides a time-of-day reference, ensuring packets of media are played out at the right time. PTP-aware Ethernet switches are able to adjust the PTP to compensate for the latency imposed by the switch itself, keeping timing tight and accurate.

PTP - Precision Time Protocol is the new “BLACK”. PTP is a more accurate version of NTP – Network Time Protocol, which controls the synchronizing and timing of data packets in the IP environment. In the world of bits and bytes, locking each bit to a slice of time must be highly accurate.

PTP-aware switches are currently fairly expensive, but PTP awareness becomes critical as soon as the network begins to scale up. We expect the price of these switches to fall and we’re already seeing the emergence of reasonably cost-effective, yet high-performing switches like the ARG/Artel Quarra.

While AES67 defines the transport and synchronization protocol, it does not define how devices discover each other, or how to control audio connections between devices. Ideally, to successfully stream AES67 across a LAN, network switches need to support a quality-of-service (QoS) networking system that prioritizes traffic. The switches should be configured for DSCP quality of service to ensure that PTP sync signals are given the highest priority, followed by the audio itself, and all other data after that.

Unpacking the New SMPTE IP Standards Suite

The potential of IP for broadcasting moved to a new level last September with the approval of the first standards in with SMPTE ST-2110, Professional Media Over Managed IP Networks. SMPTE specifies the carriage, synchronization and description of separate elementary essence streams over professional IP networks in real time for live production, playout, and other professional media applications.

The SMPTE ST-2110 standards enable traffic within and beyond a facility to be all-IP, allowing organizations to rely on a single common data-center infrastructure and giving broadcasters greater confidence in the interoperability of their equipment. With a unified approach to internal and external transport, workflows are simplified and consistent regardless of the physical location of equipment, and the number of encoders and decoders is reduced.

Whilst audio can be streamed as part of a video signal, such as that specified by ST-2022-6 (SDI over IP), ST-2110 looks towards “elemental” streaming in which audio, video and metadata is streamed separately, but over the same network and in perfect alignment through RTSP and PTP. ST-2110 defines the use of AES67 for audio as part of its elemental streaming. This means the audio component and its metadata are easy to access and don’t require unpacking a video stream, simplifying tasks such as adding captions, subtitles and teletext, and the processing of multiple audio languages and types. Elemental streaming also removes the latency introduced by encoding and decoding, and it minimizes the overall data amount being input or output by a given device.

Within SMPTE ST-2110, parts 30 and 31 are the most relevant for audio professionals. SMPTE ST 2110-30 defines AES67 as the audio transport format of the standard and requires its use for audio streaming. ST-2110 goes beyond the base recommendations of AES67 and adds a few extras, including the option to supply labels for the audio channels, as part of a stream’s SDP file. While ST-2110 requires the use of AES67, we can expect SMPTE to continue expanding on the standard, for instance, including a method for adding labels to audio channels in a stream’s session description (SDP).

SMPTE ST 2110-31 is about the transparent transport of AES3, including the individual status components of an AES3 signal, allowing for the passing of Dolby E. Formats that require the status bits are maintained.

Looking Ahead - NMOS

AES67 defines the format of audio transport, but it does not define how different devices discover each other or how they connect. Currently, most AES67 implementations use Bonjour/mDNS for discovery, which allows devices to see the streams that other devices are transmitting. However, just because a device is AES67 compliant does not mean it will use Bonjour/mDNS. If two devices do not share a common discovery mechanism, then connecting the two is more manual – often requiring the user to copy the SDP from the transmitter and enter it on the receiver.

The AIMS agenda does not end with SMPTE ST-2110. The latest objective provides a common means of identifying and registering devices across all workflows and locations based on the Network Media Open Specifications (NMOS) IS-04 developed by the Advanced Media Workflow Association (AMWA).

Typically, connecting AES67 streams requires engineers to access each device’s web-app for configuration. When a transmitter is configured on one device and the receiving device supports the same discovery mechanism, the operator should be able to log onto the second device and see the stream created on the first. If not, the receiving device will have to be manually configured to receive the stream based on the SDP copied from the transmitter. NMOS will revolutionize this process by defining discover and connection management and allowing controllers to switch stream connections more dynamically. Systems such as software GUIs and hardware panels that function like familiar router panels are now being developed; with these systems, the network effectively becomes the router.

Eagerly awaited by the broadcast market, NMOS will greatly simplify discovery and connection of both audio and video devices on an IP network. NMOS-based software applications and hardware switching panels are being developed that will allow people to connect streams between devices in a very user-friendly way, some of them replicating familiar router interfaces and removing the “science fair” feel of setup and configuration. NMOS will allow connections between devices to be much more dynamic; in effect, the network becomes the router.

Conclusion

The IP revolution is well underway, bringing unprecedented opportunities to broadcast operations. We’re headed into a world in where broadcasters shift away from a conventional signal-based approach in favor of a services model; where content, both live and stored, may be discovered and accessed by anyone in possession of access rights and an appropriate IP connection, regardless of location. And now, many of the challenges and barriers to effective IP signal transport and management are being addressed by breakthroughs such as NMOS, bringing plug-and-play simplicity to operations and reducing the reliance on specialized engineers.

There’s lots of good news there for audio engineers, and the emergence of key standards will smooth the path – but it’s critical that broadcast engineers gain a keen understanding of AES67 and SMPTE ST 2110 as well as the nuts and bolts of configuring and managing IP switches.

Peter Walker, Product Manager, Calrec Audio.

Other related articles posted on The Broadcast Bridge.

A Conversation with Up-and-Coming Audio Mixer Sean Prickett

You might also like...

Live Sports Production: Camera To Truck

Much of the OB production infrastructure has moved to IP, but has the connectivity between the cameras and the OB or backhaul also migrated to IP?

Building Software Defined Infrastructure: Zero Tolerance Security

Software based systems bring immense flexibility but they also bring increased vulnerability and inevitable trade-offs between flexibility and security.

Live Sports Production: Exploring The Evolving OB

The first of our three articles is focused on comparing what technology is required in OBs and other venue systems to support the various approaches to live sports production.

Cloud Compute Infrastructure At IBC 2025

In celebration of the 2025 IBC Show, this article focuses on the key theme of cloud compute infrastructure and what exhibitors at the show are doing in this key area of technological enablement.

Monitoring & Compliance In Broadcast: Real-time Local Network Monitoring

With many production systems now a hybrid of SDI & IP networking, monitoring becomes a blend of the old and the new within a software controlled environment.