Standards: Part 10 - Embedding And Multiplexing Streams

Audio visual content is constructed with several different media types. Simplest of all would be a single video and audio stream synchronized together. Additional complexity is commonplace. This requires careful synchronization with accurate timing control.

This article is part of our growing series on Standards. There is an overview of all 26 articles in Part 1 -  An Introduction To Standards.

There are several different kinds of streams involved:

  • Elementary (sometimes called essence) streams.
  • Program streams.
  • Transport streams.

These are described as streams which is short for bitstream. They are a sequence of bits, assembled into bytes and then combined into packets for transmission over the network. Elementary streams are combined in a nested fashion to create program streams which are combined to make transport streams for delivery.

Timing & Synchronization
  • The timing and synchronization of audio, video and optional metadata tracks is governed by the systems layer. This is fundamental. Timing and synchronization are relevant here:
  • At the lowest level, audio and video need to be precisely synchronized. Video is locked on a frame-by-frame basis. Audio is synchronized at the sample level. It must be correct to within a few milliseconds to preserve lip-sync.
  • The program stream is referenced to the start time so that text assets and sub-titles can be delivered at the correct frame times. The accuracy should be better than half a second.
  • When programs are broadcast on-air, the time of broadcast is important so programs can be transmitted as described in the EPG. This is often just an approximation. Programs may be broadcast up to 10 minutes early or late and sometimes not at all.

The relevant standards organizations are:

  • IEEE Precision Time Protocols. Used as a basis for most other standards.
  • AES standards for audio carriage.
  • SMPTE standards for timing and professional media over IP.
  • ISO MPEG standards for audio/visual content.
  • DVB broadcast transmission specifications.
  • ETSI detailed specs for DVB use.

These are the relevant specifications for timing:

Standard Description
IEEE1588-2002 Precision Time Protocol (PTPv1).
IEEE1588-2008 Precision Time Protocol (PTPv2).
AES3 Integral clock timing is embedded within the essence data.
AES11-2020 Synchronization of digital audio equipment in studio operations.
AES-R11-2009 Methods for the measurement of audio-video synchronization error.
AES67 Uses IEEE 1588-2008 - PTPv2.
ST 12-1 SMPTE Timecode.
ST 309 SMPTE timing - Date values.
ST 2059-1 Generation and Alignment of Interface Signals to the SMPTE Epoch.
ST 2059-2 SMPTE Profile for use of IEEE 1588-2008 - PTPv2. Read this together with part 1 to understand the entire concept.
ST 2059-10 Engineering guideline - Introduction to the ST 2059 Synchronization System.
RP 2059-15 Recommended practice for using the YANG Data Model for ST 2059-2 PTP Device Monitoring in Professional Broadcast Applications.
ST 2110-10 SMPTE Professional Media over Managed IP Networks: System Timing and Definitions. This uses IEEE 1588-2008 - PTPv2.
ISO 11172-1 MPEG-1 Systems layer provides timing and synchronization and describes how to multiplex elementary streams together to create a transport stream.
ISO 13818-1 MPEG-2 Systems layer adds to and enhances MPEG-1.
ISO 14496-1 MPEG-4 Systems layer adds to and enhances MPEG-2.
DVB The Time Date Table (TDT) and Time Offset Table (TOT) are embedded inside the broadcast signal.
EN 300 468 v1.3.1 ETSI Specification for Service Information (SI) in DVB systems describes the TDT and TOT data embedded in the DVB transmissions.


Elementary Streams

MPEG describes elementary streams as the output of an encoder. They only carry a single type of media:

  • Audio
  • Video
  • Sub-title text
  • URLs
  • Metadata
  • Control signals

Elementary streams are a sequence of bits. The codec specifications only describe how they should be decoded. This allows encoder developers to improve the compression algorithms. Thus, coding efficiency improves over time without revising the standard. The bitstreams are sliced into packets for transmission or storage. The term 'track' is used in place of 'stream' when the content is stored in a file.

Multiple Audio Channels

Audio complexity increases as more channels are introduced. Wikipedia describes 21 different surround-sound arrangements. The most common formats are:

Format Details Notation Channels
Mono A single channel on its own. 1.0 1
Stereo Separate left and right sound-mixes. 2.0 2
Quadraphonic A legacy format creating a surround effect. Sounds are placed left and right, front and back. 4.0 4
Surround-sound Sounds are placed left, right and middle at the front with additional left and right rear speakers. Plus, a non-directional single low-frequency channel. 5.1 6
Early Dolby Atmos™️ Enhances the 5.1 arrangement to add four more channels at ceiling height. Two on each side. 5.1.4 10
Dolby Atmos™️ (high performance) An advanced configuration places loudspeakers around the audience and in the ceiling above them. There are 11 full range speakers, one low-frequency woofer and 8 high-level speakers. 11.1.8 20
NHK Super Hi-Vision The sound-system for the 8K demonstrations in 2018 used multiple 7.1 surround systems (floor, wall and ceiling) with extra low-frequency support. 22.2 24


Microsoft define an informal specification for channel names that identifies 18 discrete sound sources within a multi-channel environment. It is described as KSAUDIO_CHANNEL_CONFIG and is also useful for non-Windows applications. Other naming conventions may apply in different environments. These individual sound sources are mixed down via a matrix when fewer channels are required. Mixing tools for spatially positioning events within the soundscape become more sophisticated as the number of channels increases.

Multiple Video Channels

Usually, only a single video channel is required. Emerging virtual-reality applications with stereoscopic-vision require two. DVDs support multiple video angles selectable at playback. Some orchestral concerts offer different views for each section of the orchestra. These must be perfectly synchronized with each other and the audio.

Program Streams

Embedding combines several streams of media to create a single higher-level stream. In its simplest form, packets of audio, video and timed-text are interleaved alternately. Audio requires much less space than video and intermittent text requires even less. The interleaving may not always be in a regular pattern. Surround-sound audio will be even more complex as there are several more audio streams to include.

All of the streams need to arrive at the same time so the downstream devices can maintain synchronization to minimize buffering support.

This example shows how a program stream is created from audio, video and timed text elementary streams.

Synchronization is important and is described in delivery specifications:

  • The DPP requires that sound and vision synchronization markers at the start of the program are within 5 milliseconds.
  • Netflix mandates that separated audio files must match the duration of their companion video files to within 1 second.

Program streams are suitable for delivery via streaming services but need to be combined with several others for on-air broadcasting.

SDI Audio Embedding

SMPTE ST 2110-2x standards accommodate SDI conforming to the ST 292M standard as a source format. SDI has the capacity to carry up to 16 channels of audio depending on the sample size and frequency.

The SDI format is derived from classic analogue TV services, having a space where the video is blanked. Each horizontal line has a space at the start for ancillary data (HANC). Lines are reserved at the top and bottom of the frame for more ancillary data (VANC).

Digital audio is stored in the HANC space and is extracted for conversion to MPEG or ST 2110 compatible formats.

Constructing A Transport Stream

The individual elementary streams are combined to make a program stream. This represents a single TV channel in a broadcast scenario. Several program streams are combined with additional engineering and EPG metadata to construct a transport stream.

Here is the nested structure:

Over The Air (OTA) DVB Transport Streams

Terrestrial or Satellite broadcasts combine multiple separate program streams (channels) into a transport stream. Terrestrial broadcasters call this a multiplex while satellite broadcasters describe it as a transponder. It might also be described as a bouquet. This is based on the Digital Video Broadcasting (DVB) standards.

The available frequency bands limit the DTT broadcasts to a half-dozen multiplexes. This is sufficient to deliver more than a hundred channels. The commercial DSat service carries more transponders and delivers approximately 300 channels. Any space that is too small to squeeze in another TV channel is used to carry radio broadcasts and data services.

Cable broadcasts were historically delivered using similar DVB standardized transports. They are rapidly migrating to a streamed over broadband IP service model.

Here is an example transport stream spanning 24 hours:

Some channels are only available for part of the day but the EPG hides this and displays them as separate channels even though they use the same slot in the multiplex.

Simple Multiplexing

Program streams that are combined into a transport stream have no knowledge of each other or how much bandwidth they are consuming. A simple multiplexing scheme bets on them not exceeding 100% of the bandwidth when combined. If several channels have a burst of activity simultaneously, the complexity increases and the available bandwidth is insufficient to carry the load. This leads to massive data loss and the picture breaks up. This is described as 'Blowing up or busting the mux'.

Statistical Multiplexing

The available bandwidth can be used more effectively when several channels are transmitted together. Hardware compression in the head-end applies statistical multiplexing to avoid overloading the available capacity.

Statistical multiplexing allows individual channels to burst and use more than their average bandwidth at the expense of other channels. The Stat-mux adjusts the compression ratios to maintain a constant bitrate even when several channels are bursting simultaneously.

These are important caveats to bear in mind:

  • Statistical multiplexing for broadcasting can only be used with uncompressed (raw) source material. Consequently, this requires a slightly higher network capacity for delivery of content to the head-end.
  • Some channels are stored and re-broadcast one hour later. The chances of simultaneous bursting are small.
  • When the same content is simulcast on several channels (for example BBC News), a statistical multiplexing approach is very helpful.

Timing and synchronization are critically important when creating program streams for delivery over IP or combining into broadcast transmissions.

Ensuring that broadcast bouquets of channels are deliverable within the available bandwidth requires careful management of the compression with a statistical multiplexor.

Part of a series supported by

You might also like...

Future Technologies: New Hardware Paradigms

As we continue our series of articles considering technologies of the near future and how they might transform how we think about broadcast, we consider the potential processing paradigm shift offered by GPU based processing.

Designing IP Broadcast Systems: Why Can’t We Just Plug And Play?

Plug and play would be an ideal solution for IP broadcast workflows, however, this concept is not as straightforward as it may first seem.

Future Technologies: Private 5G Vs Managed RF

We continue our series considering technologies of the near future and how they might transform how we think about broadcast, with whether building your own private 5G network could be an excellent replacement for managed RF.

Standards: Part 9 - Standards For On-air Broadcasting & Streaming Services

Traditional on-air broadcasters and streaming service providers use many of the same standards to define how content is received from external providers and how it is subsequently delivered to the consumer. They may apply those standards in slightly different ways.

An Introduction To Network Observability

The more complex and intricate IP networks and cloud infrastructures become, the greater the potential for unwelcome dynamics in the system, and the greater the need for rich, reliable, real-time data about performance and error rates.