Standards: Containers - About The MPEG-4 Standard

MPEG-4 is far more than just a codec or container — it’s a sprawling multimedia platform spanning 34 separate parts, from audio and video compression to interactive scene description and rights management. This guide maps the standard’s architecture, separates what succeeded from what didn’t, and untangles the patent landscape.

ISO 14496 – a.k.a. MPEG-4

The original scope of ISO 14496 – MPEG-4 was intended to improve the coding of audio and video to replace MPEG-2. It has grown far beyond that to become a powerful multimedia platform for building interactive user experiences.

MPEG-4 is often described as a codec. The standard does include video coding specifications but there is more to MPEG-4 than that. MPEG-4 Part 2 introduced a codec that improved the compression ratio compared with MPEG-2. That codec didn’t gain much traction because the AVC codec described in Part 10 delivered even better performance. Later parts and other standards documents embellish the codec support.

Elsewhere you will find MPEG-4 described as a container. This is also true because parts 12, 14 and 15 describe the ISO Base Media File Format (ISOBMFF).

As covered in the previous chapter, the ISOBMFF storage file structure is based on Apple QuickTime movie containers. MPEG-4 inherits all the capabilities of QuickTime and therefore describes a sophisticated interactive multimedia platform for deployment on digital TV and the Internet. It can support interactive user experiences delivered over the Internet and via digital TV services from a common source that can potentially work everywhere.

A Menagerie Of MPEG-4 Parts

MPEG-4 currently comprises 34 individual parts with the possibility of more being introduced later. There are also occasional references to other MPEG standards.

The audio/visual coding and file storage specifications have been widely adopted by removable media, broadcast, streaming and World-Wide-Web communities. The rest have largely failed to gain any traction and have been supplanted by open standards alternatives.

Here is a list of all 34 parts of the ISO 14496 – MPEG-4 standard. The last three columns indicate the version and revisions:

  • Full – The most recent edition of the full standard.
  • Amd – Indicates the latest published amendment or corrigenda.
  • New – A pending new edition that may affect your development plans. These include prior updates.
PartContentFullAmdNew
1Systems20102014Pending
2Visual20042013
3Audio2019
4Conformance testing20042019
5Reference software2019
6DMIF2000
7Reference software for A/V Objects2004
8Carriage over IP2004
9Reference hardware2009
10AVC2022Pending
11BIFS (Scene description)2015
12ISO Base File Format2022Pending
13IPMP2004
14MP4 file format2020
15Carriage of NAL format video20222022Pending
16AFX – Animation Framework20112017
17Streaming Text Format2006
18Font compression and streaming20042014
19Synthesized texture stream2004
20LASeR (Lightweight scene description)20082010
21MPEG-J extension for rendering20192023Pending
22Open font format20192023Pending
23Symbolic Music Representation (SMR)2008
24Audio and systems interaction2008Pending
253D Graphics Compression Model2011
26Audio conformance testing20102018Pending
273D Graphics conformance20092015
28Composite font representation20122014
29Web Video Coding (WVC)20182022
30Timed text and other visual overlays in ISO base media file format20182022
31Video coding for browsersDeleted
32File format reference software and conformance2021Pending
33Internet Video Coding (IVC)2019
34Bitstream Syntactic Description Language (SDL)Pending

This list was derived from the ISO standards online store which is the authoritative source. The dates will change as the developers publish further revisions.


Organizing The MPEG-4 Parts Into Categories

It helps to categorize the parts to see how they interoperate and enhance one another. These groupings are particularly important and have achieved significant traction:

  • Systems layer and DMIF.
  • Audio standards.
  • Video standards.
  • Containment.

The remainder are interesting but have not been so widely adopted:

  • Typography.
  • Text.
  • 3D & Graphics.
  • Multimedia binding and presentation.
  • Streaming.
  • Rights control.
  • Supporting technologies.

Profiles and levels can be applied to constrain the content coding. If a player supports the same profile and level as the encoder there is no need to implement the entire standard in all its parts.

Systems Layer

MPEG-4 Part 1 describes the systems layer like the earlier MPEG-2 standard. This provides a foundation to synchronize separate streams of media so they can be presented together. It identifies and organizes the elementary objects within the package which are then multiplexed for delivery.

Delivery Multimedia Integration Framework (DMIF)

Part 6 of the standard describes the Delivery Multimedia Integration Framework (DMIF). This specifies two interfaces that separate the description of an object based interactive multimedia user experience from the transport mechanisms that deliver the content to the receiver. The DMIF Application Interface (DAI) separates the application from the FlexMux coded elements. Then the DMIF-Network Interface (DNI) separates the FlexMux from the physical transport layer.

The content is first encoded into the elementary streams. FlexMux then multiplexes them into a single package. The transport multiplexor then breaks the package into fragments for onward transmission.

Provided the transport mechanism supports DNI, any practical means can carry the content to the end user.

A receiving player accepts incoming transport packets, reassembles the package and hands them to the FlexMux which then forwards them for decoding. The application layer then marshals the elementary objects and renders the user experience.

DMIF completely abstracts the packaging and transport layers and the MPEG-4 application layer is completely unaware of how the content is delivered. The symmetry of DMIF at both ends is very elegant.

Audio Coding

The audio support in Part 3 builds on earlier MPEG-1 and MPEG-2 standards, but where earlier MPEG Audio standards concentrated on general audio coding, MPEG-4 covers a much wider range of target applications.

It manages a collection of versatile codecs through a unified interface. The new codecs are sometimes a more efficient choice than AAC:

CategoryCodecs
Lossy speech codingHVXC and CELP.
General audio codingAAC, TwinVQ and BSAC.
Lossless codingMPEG-4 SLS, Audio Lossless Coding, MPEG-4 DST.
Text to speechTTSI.
Structured AudioSAOL, SASL and MIDI.
Audio SynthesisWavetable based, Sample based, Algorithmic and Effects.

It also adds support for spatialized sound.

Part 23 adds Symbolic Music Representation which operates at a higher structural level than the Musical Instrument Digital Interface standard (MIDI). More in the domain of a sequencer perhaps.

Part 24 describes additional audio & systems interaction with file storage containers. Refer to Chapter 3-3 of this book for an in-depth examination of the AAC codec.

Building your own custom player for these special codecs might incur a patent licensing liability.

Video Coding

The video coding in Part 2 is superseded by the much-improved compression possible with AVC which is described in Part 10.

Quite early on, Part 2 supported non-rectangular alpha channel masked video which has now been added to Part 10. An auxiliary monochrome picture is decoded with the main image and the grey levels are used by the player as an alpha mask to proportionally mix the background with the decoded content overlay when rendering the video.

Part 29 – Web Video Coding (WVC) is an abridged version of the Constrained Baseline Profile covered in Part 10.

Part 31 – Video coding for browsers is withdrawn since it describes profiles already covered by Part 10 and was therefore redundant.

Part 33 – Internet Video Coding (IVC) was a completely royalty free codec based on older patent expired technologies. New patent holders then emerged which rendered IVC completely irrelevant.

Parts 29, 31 and 33 largely duplicate existing specifications and have not been embraced by the industry.

Video coding has been a major success with the AVC codec described by part 10 and the newer and more efficient HEVC codec covered by ISO 23008 (MPEG-H). It is up to the encoder and player manufacturers to support the more exotic functionality where it is appropriate.

Refer to Chapter 2-2 for more information about the AVC codec. Chapter 2-3 discusses the HEVC codec.

Containment

Media container files are described initially in Part 12 of the standard, which covers the ISO Base Media File Format. Basic timing and synchronization of multiple element streams is described with metadata for managing the content.

Part 14 describes the version two enhancement to ISOBMFF for storing MPEG-4 content. This completely replaces the version one file format described in earlier editions of Part 1.

Part 15 extends the container specification to support HEVC video and other element types.

The ISOBMFF container is described in Chapter 4-2.

Typography

The typography support in MPEG-4 delivers custom fonts whose metrics should then be identical on all platforms.

Part 18 deals with the compression of font data so it can be delivered more efficiently. Part 22 includes these font specifications and references:

  • TrueType fonts.
  • PostScript fonts.
  • Open Font Format (OFF) that combines aspects of TrueType and PostScript fonts.
  • Compact Font Format (CFF).

Part 28 describes composite font packages that virtualize multiple different font types into a single consistent font description. It also adds support for Unicode. This increases the number of available character Glyphs to display different languages, symbols and Emoji pictograms.

Timed Text Fragments

Part 17 describes how text fragments can be packaged, streamed and synchronized to the timebase controlled by the Part 1 systems layer. Each fragment of text is called a Timed Text Unit (TTU) with a specific time-stamp and associated payload. Very low bit-rates are possible when multiple timed text fragments are combined into a single delivery packet. This allows text to be efficiently downloaded to mobile devices for applications such as Karaoke.

The format of the payload is left to the application to define. This might conform to other standards or be a proprietary design.

Part 30 describes time-synchronized text storage in ISOBMFF files. This includes W3C specified WebVTT text tracks for subtitle support in web video players. These will trigger a JavaScript event in your web browser that handles the content of each timed-text fragment to make decisions based on the content.


This technique was actually supported as far back as 1999 in proprietary RealVideo streams where you could embed a JavaScript: URL to call dynamic page changes to action.  The newer WebVTT support is more flexible allowing multiple tracks carrying different languages and control signaling to alter the viewing experience in the player (aspect ratio switching for example).


Refer to Chapter 5-6 in the ST 2110 section for additional coverage of the context for applying timed-text. It is also highly relevant to the design of client-side video players. 

3D & Graphics

Parts 19 and 25 describe synthesized textures and 3D graphics compression. These may be useful in Metaverse applications. They might be replaced by, or absorbed into, the point-cloud work being developed as component parts of ISO standard 23090 (MPEG-I).

A Java based graphics sub-system (MPEG-J) is described in Part 21.

The AFX Animation Framework is covered by Part 16.

Multimedia Binding & User Interface Presentation

In 2000, Interactive TV and multimedia was a growing and popular trend. The MPEG-4 platform looked as if it would be the pre-eminent technology choice.

Part 11 describes a Binary Format for Scenes (BIFS) which is based on much earlier work to describe Web-based VR user-experiences with the VRML markup language.

The BIFS specification was followed by Part 20, which described the Lightweight Application Scene Representation (LASeR). This is a binary representation of Scalable Vector Graphic (SVG) content.

These two technologies offered the potential for very advanced interactive user experiences. More than 20 years later, we would now implement interactive user experience with open and patent-free web standards. HTML5 combined with CSS and JavaScript is the optimum and affordable way to build portable interactivity.

Streaming

Part 8 describes the carriage of MPEG-4 packages over IP networks. There are useful guidelines for designing Real Time Transport Protocol (RTP) payloads. Security and multicasting to deliver the same content to multiple receiving clients is also discussed.

Session Description Protocol (SDP) for low latency transport for Voice over IP (VoIP) and video conferencing systems implementation is also described.

Appropriate media type identifier specifications are included. Defining the right media type when deploying content on the Internet invokes the appropriate handling mechanisms for correct playback in the receiver.

Refer to Chapter 6-4 for further discussion on how streaming protocols work. The coverage of ST 2110 in Section 5 is also relevant.

Rights Control

MPEG-4 Part 13 describes the Intellectual Property Management and Protection (IPMP) tools for embedding rights controlling metadata inside the bitstream. This hooks into a rights management system to control access to the content for subscription based services.

Consult ISO 21000 (MPEG-21) for additional material related to this topic.

Supporting Technologies

MPEG-4 Part 34 covers a high-level Syntactic Description Language (SDL) for describing bitstream content.  This abstraction is used in many MPEG standards to specify video and audio coding into a format ready for transmission.


Coding standards only describe the content that a player is expected to support. The encoding strategy is left for innovators to develop and improve as they create ever more intelligent and efficient coding tools.


Reference software (Parts 5 and 7) and hardware (Part 9) implementations are provided for developers as a starting point. Part 5 is patent encumbered and might be subject to a license fee but parts 7 and 9 are patent free.

Conformance testing in Parts 4, 26 & 27 describe how to check your implementation is standards compliant. These techniques are generally applicable to other standards. Part 4 describes generic concepts while parts 26 and 27 focus on audio and 3D graphics. Parts

4 and 27 use patented technologies for which a license fee may be due.

Can MPEG-4 Compete Against Open-Standards?

The fortunes of MPEG-4 have largely been dictated by patent license fees. Those parts of MPEG-4 that have been adopted widely have been very successful (AVC & AAC). The rest have been less popular and might never gain significant traction.

Newer open-standards-based alternatives are commercially more attractive and well supported by most web browsers. As HTML-5 has become more capable, audio-visual playback is supported as a first-class citizen and no longer needs plug-in support. Adding CSS3, JavaScript, SVG and Unicode gives you all the tools you need to build sophisticated user experiences.

Licensing & Patents In MPEG-4

Read carefully where a standard describes whether patents apply to its use. It is usually very near the front. The authors may not have known which patents apply when the standard was drafted and the descriptions may be incomplete. It is worth the time and effort to research this before committing to the use of a standard to avoid license fees later on.

Details of the patents known to ISO are described in a downloadable spreadsheet but there may be other patents that they do not know about until the owner declares their interest in the standards:

https://www.iso.org/patents

This list of MPEG-4 parts indicates whether they are encumbered by patents.

The list will evolve because patents don’t last forever (typically 20 years) and submarine patents do not surface right away. Almost all of the patents that encumber MPEG-4 Part 2 (Visual) have expired. The patents on AVC will take a little longer but will all have expired by 2030. HEVC will take longer still but eventually will be patent free.

PartContentPatents
1Systems.Yes
2Visual.Yes
3Audio.Yes
4Conformance Testing.Yes
5Reference Software.Yes
6DMIF.Yes
7Reference Software for A/V Objects.No
8Carriage over IP.No
9Reference Hardware.No
10AVC.Yes
11BIFS (Scene description).Yes
12ISO Base File Format.Yes
13IPMP.No
14MP4 File Format.Yes
15Carriage of NAL format video (Formerly AVC File Format).Yes
16AFX – Animation Framework.Yes
17Streaming Text Format.No
18Font Compression and streaming.Yes
19Synthesized texture stream.Yes
20LASeR (Lightweight scene description).Yes
21MPEG-J extension for rendering.No
22Open font format.Yes
23Symbolic Music Representation (SMR).No
24Audio and systems interaction.No
253D Graphics Compression Model.Yes
26Audio Conformance.No
273D Graphics conformance.Yes
28Composite font representation.No
29Web video coding.Yes
30Timed text and other visual overlays in ISO base media file format.No
31Video coding for browsers.Yes
32File format reference software and conformance.No
33Internet video coding.Yes
34Syntactic description language.No

Publicly Available Standards

Once a standard is published, you can purchase it from the ISO online store. Amendments and corrigenda will need to be purchased separately. Some Publicly Available Standards (PAS) are available free of charge. Download them from the ISO webstore and search using the key ‘PAS’:

https://www.iso.org/store.html

The following MPEG-4 parts are available under the PAS scheme:

PartTitle
4Conformance testing bitstreams.
5Reference software.
7Reference Software for A/V Objects.
10AVC.
20LASeR (Lightweight scene description).
22Open font format.
26Audio Conformance testing.
273D Graphics conformance testing.
28Composite font representation.
32File format reference software and conformance.
11Stereoscopic video.
12Interactive music.
13Augmented reality.
15Multimedia preservation.
16Publish/Subscribe.
17Multiple sensorial media.
18Media linking.
19Common media (CMAF) for segmented media.

PAS means that the standards documents are freely available but this does not mean they are free of patent encumbrances. You may still need to pay license fees.

Supported by

You might also like...

Network Traffic Engineering: RIST & SRT - The Success Of ARQ Based Protocols

IP networks are inherently unreliable. We kick off this series on IP Network Traffic Engineering with a look at how RIST and SRT give broadcast engineers user-configurable control over the latency-versus-reliability trade-off for real-time media streaming.

Standards: Video - Standards For Video Coding

From 4K to 32K, the demand for ever-larger video formats is pushing codec technology to its limits. This guide surveys the landscape of video coding standards – from legacy MPEG formats to AI-driven neural network compression – to help navigate the choices sha…

Broadcast Standards 2026 – Video Coding

Video coding was developed to deliver video conferencing services over low-bandwidth modem connections, but modern demands for ever-larger video formats are pushing codec technology to its limits.

Network Traffic Engineering: Part 1

IP networks are inherently unreliable. They always have been – it is literally designed in as a feature.

Standards: An Introduction To Standards

There are many standards relevant to the broadcasting and media industry. In this section we examine the background to standards, who develops them, where to find them and why they are absolutely and totally necessary.