Standards: Containers - About The MPEG-4 Standard
MPEG-4 is far more than just a codec or container — it’s a sprawling multimedia platform spanning 34 separate parts, from audio and video compression to interactive scene description and rights management. This guide maps the standard’s architecture, separates what succeeded from what didn’t, and untangles the patent landscape.
ISO 14496 – a.k.a. MPEG-4
The original scope of ISO 14496 – MPEG-4 was intended to improve the coding of audio and video to replace MPEG-2. It has grown far beyond that to become a powerful multimedia platform for building interactive user experiences.
MPEG-4 is often described as a codec. The standard does include video coding specifications but there is more to MPEG-4 than that. MPEG-4 Part 2 introduced a codec that improved the compression ratio compared with MPEG-2. That codec didn’t gain much traction because the AVC codec described in Part 10 delivered even better performance. Later parts and other standards documents embellish the codec support.
Elsewhere you will find MPEG-4 described as a container. This is also true because parts 12, 14 and 15 describe the ISO Base Media File Format (ISOBMFF).
As covered in the previous chapter, the ISOBMFF storage file structure is based on Apple QuickTime movie containers. MPEG-4 inherits all the capabilities of QuickTime and therefore describes a sophisticated interactive multimedia platform for deployment on digital TV and the Internet. It can support interactive user experiences delivered over the Internet and via digital TV services from a common source that can potentially work everywhere.
A Menagerie Of MPEG-4 Parts
MPEG-4 currently comprises 34 individual parts with the possibility of more being introduced later. There are also occasional references to other MPEG standards.
The audio/visual coding and file storage specifications have been widely adopted by removable media, broadcast, streaming and World-Wide-Web communities. The rest have largely failed to gain any traction and have been supplanted by open standards alternatives.
Here is a list of all 34 parts of the ISO 14496 – MPEG-4 standard. The last three columns indicate the version and revisions:
- Full – The most recent edition of the full standard.
- Amd – Indicates the latest published amendment or corrigenda.
- New – A pending new edition that may affect your development plans. These include prior updates.
| Part | Content | Full | Amd | New |
|---|---|---|---|---|
| 1 | Systems | 2010 | 2014 | Pending |
| 2 | Visual | 2004 | 2013 | |
| 3 | Audio | 2019 | ||
| 4 | Conformance testing | 2004 | 2019 | |
| 5 | Reference software | 2019 | ||
| 6 | DMIF | 2000 | ||
| 7 | Reference software for A/V Objects | 2004 | ||
| 8 | Carriage over IP | 2004 | ||
| 9 | Reference hardware | 2009 | ||
| 10 | AVC | 2022 | Pending | |
| 11 | BIFS (Scene description) | 2015 | ||
| 12 | ISO Base File Format | 2022 | Pending | |
| 13 | IPMP | 2004 | ||
| 14 | MP4 file format | 2020 | ||
| 15 | Carriage of NAL format video | 2022 | 2022 | Pending |
| 16 | AFX – Animation Framework | 2011 | 2017 | |
| 17 | Streaming Text Format | 2006 | ||
| 18 | Font compression and streaming | 2004 | 2014 | |
| 19 | Synthesized texture stream | 2004 | ||
| 20 | LASeR (Lightweight scene description) | 2008 | 2010 | |
| 21 | MPEG-J extension for rendering | 2019 | 2023 | Pending |
| 22 | Open font format | 2019 | 2023 | Pending |
| 23 | Symbolic Music Representation (SMR) | 2008 | ||
| 24 | Audio and systems interaction | 2008 | Pending | |
| 25 | 3D Graphics Compression Model | 2011 | ||
| 26 | Audio conformance testing | 2010 | 2018 | Pending |
| 27 | 3D Graphics conformance | 2009 | 2015 | |
| 28 | Composite font representation | 2012 | 2014 | |
| 29 | Web Video Coding (WVC) | 2018 | 2022 | |
| 30 | Timed text and other visual overlays in ISO base media file format | 2018 | 2022 | |
| 31 | Video coding for browsers | Deleted | ||
| 32 | File format reference software and conformance | 2021 | Pending | |
| 33 | Internet Video Coding (IVC) | 2019 | ||
| 34 | Bitstream Syntactic Description Language (SDL) | Pending |
This list was derived from the ISO standards online store which is the authoritative source. The dates will change as the developers publish further revisions.
Organizing The MPEG-4 Parts Into Categories
It helps to categorize the parts to see how they interoperate and enhance one another. These groupings are particularly important and have achieved significant traction:
- Systems layer and DMIF.
- Audio standards.
- Video standards.
- Containment.
The remainder are interesting but have not been so widely adopted:
- Typography.
- Text.
- 3D & Graphics.
- Multimedia binding and presentation.
- Streaming.
- Rights control.
- Supporting technologies.
Profiles and levels can be applied to constrain the content coding. If a player supports the same profile and level as the encoder there is no need to implement the entire standard in all its parts.
Systems Layer
MPEG-4 Part 1 describes the systems layer like the earlier MPEG-2 standard. This provides a foundation to synchronize separate streams of media so they can be presented together. It identifies and organizes the elementary objects within the package which are then multiplexed for delivery.
Delivery Multimedia Integration Framework (DMIF)
Part 6 of the standard describes the Delivery Multimedia Integration Framework (DMIF). This specifies two interfaces that separate the description of an object based interactive multimedia user experience from the transport mechanisms that deliver the content to the receiver. The DMIF Application Interface (DAI) separates the application from the FlexMux coded elements. Then the DMIF-Network Interface (DNI) separates the FlexMux from the physical transport layer.
The content is first encoded into the elementary streams. FlexMux then multiplexes them into a single package. The transport multiplexor then breaks the package into fragments for onward transmission.
Provided the transport mechanism supports DNI, any practical means can carry the content to the end user.
A receiving player accepts incoming transport packets, reassembles the package and hands them to the FlexMux which then forwards them for decoding. The application layer then marshals the elementary objects and renders the user experience.
DMIF completely abstracts the packaging and transport layers and the MPEG-4 application layer is completely unaware of how the content is delivered. The symmetry of DMIF at both ends is very elegant.
Audio Coding
The audio support in Part 3 builds on earlier MPEG-1 and MPEG-2 standards, but where earlier MPEG Audio standards concentrated on general audio coding, MPEG-4 covers a much wider range of target applications.
It manages a collection of versatile codecs through a unified interface. The new codecs are sometimes a more efficient choice than AAC:
| Category | Codecs |
|---|---|
| Lossy speech coding | HVXC and CELP. |
| General audio coding | AAC, TwinVQ and BSAC. |
| Lossless coding | MPEG-4 SLS, Audio Lossless Coding, MPEG-4 DST. |
| Text to speech | TTSI. |
| Structured Audio | SAOL, SASL and MIDI. |
| Audio Synthesis | Wavetable based, Sample based, Algorithmic and Effects. |
It also adds support for spatialized sound.
Part 23 adds Symbolic Music Representation which operates at a higher structural level than the Musical Instrument Digital Interface standard (MIDI). More in the domain of a sequencer perhaps.
Part 24 describes additional audio & systems interaction with file storage containers. Refer to Chapter 3-3 of this book for an in-depth examination of the AAC codec.
Building your own custom player for these special codecs might incur a patent licensing liability.
Video Coding
The video coding in Part 2 is superseded by the much-improved compression possible with AVC which is described in Part 10.
Quite early on, Part 2 supported non-rectangular alpha channel masked video which has now been added to Part 10. An auxiliary monochrome picture is decoded with the main image and the grey levels are used by the player as an alpha mask to proportionally mix the background with the decoded content overlay when rendering the video.
Part 29 – Web Video Coding (WVC) is an abridged version of the Constrained Baseline Profile covered in Part 10.
Part 31 – Video coding for browsers is withdrawn since it describes profiles already covered by Part 10 and was therefore redundant.
Part 33 – Internet Video Coding (IVC) was a completely royalty free codec based on older patent expired technologies. New patent holders then emerged which rendered IVC completely irrelevant.
Parts 29, 31 and 33 largely duplicate existing specifications and have not been embraced by the industry.
Video coding has been a major success with the AVC codec described by part 10 and the newer and more efficient HEVC codec covered by ISO 23008 (MPEG-H). It is up to the encoder and player manufacturers to support the more exotic functionality where it is appropriate.
Refer to Chapter 2-2 for more information about the AVC codec. Chapter 2-3 discusses the HEVC codec.
Containment
Media container files are described initially in Part 12 of the standard, which covers the ISO Base Media File Format. Basic timing and synchronization of multiple element streams is described with metadata for managing the content.
Part 14 describes the version two enhancement to ISOBMFF for storing MPEG-4 content. This completely replaces the version one file format described in earlier editions of Part 1.
Part 15 extends the container specification to support HEVC video and other element types.
The ISOBMFF container is described in Chapter 4-2.
Typography
The typography support in MPEG-4 delivers custom fonts whose metrics should then be identical on all platforms.
Part 18 deals with the compression of font data so it can be delivered more efficiently. Part 22 includes these font specifications and references:
- TrueType fonts.
- PostScript fonts.
- Open Font Format (OFF) that combines aspects of TrueType and PostScript fonts.
- Compact Font Format (CFF).
Part 28 describes composite font packages that virtualize multiple different font types into a single consistent font description. It also adds support for Unicode. This increases the number of available character Glyphs to display different languages, symbols and Emoji pictograms.
Timed Text Fragments
Part 17 describes how text fragments can be packaged, streamed and synchronized to the timebase controlled by the Part 1 systems layer. Each fragment of text is called a Timed Text Unit (TTU) with a specific time-stamp and associated payload. Very low bit-rates are possible when multiple timed text fragments are combined into a single delivery packet. This allows text to be efficiently downloaded to mobile devices for applications such as Karaoke.
The format of the payload is left to the application to define. This might conform to other standards or be a proprietary design.
Part 30 describes time-synchronized text storage in ISOBMFF files. This includes W3C specified WebVTT text tracks for subtitle support in web video players. These will trigger a JavaScript event in your web browser that handles the content of each timed-text fragment to make decisions based on the content.
This technique was actually supported as far back as 1999 in proprietary RealVideo streams where you could embed a JavaScript: URL to call dynamic page changes to action. The newer WebVTT support is more flexible allowing multiple tracks carrying different languages and control signaling to alter the viewing experience in the player (aspect ratio switching for example).
Refer to Chapter 5-6 in the ST 2110 section for additional coverage of the context for applying timed-text. It is also highly relevant to the design of client-side video players.
3D & Graphics
Parts 19 and 25 describe synthesized textures and 3D graphics compression. These may be useful in Metaverse applications. They might be replaced by, or absorbed into, the point-cloud work being developed as component parts of ISO standard 23090 (MPEG-I).
A Java based graphics sub-system (MPEG-J) is described in Part 21.
The AFX Animation Framework is covered by Part 16.
Multimedia Binding & User Interface Presentation
In 2000, Interactive TV and multimedia was a growing and popular trend. The MPEG-4 platform looked as if it would be the pre-eminent technology choice.
Part 11 describes a Binary Format for Scenes (BIFS) which is based on much earlier work to describe Web-based VR user-experiences with the VRML markup language.
The BIFS specification was followed by Part 20, which described the Lightweight Application Scene Representation (LASeR). This is a binary representation of Scalable Vector Graphic (SVG) content.
These two technologies offered the potential for very advanced interactive user experiences. More than 20 years later, we would now implement interactive user experience with open and patent-free web standards. HTML5 combined with CSS and JavaScript is the optimum and affordable way to build portable interactivity.
Streaming
Part 8 describes the carriage of MPEG-4 packages over IP networks. There are useful guidelines for designing Real Time Transport Protocol (RTP) payloads. Security and multicasting to deliver the same content to multiple receiving clients is also discussed.
Session Description Protocol (SDP) for low latency transport for Voice over IP (VoIP) and video conferencing systems implementation is also described.
Appropriate media type identifier specifications are included. Defining the right media type when deploying content on the Internet invokes the appropriate handling mechanisms for correct playback in the receiver.
Refer to Chapter 6-4 for further discussion on how streaming protocols work. The coverage of ST 2110 in Section 5 is also relevant.
Rights Control
MPEG-4 Part 13 describes the Intellectual Property Management and Protection (IPMP) tools for embedding rights controlling metadata inside the bitstream. This hooks into a rights management system to control access to the content for subscription based services.
Consult ISO 21000 (MPEG-21) for additional material related to this topic.
Supporting Technologies
MPEG-4 Part 34 covers a high-level Syntactic Description Language (SDL) for describing bitstream content. This abstraction is used in many MPEG standards to specify video and audio coding into a format ready for transmission.
Coding standards only describe the content that a player is expected to support. The encoding strategy is left for innovators to develop and improve as they create ever more intelligent and efficient coding tools.
Reference software (Parts 5 and 7) and hardware (Part 9) implementations are provided for developers as a starting point. Part 5 is patent encumbered and might be subject to a license fee but parts 7 and 9 are patent free.
Conformance testing in Parts 4, 26 & 27 describe how to check your implementation is standards compliant. These techniques are generally applicable to other standards. Part 4 describes generic concepts while parts 26 and 27 focus on audio and 3D graphics. Parts
4 and 27 use patented technologies for which a license fee may be due.
Can MPEG-4 Compete Against Open-Standards?
The fortunes of MPEG-4 have largely been dictated by patent license fees. Those parts of MPEG-4 that have been adopted widely have been very successful (AVC & AAC). The rest have been less popular and might never gain significant traction.
Newer open-standards-based alternatives are commercially more attractive and well supported by most web browsers. As HTML-5 has become more capable, audio-visual playback is supported as a first-class citizen and no longer needs plug-in support. Adding CSS3, JavaScript, SVG and Unicode gives you all the tools you need to build sophisticated user experiences.
Licensing & Patents In MPEG-4
Read carefully where a standard describes whether patents apply to its use. It is usually very near the front. The authors may not have known which patents apply when the standard was drafted and the descriptions may be incomplete. It is worth the time and effort to research this before committing to the use of a standard to avoid license fees later on.
Details of the patents known to ISO are described in a downloadable spreadsheet but there may be other patents that they do not know about until the owner declares their interest in the standards:
https://www.iso.org/patents
This list of MPEG-4 parts indicates whether they are encumbered by patents.
The list will evolve because patents don’t last forever (typically 20 years) and submarine patents do not surface right away. Almost all of the patents that encumber MPEG-4 Part 2 (Visual) have expired. The patents on AVC will take a little longer but will all have expired by 2030. HEVC will take longer still but eventually will be patent free.
| Part | Content | Patents |
|---|---|---|
| 1 | Systems. | Yes |
| 2 | Visual. | Yes |
| 3 | Audio. | Yes |
| 4 | Conformance Testing. | Yes |
| 5 | Reference Software. | Yes |
| 6 | DMIF. | Yes |
| 7 | Reference Software for A/V Objects. | No |
| 8 | Carriage over IP. | No |
| 9 | Reference Hardware. | No |
| 10 | AVC. | Yes |
| 11 | BIFS (Scene description). | Yes |
| 12 | ISO Base File Format. | Yes |
| 13 | IPMP. | No |
| 14 | MP4 File Format. | Yes |
| 15 | Carriage of NAL format video (Formerly AVC File Format). | Yes |
| 16 | AFX – Animation Framework. | Yes |
| 17 | Streaming Text Format. | No |
| 18 | Font Compression and streaming. | Yes |
| 19 | Synthesized texture stream. | Yes |
| 20 | LASeR (Lightweight scene description). | Yes |
| 21 | MPEG-J extension for rendering. | No |
| 22 | Open font format. | Yes |
| 23 | Symbolic Music Representation (SMR). | No |
| 24 | Audio and systems interaction. | No |
| 25 | 3D Graphics Compression Model. | Yes |
| 26 | Audio Conformance. | No |
| 27 | 3D Graphics conformance. | Yes |
| 28 | Composite font representation. | No |
| 29 | Web video coding. | Yes |
| 30 | Timed text and other visual overlays in ISO base media file format. | No |
| 31 | Video coding for browsers. | Yes |
| 32 | File format reference software and conformance. | No |
| 33 | Internet video coding. | Yes |
| 34 | Syntactic description language. | No |
Publicly Available Standards
Once a standard is published, you can purchase it from the ISO online store. Amendments and corrigenda will need to be purchased separately. Some Publicly Available Standards (PAS) are available free of charge. Download them from the ISO webstore and search using the key ‘PAS’:
https://www.iso.org/store.html
The following MPEG-4 parts are available under the PAS scheme:
| Part | Title |
|---|---|
| 4 | Conformance testing bitstreams. |
| 5 | Reference software. |
| 7 | Reference Software for A/V Objects. |
| 10 | AVC. |
| 20 | LASeR (Lightweight scene description). |
| 22 | Open font format. |
| 26 | Audio Conformance testing. |
| 27 | 3D Graphics conformance testing. |
| 28 | Composite font representation. |
| 32 | File format reference software and conformance. |
| 11 | Stereoscopic video. |
| 12 | Interactive music. |
| 13 | Augmented reality. |
| 15 | Multimedia preservation. |
| 16 | Publish/Subscribe. |
| 17 | Multiple sensorial media. |
| 18 | Media linking. |
| 19 | Common media (CMAF) for segmented media. |
PAS means that the standards documents are freely available but this does not mean they are free of patent encumbrances. You may still need to pay license fees.
These Appendix articles contain additional information you may find useful:
Supported by
You might also like...
Network Traffic Engineering: RIST & SRT - The Success Of ARQ Based Protocols
IP networks are inherently unreliable. We kick off this series on IP Network Traffic Engineering with a look at how RIST and SRT give broadcast engineers user-configurable control over the latency-versus-reliability trade-off for real-time media streaming.
Standards: Video - Standards For Video Coding
From 4K to 32K, the demand for ever-larger video formats is pushing codec technology to its limits. This guide surveys the landscape of video coding standards – from legacy MPEG formats to AI-driven neural network compression – to help navigate the choices sha…
Broadcast Standards 2026 – Video Coding
Video coding was developed to deliver video conferencing services over low-bandwidth modem connections, but modern demands for ever-larger video formats are pushing codec technology to its limits.
Network Traffic Engineering: Part 1
IP networks are inherently unreliable. They always have been – it is literally designed in as a feature.
Standards: An Introduction To Standards
There are many standards relevant to the broadcasting and media industry. In this section we examine the background to standards, who develops them, where to find them and why they are absolutely and totally necessary.