Standards: Part 6 - About The ISO 14496 – MPEG-4 Standard

This article describes the various parts of the MPEG-4 standard and discusses how it is much more than a video codec. MPEG-4 describes a sophisticated interactive multimedia platform for deployment on digital TV and the Internet.

This article is part of our growing series on Standards. There is an overview of all 26 articles in Part 1 -  An Introduction To Standards.

The original scope of ISO 14496 - MPEG-4 improved the coding of audio and video to replace MPEG-2. Since then, it has evolved into a sophisticated interactive multimedia platform. It can support interactive user experiences delivered over the Internet and via digital TV services from a common source that can potentially work everywhere.

About The MPEG-4 Parts

MPEG-4 comprises 34 individual parts with the possibility of more being introduced later. There are also occasional references to other MPEG standards.

The audio/visual coding and file storage specifications have been widely adopted by removable media, broadcast, streaming and World-Wide-Web communities. The rest have largely failed to gain any traction and have been supplanted by open standards alternatives.

It helps to categorize the parts to see how they interoperate and enhance one another. These groupings are particularly important and have achieved significant traction:

  • Systems layer and DMIF.
  • Audio standards.
  • Video standards.
  • Containment.

The remainder are interesting but have not been so widely adopted:

  • Typography.
  • Text.
  • 3D & Graphics.
  • Multimedia binding and presentation.
  • Streaming.
  • Rights control.
  • Supporting technologies.

Profiles and levels can be applied to constrain the content coding. If a player supports the same profile and level there is no need to implement the entire standard in all its parts.

Systems Layer

Part 1 describes the systems layer like the earlier MPEG-2 standard. This provides a foundation to synchronize separate streams of media so they can be presented together. It identifies and organizes the elementary objects within the package which are then multiplexed for delivery.

Delivery Multimedia Integration Framework (DMIF)

Part 6 describes the Delivery Multimedia Integration Framework (DMIF). This specifies two interfaces that separate the description of an object based interactive multimedia user experience from the transport mechanisms that deliver the content to the receiver. The DMIF-Application Interface (DAI) separates the application from the FlexMux coded elements. Then the DMIF-Network Interface (DNI) separates the FlexMux from the physical transport layer.

The content is first encoded into the elementary streams. FlexMux then multiplexes them into a single package. The transport multiplexor then breaks the package into fragments for onward transmission.

Provided the transport mechanism supports DNI, any practical means can carry the content to the end user.

A receiving player accepts incoming transport packets, reassembles the package and hands them to the FlexMux which then forwards them for decoding. The application layer then marshals the elementary objects and renders the user experience.

DMIF completely abstracts the packaging and transport layers and the MPEG-4 application layer is completely unaware of how the content is delivered. The symmetry of DMIF at both ends is very elegant.

Audio Coding

The Audio support in Part 3 builds on earlier MPEG-1 and MPEG-2 standards and adds support for the following:

  • Perceptual coding of audio signals with variations of Advanced Audio Coding (AAC).
  • Audio Lossless Coding (ALS).
  • Scalable Lossless Coding (SLS).
  • Structured Audio for very low bit rate delivery and reconstruction in the player.
  • Text-To-Speech Interface (TTSI).
  • Harmonic Vector Excitation Coding (HVXC) for speech synthesis.
  • Code-excited linear prediction (CELP) speech synthesis.
  • Spatialized sound.

Part 23 adds Symbolic Music Representation which operates at a higher structural level than the Musical Instrument Digital Interface standard (MIDI). More in the domain of a sequencer perhaps.

Part 24 describes additional audio & systems interaction with file storage containers.

Video Coding

The video coding in Part 2 is superseded by the much-improved compression possible with AVC which is described in Part 10.

Quite early on, Part 2 supported non-rectangular alpha masked video which has now been added to Part 10. An auxiliary monochrome picture is decoded with the main image and the grey levels are used by the player as an alpha mask when rendering the video.

Part 29 - Web Video Coding (WVC) is and abridged version of the Constrained Baseline Profile covered in Part 10.

Part 31 - Video coding for browsers is withdrawn since it described profiles already covered by Part 10 and was therefore redundant.

Part 33 - Internet Video Coding (IVC) was a completely royalty free codec based on older patent expired technologies. Patent holders emerged rendering IVC completely irrelevant.

Parts 29, 31 and 33 could be ignored since they duplicate existing specifications and have not been embraced by the industry.

Video coding has been a major success with the AVC codec described by part 10 and the newer and more efficient HEVC codec covered by ISO 23008 (MPEG-H). It is up to the encoder and player manufacturers to support the more exotic functionality where it is appropriate.


Media container files are described initially in Part 12, which covers the ISO Base Media File Format. Basic timing and synchronization of multiple element streams is described with metadata for managing the content.

Part 14 describes the version 2 enhancement to ISOBMFF for storing MPEG-4 content. This completely replaces the version 1 file format described in earlier editions of Part 1.

Part 15 extends the container specification to support HEVC video and other element types.


The typography support delivers custom fonts whose metrics should be identical on all platforms.

Part 18 deals with the compression of font data so it can be delivered more efficiently.

Part 22 includes these font specifications and references:

  • TrueType fonts.
  • PostScript fonts.
  • Open Font Format (OFF) that combines aspects of TrueType and PostScript fonts.
  • Compact Font Format (CFF).

Part 28 describes composite font packages that virtualize multiple different font types into a single consistent font description. It also adds support for Unicode. This increases the number of available character Glyphs to display different languages, symbols and Emoji pictograms.

Text Fragments

Part 17 describes how text fragments can be packaged, streamed and synchronized to the timebase controlled by the Part 1 systems layer. Each fragment of text is called a Timed Text Unit (TTU) with a specific time-stamp and associated payload. Very low bit-rates are possible when multiple timed text fragments are combined into a single delivery packet. This allows text to be efficiently downloaded to mobile devices for Karaoke applications for example.

The format of the payload is left to the application to define. This might conform to other standards or be a proprietary design.

Part 30 describes time-synchronized text storage in ISOBMFF files. This includes W3C specified WebVTT text tracks for subtitle support in web video players. These will trigger a JavaScript event in your web browser that handles the content of each timed-text fragment to make decisions based on the content.

This technique was actually supported as far back as 1999 in proprietary RealVideo streams where you could embed a javascript: URL to call dynamic page changes to action.  The newer WebVTT support is more flexible allowing multiple tracks carrying different languages and control signalling to alter the viewing experience in the player (aspect ratio switching for example).

3D & Graphics

Parts 19 and 25 describe synthesized textures and 3D graphics compression. These may be useful in Metaverse applications. They might be replaced by or absorbed into the point-cloud work being developed as component parts of ISO standard 23090 (MPEG-I).

A Java based graphics sub-system (MPEG-J) is described in Part 21.

The AFX Animation Framework is covered by Part 16.

Multimedia Binding & User Interface Presentation

In 2000, Interactive TV and multimedia was a growing and popular trend. The MPEG-4 platform looked as if it would be the pre-eminent technology choice.

Part 11 describes a BInary Format for Scenes which is based on VRML work done on Web-based VR user-experiences. This was followed by Part 20, which described the Lightweight Application Scene Representation (LASeR). This is a binary representation of Scalable Vector Graphic (SVG) content. These two technologies offered the potential for very advanced interactive user experiences. Ultimately, patent licensing made this approach too costly to deploy commercially.


Part 8 describes the carriage of MPEG-4 packages over IP networks. There are useful guidelines for designing Real-Time Transport Protocol (RTP) payloads. Security and multicasting to deliver the same content to multiple receiving clients is also discussed.

Session Description Protocol (SDP) for low latency transport for Voice over IP (VoIP) and video conferencing systems implementation is covered.

Appropriate MIME-type specifications are included. Defining the correct MIME-type when deploying media content on the Internet invokes the appropriate handling mechanisms for correct playback.

Rights Control

Part 13 describes the Intellectual Property Management and Protection (IPMP) tools for embedding rights controlling metadata inside the bitstream. This hooks into a rights management system to control access to the content for subscription-based services.

Consult ISO 21000 (MPEG-21) for additional material related to this topic.

Supporting Technologies

Part 34 covers a high-level Syntactic Description Language (SDL) for describing bitstream content. This abstraction is used in many MPEG standards to specify video and audio coding into a format ready for transmission.

Coding standards only describe the content that a player is expected to support. The encoding strategy is left for innovators to develop and improve as they create ever more intelligent coding tools.

Reference software (Parts 5 and 7) and hardware (Part 9) implementations are provided for developers as a starting point. Part 5 is patent encumbered and might be subject to a license fee but parts 7 and 9 are patent free.

Conformance testing in Parts 4, 26 & 27 describe how to check your implementation is standards compliant. These techniques are generally applicable to other standards. Part 4 describes generic concepts while parts 26 and 27 focus on audio and 3D graphics. Parts 4 and 27 use patented technologies for which a license fee may be due.


The fortunes of MPEG-4 have largely been dictated by patent license fees. Those parts of MPEG-4 that have been adopted widely have been very successful (AVC & AAC). The rest have been less popular.

Newer open-standards based alternatives are commercially more attractive and well supported by most Web Browsers. As HTML-5 has become more capable, audio-visual playback is supported as a first-class citizen and no longer needs plug-in support. Adding CSS3, JavaScript, SVG and Unicode gives you all the tools you need to build sophisticated user experiences.

Part of a series supported by

You might also like...

An Introduction To Network Observability

The more complex and intricate IP networks and cloud infrastructures become, the greater the potential for unwelcome dynamics in the system, and the greater the need for rich, reliable, real-time data about performance and error rates.

2024 BEITC Update: ATSC 3.0 Broadcast Positioning Systems

Move over, WWV and GPS. New information about Broadcast Positioning Systems presented at BEITC 2024 provides insight into work on a crucial, common view OTA, highly precision, public time reference that ATSC 3.0 broadcasters can easily provide.

Next-Gen 5G Contribution: Part 2 - MEC & The Disruptive Potential Of 5G

The migration of the core network functionality of 5G to virtualized or cloud-native infrastructure opens up new capabilities like MEC which have the potential to disrupt current approaches to remote production contribution networks.

Designing IP Broadcast Systems: Addressing & Packet Delivery

How layer-3 and layer-2 addresses work together to deliver data link layer packets and frames across networks to improve efficiency and reduce congestion.

Next-Gen 5G Contribution: Part 1 - The Technology Of 5G

5G is a collection of standards that encompass a wide array of different use cases, across the entire spectrum of consumer and commercial users. Here we discuss the aspects of it that apply to live video contribution in broadcast production.