How To Achieve Broadcast-Grade Latency For Live Video Streaming

Achieving ‘broadcast-grade-streaming’ sets a latency target of 5s - no easy task with current norms of about 30-60 seconds. We delve into the standards and cutting edge thinking involved in trying to reach this goal.

This article is part of 'The Big Guide To OTT - The Book'

For a serious discussion about “making streaming broadcast-grade” we must address latency. Our benchmark is 5-seconds from encoder input to device, but we cannot compromise on quality. It’s not easy, and leaders in the field grapple with the trade-offs to ensure the best possible customer experience.

The consumer expectation of “broadcast-grade streaming” is a simple target to aim for – 5-second latency (or less) at a sustained (i.e., no buffering) highest bitrate for the device, and negligible bitrate ladder changes to accommodate network conditions.

Perfect bandwidth between encoder and device would deliver perfect video quality, but on its own it does not deliver broadcast-grade latency. Latency reduction needs a reengineering of how we process the video and decode it on the video player. But once we have reengineered the video processing, we then need that perfect quality to be achieved, which is harder to achieve because reducing latency removes margins for error, so now a perfect customer experience is harder to maintain. Broadcast-grade video streaming is not easy.

Therefore, we talk about “safely” reaching 8 seconds latency as a reasonable target for today, improving from the current norm of about 30-60 seconds. It is possible to reach 8 seconds with an appropriate amount of safety built into the stream, using standard HLS and DASH protocols. Beyond this, moving to the 5 second range, and onward to the 2-4 second range, there are the less-tested options related to CMAF media formats, and pioneering initiatives like HESP (High Efficiency Streaming Protocol). This 3-part article looks at what can be achieved with these approaches.

The First Low Latency Zone – 8 Seconds

In latency terms, 8 seconds is considered by experts to be a relatively safe level to reach. 8 seconds includes allowing enough time to correct stream errors so that the viewing experience is not negatively affected. This uses DASH and HLS standards and shortens the all-important segment size.

Latency in this article is broken down into components and typical timings for 6 second segments as described in the below diagram and table from the Streaming Video Alliance (SVA).

Clearly, the biggest driver of latency is in the packager and player components, which are intrinsically linked. Once a segment size (i.e., a GOP, or Group of Pictures) has been defined by the packager, it will typically be multiplied by 3 to give the player a buffer window that helps it to negotiate network conditions that slow down the stream or cause it to drop packets. 3 segments of 6 seconds each is an HLS recommendation, which then adds up to our industry-famous “30 seconds” of latency.

Figure 1 – Live streaming workflow and the “30-second” latency norm.

The “safety-zone” of about 8 seconds, which is only a small way behind our customer experience target for broadcast-grade live viewing experiences, is achieved be taking two important process reengineering steps.

First, we need to reduce the size of each segment to 2 seconds. HLS and DASH both allow for this change in their specifications. This change is made at the encoder stage, but the latency benefits are really seen at the packager and player stages. The SVA note that the trade-offs between latency, start-up delay, and quality are best optimized at the 2-second segment size. Even though the HLS specification allows for 1 second segments, and the DASH specification allows for 200ms segments, the smaller segment sizes can lead to encoding efficiency problems in terms of cost increase or a reduction in quality for the same cost.

Second, we need to reduce the number of segments to 2 from the typical standard of 3. This is a simple choice that makes the biggest single improvement to latency although it is risky from a quality perspective, even with ABR options. This subject is covered in more detail later in this article.

Third, there are small tweaks that can be done to reengineer or optimise for further latency reduction. These can remove c. 3-4 seconds if done in isolation of the previous two points, which doesn’t get us close to broadcast grade on their own, but they may be important details to address to optimise customer experience.

Figure 2 – An example of end-to-end latency using a two-second segment duration (DASH or HLS).

In most situations, they will be useful only when the first two steps have been taken. For example, if a decision is made to stick with 3 segments in the buffer, then these extra optimisations could save an equivalent amount of time and keep latency and quality jointly optimised. The tweaks include:

Minimise “waste” in the video processing infrastructure by optimizing TCP settings and streaming in UDP between the encoder and packager when they are separate.
Locate VOD storage immediately next to the Origin, using local storage rather than network-attached storage.
If using network-attached storage, use lower storage replication factors.
Ensure the CDN Edge serves stream requests as close as possible to the consumer.
Optimise for Cache Hit Efficiency, to reduce round-trip request times between CDN layers or between CDN and Origin.

How ABR Complicates Latency Management

ABR (Adaptive Bitrate) was invented to optimise the streaming viewing experience when there are unpredictable and inconsistent network conditions (which can be almost all the time on the internet). The primary goal is to maintain the stream to the consumer and stop the dreaded video freeze from occurring. It has been a very important development in the evolution of video streaming.

Like low latency, ABR depends on the buffer size, but for a different reason. Latency is heavily affected by the size of the segments and the number of segments held in the buffer. ABR is heavily affected by how much time is available to measure network performance and switch between bitrate profiles. ABR needs time to identify a network issue through network measurements and choose to switch to a lower bitrate that can be delivered in time to reach the player and provide stream continuity. When the buffer is longer there is simply more time available to take action.

As an example, a segment of 2 seconds will take 2 seconds to download across the network to the player from the upstream CDN. If there are 3 segments in the buffer, this means that there are 6 seconds queued up. If the 4th segment has an issue being downloaded in time when the 1st segment ends, the player can request a new 4thsegment. To arrive in time, the request must be made and the download must begin within the 2nd 2-second segment, so that it can be downloaded when the 3rd segment plays and be ready to play when the 3rd segment ends.

But if there are network issues, like capacity problems, round-trip delays, and network drops, then there is also a higher chance that other issues can occur, such as a restart or retry. There needs to be enough margin for error to manage the segment download. This is where the 3-segment standard has come from. But low latency reduces this margin for error, so sensitivity to network conditions increases, and even ABR’s ability to help is reduced.

How Smart TVs Can Complicate Latency Management

Ironically, the type of content where low latency is important, like live events, is also the type of content that has high potential to create network overload because of large audiences viewing concurrently. And a new issue is emerging with low latency delivery to Smart TVs, which again is ironically the environment where big live events are being increasingly viewed for best quality on the big screen.

The issue is that HTML5 browsers, which also applies to many Smart TVs, do not measure bandwidth availability for ABR calculations in the same way as players have traditionally measured. In low latency mode, it is critical to receive data into the decoder’s buffer as quickly as possible. In HTML5 this uses the Fetch API combined with HTTP’s transfer-encoding: chunked command which allows for data to be transmitted as it is being made available. But the Fetch API cannot measure idle time when chunks of data are being sent, so it will calculate a shorter transfer duration than reality, and therefore make it difficult for a player to estimate accurately if it should increase the video quality.

Working with low latency increases the importance of measuring upstream network performance to make fast decisions. So, if network measurement becomes less accurate or takes longer to get a measurement, then potentially we may need to change how ABR works.

To Overengineer Or Not To Overengineer?

If you want to overengineer your stream management approach to avoid these problems, you don’t really have a good choice.

The first point to make is that it is not possible to know when networks will go down or become non-performant for the video that is trying to traverse it, so when should you choose to over-engineer? ABR is the solution we have today to maintain stream delivery rather than freeze the video. But other than this, we have few choices to fix performance problems.

We could overengineer our delivery by sending multiple streams simultaneously to each player to give a real-time failover option to the player. But this brute-force approach leads to problems with bandwidth utilisation and the cost of delivery for broadcasters, potentially doubling costs, or worse. The goal is not to increase CDN costs, so we must be clever.

Leading distribution and player business, THEO Technologies, have a lot of internal discussions about this. They conclude that the first principle is to avoid needing to redownload a segment. As THEO CTO, Pieter-Jan Speelmans says, “We believe that aborting a download is the start of a chain-reaction of problems for the viewer experience. It can throw off critical timing deadlines in software algorithms that are running many simultaneous calculations to make stream request decisions. We do have customers working with multiple streams, or preloading channels based on customer behaviours within their EPG, but these are custom projects for very specific environmental conditions or specific types of content. The real solution for low latency is to engineer more cleverly to make best use of the video segments and chunks we have access to.”

Streaming is a pull system which creates some challenges for latency management due to the nature of the networks that streaming traverses. But this exact pull system is also the reason why there is an opportunity to exceed today’s “broadcast-grade” latency standards and perhaps set a new bar that we could call “streaming-grade”.

The first part of this article explained how the 8-second latency standard from encoder input to player output is achieved. This second part focuses on how streaming delivery could effectively match broadcast-grade latency standards in the 5-second range.

Why Key Frames Really Matter

First, we need to look at the technical subject of key frames, that are an integral part of engineering for low latency. They are the main markers of segment start/stop points and are used by video players to request content from the upstream delivery chain.

The current challenge with the time to download a new segment is related to how segments are marked. A key frame for every segment – i.e., every 2 seconds if the segment is 2 seconds long – is the marker in the video where playback can initiate. The choice a player makes in downloading a new segment is to download up to the end of the next keyframe in the same quality, or to download from the existing keyframe at a lower quality. Switching streams can happen between key frames, but playing a new segment is constrained by key frames.

To manage video quality, a player must be able to detect that a network change has occurred and initiate action to rectify the problem. This process typically takes 300-400ms and can occur once every segment. If the segment has been reduced to just 1 second, then taking 400ms to measure and initiate the download of a new 1 second segment puts pressure on stream quality.

Pieter-Jan Speelmans, CTO at THEO, states “Given typical network measurement speeds and the inherent nature of key frames and segments, it becomes rather complex to write an ABR algorithm that works well for low latency streaming. This is particularly difficult for web browser and Smart TV environments where low latency chunking is applied because we can only measure when a chunk was completed and not when it started. There can also be idle time when servers are not sending data. These two points make the normal calculation methods too simple and can result in inaccurate measurements. To engineer for low latency and high quality we should look to network set-up and new ways to manage segments. The latter is where we are focused.”

How CMAF Helps To Lower Latency

CMAF (Common Media Application Format) is a format, whereas DASH and HLS are protocols. CMAF describes how media can be stored and transported, and the interoperability with the client. It does not provide guidance about the other network elements involved in the low latency workflow (like the packager and origin), and so it is implemented within the DASH and HLS specifications that does refer to all elements in the workflow.

Figure 3 – Legacy HLS/DASH and the CMAF LLC and HTTP Chunk Transfer processing method.

CMAF LLC (Low Latency Chunk) defines a sub-segment GOP (Group of Pictures) or “chunk”, and it allows for delivery of the chunk before the full segment duration is calculated. The chunk size is applied across the whole workflow, as shown below, to allow a decoder to start decoding the video stream before a complete segment is received.

CMAF requires a key frame at the start of each segment. However, protocols making use of CMAF, such as LL DASH, are not required to specify the sub-structure of these segments. This can cause problems for player clients which need to determine where they can start playback, and where they should start to download. In these cases, CMAF requires downloads to start from the beginning of a segment, which has implications for managing quality and latency simultaneously.

The Second Low Latency Zone – The 5-second Range

HLS LL and DASH LL, the two leading low latency (LL) protocols, have fundamental design differences in how they operate. HLS LL delivers the chunks as individual files called “Parts” which must be fully formed before they can be delivered to the client. DASH LL progressively downloads chunks as they become available.

The HLS LL approach means that it aligns fully with the existing methodology used in standard HLS. In essence, it is like reducing the segment duration using standard HLS, but by using CMAF LLC there are optimisations to the manifest and media segment request workflows. For example, LL HLS makes use of “Preload Hints” which allow a playback client to request the next upcoming Part even though it is not available yet. This means the media data can be acquired faster, reach the buffer earlier in the process, and then be rendered out at a lower latency compared to an approach without CMAF chunks.

For DASH LL, the implementation of CMAF LLC means that a chunk can be created from even a single frame of video (e.g., 33 milliseconds for a video using 30 frames per second). The SVA (Streaming Video Alliance) recommend using 4-8 frames per CMAF chunk (e.g., 133 to 266 milliseconds for a 30 FPS video). This is because lower chunk sizes would over-engineer the workflow considering the latency that is implemented at other points in the delivery chain, such as the encoder, CDN, and player.

Implementing either of these low latency protocols creates the opportunity to reach 5 seconds of latency between encoder input and player output (see below). This example is using a 2 second segment size, which is generally recommended to optimise for the encoder quality and efficiency. Anything lower results in excessive costs and processing requirements for the perceived benefits.

Figure 4 – How 5-second latency is achieved.

For the whole delivery chain to support the HLS LL and DASH LL protocols, HTTP chunked transfer must be supported. This requires the implementation of HTTP/2 at the CDN and the player. To improve bitrate switching for ABR support within the DASH LL protocol, it is necessary to implement the new MPEG DASH mechanism called Resynch (MPEG DASH 4th edition), which provides additional markers (“IDR Pictures”) within a segment from which to initialise a new segment on the player. However, none of the player or packager solutions appear to have implemented this to date.

Then the encoder and packager must inform the player where those starting opportunities are in the other bitrate profiles (see below), and the player must be able to start downloading the new profile’s mp4 file from the right byte-position. This means that the CDN and origin must be able to answer byte-range requests.

In practice, the use of a single segment is unlikely, which adds 2 seconds back to the above latency calculation. To reduce latency down again to 5 seconds would require reducing the segment size to 1 second, which introduces challenges to encoder efficiency, and reduces the margins for error related to new segment downloads. To sustain high quality streaming the player must be able to switch stream and access a new key frame before the buffer runs out. So, if the key frames are 1 second apart and a switch takes about 400ms, then you need at least 1400ms in the buffer. If there are just 1000ms of data (one segment), and if there is some variance in CDN delivery time or in the network – which is likely the cause of the switch request anyway – then there is a high chance of not receiving the next key frame in time, leading to rebuffering and the poor viewer experience.

Figure 5 – Bitrate switching can happen in the middle of a segment due to the chunking method and Resynch implementation.

So, while CMAF LLC is a very important step forwards for achieving broadcast-grade streaming, it doesn’t yet help us reach the ultra-low latency standard that could potentially bring us to reliable 1-second latency, explored in Part 3 of this article.

There are very practical steps available today for Streamers to reach broadcast-grade latency. Low latency protocols from HLS and DASH that implement CMAF Low Latency Chunking are the main proponents of 5-8 seconds of latency from encoder input to player output. But the pull system of streaming creates an opportunity to reach a new 1-second standard, perhaps that we can call “streaming-grade”.

The Third Low Latency Zone – The 1-second Range

HESP ( High Efficiency Streaming Protocol) was founded by THEO Technologies and Synamedia, and an alliance of other streaming technology providers have added their weight to the initiative, including MainStreaming, G-Core Labs, Noisypeak, Native Waves, and EZ DRM.

HESP created a new way to look at decision-making within ABR constructs to smoothly handle ultra-low latency. The traditional thinking has always been that quality must come before low latency, which is natural given the overriding interests of consumers and content providers for good quality viewing experiences. So far, we have therefore lived with the fact that we need to switch off alerts from social media and other apps (and try not to listen to our neighbours) in order not to spoil our live event viewing experience. HESP aims to solve this problem by achieving broadcast-grade latency with the safety valve of ABR to avoid rebuffering and poor viewing experiences.

A core design principle in HESP is that the overall delivery chain must be set-up so latency can be pushed to its absolute limits. In other words, we should be able to stream at the fastest possible speed allowed by the inherent latency of the whole delivery chain, without worrying about this causing poor video quality. This means that HESP focuses on three things – 1) using long contribution segments to the player for maximum stream input stability, 2) using standardized low-latency chunk sizes (e.g., 200ms), and 3) using ultra-fast but efficient stream switching abilities.

The use of long segment sizes creates the opportunity for encoding efficiency with high quality. As a rule, shorter segment sizes create a heavier workload on the encoder and create trade-offs for quality. This first technical design point allows HESP to achieve best quality output from the encoder, by allowing time to calculate the optimal motion vectors and use optimal references. A 2-second segment is sufficient for a broadcast-grade encoder output.

The second design point to use standardised low latency chunk sizes around 200ms is used to comply with best practices as researched and advocated for by the SVA and other leading vendors. The overall efficiency of the delivery chain does typically not warrant using smaller chunk sizes, even if 1-frame chunks of 16ms could be theoretically achieved for video at 60 frames per second (or 33ms for 30 FPS video).

The third design point of having ultra-fast switching capabilities is how HESP continues to use ABR within a very low latency environment (note that this also brings fast channel-change capabilities, but this is a separate story). The key to this is to first introduce what HESP call Initialization Frames, a unique feature in this protocol. The Initialisation Frames are generated at the encoder and are output as a second stream. There are choices about how many initialisation frames are necessary – for example, super high-performance streaming can have initialisation frames for every frame, but it is also possible to use a sparse set-up where the initialization frames are inserted every 10-20 frames. These frames are the switching point that the player can use.

The second stream concept is inevitably going to create additional egress from the origin, but it doesn’t add any extra overhead to the CDN, because the Player will only request chunks from one of the two available streams, rather than from both. For high-performance streaming needs, perhaps only applied during live events that use high resolution formats, this level of investment could easily make very good economic sense for the excellent viewer experience it should deliver.

In the end, the buffer size that HESP can work with is the length of time it takes a player to receive the second stream from the CDN, which can easily be in the 300-500ms range if the content is already in the Edge Cache. This allows the player to receive the right content very quickly which solves one of the most important issues to reduce latency to its absolute lowest levels in an ABR streaming environment.

Because of these capabilities, in an ultra-high-performance single-frame chunk size set-up with HESP, it is possible to consider the following latency table. The length of the single frame is assumed to be 33ms from a 30 frames per second video (rounded up).

Figure 6 – The HESP construct of the Initialization stream that supports the main “Continuation stream”.

As Pieter-Jan Speelmans at THEO concludes: “HESP can quite easily achieve sub-second latency from encoder input to player output given that we have implemented a real-time back-up plan with initialization frame architecture. If the delivery chain conditions are excellent, we have observed as low as 400ms end to end latency. Of course, if we want to reduce risk, we can increase latency to 3-4 seconds to achieve a completely buffering-free experience and still deliver better than broadcast-grade latency. We have spent over 5 years working on this approach and believe it can truly bring D2C Streamers the key low latency performance that will move streaming performance to the next level.

Figure 7 – How 1-second latency is achieved.

Conclusion

The engineering tools to achieve near-broadcast grade latency of 5-8 seconds are now embedded in the HLS and DASH specifications. Implementations are still in the early stages, but clearly D2C Streamers with the need for low latency performance now have the ability to reengineer for lower latency.

The restrictions of the CMAF LLC format have encouraged the introduction of HESP, to find ways to overcome the inherent trade-off between quality and latency and beat what we currently call “broadcast-grade streaming”.

In streaming we need to consider the quality of the network and the available capacity to pass bits from encoder to player. ABR is a very important tool for us to protect the viewer experience, but low latency really stretches ABR to its limits. The set-up of the encoder, packager, and player are fundamental to achieve low latency, and an excellent CDN is essential to be able to quickly pass through the lower chunk sizes and sustain the overall streaming performance, particularly when large live audiences are involved and the whole subject of “latency sensitivity” has been raised to its highest alert level.

The engineering innovation of HESP creates incredible possibilities for “streaming-grade” to be considered the gold standard of content delivery. Remarkably, it could be even better than broadcast-grade. No pun intended, but things move fast in the world of streaming.

Part of a series supported by

You might also like...

Microphones: Part 11 - The State Of The Art… And The Potential Of MEMS Microphone Arrays

Here we look from the state of the art in microphones, to what the future may bring with the enticing theoretical potential of microphone arrays built using MEMS technology.

Microphones: Part 10 - Mid-Side (M-S) Recording And Processing

M-S techniques provide useful sound-field positioning and a convenient way to check mono compatibility. We explain the hard science behind this often misunderstood technique.

Microphones: Part 9 - The Science Of Stereo Capture & Reproduction

Here we look at the science of using a matched pair of microphones positioned as a coincident pair to capture stereo sound images.

Microphones: Part 8 - Audio Vectorscopes

The audio vectorscope is an excellent tool for assuring quality in stereo sound production, because it makes the virtual sound image visible in the same way that a television vectorscope allows the color signals to be seen.

Microphones: Part 7 - Microphones For Stereophony

Once the basic requirements for reproducing sound were in place, the most significant next step was to reproduce to some extent the spatial attributes of sound. Stereophony, using two channels, was the first successful system.