OTT (Or Is It ABR?) - Part 2 - The Mechanics Of ABR Delivery

ABR delivery offers the prospect of automatic adaptation of bit rates to the prevailing network conditions. Since the client is aware of the set of available bit rates, it can determine, segment-by-segment, which is the optimal bit rate to use.

This article was first published as part of Essential Guide: OTT (or is it ABR?)

OTT (Or Is It ABR?) - Part 1 - The Challenges To Be Solved

The aim is to receive the highest rates that can be delivered given the network availability (i.e. based on measured download rate) and avoiding too many rate step-changes. There are two scenarios that the clients should avoid: a) running at too low a bit rate (since that will result in more quantization artifacts and possibly reduced resolution), and b) avoiding too many changes in selected rate (both directions of change cause a visual disturbance to the viewer, but if there is a need for a “panic” change down – i.e. when a segment at a given rate is failing to be downloaded quickly enough – the visual disruption is severe).

One unfortunate aspect is that ABR clients create bursty, periodic traffic, and while that’s not an issue if there is only one client, with multiple ABR clients in a shared capacity, the measured download rate can be unstable or unfair. The degree of instability is a function of the responsiveness of the client to changes in measured rate: too fast and the rates will change frequently, too slow and there is the risk of congestion and a resulting panic rate change. Stability is also a function of the segment length.

Understanding The Challenge of Low Latency ABR

Achieving low latency for OTT, or even for ABR on-network, is not trivial. To understand why, it’s important to consider the way data is formatted and transferred, and how networks and clients behave. A typical configuration today will use 6 second segments, with 30 or 60 frames per second, depending on the profile. In the sections below, the diagrams show a reduced number of frames per segment, just for clarity.

Figure 2: ABR live workflow.

Live content is initially processed by the live transcoder in real time. Some latency is incurred at that point, however in principle that latency is equivalent to the conventional TV’s buffering delay, so is not included here for the comparison. Encoded video is fed to the packager as a continuous data flow (with embedded boundary point markers denoting the start & end of each segment), where it is formatted into segment files and published onto the origin server. The segment duration is typically about 6 seconds and it is not until the complete segment is placed on the origin that the manifest can be updated to advertise its availability. Once it is available, it can be pulled by a client via the CDN (which caches in the process) using a single HTTP transfer. Whole file transfers must take place and unless the CDN bandwidth is very much higher than the sum of the data being moved across it, each file will take a time that is less than a full segment, but still finite, to be guaranteed to be moved. The CDN may not provide the data from the cache until the segment is fully received.

At the client, there are typically three segments in operation: one being decoded, one ready for decode next and one being received. The reason for the “next segment” to be complete and waiting is due to the need to adapt bit rate. Each time bit rate adapts significantly, the perception of quality is impaired (the quality discontinuity is visually unpleasant. Since each client makes an autonomous decision about what bit rate to attempt to download, there is a need for it to:

Measure download rate with a reasonable accuracy.
Determine whether the segment will be received sufficiently ahead of time.
Abandon an HTTP transfer, start a new one and complete the download without underflowing the buffer.

Consider what happens when the available bandwidth suddenly reduces, for example because of competing traffic on the network connection.

Figure-3 shows a single 3-representation (or “profile”) example, with a, b and c designating the representation, and 1, 2, 3 and 4 showing the time sequence. Since the content is being encoded live, there is an earliest availability point, after which it can be downloaded. In the absence of any restriction, the highest representation will be downloaded.

Figure 3: Segments available over time to be downloaded by client.

Next consider the case of competing traffic causing the transfer rate to reduce. Having downloaded 2a, the client next tries to download 3a. This time, however, because of the competing traffic, the rate is much lower, meaning the transfer would not complete in time. In order for the client to be able to make that decision, enough data must have been downloaded to give a reasonable accuracy for the measurement. While there is no specific right answer to how long to measure, about 2 seconds’ worth of data appears to give a reasonable balance between time taken and accuracy. Since there is inevitably some uncertainty, clients ask for a lower bit rate representation than they measure as the capacity: about 70% is typical. Without this, the clients would spuriously adapt rates too frequently, causing frequent visual disturbance.

Figure 4: Rate adaptation by the client device.

Once the client has made the decision to abandon a transfer and drop to a lower rate, some time will have elapsed (i.e. the time taken to make the measurement and consequently the decision). The alternative representation for the same segment in time therefore must be downloaded in substantially less time than normal, which is why clients often drop all the way to the lowest rate representation when adapting downwards. Unfortunately, this also maximizes the visual disturbance caused by the adaptation. If shorter segments were used, the period available for measurement would be shorter, leading to a corresponding increase in measurement volatility (and therefore frequency of spurious adaptation). This can be mitigated to some extent by reducing further the 70% figure, but either way, using shorter segments to reduce latency will lead to a reduction in quality.

The client’s need for measurement is fundamental to the client being able to adapt autonomously. Other system delays, however, are not fundamental and there exist mechanisms to improve the latency.

Overall, the end-to-end latency looks similar to Figure 5, with complete segments moving between each stage.

Figure 5: Overall end-to-end latency with conventional ABR and HTTP transfer.

Another aspect to consider is that a client also needs to guess when to next request a segment (or in the case of HLS, when to poll for the next update to the manifest). This results in further latency as the client needs to avoid asking too early, so must apply some degree of conservatism.

Other related articles posted on The Broadcast Bridge.

Essential Guide: OTT (or is it ABR?)

Part of a series supported by

You might also like...

Microphones: Part 11 - The State Of The Art… And The Potential Of MEMS Microphone Arrays

Here we look from the state of the art in microphones, to what the future may bring with the enticing theoretical potential of microphone arrays built using MEMS technology.

Microphones: Part 10 - Mid-Side (M-S) Recording And Processing

M-S techniques provide useful sound-field positioning and a convenient way to check mono compatibility. We explain the hard science behind this often misunderstood technique.

Microphones: Part 9 - The Science Of Stereo Capture & Reproduction

Here we look at the science of using a matched pair of microphones positioned as a coincident pair to capture stereo sound images.

Microphones: Part 8 - Audio Vectorscopes

The audio vectorscope is an excellent tool for assuring quality in stereo sound production, because it makes the virtual sound image visible in the same way that a television vectorscope allows the color signals to be seen.

Microphones: Part 7 - Microphones For Stereophony

Once the basic requirements for reproducing sound were in place, the most significant next step was to reproduce to some extent the spatial attributes of sound. Stereophony, using two channels, was the first successful system.