OTT - What and Where to Monitor – Part 3

In the last two articles in this series we looked at why we need to monitor in OTT. Then, through analysing a typical OTT distribution chain, we sought to understand where the technical points of demarcation and challenges arise. In this concluding article, we look at what and where to monitor in a multi-service-provider OTT delivery system.

Quality of experience (QoE) is the key deciding factor in designing a monitoring infrastructure. We have already seen how a seemingly simple challenge of delivering a live sports event to IP playback or mobile devices, soon becomes incredibly complex.

The key aspects of QoE for the viewer are stable pictures and distortion free sound. Even the slightest audio clicks, or picture freeze can leave viewers disgruntled and looking for an alternative source of entertainment, resulting in lost revenue for the broadcaster.

Start with Video and Audio

Video and audio leaving the studio is the first point of monitoring. The usual video and audio levels must be checked to confirm they are within specification and that there are no freeze or black frames. A single video and audio stream may be compressed to deliver five or six different bit-rates to comply with DASH and HLS protocols. Excessive noise or out-of-specification signals can have a devastating consequence during this process.

QoE further leads to calculating and establishing the Mean Opinion Score (MOS) to represent the quality of the system. This is the arithmetic mean over predefined scale to establish the performance of the video and audio quality. MOS scales range from 1 for bad, and 5 for excellent.

Any SCTE35 ad-insertion markers must be checked to verify the data is correct leaving the broadcaster. Also, compliance with closed caption and audio loudness must be established and that HDR metadata is valid. The streams can also be decoded and compared to the original to confirm they are good at the headend before moving to the distribution network.

Headend Monitoring

Monitoring must be added post segmentation and packaging to make sure the file chunks and manifest files are correct and comply with the relevant standard. At this point the files are pushed to either the origin server or directly to the CDN edge servers depending on the model adopted by the broadcaster. Again, these files must be monitored to confirm they are arriving at the server in good time, are intact, the right size, and that the bit rates are correct.

The CDN edge servers should hold just enough file chunks and manifest files to service the playout and mobile devices. Unlike traditional RF and cable broadcasting, the playout and mobile devices are not synchronous, each device is free running and requests data from the edge server at a different time.

If the CDN edge server has not been correctly configured and it’s not caching effectively, it may find it cannot service a chunk request from the playout or mobile device. The edge server then requests repeat chunks and manifest files from the origin server. If this occurs regularly or for many devices, excessive load on the origin servers results as well as unnecessary data bandwidth consumption on the network. Again, there is a compromise because if the caching was too big, then excessive latency would occur.

Figure 1 – When a mobile or playback device requests a chunk from the edge server, the edge server requests the chunk from the origin. It then stores the chunk in the cache so that when the next device requests the segment it’s available locally without the request having to go back to the origin server. If the edge server’s cache is too short, that is it deletes the chunks prematurely, any subsequent devices requesting a chunk that should be cached will force the edge server to request the chunk again from the origin server. This adds unnecessary load to the origin server as well as increasing the load on the network bandwidth.

Intelligently monitoring the data flow between the edge and origin servers is critical. It’s not just a matter of monitoring data-rates as the information provided is limited. We need to know if multiple chunk and manifest file requests are occurring and how often. Therefore, the monitoring probe must be able to understand the DASH or HLS protocol to determine if an error is or has occurred.

Furthermore, monitoring correlation across multiple bit rates and multiple protocols such as DASH and HLS is incredibly difficult, especially if they are spread over several silos. Understanding the source of these errors is essential to truly understand where they originate.

Low level protocol data errors must also be checked. TCP is very good at masking underlying link or device errors. By its very nature, it will continue to resend packets if the receiver either does not receive or acknowledge them. Monitoring just the data-rate will not help as the TCP protocol will continually resend data packets so the bit-rate looks good, but the data throughput is very low. These errors may occur on one CDN but not another. Unless adequate rate-shaping is applied to a server, a misbehaving TCP link can easily consume disproportionate amounts of bandwidth resulting in the loss of other services.

Consuming Bandwidth

An origin-server may easily provide six bit-streams with varying data-rates for an adaptive distribution, each of these will ultimately be a TCP connection, if one of them misbehaves and consumes the link bandwidth, all the other streams will cease to work. Normal bit-rate monitoring may not easily detect this situation as the bit rate will be correct. So specific DASH and HLS aware monitoring must be used.

Load-balancing is the process of distributing HTTP messages to origin-servers to evenly distribute the load processing between them. In a virtualized environment, new servers can be quickly added to respond to peak demands and removed once the event is complete. Monitoring the connections to the load-balancer will soon expose requests that are not being correctly responded to, or excessive latency.

Monitoring at the data level is often referred to as Quality of Service (QoS). Importantly, this doesn’t require decoding into audio and video as we can make some assumptions about the integrity of the video and audio based on the delivery and progress of the chunks and manifest files.

Figure 2 – Adding a combination of media probes and intelligent monitoring is key to building a successful infrastructure, especially when many service vendors are involved.

Some CDN providers give monitoring access for broadcasters to allow intelligent monitoring during the signal path. This allows broadcasters to measure the chunk and manifest files to confirm they are progressing through the chain correctly. If this service is not available, then the broadcaster is able to add intelligent monitoring before and after the CDN.

Improvement Through Collaboration

Although it may seem counterintuitive, that is a CDN provider would provide monitoring, what is becoming clear is that if a CDN provider can quickly demonstrate their system is working correctly, then the broadcaster can look elsewhere to determine an error. More and more CDN service providers see the benefit of collaboration and are now working together to provide intelligent monitoring tools to deliver a better service, thus breaking down the silo culture.

Further monitoring can be added to the Access and ISP points to analyse the final point in the chain. Bringing all the monitoring back to a central resource for the broadcaster now delivers an incredibly detailed, proactive, and efficient monitoring system. Broadcasters can quickly determine if an error has occurred and have deep insight into the network to understand where it has occurred.

Maintaining QoE is paramount to keeping viewers engaged and watching programs. And now, data-analytics companies can determine if a playback or mobile device is providing a good QoE for the viewer. The software probes can determine if a DASH system is constantly switching between streams and why, for example. They can establish if there is buffer underrun, or poor network coverage.

Understand Who is Affected

Companies must understand the interaction between their respective points of demarcation now. The impending migration to virtualized infrastructures and cloud makes this kind of monitoring critical.

Being able to understand whether a problem has occurred at the headend, CDN, or access network is incredibly powerful for a broadcaster. Establishing who was affected, how often, and when, will help broadcasters quickly fix a fault as well as deal with any resulting viewer fall-out.

But the end game is to be able to detect problems before they become apparent to a viewer so they can be dealt with quickly. As intelligent monitoring drives deeper into multi-service provider networks and facilities, problems will become established more quickly, and more importantly, the silo culture will diminish as service providers collaborate further.

Other related articles posted on The Broadcast Bridge.

Essential Guide: OTT Monitoring Uncovered

Part of a series supported by

You might also like...

Understanding IP Production Networks: Part 14 - Delay Monitoring

We use buffers to reassemble asynchronous streams so we must measure how long individual packets take to reliably get to the receiver, and the maximum and minimum delay of all packets at the receiver.

Understanding IP Production Networks: Part 13 - Quality Of Service

How QoS introduces a degree of control over packet prioritization to improve streaming over asynchronous networks.

Understanding IP Production Networks: Part 12 - Measuring Line Speeds

Broadcast and IT engineers take very different approaches to network speed and capacity; it is essential to reach a shared understanding.

Understanding IP Production Networks: Part 11 - Network Analyzers

Wireshark is an invaluable tool that enables engineers to examine network traffic in detail. Commercial monitoring platforms provide even deeper observation.

Understanding IP Production Networks: Part 10 - Security

The flexibility of IP and COTS brings with it all of the security dangers of the internet and the need for robust processes. It means new questions need to be asked of broadcast equipment manufacturers.