A Video Quality and Measurement Overview: Part 2

In this Part 2 of the series, the author reviews best practices and tools needed to measure consumer’s QoE.

In the first part of this series, we focused on content preparation. This second installment will focus on content distribution. Distribution architectures vary greatly from service provider to service provider, but there are generally some common elements that we can focus on: hub sites, regional edge sites, and playback/viewing devices.

Hub Sites

Hub sites are typically fed via small to large fiber rings. There has traditionally been little need to spend energy (or money) on QoE monitoring at this point, because the content itself was not being processed. However, with the advent of distributed processing architectures, such as ad-insertion and just-in-time packaging of adaptive bitrate content (HLS, HDS, DASH…), this may not always be the case. In general, the primary goal at these hub sites is to localize any possible network distribution issues. In order to measure key packet performance metrics, you will need to perform IP inspection (HTTP, UDP, RTMP, etc.). You may also need to inspect higher protocol layer content containers such as MPEG Transport Stream metrics to determine packet loss, because in some architectures some form of IP encapsulation like RTP is lacking, making it necessary to look at metrics like MPEG PID continuity counters to infer any loss in the network. You’ll also want to measure the network jitter levels using metrics like MDI (Media Delivery Index, RFC based implementation) to understand how your routers and switches handle video traffic bursts and buffering. Building caching nests is very common now (caching nests are distributed caching sites deployed tactically at certain edge areas of the networks), but can your network infrastructure handle microbursts? I could also discuss for hours how often we see just a few flows of misbehaving IPTV multicast traffic adversely impacting many other flows across a common router interface. Before any new video content distribution network is used for production traffic, it’s highly recommended to perform comprehensive baseline testing of video bursts, jitter, and loss to understand the mechanics of the deployed hardware and software. Would you believe that a firmware bug on a router line card could ever cause a single individual bit to change in an IP packet? I didn’t either, until we diagnosed such a case.

Regional Edge Sites

Attention now shifts to the regional edge sites, which are downstream of the hubs. Here again, the content may be aggregated or transformed prior to final distribution. You might be dealing with last-mile distribution to hundreds of service groups, edge-QAMs, DSLAMs, LTE wireless, and more, so the ability to quickly isolate issues is the primary consideration. Monitoring coverage is crucial in this part of the network, because issues here will lead to many more “truck rolls,” compared to issues in the head-end or core of the network, which are regularly staffed.

A fine balance must be met with the amount of monitoring, and the accuracy of said monitoring. I recommend targeting a false positive rate of less than 0.1% (one alarm in 1,000 is false). This is a tall order, but absolutely achievable with the right plan and architecture. I would argue that it’s not required (or economically feasible) to have probes in every one of those edge segments; intelligent scanning and probing can solve the situation in many cases. If you can guarantee detection of an issue using this methodology using a 10 minute scanning window, you can dramatically reduce the Mean-Time-To-Repair (MTTR) for network issues.

You’ll also want to eliminate those sporadic “ghost” problems that keep coming up time and time again. I recall a case a few years ago where a service provider was frustrated at the relatively high rate of calls/truck rolls being experienced in a certain section of their network. After planning and deploying a proper monitoring architecture, they quickly localized the issue to a weekly event every Saturday afternoon. They dispatched a crew the following week to investigate and found a lawn maintenance crew operating around the acquisition antennas. The engines from some of their equipment were over-driving the satellite receiver equipment, leading to intermittent outages.

Issue trending and reporting should be another key monitoring focus at the network edge, possibly even more than the head-end. Let’s face it, most costly truck rolls occur not because of well-known head-end outages but because of sporadic and continuous issues originating much further out in the network.

The Playback/Viewing Device

We’ve now reached the proverbial end-of-the-line monitoring target, the consumer device itself. Although long-standing established standards and platforms have existed and been applied for the last 10-20 years (OOB analog signaling, TR169, SNMP polling, etc.), they’ve failed to keep-up with the tsunami of delivery methodologies and the prolific nature of consumer devices ingesting today’s content. Consumers are now being freed from the historical service provider-mandated equipment shackles, and we can now use our smartphones, tablets, TVs, game consoles, and more to consume video content, whenever, wherever, and however we want it.

In order to properly evaluate QoE, you must have visibility into the delivery infrastructure, even if it’s not your own.

In order to properly evaluate QoE, you must have visibility into the delivery infrastructure, even if it’s not your own.

But what does that mean to a service provider? How can you make sure that your customer has a great Quality of Experience if he’s viewing content on a device you do not own, carried over a last mile network that isn’t yours, cached by a third-party CDN, and prepped and packaged by a third-party cloud-based head-end. Daunting isn’t it? But you can take back control of this complex ecosystem by having proper measurements and methodologies in place that provide real-time visibility. Daily or weekly reports telling you that playback startup time was slower than the average or that HTTP errors are on the rise at night will not help. All that does is validate your customer churn and negative feedback. You need data that is reliable and real-time so appropriate technical and business decisions can be taken to address the issues. You need to know the “whys.”

It might seem counter-intuitive at first, but if you want the end-device monitoring to be valuable and effective, you must have visibility into the delivery infrastructure, even if it’s not your own.

For example:

  • Are you monitoring the publishing of the content from your head-end in the cloud partner to your CDN partner?
  • Are you pro-actively monitoring the CDN performance from a content availability perspective?
  • Do you have a proper SDK/agent strategy for your consumer devices that provide more than simple Operating System (iOS, Android, etc.) and Media Player log scraping?

If you answered no to any of those questions, how will you deal with the finger-pointing when problems do arise (and trust me, they will)? There are several schools of thought, some advocating brute-force solutions like CDN switching when problems are seen. In my experienced opinion, that’s a bit of a shotgun approach in the dark of the night. If you can’t pinpoint where the problem is, all you are doing is over-engineering the cost and complexity of the platform. You may occasionally get lucky and improve something with this approach, but never really know why. Is that really an appropriate quality strategy?

To properly identify and repair issues, engineers need to bring together both the “operational” nature of traditional service monitoring to the “behavioral” nature of the end-user.  This means you need to require performance-oriented SLAs that you can independently measure, validate and qualify--all in real time.

To properly identify and repair issues, engineers need to bring together both the “operational” nature of traditional service monitoring to the “behavioral” nature of the end-user. This means you need to require performance-oriented SLAs that you can independently measure, validate and qualify--all in real time.

Tying it All Together

To gain the necessary perspective on quality, it all comes down to tying together the “operational” nature of traditional service monitoring to the “behavioural” nature of the end-user content consumption and service usage. Today’s technologies offer unprecedented capabilities to capture and correlate the quality at the viewer’s screen, and correlate that to the distribution ecosystem’s performance. And while you may not own the overall network or delivery ecosystem, you are still responsible for the viewer’s QoE. If they are unhappy, they will stop paying subscriptions, or viewing ads. If you are using a multi-vendor OTT distribution architecture, I would advocate a trust but verify approach: select solid partners and solutions you can count on, but also work out performance-oriented SLAs that you can independently measure, validate and qualify in real time.

Summary

In this two part series, I first focused on measuring the quality of the content and its preparation for downstream delivery. In this part, I emphasized the need to measure and monitor at multiple points in the distribution and delivery network. These two parts were only separated due to publishing space constraints. In fact, end viewer quality is an end-to-end game, requiring strong defense (24/7/365 monitoring) and a focused offense (active testing of asset availability, targeted MOS measurements, etc.) that provide you the real-time visibility you need to not only know when there are quality issues, but provide the information you need to address those issues in the most efficient and effective manner possible. An end-to-end, operational and behavioral correlation analytics architecture is the only way you can truly gain the necessary control of your video service ensuring you can always get the right answers when issues arise and you are asked “why?” Only when you can measure it do you truly own it.

Gino Dion, Vice President of Engineering, IneoQuest

Gino Dion, Vice President of Engineering, IneoQuest

You might also like...

REMI On A Budget - A Remote Switching Paradigm Shift

Remote Integration Model (REMI) production is more than remote cameras. It’s a new way of thinking and working. This tale of trying to implement a REMI production model within tight financial constraints highlights some of the operational challenges involved.

NextGenTV Mid-2022 Progress Report

Some call it the Broadcast Core Network, or Broadcast Internet, or One-to-Many Private Datacasting. Others simply call it datacasting.

Vendor Spotlight: Blackmagic Design

Founded in 2001, Blackmagic Design is based in Melbourne, Australia, with offices in Fremont and Burbank, Calif. in the U.S., as well as Manchester, England, and Amsterdam, The Netherlands.

Essential Guide: Flexible Contribution Over IP

Outside Broadcast connectivity using managed and unmanaged networks is delivering opportunities for employers that enhances productivity through flexibility, scalability, and resilience.

NRK’s LL35 Gaming Reality TV Experiment

LL35 is a “If you build it, they will come” attempt to attract the young gaming community to live streaming and TV content.