Broadcasters Should Learn From Fastly Internet Outage

The BBC was one broadcaster whose web site was hit by the recent major internet outage.

Broadcasters such as the BBC and CNN were among major casualties of the recent global internet outage involving Californian CDN provider Fastly, which will lead to re-evaluation of measures to reduce impact of such disruptions in future.

The incident underlined the heavy dependence of service providers on just a handful of major CDNs and cloud networks in the modern streaming world, but also illuminated how some web sites suffered much less as a result of steps already taken to avoid single points of failure in their delivery ecosystems. There are also lessons for Fastly and other cloud service providers themselves, which can do more to insulate themselves against the impact of software bugs.

Fastly is one of the largest CDNs, vying with Akamai, Cloudflare, Amazon’s CloudFront, Microsoft’s Azure Cloud and Google Cloud, among others. Their function is to enable large scale transport of content across the world for local delivery from edge servers to optimize performance and minimize consumption of bandwidth over the long range, while insulating customers such as broadcasters against services disruptions and security risks like Distributed Denial of Serviced (DDoS) attacks. This last point raises an irony in that the scale required for such operations has whittled down the number of providers to just a handful, which as has just been seen exposes many web sites to points of failure on their network. So while the networks are designed to insulate customers against failure they can themselves become causes of it.

In this case Fastly responded quickly enough to the outage, which lasted for just under an hour during Tuesday June 8th, with 95% of the network back up and running after 49 minutes, according to the company. But this was enough to have a significant business impact in some cases, with Amazon losing an estimated $30 million in online sales, despite having its own CDN that was unaffected by the outage.

Indeed, Amazon’s case highlights the tradeoffs major internet companies themselves have to make. The point here is that Amazon has big fingers in several pies and as well as being a network infrastructure company is also a video service provider and of course the ecommerce company. The ecommerce business balances its traffic load across several CDNs to optimize the experience for consumers and mitigate risk for its business, one being its own CloudFront while others include Fastly, Akamai and Edgecast.

Traffic originating from the same locations or users is frequently switched from one to another of these according to varying load and availability. These days the CDN can switch several times even while loading a single web page, given the dynamic nature of the content with often a number of media elements being combined from multiple sources.

As the outage evolved, some Amazon users were initially connected via Fastly’s CDN and so experienced a short service disruption. However, Amazon quickly routed all traffic to other CDNs including its own, so that the disruption was confined to at most 10 minutes, or else a lot more than $30 million would have been lost.

Some service providers lacking the same dynamic load balancing across multiple CDNs were still able to minimize impact by rerouting directly to the content origin, even if this might incur greater short-term cost. This was the case for the New York Times, which reduced downtime for most users and customers by temporarily redirecting them to the Google Cloud Platform (GCP) where the origin servers holding its content were hosted.

This led to its service availability increasing significantly 40 minutes into the outage, 10 minutes before the fix implemented by Fastly came into effect. So, while the NY Times fared less well than Amazon’s ecommerce site, it did better than those content providers that had done nothing to mitigate impact of such an outage.

The lessons for broadcasters are little different than for many other service providers. Their own software and hardware engineers are powerless to do much about an outage when it occurs but can do plenty to mitigate the impact in advance. They should revisit their contingency planning and assess where their weaknesses and dependencies are, especially for streaming provision. It is their online sites and VoD portals that were affected by the Fastly outage, but these account for a large and growing proportion of eyeball time, as well as revenues.

While the outage was widely interpreted as a warning against too much consolidation, there is in fact a sufficient number of major CDNs for load balancing to cushion against impact of such outages. The incident will though cause major sites such as Amazon to consider how they can accelerate their mitigation procedures to bring that 10-minute service hiatus down to just one minute or even a few seconds.

It may be uneconomic for smaller video service providers to adopt a multiple CDN policy directly and they may be interested in a service such as the Umbrella CDN Selection package from French specialist in that field Broadpeak. This package was made available on the AWS Marketplace in April 2021 and allows providers to select the best CDN for streaming video content according to various criteria, such as device, geolocation, type of customer, or network operator. Any CDN supporting HTTP redirect can participate in Broadpeak’s umbrella service, which could cushion the shock of any individual CDN outage.

Nick Rockwell, Senior Vice President of Engineering and Infrastructure at CDN provider Fastly, was praised for leading a rapid response to the recent global internet outage.

There are also lessons for CDN providers themselves and especially Fastly, as its head of engineering and infrastructure Nick Rockwell conceded. “Even though there were specific conditions that triggered this outage, we should have anticipated it,” said Rockwell. “We provide mission-critical services, and we treat any action that can cause service issues with the utmost sensitivity and priority. We apologize to our customers and those who rely on them for the outage and sincerely thank the community for its support.”

The outage was cause by a bug in Fastly’s code introduced in mid-May as part of an upgrade to help customers reconfigure their service more easily. This lay dormant for almost a month until an unnamed customer updated its settings, triggering the bug, which then took down most of the company’s network.

“On May 12, we began a software deployment that introduced a bug that could be triggered by a specific customer configuration under specific circumstances,” Rockwell said. “Early June 8, a customer pushed a valid configuration change that included the specific circumstances that triggered the bug, which caused 85% of our network to return errors.

The disruption was detected within one minute, then identified and isolated, but it took 49 minutes for normal operation to be restored. An obvious improvement would be to ensure a much faster response and return to normality given that the initial incident was detected quickly, and that might require automated troubleshooting, or probably in practice greater in-built software redundancy.

While hardware redundancy in major systems has been in place for around four decades, with procedures such as hot standby relying on features such as redundant storage and CPU, achieving the same for software has proved much more elusive. This reflects the greater complexities involved in say rolling back to previous versions of software in the event of failure, having duplicate logic paths, or even dual code stacks developed by different teams. All these are possible and have been done, as in aircraft control systems, but enabling rapid failover in all circumstances has proved impossible so far.

Some enterprises and academic institutions are still conducting formal studies of software redundancy, aiming to develop a framework for avoiding single points of failure in critical systems including CDNs. One aim is to quantify the degree of software redundancy in a system and then enable points of failure to be identified and eliminated. There will have to be a multi-layered approach, feeding down from what might be called shallow redundancies in high level code down to deeper distinctions at the algorithmic level. The latter is becoming more important with growing reliance on deep learning for critical processes in broadcasting among other sectors.

That is another story, but one takeaway here is that Fastly, far from suffering business damage itself from the outage, seems almost to have gained credibility from it. The company’s stock price actually rose 12% on the day of the outage, with several possible factors.

One was that the incident highlighted just how big a role the company’s CDN infrastructure played for such a large number of major enterprises, others including Reddit, PayPal, Spotify, Al Jazeera Media Network, and AT&T’s HBO.

Another explanation was that the company responded quickly to the outage, gaining favor in the financial community. Most likely perhaps it reflected the company holding its hand up and owning up quickly to the cause of the problem as well as the need for remedial measures. Another message then is that potential business damage can be turned round by a rapid and open response, rather than relying on obfuscation and evasion as so many companies have in the past when experiencing some major security breach or technical failure.

You might also like...

The Big Guide To OTT - The Book

The Big Guide To OTT ‘The Book’ provides deep insights into the technology that is enabling a new media industry. The Book is a huge collection of technical reference content. It contains 31 articles (216 pages… 64,000 words!) that exhaustively explore the technology and…

The Battle To Beat Content Piracy

OTT operators need heightened awareness of how to manage the threat of piracy. But OTT also offers a promise: with the right legal framework, the available technical solutions could bring video piracy to dramatically lower levels.

An Introduction To Network Observability

The more complex and intricate IP networks and cloud infrastructures become, the greater the potential for unwelcome dynamics in the system, and the greater the need for rich, reliable, real-time data about performance and error rates.

Next-Gen 5G Contribution: Part 2 - MEC & The Disruptive Potential Of 5G

The migration of the core network functionality of 5G to virtualized or cloud-native infrastructure opens up new capabilities like MEC which have the potential to disrupt current approaches to remote production contribution networks.

The Business Cost Of Poor Streaming Quality

Poor quality streaming loses viewers at an alarming rate especially when we consider the unintended consequences of poor error reporting on streaming players.