The BBC was one broadcaster whose web site was hit by the recent major internet outage.
Broadcasters such as the BBC and CNN were among major casualties of the recent global internet outage involving Californian CDN provider Fastly, which will lead to re-evaluation of measures to reduce impact of such disruptions in future.
The incident underlined the heavy dependence of service providers on just a handful of major CDNs and cloud networks in the modern streaming world, but also illuminated how some web sites suffered much less as a result of steps already taken to avoid single points of failure in their delivery ecosystems. There are also lessons for Fastly and other cloud service providers themselves, which can do more to insulate themselves against the impact of software bugs.
Fastly is one of the largest CDNs, vying with Akamai, Cloudflare, Amazon’s CloudFront, Microsoft’s Azure Cloud and Google Cloud, among others. Their function is to enable large scale transport of content across the world for local delivery from edge servers to optimize performance and minimize consumption of bandwidth over the long range, while insulating customers such as broadcasters against services disruptions and security risks like Distributed Denial of Serviced (DDoS) attacks. This last point raises an irony in that the scale required for such operations has whittled down the number of providers to just a handful, which as has just been seen exposes many web sites to points of failure on their network. So while the networks are designed to insulate customers against failure they can themselves become causes of it.
In this case Fastly responded quickly enough to the outage, which lasted for just under an hour during Tuesday June 8th, with 95% of the network back up and running after 49 minutes, according to the company. But this was enough to have a significant business impact in some cases, with Amazon losing an estimated $30 million in online sales, despite having its own CDN that was unaffected by the outage.
Indeed, Amazon’s case highlights the tradeoffs major internet companies themselves have to make. The point here is that Amazon has big fingers in several pies and as well as being a network infrastructure company is also a video service provider and of course the ecommerce company. The ecommerce business balances its traffic load across several CDNs to optimize the experience for consumers and mitigate risk for its business, one being its own CloudFront while others include Fastly, Akamai and Edgecast.
Traffic originating from the same locations or users is frequently switched from one to another of these according to varying load and availability. These days the CDN can switch several times even while loading a single web page, given the dynamic nature of the content with often a number of media elements being combined from multiple sources.
As the outage evolved, some Amazon users were initially connected via Fastly’s CDN and so experienced a short service disruption. However, Amazon quickly routed all traffic to other CDNs including its own, so that the disruption was confined to at most 10 minutes, or else a lot more than $30 million would have been lost.
Some service providers lacking the same dynamic load balancing across multiple CDNs were still able to minimize impact by rerouting directly to the content origin, even if this might incur greater short-term cost. This was the case for the New York Times, which reduced downtime for most users and customers by temporarily redirecting them to the Google Cloud Platform (GCP) where the origin servers holding its content were hosted.
This led to its service availability increasing significantly 40 minutes into the outage, 10 minutes before the fix implemented by Fastly came into effect. So, while the NY Times fared less well than Amazon’s ecommerce site, it did better than those content providers that had done nothing to mitigate impact of such an outage.
The lessons for broadcasters are little different than for many other service providers. Their own software and hardware engineers are powerless to do much about an outage when it occurs but can do plenty to mitigate the impact in advance. They should revisit their contingency planning and assess where their weaknesses and dependencies are, especially for streaming provision. It is their online sites and VoD portals that were affected by the Fastly outage, but these account for a large and growing proportion of eyeball time, as well as revenues.
While the outage was widely interpreted as a warning against too much consolidation, there is in fact a sufficient number of major CDNs for load balancing to cushion against impact of such outages. The incident will though cause major sites such as Amazon to consider how they can accelerate their mitigation procedures to bring that 10-minute service hiatus down to just one minute or even a few seconds.
It may be uneconomic for smaller video service providers to adopt a multiple CDN policy directly and they may be interested in a service such as the Umbrella CDN Selection package from French specialist in that field Broadpeak. This package was made available on the AWS Marketplace in April 2021 and allows providers to select the best CDN for streaming video content according to various criteria, such as device, geolocation, type of customer, or network operator. Any CDN supporting HTTP redirect can participate in Broadpeak’s umbrella service, which could cushion the shock of any individual CDN outage.
Nick Rockwell, Senior Vice President of Engineering and Infrastructure at CDN provider Fastly, was praised for leading a rapid response to the recent global internet outage.
There are also lessons for CDN providers themselves and especially Fastly, as its head of engineering and infrastructure Nick Rockwell conceded. “Even though there were specific conditions that triggered this outage, we should have anticipated it,” said Rockwell. “We provide mission-critical services, and we treat any action that can cause service issues with the utmost sensitivity and priority. We apologize to our customers and those who rely on them for the outage and sincerely thank the community for its support.”
The outage was cause by a bug in Fastly’s code introduced in mid-May as part of an upgrade to help customers reconfigure their service more easily. This lay dormant for almost a month until an unnamed customer updated its settings, triggering the bug, which then took down most of the company’s network.
“On May 12, we began a software deployment that introduced a bug that could be triggered by a specific customer configuration under specific circumstances,” Rockwell said. “Early June 8, a customer pushed a valid configuration change that included the specific circumstances that triggered the bug, which caused 85% of our network to return errors.
The disruption was detected within one minute, then identified and isolated, but it took 49 minutes for normal operation to be restored. An obvious improvement would be to ensure a much faster response and return to normality given that the initial incident was detected quickly, and that might require automated troubleshooting, or probably in practice greater in-built software redundancy.
While hardware redundancy in major systems has been in place for around four decades, with procedures such as hot standby relying on features such as redundant storage and CPU, achieving the same for software has proved much more elusive. This reflects the greater complexities involved in say rolling back to previous versions of software in the event of failure, having duplicate logic paths, or even dual code stacks developed by different teams. All these are possible and have been done, as in aircraft control systems, but enabling rapid failover in all circumstances has proved impossible so far.
Some enterprises and academic institutions are still conducting formal studies of software redundancy, aiming to develop a framework for avoiding single points of failure in critical systems including CDNs. One aim is to quantify the degree of software redundancy in a system and then enable points of failure to be identified and eliminated. There will have to be a multi-layered approach, feeding down from what might be called shallow redundancies in high level code down to deeper distinctions at the algorithmic level. The latter is becoming more important with growing reliance on deep learning for critical processes in broadcasting among other sectors.
That is another story, but one takeaway here is that Fastly, far from suffering business damage itself from the outage, seems almost to have gained credibility from it. The company’s stock price actually rose 12% on the day of the outage, with several possible factors.
One was that the incident highlighted just how big a role the company’s CDN infrastructure played for such a large number of major enterprises, others including Reddit, PayPal, Spotify, Al Jazeera Media Network, and AT&T’s HBO.
Another explanation was that the company responded quickly to the outage, gaining favor in the financial community. Most likely perhaps it reflected the company holding its hand up and owning up quickly to the cause of the problem as well as the need for remedial measures. Another message then is that potential business damage can be turned round by a rapid and open response, rather than relying on obfuscation and evasion as so many companies have in the past when experiencing some major security breach or technical failure.
You might also like...
TDM Mesh Networks: A Simple Alternative To Leaf-Spine ST2110. Pt1 - Balancing Technical Requirements
IP is well known and appreciated for its flexibility, scalability, and resilience. But there are times when the learning curve and installation challenges a complete ST-2110 infrastructure provides are just too great.
IP and COTS infrastructure designs are giving us the opportunity to think about broadcast systems in an entirely different manner. Although broadcast engineers have been designing studio facilities to be flexible from the earliest days of television, the addition of…
We live in fascinating times: increasingly, we live in the era of cloud-based broadcast operations.
Moving to IP is allowing broadcasters to explore new working practices and mindsets. Esports has grown from IT disciplines and is moving to broadcast and has the potential to show new methods of working.
Building optimized systems that scale to meet peak demand delivers broadcast facilities that are orders of magnitude more efficient than their static predecessors. In part 2 of this series, we investigate how this can be achieved.