Hardware Infrastructure Global Viewpoint – June 2021
Reading the Fastly internet service outage newspaper headlines this week would make anybody relying on the internet take a deep and thoughtful intake of breath. But when we dig below the headlines into the detail, there are a lot of positives that broadcasters can take away from this event.
For as long as I can remember I’ve been designing broadcast A-B failover systems. Like many engineers before me I’ve spent hours analyzing systems finding the single points of failure and engineering them “out” so that when a video or audio chain stopped working, there was always a backup.
Diligently we would test the failover procedures, even running the standby generator every Sunday morning for 20 minutes to power the playout critical systems and maintain transmission. This was generally successful, and the checkbox would be ticked, until on one occasion the building power failed for real. Thankfully the generator kicked in and took the load from the UPS as we continued our uninterrupted broadcast, at least for thirty minutes, after which the lights went dark, the screens went blank, and the generator shuddered to a halt.
During the “after event debrief” or “engineer kicking” as it’s more affectionally known, we discovered a family of rodents had built a nest within the confines of the cooling radiator deep within the generators structure. These rodents were probably grateful for their Sunday morning heat boost but neglected to build their nest from non-flammable materials leading to it igniting after 30 minutes of generator usage and causing it to go into a safety shutdown mode.
During the initial design phase I cannot remember a single person advising of the danger of rats building a nest within the confines of the generator, or the user manual advising against this.
I’m sure many engineers have similar anecdotes that caused failover systems not to operate at the critical moment. My point is, no matter how clever and experienced we may be, as Nassim Taleb of the Black Swan Theory would tell us, humans cannot predict extreme events with any certainty. We only know they will happen, but not when or how.
Could the Fastly developers have predicted their code would have been responsible for 85% of their systems failing in a few minutes? I am firmly of the opinion that they could not have done this. It’s impossible for anybody to make a system 100% reliable.
The really good news is that within 49 minutes of the original failure, 95% of Fastly’s services were restored and operating as normal. I’m sure this was due to an army of highly knowledgeable and experienced engineers identifying and fixing the outage with clinical precision.
For me, the fact that Fastly had the resource and skill set to fix this problem so quickly is the big take away and inspires great optimism and faith in public cloud systems. More broadcasters are adopting cloud workflows and with this we are buying into a system that fails so rarely that on the few occasions it does, it hits the international news headlines.
No single person can think of all the events that may lead to a failure; all we can really do is accept and work within the limits of probabilities. A cloud service provider may be able to operate within the realms of “eleven nines”, but they’re not claiming to be infallible, and nor should they, as no other system is 100% reliable.
I fear for many years broadcast engineers have lulled themselves, and those they work with into a false sense of security with binary main-backup methodologies. System outages and extreme unpredictable events are a fact of life, and we must work with them instead of attempting to negate them. Maybe we should really be thinking in terms of probability, expectation, and contingency? Cloud systems provide all these.