To deliver resilient, flexible, and scalable infrastructures broadcasters must accept failure and design their systems to recover quickly from it. Measuring success is critical to understanding points of failure in parallel and serial workflows, and in part 2 of this series, we dig deeper into measuring resilience to quantify design infrastructure decision making.
Other articles from this series.
Designing For Resilience
It’s difficult to quantify every element in a highly dynamic broadcast infrastructure employed in either on-prem or off-prem clouds and virtualization. However, due to the flexible nature of the infrastructure, high levels of resilience can be achieved by dispersing the workload throughout the facility.
It’s worth remembering that IT professionals have been working with highly dynamic infrastructures that provide massive levels of reliability for many years. Although television is special due to the sampled nature of video and audio that creates synchronous data, as broadcasters move to IP, cloud and virtualized infrastructures, the underlying architecture employed is being used every day in equally challenging industries such as finance and medical.
Monitoring goes a long way in keeping cloud systems reliable as faults can be detected and rectified quickly, but many of the fixes are built into the architectures and software. For example, a “retry” strategy is often adopted when a microservice doesn’t receive an acknowledge back from another microservice it is communicating with. The retry will resend the message several times in the hope that the message has been lost through a transient fault.
After a certain number of retries it’s assumed that the fault is no longer transient and to stop the network being flooded with messages, the control software will instigate a circuit breaker. Just like an electrical circuit breaker, this method cuts off the server sending the messages into the network, thus stopping any potential network congestion. After a predefined length of time, the circuit breaker is automatically reset so the microservice can start sending messages again.
Although the traditional broadcast main-backup model may not be as resilient as broadcasters would like it to be, the cloud and virtualized equivalent can deliver much greater reliability and flexibility. The assignable nature of much of the hardware, such as the servers, means that we can identify and duplicate the infrastructure at many more points throughout the workflow.
A-B IT enterprise architectures have redundancy built into every part of the infrastructure. For example, servers have multiple disk arrays to form RAID storage that is fault tolerant, and dual power supplies are built into every device as a matter of course. Isolating functionality and distributing across multiple areas or sites is much easier as the hardware follows a design which is relatively easy to replicate.
Every element of the platform must be assessed for risk. Every, server, switch, router, and cable must be considered. If a cable fails then how is the operation affected and more importantly, what is the remedy? Although this sort of thinking isn’t new to broadcast engineers, but the automated nature of finding a remedy is. Rather than thinking in terms of “if it ain’t broke then don’t fix it”, now is the time to think “this will break, and this is my plan to resume operation in the shortest possible time”.
The rip-and-replace model is at the core of any cloud or virtualized datacenter, and this has been boosted with microservice architectures. Rip-and-replace allows us to create services to scale infrastructures and delete them again when not needed. Although this doesn’t go as deep as physical hardware, the concepts help understand how modern enterprise datacenters operate and how the people who design them think.
Understanding failure is important, but what is most important to the operation is success, otherwise known as uptime. The uptime, or availability is what the viewers are most interested in. They don’t really care about how much effort has gone into designing resilience, but they are interested in watching their favorite programs undisturbed, especially for high-value live programs such as sport.
Measuring availability helps engineers quantify both the reliability and efficiency of their systems. Designing dual redundancy may increase availability by 0.5% and creating triple redundancy might increase availability by another 0.003%, but at an additional cost of $100K. Is the extra $100K a good investment? That’s for the CEO to decide, but as engineers we can at least present accurate data to them so they can make an informed decision.
Mean Time Between Failure (MTBF) is a measure supplied by vendors to give an idea of how reliable their equipment is. And equally important is the Mean Time to Recovery (MTTR). These are combined to provide the availability, or uptime as follows:
MTTR is the time taken to recover after a failure has happened and is dependent on the sort of problem that has occurred. Public cloud providers often cite availability with values such as 99.9999999%. As MTTR is inversely proportional to the availability, it can be seen that the shorter the MTTR, the higher the availability. This both works at a system and function level. But again, the weakest link prevails. So, if a datacenter only has an availability of 99.0% then no matter how good the architecture and application design, the best availability the system could ever achieve is 99.0%.
As a benchmark, availability is proportional to the number of individual processes in a system. If a single workflow is considered, then the total availability follows the following equation:
Where i is the number of processes in the workflow. For example, for three processes (where i = 2) with availability measures of 0.999, 0.998, and 0.997 respectively and connected in series, then the total availability will be:
However, resilient systems that operate with parallel redundancy improve overall availability as the individual process failure rates, that is (1-availability), are multiplied as per the following equation:
Where i is the number of parallel (or resilient) processes in the workflow. For example, for two processes (where i = 1) with availability measures of 0.994, and 0.994 respectively and connected in parallel, then the total availability will be:
Therefore, just adding a single parallel workflow with adequate automated workflow re-routing will increase the availability from 0.994 to 0.999964. However, if another parallel workflow is added then the uptime will only increase from 0.999964 to 0.999999784. There has to be a point where the cost of increasing the amount of resource hits the point of diminishing returns.
Cloud and virtualized computing are providing outstanding opportunities for broadcasters to not only improve the flexibility and scalability of their workflows, but to also increase their resilience. By looking at how other industries operate and understanding how they assess risk, broadcasters can both improve the resilience of their workflows and at the same time quantify the risk.
Building workflows that not only accept failure but also embrace it is key to designing and implementing reliability and uptime into broadcast workflows. The outdated thinking of “if it ain’t broke then don’t fix it” has been superseded by “this will break, and this is my plan to resume operation in the shortest possible time”.
You might also like...
The technology used to create deepfake videos is evolving very rapidly. Is the technology used to detect them keeping pace and are there other approaches broadcast newsrooms can use to protect themselves?
Scalable Dynamic Software For Broadcasters is a free 88 page eBook containing a collection of 12 articles which give a detailed explanation of the principles, terminology and technology required to leverage microservices based, software only broadcast production infrastructure.
Traditional monolithic software applications were often difficult to maintain and upgrade. In part, this was due to the massive interdependencies within the code that required the entire application to be upgraded and restarted resulting in down-time that regularly created many…
A self-described “technologist” at heart, Louis Hernandez Jr. knows an emerging trend when he sees one and likes to ride the wave as long as possible. Trained by his father, a computer science teacher, with his formal undergraduate and MBA in …
The criticality of service assurance in OTT services is evolving quickly as audiences grow and large broadcasters double-down on their streaming strategies.