Delivering High Availability Cloud - Part 2
To deliver resilient, flexible, and scalable infrastructures broadcasters must accept failure and design their systems to recover quickly from it. Measuring success is critical to understanding points of failure in parallel and serial workflows, and in part 2 of this series, we dig deeper into measuring resilience to quantify design infrastructure decision making.
Other articles from this series.
Designing For Resilience
It’s difficult to quantify every element in a highly dynamic broadcast infrastructure employed in either on-prem or off-prem clouds and virtualization. However, due to the flexible nature of the infrastructure, high levels of resilience can be achieved by dispersing the workload throughout the facility.
It’s worth remembering that IT professionals have been working with highly dynamic infrastructures that provide massive levels of reliability for many years. Although television is special due to the sampled nature of video and audio that creates synchronous data, as broadcasters move to IP, cloud and virtualized infrastructures, the underlying architecture employed is being used every day in equally challenging industries such as finance and medical.
Monitoring goes a long way in keeping cloud systems reliable as faults can be detected and rectified quickly, but many of the fixes are built into the architectures and software. For example, a “retry” strategy is often adopted when a microservice doesn’t receive an acknowledge back from another microservice it is communicating with. The retry will resend the message several times in the hope that the message has been lost through a transient fault.
After a certain number of retries it’s assumed that the fault is no longer transient and to stop the network being flooded with messages, the control software will instigate a circuit breaker. Just like an electrical circuit breaker, this method cuts off the server sending the messages into the network, thus stopping any potential network congestion. After a predefined length of time, the circuit breaker is automatically reset so the microservice can start sending messages again.
Implementing Resilience
Although the traditional broadcast main-backup model may not be as resilient as broadcasters would like it to be, the cloud and virtualized equivalent can deliver much greater reliability and flexibility. The assignable nature of much of the hardware, such as the servers, means that we can identify and duplicate the infrastructure at many more points throughout the workflow.
A-B IT enterprise architectures have redundancy built into every part of the infrastructure. For example, servers have multiple disk arrays to form RAID storage that is fault tolerant, and dual power supplies are built into every device as a matter of course. Isolating functionality and distributing across multiple areas or sites is much easier as the hardware follows a design which is relatively easy to replicate.
Every element of the platform must be assessed for risk. Every, server, switch, router, and cable must be considered. If a cable fails then how is the operation affected and more importantly, what is the remedy? Although this sort of thinking isn’t new to broadcast engineers, but the automated nature of finding a remedy is. Rather than thinking in terms of “if it ain’t broke then don’t fix it”, now is the time to think “this will break, and this is my plan to resume operation in the shortest possible time”.
The rip-and-replace model is at the core of any cloud or virtualized datacenter, and this has been boosted with microservice architectures. Rip-and-replace allows us to create services to scale infrastructures and delete them again when not needed. Although this doesn’t go as deep as physical hardware, the concepts help understand how modern enterprise datacenters operate and how the people who design them think.
Measuring Success
Understanding failure is important, but what is most important to the operation is success, otherwise known as uptime. The uptime, or availability is what the viewers are most interested in. They don’t really care about how much effort has gone into designing resilience, but they are interested in watching their favorite programs undisturbed, especially for high-value live programs such as sport.
Measuring availability helps engineers quantify both the reliability and efficiency of their systems. Designing dual redundancy may increase availability by 0.5% and creating triple redundancy might increase availability by another 0.003%, but at an additional cost of $100K. Is the extra $100K a good investment? That’s for the CEO to decide, but as engineers we can at least present accurate data to them so they can make an informed decision.
Mean Time Between Failure (MTBF) is a measure supplied by vendors to give an idea of how reliable their equipment is. And equally important is the Mean Time to Recovery (MTTR). These are combined to provide the availability, or uptime as follows:
MTTR is the time taken to recover after a failure has happened and is dependent on the sort of problem that has occurred. Public cloud providers often cite availability with values such as 99.9999999%. As MTTR is inversely proportional to the availability, it can be seen that the shorter the MTTR, the higher the availability. This both works at a system and function level. But again, the weakest link prevails. So, if a datacenter only has an availability of 99.0% then no matter how good the architecture and application design, the best availability the system could ever achieve is 99.0%.
As a benchmark, availability is proportional to the number of individual processes in a system. If a single workflow is considered, then the total availability follows the following equation:
Where i is the number of processes in the workflow. For example, for three processes (where i = 2) with availability measures of 0.999, 0.998, and 0.997 respectively and connected in series, then the total availability will be:
However, resilient systems that operate with parallel redundancy improve overall availability as the individual process failure rates, that is (1-availability), are multiplied as per the following equation:
Where i is the number of parallel (or resilient) processes in the workflow. For example, for two processes (where i = 1) with availability measures of 0.994, and 0.994 respectively and connected in parallel, then the total availability will be:
Therefore, just adding a single parallel workflow with adequate automated workflow re-routing will increase the availability from 0.994 to 0.999964. However, if another parallel workflow is added then the uptime will only increase from 0.999964 to 0.999999784. There has to be a point where the cost of increasing the amount of resource hits the point of diminishing returns.
Conclusion
Cloud and virtualized computing are providing outstanding opportunities for broadcasters to not only improve the flexibility and scalability of their workflows, but to also increase their resilience. By looking at how other industries operate and understanding how they assess risk, broadcasters can both improve the resilience of their workflows and at the same time quantify the risk.
Building workflows that not only accept failure but also embrace it is key to designing and implementing reliability and uptime into broadcast workflows. The outdated thinking of “if it ain’t broke then don’t fix it” has been superseded by “this will break, and this is my plan to resume operation in the shortest possible time”.
Supported by
You might also like...
Audio At IBC 2024
Great audio is fundamental to any great broadcast and professional audio remains one of the busiest areas of the show both in terms of number of exhibitors and innovative new technologies on show. IP and cloud developments seem set to…
Network Orchestration & Monitoring At IBC 2024
Software defined systems is one of the hottest topics of the broadcast industry and IBC will be the perfect opportunity to get first hand demonstrations and expert advice from the vendors at the forefront of the leading edge of the…
Encoding & Transport For Remote Contribution At IBC 2024
The technology required to get high quality content from the venue to the viewer for live sports production remains an area of intense research and development, so there will be plenty of innovation and expertise in this area on the…
UHD & HDR Video Workflows At IBC 2024
As we head for Amsterdam we re-visit the key theme of technology that eases the burden of achieving effective workflows that simultaneously support multiple production and delivery video formats.
Hybrid SDI-IP Network Technologies At IBC 2024
As IBC approaches we pick up the key theme of hybrid SDI-IP network infrastructure with a run down of what to expect from vendors, and take a look at what might be interesting in the conference program.