Delivering High Availability Cloud - Part 2

To deliver resilient, flexible, and scalable infrastructures broadcasters must accept failure and design their systems to recover quickly from it. Measuring success is critical to understanding points of failure in parallel and serial workflows, and in part 2 of this series, we dig deeper into measuring resilience to quantify design infrastructure decision making.

This article was first published as part of Essential Guide: Delivering High Availability Cloud - download the complete Essential Guide HERE.

Designing For Resilience

It’s difficult to quantify every element in a highly dynamic broadcast infrastructure employed in either on-prem or off-prem clouds and virtualization. However, due to the flexible nature of the infrastructure, high levels of resilience can be achieved by dispersing the workload throughout the facility.

It’s worth remembering that IT professionals have been working with highly dynamic infrastructures that provide massive levels of reliability for many years. Although television is special due to the sampled nature of video and audio that creates synchronous data, as broadcasters move to IP, cloud and virtualized infrastructures, the underlying architecture employed is being used every day in equally challenging industries such as finance and medical.

Monitoring goes a long way in keeping cloud systems reliable as faults can be detected and rectified quickly, but many of the fixes are built into the architectures and software. For example, a “retry” strategy is often adopted when a microservice doesn’t receive an acknowledge back from another microservice it is communicating with. The retry will resend the message several times in the hope that the message has been lost through a transient fault.

After a certain number of retries it’s assumed that the fault is no longer transient and to stop the network being flooded with messages, the control software will instigate a circuit breaker. Just like an electrical circuit breaker, this method cuts off the server sending the messages into the network, thus stopping any potential network congestion. After a predefined length of time, the circuit breaker is automatically reset so the microservice can start sending messages again.

Implementing Resilience

Although the traditional broadcast main-backup model may not be as resilient as broadcasters would like it to be, the cloud and virtualized equivalent can deliver much greater reliability and flexibility. The assignable nature of much of the hardware, such as the servers, means that we can identify and duplicate the infrastructure at many more points throughout the workflow.

A-B IT enterprise architectures have redundancy built into every part of the infrastructure. For example, servers have multiple disk arrays to form RAID storage that is fault tolerant, and dual power supplies are built into every device as a matter of course. Isolating functionality and distributing across multiple areas or sites is much easier as the hardware follows a design which is relatively easy to replicate.

Every element of the platform must be assessed for risk. Every, server, switch, router, and cable must be considered. If a cable fails then how is the operation affected and more importantly, what is the remedy? Although this sort of thinking isn’t new to broadcast engineers, but the automated nature of finding a remedy is. Rather than thinking in terms of “if it ain’t broke then don’t fix it”, now is the time to think “this will break, and this is my plan to resume operation in the shortest possible time”.

The rip-and-replace model is at the core of any cloud or virtualized datacenter, and this has been boosted with microservice architectures. Rip-and-replace allows us to create services to scale infrastructures and delete them again when not needed. Although this doesn’t go as deep as physical hardware, the concepts help understand how modern enterprise datacenters operate and how the people who design them think.

Measuring Success

Understanding failure is important, but what is most important to the operation is success, otherwise known as uptime. The uptime, or availability is what the viewers are most interested in. They don’t really care about how much effort has gone into designing resilience, but they are interested in watching their favorite programs undisturbed, especially for high-value live programs such as sport.

Measuring availability helps engineers quantify both the reliability and efficiency of their systems. Designing dual redundancy may increase availability by 0.5% and creating triple redundancy might increase availability by another 0.003%, but at an additional cost of $100K. Is the extra $100K a good investment? That’s for the CEO to decide, but as engineers we can at least present accurate data to them so they can make an informed decision.

Mean Time Between Failure (MTBF) is a measure supplied by vendors to give an idea of how reliable their equipment is. And equally important is the Mean Time to Recovery (MTTR). These are combined to provide the availability, or uptime as follows:

MTTR is the time taken to recover after a failure has happened and is dependent on the sort of problem that has occurred. Public cloud providers often cite availability with values such as 99.9999999%. As MTTR is inversely proportional to the availability, it can be seen that the shorter the MTTR, the higher the availability. This both works at a system and function level. But again, the weakest link prevails. So, if a datacenter only has an availability of 99.0% then no matter how good the architecture and application design, the best availability the system could ever achieve is 99.0%.

As a benchmark, availability is proportional to the number of individual processes in a system. If a single workflow is considered, then the total availability follows the following equation:

Where i is the number of processes in the workflow. For example, for three processes (where i = 2) with availability measures of 0.999, 0.998, and 0.997 respectively and connected in series, then the total availability will be:

However, resilient systems that operate with parallel redundancy improve overall availability as the individual process failure rates, that is (1-availability), are multiplied as per the following equation:

Where i is the number of parallel (or resilient) processes in the workflow. For example, for two processes (where i = 1) with availability measures of 0.994, and 0.994 respectively and connected in parallel, then the total availability will be:

Therefore, just adding a single parallel workflow with adequate automated workflow re-routing will increase the availability from 0.994 to 0.999964. However, if another parallel workflow is added then the uptime will only increase from 0.999964 to 0.999999784. There has to be a point where the cost of increasing the amount of resource hits the point of diminishing returns.

Conclusion

Cloud and virtualized computing are providing outstanding opportunities for broadcasters to not only improve the flexibility and scalability of their workflows, but to also increase their resilience. By looking at how other industries operate and understanding how they assess risk, broadcasters can both improve the resilience of their workflows and at the same time quantify the risk.

Building workflows that not only accept failure but also embrace it is key to designing and implementing reliability and uptime into broadcast workflows. The outdated thinking of “if it ain’t broke then don’t fix it” has been superseded by “this will break, and this is my plan to resume operation in the shortest possible time”.

Supported by

You might also like...

Network Traffic Engineering: Why MPEG-TS Is Still The Standard

MPEG transport stream (MPEG TS) was designed in the 1990s to deliver continuous video and audio over unreliable, one-way networks, such as satellite, terrestrial RF, and cable, where packet loss and corruption are expected. But it is still prevalent in…

Standards: Video - High Efficiency Video Coding (HEVC)

Designed to halve the bitrate of AVC while supporting resolutions up to 16K, HEVC represents a significant leap in video coding efficiency. This guide explores its profiles, tiers and levels, and examines whether it can overcome the challenges of entrenched…

SMPTE Education Launches Summer 2026 Lineup Of IP And ST 2110 Courses

Boasting two standalone courses, an intensive boot camp, and a hands-on practical lab, SMPTE Education has launched its summer 2026 Lineup of IP and ST 2110 Courses.

Standards: Video - Advanced Video Coding (AVC)

AVC remains one of the most widely deployed video codecs in the world, but navigating its profiles, levels and signaling mechanisms is far from straightforward.

Network Traffic Engineering: RIST & SRT - The Success Of ARQ Based Protocols

IP networks are inherently unreliable. We kick off this series on IP Network Traffic Engineering with a look at how RIST and SRT give broadcast engineers user-configurable control over the latency-versus-reliability trade-off for real-time media streaming.