Delivering High Availability Cloud - Part 1

Broadcast television is a 24/7 mission critical operation and resilience has always been at the heart of every infrastructure design. However, as broadcasters continue to embrace IP and cloud production systems, we need to take a different look at how we assess risk.

This article was first published as part of Essential Guide: Delivering High Availability Cloud - download the complete Essential Guide HERE.

Accepting Failure

After decades of building highly resilient systems it might seem strange that broadcasters should even consider accepting failure, but this is exactly how IT centric companies think when employing enterprise grade datacenters. Instead of attempting to avoid failure, which is invariably difficult if not impossible to achieve, they accept it and build cloud native services to respond to failure and return to a fully functioning state as quickly as possible.

Traditional broadcast workflows can be thought of as monolithic systems, each stage relies on the previous stage in the signal flow, and if any of those fails then the whole workflow fails. In effect, the workflow is only as good as the weakest link. Developing the main-backup workflows does alleviate this, but only in a limited fashion. With the traditional broadcast main-backup model we simply have one monolithic workflow backing up another.

The reason for the limited backup capability comes from the need to provide application specific hardware in the workflow. For example, a standards converter will only ever convert between standards, and a transcoder can only transcode. It would be very difficult to convert a transcoder into a high-end standards converter, and vice versa. However, with software defined systems, such as those employed in the cloud, the hardware can adopt significantly more functions, and this in turn provides many more opportunities to include resilience.

Figure 1 – achieving resilience is far from linear. No matter how much money is spent on a workflow, the point of diminishing returns will be reached. Cloud and virtualized infrastructures help us quantify this so that the best balance between resilience and cost can be achieved.

Single Points Of Failure

It’s possible to argue that the traditional broadcast main-backup model does have resilience built in as any failure within the main signal flow is paralleled with the backup workflow. But how often is the backup workflow tested? In an industry where the saying “if it ain’t broke, then don’t fix it” prevails, then the regular testing of the backup is often feared to the point where it isn’t checked as often as it should be.

The argument for the main-backup model is further strained when we look at where the equipment is located, how it is powered, and cooled. Although the broadcast workflow may be separated, its physical locality, power and cooling sources are generally not. Diverse power supplies with UPSs and generators are often employed, but to what extent? How many broadcast facilities have a completely diversified power supply system that guarantees power 24/7?

Disaster recovery models do exist where the entire infrastructure is mirrored in a location some miles away. But again, how many broadcasters can afford this level of resilience? And those that can are always looking to improve resilience and make cost savings.

Compare and contrast this to the mindset of the equivalent IT enterprise datacenter business using cloud and virtualized cloud services. Public cloud systems are built from the ground-up with resilience in mind. Datacenters are physically separated to provide A-B systems that are not only resilient, but have entirely separate power sources, air conditioning systems and security.

On-prem cloud and virtualized systems may well share the same shortcomings of the broadcast main-backup model but they excel as the equipment they employ is much more versatile. Servers can take on a whole multitude of tasks from high-end standards converters to transcoders and proc-amps. The very nature of software applications running on servers makes them flexible, and completely assignable. Furthermore, cloud systems employing microservice architectures follow a similar core infrastructure. This allows them to be scaled and mirrored to other public cloud or off-prem private cloud systems thus creating very high levels of resilience.

Using cloud infrastructures encourages the broadcaster to not only adopt IT models but also adopt their mindset. That is, assume there will be failures and build a system that can resolve them quickly.

Sources Of Interest

Working with cloud infrastructures with the IT mindset requires an understanding of where failure can occur. IP packets form the basis of an asynchronous system, and all the equipment processing, transferring, and storing the media data is also working in an asynchronous manner. SDI and AES infrastructures, by their very nature are synchronous point-to-point systems, consequently we think differently about them when compared to the IP systems.

Network latency is something broadcasters didn’t really consider prior to adoption of IP but it is now a significant factor for cloud and IP infrastructures.

Fig 2 – Microservice architectures provide great scope for automated re-routing of workflows to significantly improve resilience.

Broadcasters need to know if they are suffering from latency issues and where are they occurring, especially for contribution services. It’s all well and good switching to a back-up circuit, but what if the latency is a function of a long TCP flow that is hogging the network because a server has gone faulty and isn’t obeying the rules regarding fair use policies?

There tend to be two sorts of faults, long-term and short lived, or transient. Transient faults are the most difficult to isolate and fix and they may only occur once a week with minimal impact on the transmission. Network logging is key to solving these types of transient faults so that forensic analysis can be carried out later. And long-term faults are generally easier to identify as they exist for longer periods of time and can be chased through the system. However, the asynchronous and software driven nature of IP means that the source of the problem may not be immediately obvious due to the interaction of software buffers and bursting data transmissions.

Orchestration services handle the creation of resource to meet peak demand, but the amount of resource they can create is limited by the availability of on-prem hardware, and costs for off-prem cloud systems. Again, monitoring is key to keeping these services running efficiently and reliably.

Well-designed cloud platforms can detect and mitigate many of the challenges raised, and more.

Other related articles posted on The Broadcast Bridge.

Delivering High Availability Cloud - Part 2

Supported by

You might also like...

Understanding IP Production Networks: Part 14 - Delay Monitoring

We use buffers to reassemble asynchronous streams so we must measure how long individual packets take to reliably get to the receiver, and the maximum and minimum delay of all packets at the receiver.

Understanding IP Production Networks: Part 13 - Quality Of Service

How QoS introduces a degree of control over packet prioritization to improve streaming over asynchronous networks.

Understanding IP Production Networks: Part 12 - Measuring Line Speeds

Broadcast and IT engineers take very different approaches to network speed and capacity; it is essential to reach a shared understanding.

Understanding IP Production Networks: Part 11 - Network Analyzers

Wireshark is an invaluable tool that enables engineers to examine network traffic in detail. Commercial monitoring platforms provide even deeper observation.

Understanding IP Production Networks: Part 10 - Security

The flexibility of IP and COTS brings with it all of the security dangers of the internet and the need for robust processes. It means new questions need to be asked of broadcast equipment manufacturers.