Delivering High Availability Cloud - Part 1

Broadcast television is a 24/7 mission critical operation and resilience has always been at the heart of every infrastructure design. However, as broadcasters continue to embrace IP and cloud production systems, we need to take a different look at how we assess risk.

This article was first published as part of Essential Guide: Delivering High Availability Cloud - download the complete Essential Guide HERE.

The standard workflow model is based around the main-backup philosophy. That is, for every live infrastructure, a live workflow is paralleled with a backup workflow that imitates its operation. At key points along the workflows, automated and manual intervention facilitates the signal re-routing should a problem occur, such as equipment failure or power loss.

With such infrastructures, it doesn’t take too much digging to find holes in the system where single points of failure quickly manifest themselves.

Cloud and virtualized infrastructures are not magic, they still suffer equipment failure just like any other electronic and mechanical system. But the good news is that the ability to provide greater workflow resilience is much easier to achieve.

One of the key advantages for broadcasters migrating to IP and the cloud is that they can take advantage of the technology gains in unrelated industries. Finance and medical are two examples where broadcasters can learn from building resilience into their infrastructures.

Accepting Failure

After decades of building highly resilient systems it might seem strange that broadcasters should even consider accepting failure, but this is exactly how IT centric companies think when employing enterprise grade datacenters. Instead of attempting to avoid failure, which is invariably difficult if not impossible to achieve, they accept it and build cloud native services to respond to failure and return to a fully functioning state as quickly as possible.

Traditional broadcast workflows can be thought of as monolithic systems, each stage relies on the previous stage in the signal flow, and if any of those fails then the whole workflow fails. In effect, the workflow is only as good as the weakest link. Developing the main-backup workflows does alleviate this, but only in a limited fashion. With the traditional broadcast main-backup model we simply have one monolithic workflow backing up another.

The reason for the limited backup capability comes from the need to provide application specific hardware in the workflow. For example, a standards converter will only ever convert between standards, and a transcoder can only transcode. It would be very difficult to convert a transcoder into a high-end standards converter, and vice versa. However, with software defined systems, such as those employed in the cloud, the hardware can adopt significantly more functions, and this in turn provides many more opportunities to include resilience.

Figure 1 – achieving resilience is far from linear. No matter how much money is spent on a workflow, the point of diminishing returns will be reached. Cloud and virtualized infrastructures help us quantify this so that the best balance between resilience and cost can be achieved.

Figure 1 – achieving resilience is far from linear. No matter how much money is spent on a workflow, the point of diminishing returns will be reached. Cloud and virtualized infrastructures help us quantify this so that the best balance between resilience and cost can be achieved.

Single Points Of Failure

It’s possible to argue that the traditional broadcast main-backup model does have resilience built in as any failure within the main signal flow is paralleled with the backup workflow. But how often is the backup workflow tested? In an industry where the saying “if it ain’t broke, then don’t fix it” prevails, then the regular testing of the backup is often feared to the point where it isn’t checked as often as it should be.

The argument for the main-backup model is further strained when we look at where the equipment is located, how it is powered, and cooled. Although the broadcast workflow may be separated, its physical locality, power and cooling sources are generally not. Diverse power supplies with UPSs and generators are often employed, but to what extent? How many broadcast facilities have a completely diversified power supply system that guarantees power 24/7?

Disaster recovery models do exist where the entire infrastructure is mirrored in a location some miles away. But again, how many broadcasters can afford this level of resilience? And those that can are always looking to improve resilience and make cost savings.

Compare and contrast this to the mindset of the equivalent IT enterprise datacenter business using cloud and virtualized cloud services. Public cloud systems are built from the ground-up with resilience in mind. Datacenters are physically separated to provide A-B systems that are not only resilient, but have entirely separate power sources, air conditioning systems and security.

On-prem cloud and virtualized systems may well share the same shortcomings of the broadcast main-backup model but they excel as the equipment they employ is much more versatile. Servers can take on a whole multitude of tasks from high-end standards converters to transcoders and proc-amps. The very nature of software applications running on servers makes them flexible, and completely assignable. Furthermore, cloud systems employing microservice architectures follow a similar core infrastructure. This allows them to be scaled and mirrored to other public cloud or off-prem private cloud systems thus creating very high levels of resilience.

Using cloud infrastructures encourages the broadcaster to not only adopt IT models but also adopt their mindset. That is, assume there will be failures and build a system that can resolve them quickly.

Sources Of Interest

Working with cloud infrastructures with the IT mindset requires an understanding of where failure can occur. IP packets form the basis of an asynchronous system, and all the equipment processing, transferring, and storing the media data is also working in an asynchronous manner. SDI and AES infrastructures, by their very nature are synchronous point-to-point systems, consequently we think differently about them when compared to the IP systems.

Network latency is something broadcasters didn’t really consider prior to adoption of IP but it is now a significant factor for cloud and IP infrastructures.

Fig 2 – Microservice architectures provide great scope for automated re-routing of workflows to significantly improve resilience.

Fig 2 – Microservice architectures provide great scope for automated re-routing of workflows to significantly improve resilience.

Broadcasters need to know if they are suffering from latency issues and where are they occurring, especially for contribution services. It’s all well and good switching to a back-up circuit, but what if the latency is a function of a long TCP flow that is hogging the network because a server has gone faulty and isn’t obeying the rules regarding fair use policies?

There tend to be two sorts of faults, long-term and short lived, or transient. Transient faults are the most difficult to isolate and fix and they may only occur once a week with minimal impact on the transmission. Network logging is key to solving these types of transient faults so that forensic analysis can be carried out later. And long-term faults are generally easier to identify as they exist for longer periods of time and can be chased through the system. However, the asynchronous and software driven nature of IP means that the source of the problem may not be immediately obvious due to the interaction of software buffers and bursting data transmissions.

Orchestration services handle the creation of resource to meet peak demand, but the amount of resource they can create is limited by the availability of on-prem hardware, and costs for off-prem cloud systems. Again, monitoring is key to keeping these services running efficiently and reliably.

Well-designed cloud platforms can detect and mitigate many of the challenges raised, and more.

Supported by

You might also like...

An Introduction To Network Observability

The more complex and intricate IP networks and cloud infrastructures become, the greater the potential for unwelcome dynamics in the system, and the greater the need for rich, reliable, real-time data about performance and error rates.

The Business Cost Of Poor Streaming Quality

Poor quality streaming loses viewers at an alarming rate especially when we consider the unintended consequences of poor error reporting on streaming players.

Minimizing OTT Churn Rates Through Viewer Engagement

A D2C streaming service requires an understanding of satisfaction with the service – the quality of it, the ease of use, the style of use – which requires the right technology and a focused information-gathering approach.

Standards: Part 6 - About The ISO 14496 – MPEG-4 Standard

This article describes the various parts of the MPEG-4 standard and discusses how it is much more than a video codec. MPEG-4 describes a sophisticated interactive multimedia platform for deployment on digital TV and the Internet.

The Big Guide To OTT: Part 9 - Quality Of Experience (QoE)

Part 9 of The Big Guide To OTT features a pair of in-depth articles which discuss how a data driven understanding of the consumer experience is vital and how poor quality streaming loses viewers.