Cloud Best Practices - Part 2

In Part 1 of this series we looked at how availability zones improve reliability and business continuity, in this part we look at improving systems further using chaos engineering and SLAs.

This article was first published as part of Essential Guide: Cloud Best Practices - download the complete Essential Guide HERE.

Microservice Implementation

The physical restraints of monolithic designs make scaling them very difficult, especially in the highly dynamic systems that are now prevalent in broadcasting. Microservices are continuing to advance, and not only do they deliver exceptional scalability and resilience, but they can also be cloud service provider agnostic, thus allowing broadcasters to diversify their infrastructures much easier.

Cloud vendors do provide microservice and container specific components, but they do not have to be used. If the broadcaster is willing to invest in the microservices learning curve, then they can deploy their own virtual machines and build the required containerized infrastructure on them. This further allows broadcasters to either distribute the servers across different multiple cloud regions or distribute over entirely different cloud service providers. Other than learning how to deploy dynamic microservice architectures, one of the other significant challenges is one of latency.

Using internationally distributed cloud regions allows broadcasters to position the physical datacenter hardware closer to their clients, and this has the potential to significantly reduce latency. Also, interconnected regional datacenters from the same vendor tend to be connected with high speed and dedicated networks, and this allows data to be moved with relatively low latency, much lower than could be achieved with internet connectivity.

However, to achieve the best results for container and microservice architectures, system architects must design with distributed processing, low latency, and security, from the very beginning of the build. This also facilitates the ability to build high levels of redundancy into the broadcast system at all levels of the workflows.

As viewer requirements change, broadcasters can see how the systems stress and where, so they can increase capacity as required. And as systems progress and broadcasters learn more about microservice architectures, they can even automate much of the scaling to further improve resilience and flexibility.

Chaos Engineering

It might seem a bit of a contradiction that having spent so much time and effort making broadcast workflows operate efficiently and reliably, that we should then purposely try and break them. But this is exactly how chaos engineering works.

The principle is to introduce controlled chaos into a system so that system weaknesses can be quickly identified allowing DevOps teams to find strategies to strengthen them and improve overall resilience. Although this is a powerful tool in improving reliability, it should only be conducted as part of a planned process. Deliberately deleting the microservice system that plays the opening heads of the six-o-clock news with two minutes to transmission, with no planning, is clearly a bad idea.

Figure 2 – Chaos engineering is a continuous practice that aims to improve resilience by testing and stressing infrastructures for unexpected events.

However, this does demonstrate the power of distributed systems such as microservice architectures that have been built from the ground-up with resilience and backup in mind. With adequate planning, it should be possible to remove network cables, switch off routers and servers, and delete applications. However, no matter how resilient a designer thinks the architecture should be, it’s only when it’s tested using a chaos engineering type approach does the true validity of their design become apparent.

Chaos engineering isn’t just a one-off test, instead, it’s an integral part of the system with “chaos” being injected into the architecture on a regular basis (with adequate planning). And this isn’t just restricted to the microservice components but expands to the whole broadcast system. For example, what would happen if the electrical power supply was switched off to the datacenter? Would the UPS take the load for enough time until the stand-by generator was switched-on and stable? It’s better to know where the failures are within a planned test than find them during the middle of the night when a real fault occurs.

Just like agile processes that govern DevOps, chaos engineering is a way of life that should be embraced as it will continuously improve performance and resilience of an IP broadcast infrastructure.

Continuous System Performance Monitoring and SLAs

Monitoring IT systems goes way beyond the video and audio signal monitoring broadcasters have become accustomed to. Although this is still important and plays a big role for broadcasters, monitoring IP networks, the resource they’re using, and the costs being incurred, are equally important.

With dynamic infrastructures, especially those using public cloud services, new resource is added as required which incurs extra cost. Not only is it important to know when more resource is being allocated, but the effectiveness and efficiency of the algorithms deciding on the new allocation must also be continuously scrutinized so that they’re not over or under scheduling cloud services. Also, DevOps teams can learn a lot from how the system is behaving overall and pre-empt any problems that may be about to occur.

One example is a low API response time, this could be due to an unusually long database query or network congestion, but the monitoring will give a more focused indication to where the latency is occurring. Research has demonstrated users can adapt to reasonable amounts of constant latency, but variable latency is difficult to work with and problematic. Due to the dynamic nature of the infrastructure, the latencies can develop and concatenate without warning, hence the need to provide continuous monitoring.

As well as providing deep insight into the operation of the system, continuous monitoring highlights areas of the infrastructure that would best benefit from service level agreements (SLAs) and determine the extent of their cover. Providing blanket SLAs for every piece of equipment or service is costly and inefficient, especially as many vendors now provide varying levels of SLA. Being able to allocate the best SLA for each part of the infrastructure components will deliver optimal resilience and efficiency.

Conclusion

Determining best practices for building hybrid, on- and off-prem cloud and virtualized infrastructures is a methodology that must be considered at the very beginning of the design. These practices further develop as the functional aspects of the workflows expand, and therefore need to be under constant review. Due to their dynamic nature, best practices, like so many other aspects of broadcast IP infrastructure design and maintenance, should be considered an ongoing process that is always open to development and improvement.

Supported by

You might also like...

Standards: Video - High Efficiency Video Coding (HEVC)

Designed to halve the bitrate of AVC while supporting resolutions up to 16K, HEVC represents a significant leap in video coding efficiency. This guide explores its profiles, tiers and levels, and examines whether it can overcome the challenges of entrenched…

SMPTE Education Launches Summer 2026 Lineup Of IP And ST 2110 Courses

Boasting two standalone courses, an intensive boot camp, and a hands-on practical lab, SMPTE Education has launched its summer 2026 Lineup of IP and ST 2110 Courses.

Standards: Video - Advanced Video Coding (AVC)

AVC remains one of the most widely deployed video codecs in the world, but navigating its profiles, levels and signaling mechanisms is far from straightforward.

Network Traffic Engineering: RIST & SRT - The Success Of ARQ Based Protocols

IP networks are inherently unreliable. We kick off this series on IP Network Traffic Engineering with a look at how RIST and SRT give broadcast engineers user-configurable control over the latency-versus-reliability trade-off for real-time media streaming.

Standards: Video - Standards For Video Coding

From 4K to 32K, the demand for ever-larger video formats is pushing codec technology to its limits. This guide surveys the landscape of video coding standards – from legacy MPEG formats to AI-driven neural network compression – to help navigate the choices sha…