Cloud Best Practices - Part 2

In Part 1 of this series we looked at how availability zones improve reliability and business continuity, in this part we look at improving systems further using chaos engineering and SLAs.

This article was first published as part of Essential Guide: Cloud Best Practices - download the complete Essential Guide HERE.

Microservice Implementation

The physical restraints of monolithic designs make scaling them very difficult, especially in the highly dynamic systems that are now prevalent in broadcasting. Microservices are continuing to advance, and not only do they deliver exceptional scalability and resilience, but they can also be cloud service provider agnostic, thus allowing broadcasters to diversify their infrastructures much easier.

Cloud vendors do provide microservice and container specific components, but they do not have to be used. If the broadcaster is willing to invest in the microservices learning curve, then they can deploy their own virtual machines and build the required containerized infrastructure on them. This further allows broadcasters to either distribute the servers across different multiple cloud regions or distribute over entirely different cloud service providers. Other than learning how to deploy dynamic microservice architectures, one of the other significant challenges is one of latency.

Using internationally distributed cloud regions allows broadcasters to position the physical datacenter hardware closer to their clients, and this has the potential to significantly reduce latency. Also, interconnected regional datacenters from the same vendor tend to be connected with high speed and dedicated networks, and this allows data to be moved with relatively low latency, much lower than could be achieved with internet connectivity.

However, to achieve the best results for container and microservice architectures, system architects must design with distributed processing, low latency, and security, from the very beginning of the build. This also facilitates the ability to build high levels of redundancy into the broadcast system at all levels of the workflows.

As viewer requirements change, broadcasters can see how the systems stress and where, so they can increase capacity as required. And as systems progress and broadcasters learn more about microservice architectures, they can even automate much of the scaling to further improve resilience and flexibility.

Chaos Engineering

It might seem a bit of a contradiction that having spent so much time and effort making broadcast workflows operate efficiently and reliably, that we should then purposely try and break them. But this is exactly how chaos engineering works.

The principle is to introduce controlled chaos into a system so that system weaknesses can be quickly identified allowing DevOps teams to find strategies to strengthen them and improve overall resilience. Although this is a powerful tool in improving reliability, it should only be conducted as part of a planned process. Deliberately deleting the microservice system that plays the opening heads of the six-o-clock news with two minutes to transmission, with no planning, is clearly a bad idea.

Figure 2 – Chaos engineering is a continuous practice that aims to improve resilience by testing and stressing infrastructures for unexpected events.

However, this does demonstrate the power of distributed systems such as microservice architectures that have been built from the ground-up with resilience and backup in mind. With adequate planning, it should be possible to remove network cables, switch off routers and servers, and delete applications. However, no matter how resilient a designer thinks the architecture should be, it’s only when it’s tested using a chaos engineering type approach does the true validity of their design become apparent.

Chaos engineering isn’t just a one-off test, instead, it’s an integral part of the system with “chaos” being injected into the architecture on a regular basis (with adequate planning). And this isn’t just restricted to the microservice components but expands to the whole broadcast system. For example, what would happen if the electrical power supply was switched off to the datacenter? Would the UPS take the load for enough time until the stand-by generator was switched-on and stable? It’s better to know where the failures are within a planned test than find them during the middle of the night when a real fault occurs.

Just like agile processes that govern DevOps, chaos engineering is a way of life that should be embraced as it will continuously improve performance and resilience of an IP broadcast infrastructure.

Continuous System Performance Monitoring and SLAs

Monitoring IT systems goes way beyond the video and audio signal monitoring broadcasters have become accustomed to. Although this is still important and plays a big role for broadcasters, monitoring IP networks, the resource they’re using, and the costs being incurred, are equally important.

With dynamic infrastructures, especially those using public cloud services, new resource is added as required which incurs extra cost. Not only is it important to know when more resource is being allocated, but the effectiveness and efficiency of the algorithms deciding on the new allocation must also be continuously scrutinized so that they’re not over or under scheduling cloud services. Also, DevOps teams can learn a lot from how the system is behaving overall and pre-empt any problems that may be about to occur.

One example is a low API response time, this could be due to an unusually long database query or network congestion, but the monitoring will give a more focused indication to where the latency is occurring. Research has demonstrated users can adapt to reasonable amounts of constant latency, but variable latency is difficult to work with and problematic. Due to the dynamic nature of the infrastructure, the latencies can develop and concatenate without warning, hence the need to provide continuous monitoring.

As well as providing deep insight into the operation of the system, continuous monitoring highlights areas of the infrastructure that would best benefit from service level agreements (SLAs) and determine the extent of their cover. Providing blanket SLAs for every piece of equipment or service is costly and inefficient, especially as many vendors now provide varying levels of SLA. Being able to allocate the best SLA for each part of the infrastructure components will deliver optimal resilience and efficiency.

Conclusion

Determining best practices for building hybrid, on- and off-prem cloud and virtualized infrastructures is a methodology that must be considered at the very beginning of the design. These practices further develop as the functional aspects of the workflows expand, and therefore need to be under constant review. Due to their dynamic nature, best practices, like so many other aspects of broadcast IP infrastructure design and maintenance, should be considered an ongoing process that is always open to development and improvement.

Supported by

You might also like...

NAB Show 2024 BEIT Sessions Part 2: New Broadcast Technologies

The most tightly focused and fresh technical information for TV engineers at the NAB Show will be analyzed, discussed, and explained during the four days of BEIT sessions. It’s the best opportunity on Earth to learn from and question i…

Standards: Part 6 - About The ISO 14496 – MPEG-4 Standard

This article describes the various parts of the MPEG-4 standard and discusses how it is much more than a video codec. MPEG-4 describes a sophisticated interactive multimedia platform for deployment on digital TV and the Internet.

The Big Guide To OTT: Part 9 - Quality Of Experience (QoE)

Part 9 of The Big Guide To OTT features a pair of in-depth articles which discuss how a data driven understanding of the consumer experience is vital and how poor quality streaming loses viewers.

Chris Brown Discusses The Themes Of The 2024 NAB Show

The Broadcast Bridge sat down with Chris Brown, executive vice president and managing director, NAB Global Connections and Events to discuss this year’s gathering April 13-17 (show floor open April 14-17) and how the industry looks to the show e…

Essential Guide: Next-Gen 5G Contribution

This Essential Guide explores the technology of 5G and its ongoing roll out. It discusses the technical reasons why 5G has become the new standard in roaming contribution, and explores the potential disruptive impact 5G and MEC could have on…