Building Software Defined Infrastructure: Observability In Microservice Architecture

Building dynamic microservices based infrastructure introduces the potential for variable latency which brings new monitoring challenges that require an understanding of observability.

This article is part of ‘Building Software Defined Infrastructure: Part 3 - Monitoring Dynamic Resource’

Monitoring microservices not only embraces traditional broadcast signal monitoring but also adds to it to establish the reliability of the infrastructure, which is dynamic by its very design. This leads to a type of monitoring philosophy that may be new to broadcasters but is known to IT and network engineers, and that is observability.

Monitoring can be thought of as the process of tracking anomalies over time, whereas observability is centered around understanding dynamic systems in real-time. On the face of it these may look virtually the same, but as we understand more about the dynamic nature of microservices then the differences become much more apparent.

Monitoring signals in a broadcast facility is an engineer’s fundamental role. The reliability of video and audio signals is of paramount importance as this fundamentally drives the quality of the broadcast. Until the adoption of software infrastructures, this has been the limit of our monitoring needs. We don’t need to worry if a production switcher responds quickly to switching to another camera, or whether opening a fader on a sound desk will increase the level of the audio signal quickly. This is because the operational consoles have a direct connection to the video and audio processing units and generally do not go through any type of computer network. This is not true of software infrastructures.

Software Infrastructure Dynamic Interaction

When adopting software defined infrastructures the dynamic nature of both the network and software applications, such as microservices, not only introduces latency, but also has the potential to make the latency variable. This leads to many new challenges as we can no longer assume that the control interface to the service providing microservice is predictable or static.

To get a deeper understanding of this we need to look at what is going on with a group of services. For example, if four cameras are streamed into a microservice production switcher then not only do we need to look at the latencies and signal quality of the video, but we also need to look at the control systems between the human control interface (HCI) and the various elements within the production switcher.

The art of fully embracing microservice architectures requires us to think of the production switcher not as a single monolithic software application, but many smaller programs that combine to make the whole production switcher unit. This delivers incredible flexibility for the broadcaster and provides many monetizing opportunities for the vendor. A broadcaster may only need a single program bank with a downstream keyer for a specific application one day, and then a full two ME bank switcher with multiple keys and DVE on another day. In the spirit of microservices, this can be achieved relatively easily.

Variable Control Latency

Looking at the signal flow for the simple single four input bank with a DSK, the control from the users HCI is relatively straight forward as it’s just a TCP/IP connection to the application. The major difference between this microservice infrastructure and that of a dedicated hardware switcher, is that the control between the HCI and the microservice production switcher application is subject to all the challenges any other IP packets will have in a network. Packets may be lost, be delivered out of sequence, or just delayed by the network switches. TCP solves all these challenges (assuming there is no permanent link failure in the network) but does so at the expense of latency. If the latency is small, then nobody cares, but if it is large, then a major operational problem soon transpires.

When a message is sent to a microservice from the HCI, how do we know if it’s been received? At a simple level, we will see the video output change to a different camera, or the DSK will show the required graphic. But what if the message hasn’t been received? Where is the error? And even more challenging, what if the control response starts to become intermittent or varies in its response times?

This is where the concept of observability comes in. Observability allows us to understand how the system is behaving from the perspective of an outside observer. We need to be both collecting data through logs but also analyzing data in real-time. Although the concept of real-time analysis may be a relatively new concept for IT engineers, it’s something broadcast engineers have been doing since the beginning of television, hence our waveform monitors and vectorscopes. Logging, on the other hand is relatively new. Videotape can be thought of as a data logger and with live television there are no second chances, hence the reason broadcast engineers spend so much effort and money on making real-time infrastructures reliable.

Diagram 1 – Latency in microservices appears not only in the network delays, but the processing needed to action the command.

Observability then is heavily focused on real-time events. This may seem obvious for broadcast engineers as that’s what we have always done, but as we move more into microservices then the dynamic nature of the interaction of the apps and the underlying networks combined with the media streams and control messages results in a system that is anything but static. Furthermore, instead of thinking in absolute terms, we must start thinking in terms of probability and statistics. Although this may sound like complete heresy, it’s not a million miles away from where broadcasting has always been. We may think that an SDI signal has a clock frequency of 1.485Gb/s, but there is variance in this, and the signal does vary in frequency a little. It’s just that we don’t notice it.

Latency Is A Fact Of Life

The price we pay for the flexibility and scalability, along with all the commercial savings that IP and software defined infrastructures offer, is that the systems we are now designing have a certain amount of temporal variance in them. The more flexible we make a system, then the more temporal variance we must accept. Key to building a reliable IP and software infrastructure is knowing when enough latency is enough.

Humans can work with latency very easily if it is predictable, and we do this in all walks of life. Anybody who drives a car experiences delay when they press the brake pedal as there is a latency between pushing the brake pedal and the brake pads pressing against the discs, and then a reduction in speed. Experienced drivers don’t notice this latency as they’ve adjusted to it. This brings us to how much latency can we accept in a control type application? For the production switcher changing between four cameras, the latency could easily be 250ms. How many production switcher operators will be able to detect four or five video frames of delay between pushing the cut button and the video output changing? Not many, if any at all.

Drawing back to latency measurements in software defined infrastructures and microservice apps then we must be careful how and what we measure. If we log and record every message between every microservice in a broadcast facility, then we will be generating so much data that it will almost be useless as we won’t know where to start our analysis. This is analogous with putting a waveform monitor on every patch-cord in every bay of a broadcast facility. It’s both unworkable and unnecessary. Instead, we must take a more pragmatic look at what we observe and how. Observability is a wonderful buzz word, but if we observe a system to the point where the observer is negatively impacting the system, then the observer becomes the problem and not the solution. In COTS and IP infrastructures, it is very easy to negatively influence the systems through excessive logging and monitoring.

Just In Time Monitoring

One solution to observability is to use data sampling. In the case of the HCI controlling the four-input production switcher, instead of recording every single message, then we could record just one in a hundred messages. To understand why recording every message is an issue then just think about how the message is recorded and where. If it’s recorded on a local NAS then the logging device will have to send the messages to the NAS, probably over the same network, and by doing so would half the capacity of the network thus further increasing the latency variance.

There will probably be some sort of heartbeat or keep-alive message being exchanged between the HCI and production switcher app, and a control message is just a variation of this, so recording some of the aspects of every one-hundredth of the messages will help enormously. We could record the latency and use this to provide a variance measurement, if the control messages deviate from 250ms too much then we could send a message alarm to the engineering central monitoring system. Or we could count the number of deviations in a minute and provide this as a measure to the engineers. And by doing this we have subtlety shifted into the world of probability measurements and predictions. Nothing is absolute, even the MCR SPG master oscillator has some variance.

Like all engineering disciplines, monitoring and observability is as much of an art as it is a science, and a great deal of understanding of the underlying technologies is required.

Part of a series supported by

You might also like...

Broadcast Standards: Kubernetes & The Architecture Of Cloud Compute Based Systems

Here we describe Kubernetes and the taxonomy of containerized architecture based cloud compute system designs it manages.

Live Sports Production: Backhaul In Live Sports Production

Getting content reliably and securely from venue to studio remains key to live sports production so here we discuss the technology and services required.

Monitoring & Compliance In Broadcast: Monitoring Delivery In The Converged OTA – OTT Ecosystem

Convergence or coexistence between linear broadcast, IP based delivery and 5G mobile networks creates new challenges for monitoring of delivery paths, both technically and logistically.

IP Monitoring & Diagnostics With Command Line Tools: Part 4 - SSH Public Keys

Installing public SSH keys created on your workstation in a server will authenticate you without needing a password. This streamlines the SSH interaction and avoids the need to use stored and visible passwords in your scripts.

Building Software Defined Infrastructure: Network Monitoring

IP networks are a fundamental building block for software defined infrastructure, so a solid technical understanding of network monitoring is essential when building dynamic microservice based systems.