Messaging is an incredibly important concept in microservices and forms the backbone of their communication. But keeping systems coherent and resilient requires an understanding of how microservices communicate and why.
Other articles in this series:
There are two fundamental types of messaging in a microservice ecosystem. There are those messages that are public facing, and those that are private. The public facing messages are needed by users to execute a process, such as an ingest operation. And the private messages are those used by the microservices to communicate with each other without presenting them to the outside world.
As users will generally be accessing the operational side of the microservice through a web browser or over the internet, then the public messages must be internet compliant. Generally, this means HTTPS/TCP/IP. The TCP layer provides a logical connection with guaranteed delivery, and the HTTP provides the transfer protocol to allow web browsers and servers to exchange data.
Messaging becomes particularly interesting when we consider what happens when things go wrong. A message may be lost in transit or corrupted before it is delivered, a node may fail, an instance might crash or a be overwhelmed with requests rendering it unable to process jobs in the message queue.
There may be tens, hundreds, or even thousands of instances of microservices running in a particular ecosystem. All these need to exchange messages to maintain the workflow and provide the relevant updates.
Failures fall into two camps: transient or complete failure. With transient failures, a failure may only exist for a short period of time as the system effectively repairs itself. This could be caused due to a network switch failing resulting in the IP packets taking another route, or a node becoming temporarily overloaded but recovering quickly. Complete failures exist for much longer periods of time and can be caused by situations such as a server failing or the power system not having sufficient failsafe systems.
If a recovery strategy is not adopted, the messages will either cause congestion in the network and servers or render the system unstable. Consequently, the transient and complete failure types require different message recovery strategies, and these fall into two different types: retry and circuit breaker.
The “retry” method deals best with suspected transient failures as the sender will keep transmitting messages to the target microservice assuming the failure will recover quickly. If it does, then the previous messages will be discarded or serve no practical use and the system will continue to function. However, this is a big assumption as continually sending the same message to a microservice assumes it is idempotent, that is, sending multiple messages will not have a detrimental effect on the microservice.
Figure 1 – If microservice ‘B’ is not idempotent, resending a failed message will result in the microservice executing the instruction twice. This will have potentially disastrous side effects for the system making it unstable.
Maintaining idempotent software is incredibly important, otherwise the system can become unstable. For example, if microservice ‘A’ sends a message to microservice ‘B’ requesting an ‘id’ to check if an operation has been executed, this is considered idempotent as the message from microservice ‘A’ does not change the status of microservice ‘B’ in any way. So, if multiple messages were sent, there would be no unintended side effects for microservice ‘B’. However, if microservice ‘A’ sent a message instructing microservice ‘B’ to read the next file in a directory, microservice ‘B’ would certainly not be idempotent as there would be a massive side effect. If the same message was sent ten times by microservice ‘A’, and some of the messages were lost, then the state of microservice ‘B’ would be unknown, it would be indeterminant.
If too many failed messages are transmitted by the sender, a bottleneck can easily occur leading to congestion in the network and the destination microservices as the messages overflow in the queue buffer. The circuit breaker will detect the failure of the messages after reaching a predetermined threshold and stop the sender transmitting. The error will then be raised to the orchestrator software where other action can be taken such as re-routing or raising of an alarm.
Synchronous And Asynchronous Messaging
When discussing synchronous and asynchronous messaging it’s important to note that these refer to the protocol not the underlying I/O. An operating system can create either synchronous or asynchronous I/O access, also referred to as blocking and non-blocking. With a blocking I/O access the operating system will have to wait for an event to complete before the software can continue, but with non-blocking, the software execution is not affected. Synchronous and asynchronous messaging have a similar idea, but they work at the protocol level, not the I/O level. But it is entirely possible for a synchronous message to operate over an asynchronous I/O.
There are many pros and cons between synchronous and asynchronous messaging systems, but they differ in whether the sender waits for a response from the receiver (synchronous) or not (asynchronous). That is, with asynchronous messaging the sending thread is not blocked, which potentially results in an improved latency response. For example, if microservice ‘A’ calls microservice ‘B’, and this in turn calls microservice ‘C’, then with a synchronous system microservice ‘A’ could find itself blocked until microservice ‘C’ completes its action, reports back to microservice ‘B’ and then microservice ‘A’.
An asynchronous messaging system can also be thought of as fire-and-forget communication system.
Figure 2 – In the top diagram Microservice ‘A’ sends a synchronous message to microservice ‘B’ but it must wait until ‘B’ has responded before it can continue. In the bottom diagram, microservice ‘A’ sends messages to the message broker using a fire-and-forget methodology, so microservice ‘A’ does not have to wait for any responses. Furthermore, the messaging broker allows multiple microservices to subscribe to, and receive the messages from microservice ‘A’.
As asynchronous events do not need to wait for a response from the receiver, they lend themselves well to broadcasting or streaming messages to multiple receivers, often referred to as the pub/sub model (publisher – subscriber). Conceptually, this is similar to IP broadcast methods where routers subscribe to a broadcast stream to receive the data for multiple destination devices. Within microservice architectures, a separate messaging bus is often employed to deliver this type of event-driven communication. For example, five microservices may subscribe to a microservice sender through the messaging bus. As the sender does not expect a response from any of the receivers it will send (publish) the message and any service subscribing to it will receive the message.
To facilitate message delivery microservice architectures often employ a messaging broker. This is effectively an intelligent queue management system that buffers the messages in memory and makes sure they are delivered to the microservices that have subscribed to the publishing microservice.
The messaging broker also handles all the management of the subscriptions and facilitates their delivery. A microservice architecture may employ thousands of publishers and subscribers making the whole asynchronous messaging system incredibly complicated.
Without a messaging broker, the microservices themselves would have to provide the necessary management and protocols to facilitate asynchronous delivery. This is a complex and arduous task and so is often left to the expert service providers that deliver this type of service thus leaving the broadcast vendor to focus on building their application.
Messaging within microservice architectures not only provides a method of communication for different microservice apps but also delivers resilience and scaling to greatly improve system reliability and flexibility.
You might also like...
How to tune for legal & standards compliance and performance, during installation and daily operations.
The FCC recently announced plans for national EAS tests. The first EAS test was on 9 November 2011 at 2 pm EST. The result was that approximately half the participants didn’t receive the test message for myriad technical reasons. It took the C…
Broadcast transmitter facility planning, design and construction… and what an engineering consultant can help with.
The criticality of service assurance in OTT services is evolving quickly as audiences grow and large broadcasters double-down on their streaming strategies.
Having looked at the traditional approach to moving pictures and found that the portrayal of motion was irremediably poor, thoughts turn to how moving pictures might be portrayed properly.