Message Queue Pattern

January 3, 2023 9 min read

Modern applications are complicated and there is a lot that goes on behind the scenes to deliver functionality to large bases of users. In this post, I will talk through the Message-Queue pattern, the problem it tries to solve and why we would want to use it.

Table of Contents

The problem of long-running tasks

Frustratingly, like most things in the world of software engineering, the message-queue pattern has a number of different names. Job-queue systems, asynchronous task execution and many more. But they all relate to the same concept.

The idea is that we often need to be able to execute time-consuming tasks.
So let’s take a look at what our simplest, base case scenario looks like.

simple_blocking_application

In this simple scenario, our user makes a request to our application. Our application heads off to carry out that computation and gives a response back to the user when it is ready. Crucially, this would mean our application would be unable to serve any other requests during this time. This is what is known as a blocking call.

If another user wanted to make a request at the same time, our application would be blocked as it will need to finish executing the long-running task associated with the first inbound request. I’m sure you’ll agree that’s a poor experience for the user.

What does it mean to be an application?

So we know that as soon as our application has to deal with any kind of scale, that configuration just isn’t gonna cut it.

We need to be able to offload these time-consuming tasks to something else in our system and free up our application so that the user experience is not degraded.

Which brings us to the question of what do we expect of our application?

We can say that the responsibility of our application to serve users with the functionality they need.

We also have a single point of failure at the moment. If our application goes down, then we have lost the ability to serve our users as well as to execute tasks. And to top it off, we cannot scale those pieces of functionality independently of one another.

Restaurants have been doing this for years

Let’s imagine we go to a restaurant and a waiter comes over to take our order.

If the waiter then left to go and cook our food, then that same waiter cannot take the orders of other customers.

So how is this solved? Well of course, when the waiter takes our order, they then pass the order on to the kitchen. The chefs then do the time-consuming task of preparing our order.

In this system, the waiter is the entity that the customer interacts with. The customers tend not to interact directly with the chefs.

The waiter is also then freed up to take additional orders and pass them to the kitchen.

At busy times, the restaurant might plan to have more chefs (within reason), or orders may be queued for short periods of time before they are picked up by the chefs.

Can you see where we’re going with this?

The restaurant system can adjust its components in response to how busy it will get.

In the restaurant, each person has a specific role. Without at least 1 person in each of those roles, the restaurant would not be able to operate. We can say that the system works as a sum of its parts.

Separating the cooking from the serving

Going back to our application, if we created a new component which was responsible for the execution of the long-running task, then we could free up our ~~waiter~~ application to handle more ~~orders~~ requests.

worker_in_isolation

Okay, so this feels like progress… But something is missing.

Remember our restaurant analogy? When the restaurant gets busy, we can enlist more chefs in the kitchen to help prepare orders and speed things up for us.

But right now, our application ~~waiter~~ has to know who to pass the ~~order~~ request to.

It would make a lot more sense if we could just leave the order in a designated place. That way, our worker ~~chef~~ can simply pick up the next orders when they are ready.

Right now, our application is too tightly coupled with our worker.

Our application has to know exactly which worker to send the task to, so we are unable to deploy more workers easily.

Ideally our application should just be able to communicate with a single component, and then our workers could just check that component for ~~orders~~ messages.

A simple message-queue

This is where the message-queue pattern becomes useful for us.

message_queue_worker

In this configuration, our application is the publisher, the queue is the broker, and the worker is the consumer. Note that the sole purpose of the queue is to act as an intermediary.

This allows us to decouple the communication between our application and the worker. Instead of our application directly asking our worker to complete a task. The application can publish a message to the queue.

Our application can take an inbound request, and publish a message to the queue. The application can then return a response to the user, with an identifier which can then be used to check the progress of the task. This is known as the asynchronous request response pattern.

The message-queue pattern does add complexity to our API and to our architecture. It also incurs some overhead because we now have to publish and consume messages. However, this pattern does promote modifiability and scalability for us.

But this solution is not a free lunch.

We have introduced a lot of complexity to our platform. And it has a number of problems that we haven’t considered yet.

Acknowledging messages

Now that we have our message-queue-consumer system in place. We can horizontally scale out the number of workers that we have in response to high traffic loads.

But this brings with it a problem we haven’t had to face up to yet.

multiple_workers_with_no_acknowledgement

How do we guarantee that we don’t end up in a scenario in which we have multiple workers trying to consume the same message from the queue?

Whether we implement the broker with AWS SQS, RabbitMQ, Redis, Kafka or some other solution, the semantics will be slightly different. Note that the nuances are also slightly different depending on what is consuming the message. But essentially our consumers will have to acknowledge or reserve the message that they have picked up. This will prevent other consumers from taking the same message.

multiple_workers_with_acknowledgement

This seems fine, but we have some design decisions to make as to how to handle the failure scenarios. What do we do when Worker A reserves the message, but then suffers a failure or an outage? Do we simply want the message to become available again for Worker B to take? Or do we delete the message entirely and propagate the failure to the rest of our system? The answer to these questions will be heavily influenced by whether our task is idempotent.

Processing different types of messages

One major benefit of the message-queue pattern is that we can take advantage of this communication protocol and connect different components together. We’ve already done something similar with our application and the 1 worker.

But let’s say our application needed different types of work to be done. We could add a new queue with its own set of dedicated workers.

multiple_queues

The workers for each queue can be scaled and configured independently of each other. This is a really important feature and is integral to the design of most distributed modern systems.

Overheads

So far we’ve talked through long-running intensive tasks. But what would happen if we had a short task which was not as intensive? Well, the message-queue pattern brings some overhead that must be considered:

It takes time to publish messages to queues.
We might have to wait for a worker to complete its previous task before consuming the latest message.
We might have to wait for additional workers to spin up before one of them can consume the latest message.

This means that when our application publishes a message to the queue, we lose control of things. The execution of the task is out of our hands. You could look at this as being both a benefit and a flaw at the same time. But it does mean the debugging can be a little trickier in comparison to a simple client-server system.

The overheads that we incur also mean that this pattern is not suitable to all types of tasks. If our task was quick to execute and consumed minimal resources, then the overheads of publishing the message, waiting for the worker to consume it might just end up taking more time than the task itself!

Message persistence

The only thing we can be sure of with distributed systems, is that there will be failures. Our system now has more moving parts compared to how we first started. Which also means that more things can go wrong!

So what do we do if our broker suffers a failure and goes down?

Most message queue solutions can backup messages on disk when the message is published to the queue.

queue_persists_message_to_disk

If the queue itself suffers a failure and goes down, then we are safe in the knowledge that message was persisted to disk.

queue_suffers_outage

When the queue is brought back online, the message can be restored to the queue and regular operation can commence again.

message_restored_to_queue

When the worker acknowledges the message and consumes it, the message can then be deleted from disk.

This feature does incur a latency penalty, it takes time for the broker to persist the message to disk. Which in turn means throughput will be negatively impacted.

Summary

There a number of considerations and issues that we need to address when designing message-queue systems. Although it is a commonly applied pattern, it is not without complexity. It can solve certain problems for us, but we have to pay attention to the baggage that it can bring.

In this post, we have talked about the message-queue and a handful of the considerations we would normally be required to make. In follow-up posts, we will talk through some more concrete solutions like AWS SQS & Celery.