Evolving service architecture

Ever since the concept of services running in separate processes was formed, efficient communication between different services has been the one thing that has kept software architects awake at night.

Evolving service architecture

Ever since the concept of services running in separate processes was formed, efficient communication between different services has been the one thing that has kept software architects awake at night.

In this text I use the terms client and service quite a lot. For the purpose of this discussion a client is basically just the origin of some call and a service is the receiver. The client could be an actual client application, but it could be just another service as well.

Stage 1: Point-to-Point Communication

Point2Point

Many companies still start out with a very simple idea of what a system should be, mostly resulting in very large monolithic designs. Slowly but surely most of these systems then get confronted with the need to integrate with each other and communicate with the wider world.

The first thing developers tend to do is to go for the obvious answer: do one bespoke integration at a time. They start building a set of service endpoints at the edges of the monoliths that allow it to fulfill the business requirement at hand. The contracts are well-known and typically shared between the service and its clients.

To start with this seems a pretty good solution, but very soon, as more and more integrations are required, things start to get out of hand. You start to see the same kind of messages being sent around in slightly different formats. Then the discussions start about who owns the contract, the service or the client. What if the client does not have all of the data to fulfill the contract of the service, does it go and get them from another service first?

The need for orchestration often also starts to appear for these kinds of scenarios. This is then mostly written into the clients, because usually the service will dictate the contract.

The clients will effectively push the data out to the services to get overall consistency. That means the whole system becomes more brittle as availability can't be guaranteed for every service all of the time.

The clients need to know which services to call in order to get work done, therefor it will also become very difficult to add a new service into the mix. Many existing services will need to be adapted to talk to it. You can't deploy the new service until every impacted service is ready to deploy again.

The deployment itself also becomes a configuration nightmare, especially if you maintain multiple environments. How does one service know the endpoint of another service? How do you scale these services? Do clients need to know multiple endpoints and call them randomly to spread load? Do you introduce some kind of load balancer in front of the services? Will you introduce some kind of discovery mechanism and if so what happens if a certain service can't be found?

The coupling between services begins to grow tighter and tighter until you realize you have ended up with an even bigger monolith than you started with. The main objections to this kind of architecture are: the ever increasing complexity, the loss of resiliency and also the loss of agility.

Stage 2: Enterprise Service Bus

ESB

At this point most organizations start to think about introducing a hub-and-spoke architecture with an Enterprise Service Bus. The bus will handle the orchestration and clients and services don't need to contact each other directly anymore. The service bus usually also becomes the owner of the message contracts to which both services and clients must conform.

The message usually will contain everything that any service that is participating in a certain flow could ever want to know. If a client does not have everything, the orchestration inside the service bus can take care of the enrichment by making additional calls, resulting in a statefull workflow.

Retries and queuing are also handled by the bus, as well as request throttling if that is required.

Again this will work fine for a while. You can add services and clients without having to redeploy everything. Configuration is much easier, because all you need to know is where the service bus for that particular environment is situated. Services can be scaled and only the service bus will need to know about it.

The service bus is a potential single point of failure, but that can be mitigated quite easily.

Where the downside comes is when you have to change the message formats. A new service or requirement in the system can lead to the format changing and that can lead to very wide impacts. The workflows inside the bus are also frequently changing as the services evolve.

After a while, the service bus will become the most volatile part of the system, needing to be deployed with every change of every other part of the system. Worst of all hit of course is the team that has to maintain the service bus. It is very hard for them to keep to a regular deployment schedule. Versioning and staging changes becomes a nightmare when other teams start to delay features, forcing them to be rolled back in the service bus also just before the next release. Meanwhile other changes need to go ahead as planned.

Pressure from the business to reduce time-to-market is increasing all of the time and in this situation the traditional ESB just doesn't cut it. Change has not been isolated effectively enough, so it tends to spread across the system.

Stage 3: Event Broker pattern with lightweight events

EventBroker-1

Where most architects end up after a while is realising that actually neither of these solutions is very satisfactory.

In organizations that have been moving to microservice-like architectures for a while an even more loosely coupled, reactive style of architecture is preferred, relying on events rather than messages. The main difference you could say is that whereas the above architectures rely on a client telling a service what to do, in this kind of architecture services "react" to an event in the system.

In the middle sits an Event Broker that does not know the clients and does not care about the contents of the messages being passed around. It does not try to orchestrate services in the way a service bus usually does. All it does is pass along messages to anyone who is interested. All subscribers will get the same messages delivered to them. This is the typical pub/sub pattern.

A service can indicate its interest in messages by subscribing to the Event Broker. Usually the subscription does allow some basic filtering, as well as subscription to a specific message channel (or topic). The service can also choose to get messages immediately (with retries) or if it wants the broker to keep them on a queue for later consumption.

The content of the message or "event" as we should now call them is very basic. It states the origin, the name of the event (what happened to initiate this call) and only the information to essential retrieve further data about the event. For example the event could be the creation of a "customer" out of an existing "prospect" in the CRM system. That information is not enough for the finance system to set up an account for the new customer, but it is enough for the finance system to know what it has to do next. The finance system will call the CRM system with the information provided in the event in order to get everything it needs to finish its job.

"Wait a minute", you say, "You've introduced tight coupling again!"
Well no, and there are a number of reasons why.

First of all the events are very simple and not prone to change much. Usually they gradually grow a bit to add some information, but stays backwards compatible through time. The event broker does not care about the content of the messages anyway, so is not affected. Additional types of events can be added without impacting any existing services and services will subscribe to them when they are ready.

Secondly, in this case the finance service will call the CRM service to pull information, but we can version the interface on the CRM system so that we will not need to wait to deploy them both together. Using HTTP and REST as the API will help to reduce the actual number of times the service will need to be versioned anyway. The CRM system will always be the owner of this contract. The finance service can follow its own release cycle to upgrade to the newer API version when it is ready.

Thirdly if the CRM service is not available, the finance service can just fail the call back to the event broker, so it can be retried later on. The finance system itself is not affected adversely by the failure of the CRM system. Also if the event from the CRM system cannot reach the finance system, it will be retried without the CRM system needing to wait.

In some versions of this architecture the subscribers to the Event Broker are not actually the services themselves, but some serverless code (lambas or functions) that will do orchestration and update the store of the service they belong to. That allows for better scaling and less reliance on known endpoints. The serverless code is always versioned and deployed alongside the service and is therefor seen as part of the same microservice.

Really complex orchestrations that need to be done in order are not handled by the Event Broker. For that a dedicated service will need to be built. Serverless is again a good candidate for this. With the versioning on all the APIs this orchestration can actually be quite stable. It can change whenever it needs to change for business reasons, rather than for technical ones.

Final thoughts

Whether a CRM and a finance service are good candidates to be considered microservices is a discussion for another time. Here they just serve as an example of communication with events.

One of the most important goals of any good software architecture is to isolate change. Identifying the right boundaries for each service within a system is the most important part of that. This will reduce the number of synchronous calls and complex orchestrations that will be required later. Business is usually resistant to the idea of eventual consistency, but if it is handled correctly it is the key to getting the agility they crave for. It usually helps to explain to them that eventually does not mean in a day or 2, but rather in seconds or less, and many systems are actually eventually consistent already due to caching anyway.

When you absolutely need to get feedback from an outside service before completing an action, consider using states to accomplish this rather than synchronous calls. Put a new customer in a 'to-be-approved' state for the CRM service and wait for the 'customer approved' event to come back from the finance service. Assuming that the finance system is not down forever, the CRM service can complete its action, reacting to the second event.