Managing Microservices: Tyranny of Too Many Channels

Published in

Tom Harrison’s Blog

12 min readFeb 12, 2023

How do we manage the explosion of signals that arise when we implement a microservice architecture?

Multiplying microservices, Slack channels, Asana boards, alerts, Cloud services and all the rest make keeping track of anything, and certainly everything impossible. Some get attention, some don’t.

Every new thing requires effective communication that explains why it is needed. Yet, effective communication does not seem to be a strength of companies. Worse, we’re all afraid to get rid of things, so the sheer number of things we’re supposed to keep track of grows, and grows.

As an organization grows, the tyranny of all the channels, signals, events, and knowledge becomes unsupportable by the teams that create and (theoretically) own a given service. The services need to be managed as a group, by another team focused on making them fit in an architecture that is designed to keep them current.

Microservices are Better

My company has embraced microservices, and they are better. I came from a company whose monolith code takes 30 minutes to rebuild and whose dependencies are tied in knots — a two year project was able to partly extract four sub-services from the monolith. It’s clear: microservices force a kind of clarity of purpose, and done well create clear boundaries and interfaces. Both my old and current companies have about the same scope of functionality, and my new company has on the order of 100 distinct services.

Each service tends to have its own resources, such as a database of some kind, a message queueing system of some kind, object storage, usage logging, alerting, configuration and secrets, documentation, github repo, libraries, a build process, and so on. Each new feature or update to a service begets an Asana board, a Slack channel, documentation describing the idea and then the design. If we do the project, it has a team. Also, each one has a name.

I cannot keep track of them all. No one can.

As the scope of our broader system expands, our older services begin to show their age. So we use GitHub’s Dependabot to keep track of libraries with new versions, and security issues. Some teams religiously update their apps. Some don’t. Some can’t because the app no longer builds for some reason.

Maintenance is a Different Dimension

I have been working on one service for months now, and each new push to a branch results in a build. A few days ago, something changed, and the linter that checks the Golang code now reports a deprecation in the use of rand.Seed() and the build started failing. The Golang part of the service was built as a separate project by a separate team, and I am working on the PySpark code that uses the output of the Golang part. So I know almost nothing about that part of my service. Yet, a failed build is a blocker for me, so I needed to fix it.

I figured out if the deprecation was important, and how best to deal with it. After several hours of investigation, it turns out the warning is just intended to alert developers that this random seed was not truly random, so a bad choice when true randomness was needed. But we’re just using it to come up with a GUID for a log record, and if one in a trillion are the same, no biggie. So in our case, the deprecation was not important.

I figured out how our linting program worked, built locally to reproduce the linting error, figured out how to configure the linter, and finally realized that I could just add a comment that suppressed this specific check. And now, the build passes again.

Yay.

The World is Dynamic

I call this out not because it’s remotely unusual but because in a dynamic system of microservices the world is not fixed. New builds pick up new libraries that are needed for security, privacy, stability, or so on.

We have an internal project to pin libraries to specific versions rather than picking up “latest”. That would have prevented my two hour digression with a deprecated function but has a downstream impact of aging software. In this case, we rely on tools like Dependabot to call out old versions — it’s pretty smart, and even creates a PR that someone needs only to review, build, and deploy in most cases. But hundreds of libraries, hundreds of builds, and unclear interdependencies make such a world quite complex. Just as I found with the rand.Seed() case, a sentient human is often needed to work out and resolve issues. Also, it’s a lot of noise.

Slack message sent to one team on a given day: 4 separate services need updates

Reality often conflicts with how we all know things should happen, and reality usually wins. Several of our services use libraries that are past their end-of-life status. Had we done the “right thing” and incrementally bumped versions as they came out, added new tests or tweaks to code as issues arose, and otherwise stayed current, each new change would have taken just a few minutes. Each time we ignore a version bump, changes start building up, and the effort needed to bring a service up to current becomes more and more complicated. Further, the team or engineer that worked on a service is probably working on other projects, or may no longer work here.

Chain Reactions are Hard to Stop

Once a services diverges from being up to date we begin a chain reaction of issues that make simple changes an engineer may make turn into a task that can take hours, days, or longer to resolve.

All languages suffer from this. We use Ruby, and the term of art is gem dependency hell. Here’s a short 15-section document on how to manage GoLang dependencies. Looks like JavaScript also has hell.

Every version upgrade that is ignored increases the likelihood that another upgrade will be in conflict. A single version bump is usually safe because it stands alone — it’s atomic.

We want to have atomic upgrades, not atomic bombs.

Hundreds of services each need a little love and attention from time to time, with new ones being added regularly in order to make our product better, or to replace an old and creaking one. Each service may get updated frequently, or not at all, which means each service has its own special set of library versions, so each build creates a bespoke container image with its own graph of library versions, and hundreds of builds lead to decreasing reuse.

Microservices are Not Independent

The promise of microservices is that they are independent, have clear interfaces, and control their own dependencies. We have on multiple occasions used this to our advantage — we have rewritten a number of services originally written in Ruby to now be in GoLang — and if the interface is the same it’s a drop-in replacement.

Except it’s not a drop-in replacement — each service lives in an ecosystem, in our case Kubernetes. The output of logging from the new system is different. We’ll have different events to monitor and alert on. Also, of course when you’re building a new version of something, it’s usually to fix or enhance the old one, so now there are new endpoints, or different ones.

Differences in the new app or even just the fact of it usually means that the old service needs to sunset until all of its callers are migrated to the new version. The teams working on the callers to the new service don’t care about the new version — it’s a risk because it’s new, and the old service is working fine. So we have a conflict: the organization wants to reduce the number of moving parts, the teams want stability, and none of us has time to focus on anything other than making the part of the product we’re working on better.

A Horizontal Team Makes Microservices Viable

There’s a conflict in our world of software engineering. On the one hand, we know that monolithic apps eventually collapse upon themselves or fall over like a Jenga tower. As I have described, microservices fail in less catastrophic ways, but the chain reaction of an exploding dependency graph means that a service itself is only the part of the iceberg that you can see.

Vertical teams focus on a service and its functionality. Does it work, how are we going to implement a new feature the company needs.
Horizontal teams focus across services and their fit into the larger infrastructure and standards.

The goal of microservices is to allow a company to scale up and still remain nimble enough to add sophisticated capabilities to the product without requiring a huge re-architecture effort. This is absolutely an advantage that my company has been able to exploit, and it’s pretty awesome how quickly we can do some things.

As I have described, however, the microservices model has diminishing marginal returns over time as the technical debt of unloved older services becomes more urgent, immediate and disruptive — such as a security issue. Or in one case recently, an old service that had worked for years filled up a database. Completely. With 2³² records. Unwinding this shit show took most of the senior staff a number of days, and has resulted in a new architecture for the service using a new tool, and we expect, after three months since the incident, to deploy the new version. It will do the same thing as the old service.

Once a company reaches this point, a new function is needed — a maintenance team — whose charter is to look across all services and build the processes needed to keep them up to date. This is a horizontal team, and directly aligned with the goals of the infrastructure team, operations team, information systems team, and typically, the finance team.

Maintenance is Boring and Thankless

If you ask most software engineers, they would probably prefer to be on a team making new features. It’s what the Sales and Marketing teams cheer about in company meetings. It’s who gets honor and recognition. It’s a chance for an engineer to design a greenfield feature without all the constraints of integration. In short, maintenance is boring.

Except there are some of us who are odd that way, and would much rather build systems that manage other systems.

Historically, operations or infrastructure teams focused on how to perform the configuration tasks needed to support a system. The DevOps movement grew out of frustration: operations always took forever to create new infrastructure or set up existing infrastructure. Automation tools like Chef, Puppet, Ansible and Terraform arose. Kubernetes made the idea of a “server” quaint. Our company’s infrastructure is now effectively 100% managed by Terraform, with all workloads running in the cloud, mainly on Kubernetes.

Automated infrastructure is truly amazing, and the benefits are stunning. At our company, we can build an entire cluster with all services and configuration working in several hours. With this we can test new versions, make changes needed to support them, and roll them out to our dev / staging / prod clusters with almost no impact. And we do.

Our services, however, are still managed by vertical teams. Once you write it, you own it forever. As people come and go, ownership moves to the team. I have already described how this works in reality: these vertical teams are responsible for and measured by their ability to deliver new features and functionality. They are putatively responsible for keeping apps running, up to date, secure, and scaling up. But that’s really not how it works.

Letting Go of my Beautiful Little Pony

Once an app is deployed into production, it needs to stop being a handcrafted, artisanal little thing of beauty, modernism, and correctness that’s the brainchild of the team that created it.

Let’s be honest: software engineering is about creativity. When we build something, we do it in a way that we agree is a best practice, and solved lots of the problems we have seen with the way we used to do things. We create our own beautiful little pony, whom we love, and care for, and give a name to. When it has grown up, and is ready, we add our pony to the herd — deploying to production, and help it through its introduction.

Then, we forget our little pony, because now we’re working on an amazing new unicorn! So cute! So magical! I am going to call it UniGlow!

This is the moment when we need to let go of the pony. Now is when the horizontal team takes over. They focused on the herd of ponies, horses, unicorns, the flock of gulls, the murder of crows, the bale of turtles, and the den of snakes.

Managing the Herd

The team that manages the herd of services is really an architecture team, and that’s far more sexy and desirable moniker than maintenance team. The architecture team is responsible for:

Integrating new services into the platform and infrastructure, confirming that they follow our standards and patterns (and changing the service as needed to ensure it does!)
Monitoring the service and building suitable escalation channels for identifying and resolving issues.
Building automation that identifies when new versions of dependencies are available. This includes libraries, but also resources like database versions, messaging service versions, and so on.
Documenting and training teams and new engineers on processes and patterns. Re-training existing engineers on new patterns and processes that have evolved.
Reporting on stability, versioning status, build status, for all services as a whole. Identifying aging services that are failing often, and creating project definitions for the engineering teams to take on, along with their normal task of building new features.
Migrating services to use new automation and tooling. Tracking sunsetting projects and driving their elimination.
Consolidating libraries into bundles that can be incorporated as a set into a service, and pre-built, vetted for security and stability, considered for the tradeoff of functionality vs. commonality. Replacing one-by-one library usage in services with bundled.
Partnering at Design: Far more important: following along as new services are built, both to align engineers with the standards and patterns that the architecture team endorses. But also: identify new technical opportunities that are not yet part of the standards, creating new automation and processes to support them.

Justifying the Cost of an Architecture Team

If you don’t have a team like the one I describe now, it’s very hard to rationalize the cost unless you have great systems for tracking effort. Which, let’s be honest, you probably don’t. Especially when it comes to “quick” bug fixes and critical issues. Further, the buildup of technical debt is one of those problems like a pandemic or climate change: we all know it’s happening, but it’s really hard to know when it is going to become an issue for us.

It almost always seems more logical to do what makes money instead of spending more on reducing friction.

If you cannot really keep track of how much time is actually spent dealing with the cost of debt, then it’s abstract and probably manifests as a feeling like “yeah, we should really upgrade our services”.

We almost always underestimate the benefit of processes that reduce friction.

As a result, the first and most important step towards justifying resources for a team like this is figuring out how to measure the cost of not doing it.

In some ideal world, we would like an accurate measurement of how much time engineers spend fighting with the software process — builds failing, new libraries creating dependency hell, issues and incidents, and so on. But this is hard, requires fairly complete tracking, and estimates by engineers.

There may be proxy metrics that can be gathered instead. It took me three builds after figuring out my linting issue to get it right. You can track failed builds, for example. You can count the number of PRs opened by Dependabot, and how long they stay open. You can create simple pre-work estimates that break down projects into buckets of “how long it should take” vs. “how long it will probably actually take”. You can count incidents. You can create a graph of all versions used by all our software. All of this can be done over time so that you can begin to track progress.

Conclusions

In the end, recognizing that tech debt can and often does lead companies into states where “nothing gets done quickly”, or worse, requires intuitive awareness by company leaders that allocating resources is a necessary cost whose benefit is almost not measurable, that takes time to begin to have a positive impact, and is really hard to staff with the right people. It is for all of these reasons that this function is rare in companies.

I wonder if cleaning up technical debt and building systems to keep things up to date is actually a mid-life company strategy that can provide true competitive advantage, even as a company gets bigger?