Terraform modules: separate repo with semantic version tags

Tom Harrison
Tom Harrison’s Blog
6 min readApr 30, 2022

--

borrowed with love from https://www.geeksforgeeks.org/introduction-semantic-versioning/

When I came to my company, we had a main terraform repo for building the resources needed by the 50–100 microservices we host on our Kubernetes clusters. In a subdirectory called modules we have code that defines our databases, redis, kafka, ci/cd pipelines, github repos, monitoring, namespaces and secrets for the cluster, permissions, and several more. A root module references these modules using a path, e.g. source = "../modules/postgres" such that any change to the postgres module needs to work for all of the callers, at present about 40 databases.

This is not a sustainable pattern.

Version-Tagged Modules Repo

So, we have created a separate repo, tf-modules that we’re moving these modules into. This repo has multiple subfolders, one for each module, e.g. eks, node-groups, vpc, transit-gateway and now postgres and a bunch more.

The key is that each new change to the tf-modules repo gets a new semantic version tag, with release notes. Callers from the other repo now references the modules using source = "git://github.com/our-org/tf-modules.git//postgres?ref=1.6.2"

Our internal rule is that point version isn’t needed unless if fixes a bug affecting you, minor version may be a breaking change for a specific module, and major version is just 1, haha! We make a change, do a pull request, and once approved tag a new release version for the repo. We’ll get to proper semantic versioning at some point real soon now.

Keep changes small and version changes isolated to a single module

To avoid chaos, we are careful to make changes as local as possible, almost never to more than one module at a time.

But this is still not perfect. A caller from a root module that refers to several modules may use different versions from the same remote repo. For example s3.tf might refer to version 1.6.2 of the repo, whereas postgres.tf might refer to version 1.7.0. And this is actually fine, but can be confusing especially if documentation is weak.

Version-tags avoid unexpected breakage of root modules

The beauty of this method is that we can add to the module, change as needed, and know we won’t break anything, since all callers reference a tagged version that works for them. The callers can upgrade whenever they are making changes, or if we (in the infrastructure team) want to upgrade a set of modules we can do that one by one. This is especially important because we have around 80+ separate services that are callers, and at the moment they only tend to get updated when the team in charge of the service needs a new resource.

We still have some modules in the same repo in the modules directory, and this means any change there will be applied the next time the root module is planned and applied. This is bad, because some of these are breaking changes, and require that a formerly working root module now be upgraded before any new changes are added.

But having 80+ root modules calling any number of tagged versions of several remote modules can be kind of messy, especially when we want to make a global change.

Downsides of version-tagged modules repos

For example, recently the AWS terraform provider introduced default_tags that are linked to the provider block and propagate AWS tags to each AWS resource. This is a great and important change for us — consistent tagging of resources is essential for cost management, discovery, inventory, ownership, status and many more attributes we want to track, even permissions using ABAC with AWS SSO support.

So our tf-modules now has a version that is ready for default-tags and we now need to upgrade each of the root modules to this version. This means all of those older root modules that haven’t been updated now need to get new versions of all of their tf-modules references, some of which have changed rather significantly over time.

Upgrading isn’t usually simple

Refactoring and upgrading those root modules falls to our Infrastructure team, and it’s a lot of work. Indeed, we even sometimes find that the current module in the tf-modules repo does not support some attribute or capability of an older version. Or worse, we have had to add the horrible hack of using count to conditionally enable resources:

resource "my-potentially-used-resource" "default" {
count = var.enabled == true ? 1 : 0
...

Adding this count element to a resource means that all resources that refer to it now need to index their references, moving from

my-potentially-used-resource.default.id

to

my-potentially-used-resource[0].default.id

This pattern doesn’t scale well — we are consistently adding new services, defined as new root modules that call our tf-modules repo, so the scope of any global change gets bigger over time.

Testing is a bit of a trick

We often extend our add modules to our version-tagged modules repo when there’s a new requirement or service. So we have a new root module, calling a new version of the remote repo. There are two tricks we use to make that easier:

  1. For local development, pull both repos locally and make a symbolic link in the root-module repo to the remote repo, then temporarily reference the remote repo using relative path, e.g. source = "../tf-modules/postgres" . You’ll have to run terraform init again, but this makes local testing a lot quicker.
  2. For team development, use the pull-request name as the ref element on the git reference in the source attribute. So, make a change to the module repo on a branch, and push the branch, and make a PR. Then you can reference the remote repo from the root module by changing source = "git://github.com/our-org/tf-modules.git//postgres?ref=new-postgres-ability-pull-request"

We use the first method a lot, but when we want to have our CI/CD system build multiple environments with terraform, we use the PR method. It’s pretty clean.

Version-tagged module repo a huge win over module subfolder

Despite its drawbacks, a version-tagged external repo for terraform modules is still a win compared to having modules in a subfolder.

Still Trying to Manage Our Terraform Infrastructure

So how best to manage this? We have teams that develop services, and are mainly responsible for defining the resources their services need via terraform. Of course we should demand that they keep their terraform code up to date. And of course they should comply. But they don’t.

In truth, our terraform code isn’t that obvious or well-documented enough that an engineer working on a golang service that needs an S3 bucket and an RDS database can just create these things and know that permissions, security, versions, tags, and all the other “standards” have been met.

So in Infrastructure we work with the teams to build resources and approve their PRs for new ones. And then, they go away to make their service work. Terraform isn’t the thing they think about every day, and indeed neither are the resources created by terraform, that their apps depend upon.

Also, some changes like moving a database version from 12 to 13 … results in extended downtime. It’s a variable we expose, but not one we want developers to change.

It’s hard to move to true devops

So a lot of terraforming tends to fall on the Infrastructure team. And this really blows the whole idea of devops back down the line to the infrastructure team. In some more ideal future that we’re trying to build, we would have great modules that handle security, permissions, naming, tagging, linkage to networking, account, region, version upgrades, and all the rest, all documented and easy to follow processes.

Until then, we try to keep up, building modules and migrating existing code into them, adding new capabilities to modules as needed by app development teams, and upgrading as needed.

Terraform is still better than the alternatives

It’s still worth the effort — having all of our infrastructure and resources defined in terraform gives us a lot of benefit, and a lot of ability to make global changes. Tagging our modules repo has made this a lot easier and better in most ways, but far from all.

--

--

30 Years of Developing Software, 20 Years of Being a Parent, 10 Years of Being Old. (Effective: 2020)