AWS has improved a lot, sort of

Tom Harrison
Tom Harrison’s Blog
12 min readJul 24, 2020

--

Working on it … soon!

I am studying now for the Solutions Architect Professional certification from AWS, and as I review my knowledge, it’s clear that they have really upped their game in several important ways over the last year or two.

Incredibly Boring Features

Proper support for enterprises requires a number of improvements. First class support for external directory services, notably Active Directory has been around for a while, but it has gotten substantially more useful. Organizations have also been around for a while, but now there are some clearly defined patterns for managing multiple accounts, billing, and global service policies for access and management. Options for secure and fast connection to multiple data centers are well worked out now. Options for transferring data are excellent.

There are new or refined services for managing things that annoy all organizations, like SSL Certificates, and even creating your own private CA so internal users don’t have to get certificate warnings. And finally (finally!!) Route 53 DNS allows for creation of internal addresses (private hosted zones). First class support for secret management, and exceptional support for key management, and fuller support for service roles and other mechanisms for role-based security make it possible to reduce the number personal keys that are needed.

Built in IDS/IPS services complement their CloudTrail auditing system, WAF, Log Analysis and management, and several other “glad I don’t have to figure that out” kinds of services. More and more, AWS fulfills its broader mission to provide tools needed for a properly securable system.

Clever New Features

There are two new sets of services that I think have the possibility of being really important for companies that are either hybrid, or (god forbid) multi-cloud.

Systems Manager is a beast

The Systems Manager tools are really important, and work not just with AWS services and instances, but almost any from on-prem, or remote data center. In my last company, I even set up the Microsoft Azure-based instances to support AWS Systems Manager. The relatively new CI/CD tools, especially Code Pipeline solve a large problem many businesses, small and large have. Both of these services are cleverly implemented.

Systems Manager offers a number of features. I used it to manage OS patch compliance and inventory when it was first released a couple years ago. It was really rough at the time, and had some serious limitations.

But the service has gotten much better now, and has one feature that’s absolutely huge: you can perform nearly all operations on any node in your system without logging onto it — run commands, scripts, etc. This means no huge SSH key management or OS-level user management is needed, or open ports. Authentication and authorization is managed centrally.

Each interaction is linked to other AWS services like CloudWatch, CloudTrail, Config, and more. Batch updates to a fleet of servers are also possible, e.g. installing a new version of a software dependency, etc. So while we all want to live in a world where all systems changes are done with scripts (Infrastructure as Code) it’s just not the reality, especially with all of those non-production machines you have sitting around.

Code Pipeline and Friends

Code Pipeline strikes me as a very smart new set of services including Code Artifact, Code Build, Code Commit, Code Deploy (and Code Star, I guess). You may say “Yeah, but I already use Artifactory, Jenkins, GitHub or GitLab, and Bamboo for my CI/CD pipeline” and if so, you’ll be happy, as for the most part, these external tools work with Code Pipeline. That is, you don’t have to give up your existing. If you don’t have some of these tools, you can use the ones AWS provides.

We all want our IT and software development systems to do CI/CD, but the truth is that of all the software we build and deploy in business, far from all of it uses a proper pipeline.

It turns out managed software services are great — just look at GitHub, then GitLab to see. No one runs JIRA servers, git servers, etc. locally anymore, and while you can get hosted Jenkins or other CI services, it’s just one more thing to keep track of. Chef and Puppet are both crappy ways to deploy software, but a lot of people use them for that.

I am very impressed with Code Pipeline as a service that is actually pretty easy to configure and use, and has a really nice set of features. And the AWS alternatives are available at any time and fully managed.

Further, Code Deploy fully supports (but does not require) Blue/Green deployments — this are kind of the holy grail of software (proposed by Martin Fowler a decade ago), and now the flexibility of cloud computing makes this possible. And Code Deploy supports native deploys to various container platforms and serverless platforms.

As an aside, several years ago my team at Paytronix built our own Blue/Green deployment system using Elastic Map Reduce (EMR) and related tools on AWS. Each phase was a separately buildable and runnable element, and each phase launched all of its needed resources, building the entire stack from scratch based on configuration and code that was checked in along with the rest of the software.

Our ability to reliably launch in a pristine environment each time we tested in development, ran through full automated QA, and deployed to production completely changed our relationship with our software. We went from being fearful and risk averse to confident and fast. We also saved huge amounts of money by tearing down dev and QA machines once we were done with them. I cannot recommend this approach enough.

These tools make it hard to leave AWS

The boring security, compliance, networking, integration, management, and other abilities that have advanced are just the “blocking and tackling” abilities that are needed in order to support a broader and broader range of enterprise customers. This is just AWS doing their best to expand their market. They have done well.

Perhaps I am overly enthusiastic about Systems Manager. Having a centralized tool for managing servers, secrets, compliance, inventory, patching, not just for AWS instances, but also for on-prem, internal, and even other cloud is pretty epic. Being able to securely execute runbooks, or even arbitrary commands without needing open SSH or RDP ports is also pretty amazing. If you still have an Ops team separate from development, they are going to like the level of control and visibility this system provides.

Any customer who embraces Systems Manager is taking a step that tends to make AWS very sticky and appealing. As the tool evolves, and it’s doing so quickly, it becomes a significant differentiator from other cloud platforms.

Code Pipeline is similar. While there are services out there, and you might even already use them, Code Pipeline provides a way to integrate and unify an existing CI/CD pipeline, but also take advantage of capabilities of a cloud platform, e.g. spinning up build or test servers. Simply not having to manage user accounts or set up SSO on any additional systems is a boon in itself.

How do we know these services are intended to attract users to the platform? They are both free.

Is More Better?

Overall, AWS has improved dramatically. Things that only worked if you we “all in” on AWS in the past, now provide great hooks to external tools your organization may already be using. Of course most of this is towards the aim of cloud migrations from existing on-prem services.

But AWS is doing more platform as a service (PaaS) now, and interestingly seems to be focusing on the plethora of secondary services needed by modern IT and Development teams. If you can get your Ops, QA, and Dev teams to love AWS, then it’s a good proposition.

However, AWS has failed, and continues to do so in several critical ways:

  1. Too many choices, overwhelming
  2. Poorly integrated documentation
Just some of the available AWS services. Wow.

Choice overload

I have been using AWS since around 2008 and over 12 years I have watched the system grow from almost nothing to almost too much.

During that time, I have figured out how the really core services of AWS work, namely IAM, EC2, S3, VPC, plus CloudWatch/CloudTrail, Tagging, SNS/SQS, CloudFormation, and some other core bits. Keeping up just with these services is a lot of work. It’s effectively impossible to know all of the services offered, certainly not in any depth. That’s not necessarily bad, but there’s a real human/organizational problem that AWS has failed to address.

A new user or a new organization moving to AWS, faced with the entirely of the AWS stack is going to be overwhelmed, and be challenged to learn even how to get things going.

Security demands simplicity

When thinking about security, my mantra is “simple is more secure” and AWS fails on this front, quite badly, in fact. It is very easy, and thus quite common, for people to develop systems whose data is stored in an S3 bucket, and also fantastically easy to mess up and make that bucket accessible to the world. Corey Quinn’s awesome Last Week in AWS presents the S3 Bucket Negligence Awards to companies who have failed to secure their buckets. There’s pretty much one every week, and these are not small breaches.

Solving the most common AWS security issue is … simple?

Consider the S3 Bucket Negligence Award for this week, from Twilio. Most of these occur when someone with more permissions than knowledge decides to check the “public access” checkbox on the S3 console. Lately AWS sets of many blaring alarms and regularly shames you for using this feature. But Twilio’s error was more subtle.

Some time ago (but they assure us, after 2015) a single line was added to the S3 access policy buried 12 levels deep within IAM, that they called AllowPublicRead — in addition to the "s3:getObject" action they added "s3:putObject” which makes the bucket public-writable. Whoopsy. So some malactor did put an object that replaced a javascript file and the breach was underway. Twilio manually changed the S3 policy to diagnose or temporarily resolve some kind of problem, but never changed it back. Today, there are tools to prevent this stuff. But without engaging them, you could still have a problem.

Indeed, while I didn’t write about the Service Control Policies available now as part of Organizations, a simple SCP is one method that would work organization wide. Another is the Block Public Access S3 policy that can be linked to all buckets within an account. Or you could use Detective to look for unauthorized access. Or you could use Inspector to find the problem. Then again, CloudWatch might be the way to find problems over time, and CloudTrail could have recorded the change (I think?).

And if you ask me, all of this is a little bit of the tail wagging the dog. The real solution is to build and maintain all aspects of the system infrastructure as code, perhaps using CloudFormation. But you could also use AWS OpsWorks if you already have IaC implemented for your organization using Puppet or Chef. And of course all the cool kids use Ansible and Terraform these days, and you can mos’ def’ make those work as well.

It looks like Twilio has done or plans to do some of this kind of stuff. But in the end, let’s be real: there’s a huge amount of complexity and knowledge needed to select the right options for any given organization. It’s a huge surface area, and nearly all of the solutions require more than just a checkbox.

If it’s complicated it’s not secure

IAM is incredibly complex, and also important

IAM is waaaaaaay too hard to use, and it is the primary mechanism for configuring authentication, authorization, users, groups, permissions, policies, roles, and their relationships to each other.

IAM solves the right problem the right way. That is, securing access using permissions to principles to perform actions on resources is a centralized and advanced strategy for implementing security controls. But it turns out doing security this way is really, really tricky, as it requires that you understand the majority of the API endpoints that are what’s actually being accessed under the covers. And there are thousands upon thousands of endpoints.

AWS tried to make IAM easier by providing what they call managed policies. Typically these are associated with typical user roles, so for example they have a managed policy for Data Scientists that open access to tools like SageMaker but also allow a number of operations on S3 and other services. These are “get going” policies.

Managed policies are not “least privilege” policies, though, so using them is almost necessarily not a best practice. So what happens? When a given managed policy doesn’t provide enough permissions, people tend to do things like link more managed policies … that offer more access. (Or, maybe just make everyone an administrator).

This is a road to hell, and there are dead bodies strewn around. In fact, I am one of them. (figuratively speaking, of course)

So like everything, there are ways to get going with AWS services really quickly, and that’s pretty great. But it turns out that nearly everything we do with AWS is intrinsically more difficult, but also more exposed. S3 access, by default does not go through a router or firewall so logging and monitoring is something to do, so if you have done the right thing and set up CloudWatch, then you’ll know. Otherwise …

The path to getting a secure, performant, manageable, stable, cost-effective system can be quite a long one, and require a great deal of knowledge. In the end, it’s as good or better than other enterprise solutions because it is almost perfectly consistent across the hundreds of AWS services, and that’s good.

I have repeatedly defended AWS in my career, claiming that companies who run into security issues have a problem with their process. Failure to properly secure and maintain a system is on the company, not on AWS. Indeed, AWS does so many things to make proper security possible — the boring things — that any company wanting to roll-their-own security is just kidding themselves.

But there’s a counter argument: AWS is so huge, and has so many overlapping options that boil down to arcane ACLs, policies, roles, and permissions, and are so detailed, that the level of knowledge needed is beyond the capability of people to correctly manage the system. Complexity like this tends to create the opportunity to make a simple, yet potentially catastrophic errors that no one knows about for years.

Complexity compounded by “unit” level documentation

If you need to know how to use a service, or accomplish a specific goal with a service, the AWS doc is pretty good. Nearly all documents are at the level of the service itself, so discuss “unit” level functionality like what a given API does, or how to use the web console to accomplish a task. There are some “functional” descriptions of ways and reasons you might use the service, with some reasonable information on how to do that as well. Again, good doc, generally speaking.

But in a service-oriented architecture like AWS, someone has to come by and offer an “integration” level document for each of perhaps scores of major and common use-cases of implementation across services.

We don’t need to know how to launch an AutoScaling group of servers to run a WordPress site. Indeed, the documentation needed to help would describe broad architectures, both “recommended” and “real” with information on how to achieve the noble and aspirational goals of the AWS Well-Architected Framework.

We need a “wordpress simple” example that does everything — CloudFormation to set up the multi-AZ network and resources, maybe ImageBuilder, Code* pipeline for build and deploy, use of Secrets Manager and KMS, CA Manager, SSL, Route 53, AutoScaling, RDS, Redis, CloudWatch and CloudTrail, CloudFront for assets, WAF, Systems Manager for patching and compliance, proper tagging, Organizations with Accounts for environments and security, CloudWatch Logs, or maybe ElasticSearch with Kibana … etc. And when done, let’s containerize it and explore ECS, EKS and all that. Or maybe set up commenting as a Lambda function.

I guess that’s why AWS has so many Solutions Architects. Hopefully they hire another one soon!

--

--

30 Years of Developing Software, 20 Years of Being a Parent, 10 Years of Being Old. (Effective: 2020)