CloudFormation, Route 53, and … EKS?

The other day I got halfway through writing a very irate support ticket to AWS, stopped to do some fact checking, and learned something deeply annoying.
One of the teams I work manages a bunch of services. One of these “services” is some common Amazon Route 53 infrastructure that is set up using AWS CloudFormation, and over the history of the project the deployment in the development account the team uses has been a little flaky. Each time we hit a deployment problem, the problem always turned out to be rate limiting. It never happened in production, so the flakiness didn’t quite get the attention that it could have.
Rate limiting in Route 53
Some background: Route 53 has a hard limit of 5 control plane requests per second per AWS account. For most folks, this is fine. However, it wasn’t working well for this team.
The team had raised support tickets and gotten advice like “CloudFormation will
attempt to create your resources in parallel. One option to avoid rate limiting
is to add DependsOn links to serialize the resource creation.” We weren’t
super-happy with that answer, and there’s a
CloudFormation roadmap item
to fix it, but we needed something in the interim, and it worked … mostly.
On this day a deployment had failed again, and after going after the usual suspects and making sure that the resources were properly serialized, I got very irate. I was halfway through writing a support ticket:
We are still encountering Route 53 rate limits and our CloudFormation stack deployment / updates are intermittently failing, sometimes after only two resources are created. There are no Route 53 API calls being made by our applications, only through CloudFormation.
We are quite frustrated at this point and would like to request a session with a solution architect to help us understand how we should be doing this and
and I paused.
Check yourself before you wreck yourself
“There is no point in using the word ‘impossible’ to describe something that has clearly happened.” — Douglas Adams, Dirk Gently’s Holistic Detective Agency
“Is it true that there are no other Route 53 API calls being made?” I asked myself. A quick jaunt into AWS CloudTrail told me the answer, and also opened a gaping pit beneath my feet.
There were 436 Route 53 API calls made in the 2-minute period surrounding our CloudFormation failure. If you do the math, that’s 3.6 requests per second on average, so it’s not at all surprising that we maybe tipped over the limit of 5 at some point in there.
“But where are these coming from?” was my immediate question, and it was immediately answered.

Virtually all of these requests were being made by an EC2 instance that was part of an Amazon Elastic Kubernetes Service (EKS) cluster.
Talking through this with some other folks, I learned that they’d configured external-dns on the cluster, and that this behaviour is actually documented.
The production account doesn’t have the EKS cluster, so it’s not overwhelmed with Route 53 API calls, which explains why deployment never failed there.
Buh-bye
I wanted to decommission the cluster immediately, but unfortunately some teams still need it, so I wasn’t able to.
The external-dns documentation says that one workaround for the issue of the
controller eating your entire Route 53 request budget is to extend the interval
that the controller reconciliation loop runs at. In this particular cluster, the
reconciliation loop was running every minute (the default!) to reconcile a set
of records that change approximately never. I followed the instructions, set the
interval to a week, and settled in to see what happened.
The first thing I noticed was that immediately the calls to Route 53 stopped. Not surprising, but great to see the confirmation. Several hours after the change, there were still no calls from the previously-misbehaving cluster.
All is well now, and I get to put away my detective hat for another day.
“The light works,” he said, indicating the window, “the gravity works,” he said, dropping a pencil on the floor. “Anything else we have to take our chances with.” — Douglas Adams, Dirk Gently’s Holistic Detective Agency
What I learned
First, CloudTrail was instrumental here. I’m still a novice, but I’m learning how powerful a tool it is. Once I knew what to look for, it was immediately obvious what the source of the rate limiting was. The events in CloudTrail identified the EC2 instance and even made it clear that the source of the requests was in an EKS cluster.
Second, I was reminded that Kubernetes is not a get-out-of-ops-free card. There is a lot of expertise involved in running Kubernetes well, even when you’re using a managed service like EKS. I knew this before, but this was an example of a cluster I didn’t even know existed (don’t worry: someone more responsible did know!) having side effects way outside its scope.