SQS, DLQs, and KMS, oh my
Recently one of the teams I work with had a fun time (note: it was not fun in the moment) with Amazon Simple Queue Service (SQS), dead-letter queues (DLQs), and AWS Key Management Service (KMS).
I thought I’d share because we learned something pretty important.
If you’re new to SQS, you may be kind of aware that it’s a generalized queue service that lets you post messages onto the queue and then process queued messages in a handler, kind of like this:
There is an event source mapping in between the queue and the handler; we’re going to gloss over that detail for a minute.
SQS is super-handy, reliable, and amazingly scalable; AWS guides us to use it anywhere that we need to buffer communications between components of a solution.
There’s even a mechanism to deal with failures. You can configure a policy on your queue so that if SQS fails to deliver1 a message to the handler enough times, the message gets moved to another queue for special handling. This second queue is called a “dead-letter” queue, named after the post office concept where messages that were undeliverable got sent to a special place for handling.
1 “SQS fails to deliver” is not technically accurate, as SQS does not deliver messages, rather the event source mapping retrieves the message, attempts to deliver it to the Lambda function, and deletes the message if the message was handled successfully. If there was an error retrieving or processing the message, SQS will make the message available again after the visibility period expires. Eventually the message will expire, either after a specified time period or after a specified number of retrievals from the queue.
You may have a mental model of this setup that looks a bit like this:
SQS also lets you encrypt messages in the queue; you can use an AWS-managed key or a customer-managed key. Our teams always use customer-managed keys, so we have an SQS queue with a KMS CMK and DLQs (say that three times fast!). The updated model looks a bit like this:
If you’re an SQS expert, someone who has read the documentation carefully, or someone who read very carefully through the paragraphs above, you will instantly see the problem with this picture. If you didn’t catch it, you are not alone.
The key phrasing above is that undeliverable messages are moved from the primary queue to the dead-letter queue. They’re not retrieved from the primary queue and then sent onto the dead-letter queue, they’re moved. This isn’t explicitly stated in the documentation we have been able to find, but it’s clear from the behaviour we experienced and a couple of other hints in the documentation.
Our team discovered that the DLQ handler wasn’t processing messages, and in fact its event source mapping (the invisible thing we agreed to gloss over) was entirely disabled. An extremely-helpful AWS support person was able to tell us why.
When messages are put onto a primary queue that has KMS encryption enabled, they are encrypted using that queue’s KMS key. The handler must also have permissions to decrypt messages using the key, and we had that all working perfectly.
The Lambda function does not use the decryption permission itself, but rather “loans” that permission to the event source mapping. In our case, the Lambda function’s execution role and therefore event source mapping did not have sufficient permissions to decrypt the message, so the mapping failed to decrypt the message after retrieving it from the DLQ and did not invoke the Lambda function.
If you have a flawed understanding of the DLQ like we did, you might expect that messages would be decrypted and then re-encrypted when they are sent on the DLQ. However, they’re not decrypted and re-encrypted. They’re not even sent on the DLQ. They’re moved from the primary queue to the DLQ.
We had our DLQ set up with a separate key, and the DLQ handler had permissions to use the key, and in tests where we sent messages to the DLQ the handler worked perfectly. However, in a full end-to-end scenario where messages fell off the primary queue and were moved to the DLQ, the event source mapping failed to decrypt the message (it was set up with permissions on the DLQ key, not the primary queue key) and did not invoke the handler.
AWS is super-smart, and they know not to try things forever when there’s
persistent failure. The event source mapping eventually disables itself and
emits a CloudTrail event when it does so. If you’re vigilant or, preferably, if
you set up a CloudWatch alarm based on the ApproximateAgeOfOldestMessage
metric on the DLQ, you’ll be able to detect that messages are queuing up in the
DLQ and not getting handled.
It’s important to remember that because messages are moved and not re-sent on
the DLQ, your ApproximateAgeOfOldestMessage
does not reflect the time that the
message is in the DLQ but rather the time since the message was originally sent
on the primary queue. This is also an important thing to remember when you are
configuring your DLQ: if you use the same policy to age messages out of your DLQ
as you have in your primary queue, you may find that messages disappear without
getting handled the way you expected.
I did mention that AWS documents the requirement to use the same key:
Your source queue and its corresponding dead-letter queue need to share the same KMS key.
Oops. 😳
We’re looking into creating a cfn-lint custom rule to prevent us from tripping over this issue again.