SQS, DLQs, and KMS, oh my

2023-03-06 /posts/sqs-dlqs-and-kms-keys-oh-my/ Geoff Baskwill

932 words

Recently one of the teams I work with had a fun time (note: it was not fun in the moment) with Amazon Simple Queue Service (SQS), dead-letter queues (DLQs), and AWS Key Management Service (KMS).

I thought I’d share because we learned something pretty important.

If you’re new to SQS, you may be kind of aware that it’s a generalized queue service that lets you post messages onto the queue and then process queued messages in a handler, kind of like this:

Source posting messages onto a queue that are then sent to a handler

There is an event source mapping in between the queue and the handler; we’re going to gloss over that detail for a minute.

SQS is super-handy, reliable, and amazingly scalable; AWS guides us to use it anywhere that we need to buffer communications between components of a solution.

There’s even a mechanism to deal with failures. You can configure a policy on your queue so that if SQS fails to deliver¹ a message to the handler enough times, the message gets moved to another queue for special handling. This second queue is called a “dead-letter” queue, named after the post office concept where messages that were undeliverable got sent to a special place for handling.

¹ “SQS fails to deliver” is not technically accurate, as SQS does not deliver messages, rather the event source mapping retrieves the message, attempts to deliver it to the Lambda function, and deletes the message if the message was handled successfully. If there was an error retrieving or processing the message, SQS will make the message available again after the visibility period expires. Eventually the message will expire, either after a specified time period or after a specified number of retrievals from the queue.

You may have a mental model of this setup that looks a bit like this:

Source posting messages onto a queue that go to a handler and a DLQ for handling

SQS also lets you encrypt messages in the queue; you can use an AWS-managed key or a customer-managed key. Our teams always use customer-managed keys, so we have an SQS queue with a KMS CMK and DLQs (say that three times fast!). The updated model looks a bit like this:

Source posting messages onto a queue with a KMS customer-managed key for encryption and a DLQ with a separate KMS key

If you’re an SQS expert, someone who has read the documentation carefully, or someone who read very carefully through the paragraphs above, you will instantly see the problem with this picture. If you didn’t catch it, you are not alone.

The key phrasing above is that undeliverable messages are moved from the primary queue to the dead-letter queue. They’re not retrieved from the primary queue and then sent onto the dead-letter queue, they’re moved. This isn’t explicitly stated in the documentation we have been able to find, but it’s clear from the behaviour we experienced and a couple of other hints in the documentation.

Our team discovered that the DLQ handler wasn’t processing messages, and in fact its event source mapping (the invisible thing we agreed to gloss over) was entirely disabled. An extremely-helpful AWS support person was able to tell us why.

When messages are put onto a primary queue that has KMS encryption enabled, they are encrypted using that queue’s KMS key. The handler must also have permissions to decrypt messages using the key, and we had that all working perfectly.

The Lambda function does not use the decryption permission itself, but rather “loans” that permission to the event source mapping. In our case, the Lambda function’s execution role and therefore event source mapping did not have sufficient permissions to decrypt the message, so the mapping failed to decrypt the message after retrieving it from the DLQ and did not invoke the Lambda function.

If you have a flawed understanding of the DLQ like we did, you might expect that messages would be decrypted and then re-encrypted when they are sent on the DLQ. However, they’re not decrypted and re-encrypted. They’re not even sent on the DLQ. They’re moved from the primary queue to the DLQ.

We had our DLQ set up with a separate key, and the DLQ handler had permissions to use the key, and in tests where we sent messages to the DLQ the handler worked perfectly. However, in a full end-to-end scenario where messages fell off the primary queue and were moved to the DLQ, the event source mapping failed to decrypt the message (it was set up with permissions on the DLQ key, not the primary queue key) and did not invoke the handler.

AWS is super-smart, and they know not to try things forever when there’s persistent failure. The event source mapping eventually disables itself and emits a CloudTrail event when it does so. If you’re vigilant or, preferably, if you set up a CloudWatch alarm based on the ApproximateAgeOfOldestMessage metric on the DLQ, you’ll be able to detect that messages are queuing up in the DLQ and not getting handled.

Source posting messages onto a queue with a KMS customer-managed key for encryption and a DLQ that uses the same KMS key

It’s important to remember that because messages are moved and not re-sent on the DLQ, your ApproximateAgeOfOldestMessage does not reflect the time that the message is in the DLQ but rather the time since the message was originally sent on the primary queue. This is also an important thing to remember when you are configuring your DLQ: if you use the same policy to age messages out of your DLQ as you have in your primary queue, you may find that messages disappear without getting handled the way you expected.

I did mention that AWS documents the requirement to use the same key:

Your source queue and its corresponding dead-letter queue need to share the same KMS key.

Oops. 😳

We’re looking into creating a cfn-lint custom rule to prevent us from tripping over this issue again.