Greg Heywood - Automatically Tagging Uploads to S3

I previously wrote a post on how to automatically tag uploads to S3, which can be found here. There is an update below, which has a few changes. Firstly, it is all deployed via Terraform, but there is also a change to the code of the lambda function too.

Components

The components within AWS remain pretty similar, we need the following:

A lambda function (written in Python here)
An S3 bucket (with a Terraform generated name)
An IAM role
An IAM policy
Permissions for S3 to trigger the lambda function
An SQS queue

The queue is just an addition to deliver notifications to when a tag has been done. You could then connect something to retrieve those messages and do something with them.

Lastly, it is done via Terraform, so that you can apply/destroy as often as required.

One thing to note, is that if you want to use a specific or existing bucket, you would need to modify the code to remove the bucket element, and then apply the permissions to the bucket of your choice using the buckets ARN, which you could pass in as variable.

The lambda function is created using "archive_file" construct, which will take the all the files in the "/files" subfolder and place them in a zip, which is then used as the source for the function. Alternatively, you could pass the code inline. One advantage of doing it this way (with a file) is that it won't automatically trigger a change to your TF, just because you have updated the code/ That is something that would do if you edited the code inline, and then ran "terrform apply".

The Lambda function code

The code is pretty similar to the original code, but just a small change.

        import json
        import urllib.parse
        import boto3
        import datetime

        print('Loading function')

        s3 = boto3.client('s3')
        now = datetime.datetime.now()
        tagValue = now.date()
        tagName = "MyTag"


        def lambda_handler(event, context):
            print("Recevied event: " + json.dumps(event, indent=2))
            
            #Get the object from the event
            bucket = event['Records'][0]['s3']['bucket']['name']
            key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
            
            if not key.endswith("/"):
                try:
                    response = s3.put_object_tagging(
                        Bucket = bucket,
                        Key = key,
                        Tagging={
                            'TagSet': [
                                {
                                    'Key': tagName,
                                    'Value': str(tagValue)
                                },
                            ]
                        }
                    )
                except Exception as e:
                    print(e)
                    print('Error applying tag {} to {}.'.format(tagName, key))
                    raise e

The change is to put the actual tagging attempt under an if statement:

        if not key.endswith("/"):

This will just ensure that you don't tag "folders" (keys technically) in S3. If you want to do that, just get rid of the if statement though. The reason for this is that if you then decide to delete all files with a particular tag, you could remove the "folder" too, which you may not want to do.

You may also want to remove some of the "print" statements. I have them in there so that I can then get a good picture of what is going on by checking CloudWatch, but once you are happy that all is working OK, you could remove those.

The S3 Bucket

Terraform will create a new bucket, with a terraform generated name. This will by default, give the bucket a name that looks something like this:

        terraform-20211127132921436100000001

If you want, you could specify a name. I use the "random_id" resource, and you can use that to generate a unique name, so you could add a line to the aws_S3_bucket resource that says something like:

        function_name = "${random_id.id.hex}-Mybucket"

The result would be something like:

        35213353-MyBucket

If you want to use an existing bucket, that can be done, but you will need to update the Terraform code to refer to your bucket, rather than the bucket we define here. You could pass the arn in via a variable.

One thing you then need to do is add the bucket notification so that S3 will trigger your Lambda function when something is uploaded. There are a couple of commented out lines here that you could use to exclude specific file types or prefixes (keys).

        resource "aws_s3_bucket_notification" "s3_bucket_notification" {
            bucket = aws_s3_bucket.bucket.id
            lambda_function {
                lambda_function_arn = aws_lambda_function.s3lambda.arn
                events              = ["s3:ObjectCreated:*"]
                #filter_prefix       = "AWSLogs/"
                #filter_suffix       = ".log"
            }
            depends_on = [
                aws_iam_role_policy_attachment.logging_policy_attach,
                aws_lambda_permission.allow_bucket,
                aws_lambda_function.s3lambda
            ]
        }

We also have a "depends_on" section there. This is to make sure that everything else is in place before we attempt to create this notification. If you do not have that, it can try and create it before the function and the permissions are in place, which will throw an error. Normally you don't need to worry about this with declarative code, but this is one of the occasions where you do.

IAM Role and IAM Policy

We then create a new IAM Role and an IAM Policy.

        resource "aws_iam_role" "s3_lambda_role" {
        name               = "s3_lambda_function_role"
        assume_role_policy = <<EOF
        {
        "Version": "2012-10-17",
        "Statement": [
            {
            "Action": "sts:AssumeRole",
            "Principal": {
                "Service": "lambda.amazonaws.com"
            },
            "Effect": "Allow"
            }
        ]
        }
        EOF
        }


        # IAM policy for logging from a lambda
        resource "aws_iam_policy" "lambda_logging" {
        name        = "LambdaLoggingPolicy"
        path        = "/"
        description = "IAM policy for logging from a lambda"
        policy      = <<EOF
        {
            "Version": "2012-10-17",
            "Statement": [
                {
                    "Action": [
                        "logs:CreateLogGroup",
                        "logs:CreateLogStream",
                        "logs:PutLogEvents"
                    ],
                    "Resource": "arn:aws:logs:*:*:*",
                    "Effect": "Allow"
                },
                {
                    "Sid": "VisualEditor0",
                    "Effect": "Allow",
                    "Action": "s3:PutObjectTagging",
                    "Resource": "${aws_s3_bucket.bucket.arn}/*"
                },
                {
                    "Action": [
                        "sqs:ReceiveMessage",
                        "sqs:DeleteMessage",
                        "sqs:GetQueueAttributes",
                        "sqs:SendMessage",
                        "logs:CreateLogGroup",
                        "logs:CreateLogStream",
                        "logs:PutLogEvents"
                    ],
                    "Resource": "${aws_sqs_queue.sqs_queue.arn}",
                    "Effect": "Allow"
                }
            ]
        }
        EOF    
        }

Note the interpolation to make sure the permissions are only applied to the specific bucket in question. Even if you are just playing around, it is always good to take some steps to improve security when you get a chance.

You then need to attach the policy to the role:

        # Attaching the policy to the role
        resource "aws_iam_role_policy_attachment" "logging_policy_attach" {
            role       = aws_iam_role.s3_lambda_role.name
            policy_arn = aws_iam_policy.lambda_logging.arn
        }

SQS

Almost done. We then create the SQS queue, complete with a random prefix on the name:

        resource "aws_sqs_queue" "sqs_queue" {
            name = "${random_id.id.hex}-lambda-s3-tag"
        }

And then we need to create the event config to send the notification to SQS when the lambda function is fired.

        resource "aws_lambda_function_event_invoke_config" "invokesuccess" {
            function_name = aws_lambda_function.s3lambda.function_name
            destination_config {
                on_success {
                destination = aws_sqs_queue.sqs_queue.arn
                }
            }
        }

And that is it!

Testing

Simply upload a file to your bucket and it should be tagged automatically. Look at the properties of the object to see it.

tag

You can then go to SQS, click on your queue, choose "send and receive messages", scroll down and choose "Poll for messages". You should be able to see any messages there.

SQS

Troubleshooting

When running it, you can check for the tag on an upload immediately, and if not there, check in CloudWatch Logs to get an idea of any likely issues.

For SQS, you can also use either the command line, or just go to SQS in the console, and you should see the created queue.

Source

You can also find the source for this on Github..

Direct to EventBridge

Amazon just announced (Nov '21) that you can now send your notifications direct to EventBridge. This gives you additional filtering options, and of course the opportunity to use different (and multiple) destinations with less code.