pabis.eu

Lambda@Edge: Select test origin or stick to the old

11 March 2024

Last Friday I joined the live stream of AWS Power Hour: Architecting Professional Season 2. As in every episode, we went through some example questions. I decided to take one of them and implement a working solution. The question I picked states: there's a company that uses CloudFront to host a website on two origin servers. They deployed a new app version on one of the origins and they want to route the traffic to the new app for a small percentage of users. However, users who currently use the old version, should be able to finish their activities there. The answer is: implement a new Lambda@Edge function that examines requests from users, check if the session in the cookie is new and only then route some users to the new version. Users with a session in progress should remain on their version. For reference, check out this Power Hour episode and here the full schedule if you are interested.

It's possible to just keep application version in the cookie. Other possible ways could be JWT tokens with issue date (iat) and some other attributes or keeping session in ElastiCache for Redis or DynamoDB. I decided to go with the last approach, where session details - such as version - will be stored in a DynamoDB table and user will just send me a cookie with item's primary key. To decrease latency, I will use DynamoDB Global Table and pick the nearest replica in the Lambda function.

Solution overview

Also check out repository for this article

An example application

First we need some application that will be served by CloudFront but that will also generate some cookies. In this example we will use a simple FastAPI application combined with AWS SDK that will store sessions in DynamoDB. When the user enters the application they will be issued a session cookie for 10 minutes and with each refresh, the session will be updated in the table.

First let's create a simple FastAPI application that will just return an HTML page with some CSS. config.py contains some values for the CSS and the application version code that is critical to this task as it must be the same as used by Lambda@Edge later.

# config.py
BACKGROUND_COLOR="#F0A5C1"
FONT_FAMILY="Times New Roman, Serif"
COLOR="#255"
VERSION_NAME="Version 1.0"
VERSION_CODE="1"
# index.py
from fastapi import FastAPI, Cookie, Response
from fastapi.responses import HTMLResponse
from typing import Annotated

import config

app = FastAPI()

@app.get("/", response_class=HTMLResponse)
async def root(response: Response, sessiontoken: Annotated[str | None, Cookie()] = None):

    return f"""
    <html>
        <head>
            <title>Application {config.VERSION_NAME}</title>
            <style>
                body {{
                    background-color: {config.BACKGROUND_COLOR};
                    color: {config.COLOR};
                    font-family: {config.FONT_FAMILY};
                }}
            </style>
            <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
        </head>
        <body>
            <h1>This is application {config.VERSION_NAME}</h1>
            <p>You now got a cookie :)</p>
        </body>
    </html>
    """

For simplicity of development, I will create a Dockerfile that will embed our application. We will use a build argument to reuse the same one for the next version.

FROM python:3.11-alpine
RUN pip3 install "uvicorn[standard]" "fastapi" "boto3"

WORKDIR /app
ARG VERSION=version1
COPY app-$VERSION/* /app/

ENTRYPOINT [ "uvicorn", "index:app", "--host", "0.0.0.0" ]

You can then build the image locally and run the application for testing. It is just serving a dummy HTML template. In the browser, navigate to localhost and the port you choose when running the container with Docker.

$ docker build -t ppabis/cf-lambda-origins:version1 .
$ docker run --rm -it -p 38888:8000 ppabis/cf-lambda-origins:version1

First version of the application

Creating sessions table

Now we will create a DynamoDB table that will hold sessions of our users. The table will have only three fields: token, version and expiration. The token will be the primary key, version will determine which version we want to stick to the user and expiration will be the time when the session will expire. We don't have to validate expiration on the application side for now, it is just for the sake keeping the table clean with TTL.

Let's fire up the editor and configure our Terraform module. We will use AWS provider version 5.40.0 and Terraform 1.7.4 and these version will be used throughout this whole project.

provider "aws" {
  region = "us-east-1"
}

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = ">= 5.40.0"
    }
  }
}

Next we will define our SessionsTable table. As previously mentioned, we will enable TTL cleanup (that will remove expired sessions periodically) and set the primary key to token. We will also make it a global table by adding replicas - one for eu region and one for ap region. The original table will reside in us-east-1 region. We cannot create version and expiration fields yet, as the name suggests, DynamoDB is dynamic. We can in theory use one of those fields as a sort key or define indexes. However, let's just assume that in case version is not present, we will default to the first version. We will also use PAY_PER_REQUEST billing mode so that we don't have to deal with autoscaling, due to using a global table.

resource "aws_dynamodb_table" "sessions" {
  name           = "SessionsTable"
  billing_mode   = "PAY_PER_REQUEST"
  hash_key       = "token"

  attribute {
    name = "token"
    type = "S"
  }

  ttl {
    attribute_name = "expiration"
    enabled        = true
  }

  replica { region_name = "ap-southeast-1" }
  replica { region_name = "eu-west-1" }
}

Let's init Terraform and apply the infrastructure. We continue still with the first version of our application.

Adding sessions to the application

In the root() function there are already some hints how the session cookie will look like. We just need to implement the logic that will generate and store the session. We will use AWS SDK for Python, boto3, to store the session in the newly created DynamoDB table. We will generate random token with that will be the primary key. If update_item won't fail, we will set the cookie in the response to user so that their browser can store it. If the cookie was given previously to us with the request, we will just update the expiration time and the version to the one assigned by the current origin processing the request.

from boto3 import client
from random import randbytes
from hashlib import sha1
from datetime import datetime
import os

if os.environ.get("AWS_REGION") is None:
  dynamo = client("dynamodb", region_name="us-east-1")
else:
  dynamo = client("dynamodb")

def update_session(response: Response, sessiontoken: Annotated[str | None, Cookie()] = None):
    # If there is no session token we pick a random one
    if sessiontoken is None:
        sessiontoken = sha1(randbytes(128)).hexdigest()
    # And we update the session token in the database. Update in DynamoDB works also for new items.
    dynamo.update_item(
        TableName="SessionsTable",
        Key={ "sessiontoken": {"S": sessiontoken} },
        UpdateExpression="SET expires = :expires, version = :version",
        ExpressionAttributeValues={
            ":expires": { "N": str(int(datetime.now().timestamp() + 600)) },
            ":version": { "S": config.VERSION_CODE }
        },
    )
    # We give the user either a new cookie or the same one as before
    response.set_cookie(key="sessiontoken", value=sessiontoken)

Now we only have to edit the root function to call the update_session function with try and except blocks to handle the case when DynamoDB throws an error such as no permissions.

@app.get("/", response_class=HTMLResponse)
async def root(response: Response, sessiontoken: Annotated[str | None, Cookie()] = None):
    try:
        update_session(response, sessiontoken)
    except Exception as e:
        print(e)
        response.status_code = 500
        return f"""<html><body><h1>ERROR 500</h1><p>We expected unexpected!</p></body></html>"""
    return f"""
    <html>
    ...

If we run the application on our local machine with Docker, we should get this 500 error unless we specified some credentials for AWS. Let's try.

docker build -t ppabis/cf-lambda-origins:latest .
docker run --rm -it -p 38888:8000 ppabis/cf-lambda-origins:latest

Errors locally

We can also verify that we are not receiving any cookies in case of an error. This will ensure that our application doesn't cause any unwanted behavior and errors are handled correctly (assuming 5xx is a correct way to handle it 😁).

$ curl -v http://localhost:38888 
> GET / HTTP/1.1
> 
< HTTP/1.1 500 Internal Server Error
< date: Sat, 09 Mar 2024 10:38:12 GMT
< server: uvicorn
< content-length: 74
< content-type: text/html; charset=utf-8
< 
<html><body><h1>ERROR 500</h1><p>We expected unexpected!</p></body></html>

Hosting the application

Let's spin up some EC2 instance with Docker and host the application there. I will use t4g.nano with Amazon Linux 2023. During the boot, user data will install Docker for us. We will also define an IAM role that will allow the instance to be accessed via SSM and read/write to DynamoDB table of our choice. We will also open some HTTP port (such as 8080) in the security group so that we can test our application.

For simplicity, I will not type the whole code for this instance here. Refer to the GitHub repository for complete solution.

data "aws_ssm_parameter" "AL2023" {
  name = "/aws/service/ami-amazon-linux-latest/al2023-ami-kernel-6.1-arm64"
}

resource "aws_instance" "AppInstance" {
  ami                    = data.aws_ssm_parameter.AL2023.value
  instance_type          = "t4g.nano"
  iam_instance_profile   = aws_iam_instance_profile.AppEC2Role.name
  vpc_security_group_ids = [aws_security_group.AppEC2.id]
  user_data              = <<-EOF
  #!/bin/bash
  yum install -y docker
  systemctl enable --now docker
  usermod -aG docker ec2-user
  EOF
  tags = {
    Name = "AppVersion1"
  }
}

After init and apply, you should find it in AWS console on the list of instances. Right-click on it and select Connect. I will be connecting with Session Manager but feel free to use your own method. After you log in run docker version to verify that Docker was installed correctly. Now we have to transfer our application to the instance. You can use scp if you SSH'd into it but there are many ways you can deal with it: S3, vi, SSM parameters, etc. I am old-school nerd and will use cat with <<EOF 🤓. With this method you have to remember to escape $ with \$.

$ sudo su ec2-user
$ cd
$ cat <<EOF > index.py
from fastapi import FastAPI, Cookie, Response
# the file continues...
EOF
$ cat <<EOF > Dockerfile
FROM python:3.11-alpine
# the file continues...
COPY app-\$VERSION/* /app/
EOF
$ cat <<EOF > config.py
...
EOF
$ mkdir -p app-version1 && mv index.py app-version1/ && mv config.py app-version1/

Now it's time to build and run the application. If you use Session Manager, switch to ec2-user with su. We will run the application exposed on port which we set up in our security group to be accessible, in my case it is 8080.

$ docker build -t ppabis/cf-lambda-origins:latest .
$ docker run --rm -it -p 8080:8000 ppabis/cf-lambda-origins:latest

Now, don't disconnect from your SSH/SSM session. Open AWS console with EC2 instances in a new tab, and find the public IP. Go to the website http://127.255.255.0:8080 (replace with IP of your instance) and you should see the page. Open developer console (in Safari it is Option+Command+I) and go to Network tab. It should show you that you got a cookie in the response. Refresh and it you should send the cookie with the request and get the same in response.

Cookie received

Cookie sent and received

Remember first characters of the cookie. Go to DynamoDB in AWS console (N. Virginia/us-east-1) and find the SessionsTable table. Click Explore table items and look if you have the same cookie stored. If you see it, that means we are successful in this part of the task. Switch to some other region (such as Ireland/eu-west-1) and look if the table is also there. Explore the items. It should be replicated.

Items in us-east-1

Now hit Ctrl+C in the terminal where you run the application. Run the Docker container in the background with -d flag. We will not remove the container and name it appropriately so that we can read its logs in case of issues. Also we set the policy to restart always so that the container will start after the failure.

docker run -d --name "app_v1" -p 8080:8000 --restart always ppabis/cf-lambda-origins:latest

Now you can disconnect from this instance. We will continue the next part with updated application version hosted on another instance.

New application version

Go back to the Terraform file defining the instance. Add count = 2 so that two instances are created in total. Also update the tags to reflect this change. This instance will be clone of the first one but we will deploy a different app version there.

resource "aws_instance" "AppInstance" {
  ami                    = data.aws_ssm_parameter.AL2023.value
  ...
  tags = {
    Name = "AppVersion${count.index + 1}"
  }
}

Apply this change. Terraform should only create one instance and move the current one into index 0 but if it doesn't happen, you need to repeat previous steps or import the instance with terraform import before hitting apply.

Let's change the app for the second version. Copy over both files from app-version1 to app-version2 directory. We will only modify many parameters in config.py to distinguish between the two applications.

BACKGROUND_COLOR="#08122C"
FONT_FAMILY="Calibri, Helvetica, Sans-Serif"
COLOR="#E8C2C8"
VERSION_NAME="Version 2.0"
VERSION_CODE="2"

Copy over the files to the instance using your preferred method. I will use cat again. Next build the Docker image with extra argument that will specify the new version. Execute the container in the background the same was as before.

$ docker build -t ppabis/cf-lambda-origins:2.0 --build-arg VERSION=version2 .
$ docker run -d --name "app_v2" -p 8080:8000 --restart always ppabis/cf-lambda-origins:2.0

Go to AWS console. Get the public IP of the new instance. Go to the address in your browser. You should get a newer version of the app. Let's also explore the SessionsTable in DynamoDB to check if the new sessions are created with correct version number.

New app version

DynamoDB sessions V2

Putting application behind CloudFront

We have currently two versions of the application. In the question we were told that in the current state, only the first version of the application is served by CloudFront. So lets create a CloudFront distribution that will have just one origin for now, we will figure out the rest later.

We will need the DNS name of the EC2 instance to specify as CloudFront origin. There are many ways to get it, not necessarily copy and paste from the console 😉. If we composed previous Terraform directories as modules into one project we could use outputs and variables. But I will use the data source in this example.

data "aws_instance" "AppInstanceV1" {
  instance_tags = { Name = "AppVersion1" }
}

resource "aws_cloudfront_distribution" "AppDistribution" {
  default_cache_behavior {
    allowed_methods        = ["GET", "HEAD", "OPTIONS"]
    cached_methods         = ["GET", "HEAD"]
    target_origin_id       = "AppVersion1"
    viewer_protocol_policy = "redirect-to-https"
    forwarded_values {
      query_string = true
      cookies { forward = "all" }
    }
  }

  viewer_certificate {
    cloudfront_default_certificate = true
    minimum_protocol_version       = "TLSv1.2_2019"
  }

  restrictions {
    geo_restriction { restriction_type = "none" }
  }

  enabled = true

  origin {
    custom_origin_config {
      http_port              = 8080
      https_port             = 443
      origin_protocol_policy = "http-only"
      origin_ssl_protocols   = ["TLSv1.2"]
    }
    domain_name = data.aws_instance.AppInstanceV1.public_dns
    origin_id   = "AppVersion1"
  }
}

output "CloudFrontDNS" {
  value = aws_cloudfront_distribution.AppDistribution.domain_name
}

Apply might take time. The distribution above will have the following behavior:

After CloudFront distribution is created, you can get the domain name for it from the AWS console or with Terraform outputs. Go to the address in your browser. You should be connected to the instance via HTTPS (or to CloudFront at least).

CloudFront serving V1 app

Creating Lambda@Edge function

Now it's time to finally add some routing logic to our solution. We will create a Lambda function in us-east-1 region in Python. The function will be triggered by CloudFront on the origin request event. It will contact the nearest DynamoDB Table based on current execution region (Lambda@Edge is distributed into multiple locations) and check if the user has a valid cookie that points to a valid application version. The function will then change origin if needed. It would be nice to store the second origin in environment variables but it's not possible for Lambda@Edge. For simplicity let's just hardcode the values.

For convenience of development we will include hashicorp/archive provider to create ZIP files. Add this to your main.tf file or any other file you use for storing the providers. Run terraform init afterwards.

terraform {
  required_providers {
    ...
    archive = {
      source  = "hashicorp/archive"
      version = ">= 2.0.0"
    }
  }
}

Let's start with the Lambda function code in Python. I will use the draft from AWS docs. The function we will define will first select which DynamoDB to use based on its execution region. We defined replicas only for eu-west-1 and ap-southeast-1 so for all other continents we will just default to us-east-1.

import boto3, os

# To reduce latency, we will pick one of the tables based on the region
aws_region = os.environ.get("AWS_REGION")
aws_region = "us-east-1" if aws_region is None else aws_region
if aws_region.startswith("eu-"):
    aws_region = "eu-west-1"
elif aws_region.startswith("ap-"):
    aws_region = "ap-southeast-1"

dynamo = boto3.client("dynamodb", region_name=aws_region)

Next we will create two functions: one for determining the version based on the session token. It will contact DynamoDB to get the details of the session. If session doesn't contain version or there's no session cookie, we will just select random origin with 50/50 chance. Provide the second origin's DNS address in your the function.

import random
second_origin = "ec2-127-128-20-164.compute-1.amazonaws.com" # Get DNS address of your own secondary instance

def pick_random(request):
    # If we rolled heads, change the origin
    if random.choice([0, 1]) == 0:
        request['origin']['custom']['domainName'] = second_origin
    # Else just return the request unchanged
    return request

def select_version(token, request):
    item = dynamo.get_item(
        TableName="SessionsTable",
        Key={ "token": {"S": token} }
    )
    # If this token is found in the table and this session has a version, we stick to it
    if 'Item' in item and 'version' in item['Item']:
        if item['Item']['version']['S'] == "2":
            request['origin']['custom']['domainName'] = second_origin
    # Session might be invalid so roll any origin 50/50
    else:
        request = pick_random(request)

    return request

The last part is to define the handler that will process CloudFront request. Here we need to extract cookies from the request's header, look for the session token cookie (if it exists) and call the appropriate function.

def handler(event, context):
    token = None
    request = event['Records'][0]['cf']['request']
    headers = request['headers']

    # Check if there is any cookie header
    cookie = headers.get('cookie', [])
    if cookie:
        for c in cookie:
            # Check if there is a sessiontoken
            if 'sessiontoken' in c['value']:
                token = c['value'].split('=')[1]
                break

    if token:
        request = select_version(token, request)
    else:
        # As with invalid session case, we roll any origin 50/50
        request = pick_random(request)

    return request

Now we will define IAM policies for the function. We will need the Lambda to be executable by CloudFront and to be able to read from DynamoDB. The role must be assumable by both lambda.amazonaws.com and edgelambda.amazonaws.com. We will just give it permissions to read from the table.

resource "aws_iam_role" "OriginSelectionLambdaRole" {
  name = "OriginSelectionLambdaRole"
  assume_role_policy = jsonencode({
    Version = "2012-10-17",
    Statement = [ {
      Action = "sts:AssumeRole",
      Effect = "Allow",
      Principal = {
        Service = [
          "lambda.amazonaws.com",
          "edgelambda.amazonaws.com"
        ]
      }
    } ]
  })
}

data "aws_iam_policy_document" "DynamoDBSessionsTable" {
  statement {
    actions   = [ "dynamodb:GetItem" ]
    resources = [ "arn:aws:dynamodb:*:*:table/SessionsTable" ]
  }
}

resource "aws_iam_role_policy" "DynamoDBRead" {
  name   = "DynamoDBRead"
  role   = aws_iam_role.OriginSelectionLambdaRole.id
  policy = data.aws_iam_policy_document.DynamoDBSessionsTable.json
}

As stated before, we can utilize Hashicorp's archive provider to create a ZIP file that is needed for deploying the Lambda function. It also handles generation of the hash for the file in required format. We will also use publish = true so that applying the infrastructure will generate actual version of the function as we cannot use $LATEST for CloudFront.

data "archive_file" "lambda" {
  type        = "zip"
  source_file = "${path.module}/origin_selector.py"
  output_path = "${path.module}/origin_selector.zip"
}

resource "aws_lambda_function" "OriginSelector" {
  filename         = data.archive_file.lambda.output_path
  function_name    = "OriginSelector"
  role             = aws_iam_role.OriginSelectionLambdaRole.arn
  handler          = "origin_selector.handler"
  runtime          = "python3.11"
  source_code_hash = data.archive_file.lambda.output_base64sha256
  publish          = true
}

Before we attach the function to CloudFront, let's test it in AWS console. Go to Lambda service, enter the function and select Test tab. Create a new test event. Use JSON from this page. You can modify request headers. I added the sessiontoken cookie as the second cookie to test the function completely. You can also remove it to see if it keeps the origin the same as in the test event.

//...
            ],
            "cookie": [
             {
                 "key": "Cookie",
                 "value": "tracking=123456"
             },
             {
                 "key": "Cookie",
                 "value": "sessiontoken=472ed977dc2fd2700dbb03a58634f789fd58d911"
             }
            ],
            "user-agent": [
//...

I inserted actual working token from the DynamoDB table with version 2 to see if the origin changes from example.org to the one specified in the function.

Origin changed

Everything seems to be working. Now we can attach the function to CloudFront. Go back to its resource, to the default cache behavior block. Add the Lambda function for origin request. Also pick the appropriate version in the ARN. If you deployed the working function successfully the first time, it will be 1. Otherwise browse Versions tab in the Lambda function in AWS console and choose the latest one.

resource "aws_cloudfront_distribution" "AppDistribution" {
  default_cache_behavior {
    lambda_function_association {
      event_type   = "origin-request"
      lambda_arn   = "${data.aws_lambda_function.origin_request.arn}:1"
      include_body = false
    }
  ...
}

After applying the change, you can try opening and closing a private window in the browser and navigating to the page multiple times or clearing the cookies. You can also use this curl almost-one-liner. It should show you how the versions of the app switch with each request without a cookie - so choosing a random origin.

for i in {1..20}; do                                   
  curl -s https://$(terraform output -raw CloudFrontDNS) | grep "Version"
  sleep 3
done

App versions switching

Once you visit the website with a browser, you should stay on the same page no matter how many times you refresh it, even with cache disabled.

Is the table really global?

Observe activity in DynamoDB. The apps use only us-east-1 table. However, I am based in Europe so I expect Lambda@Edge to use eu-west-1 table. You can try reaching via VPN or make SSH tunnel via AWS to change region too. The charts below show usage for both us-east-1 and eu-west-1 tables.

Activity in us-east-1

Activity in eu-west-1