pabis.eu

CloudWatch Metrics? No, thank you. Timestream (feat CloudFormation)!

10 November 2023

Ok, Hashi, that's it. I'm giving up Terraform. And even OpenTofu doesn't taste now. Amazon convinced me. CloudFormation is the leader of Infrastructure as Code. But today, I'll break even more rules. CloudWatch Metrics? Nah, let's put our CPU and Memory usage into Timestream. Because why not? Having a graph of your filesystem usage updated every second is not enough. We need real-time.

A meme

Find the repository here: https://github.com/ppabis/ec2-metrics-timestream

What is Timestream

So similar to CloudWatch Metrics but on a much more enormous scale, Timestream is a database that collects values over time. It's a managed service, so no need to create instances. It can process millions of requests from thousands of sources. It can aggregate the results and smooth out the graphs. It can also export to S3, so that you can use it later for any ETL processes, analytics or ML.

Creating a Timestream database

Let's start with Timestream structure. Databases - the top level objects - contain tables - similar to MySQL. However, the data tables hold is not rows and columns but measures. The measures have their required value and name. They can also contain multiple dimensions. Compare it to Prometheus, where you have multiple metrics with their names, values and labels (dimensions).

In this project, we would like to collect system metrics of our EC2 instances, namely CPU utilization (per each core), memory (cached, buffered, free, used), and root filesystem (bytes used, bytes free, percent used). It would be possible to put everything into a single table but for each component we will use a different table.

Let's start with a simple template with a single database that can be named however we want using a parameter.

---
AWSTemplateFormatVersion: 2010-09-09
Description: Timestream database for system metrics

Parameters:
  DatabaseName:
    Type: String
    Description: Name of the Timestream database
    Default: MyEc2SystemMetrics

Resources:
  TimestreamDatabase:
    Type: AWS::Timestream::Database
    Properties:
      DatabaseName: !Ref DatabaseName
      Tags:
        - Key: Name
          Value: !Ref DatabaseName

Put the contents into timestream.yaml. Deploy it using AWS CLI.

$ aws cloudformation create-stack --stack-name MyTimestreamMetrics --template-body file://timestream.yaml
$ aws cloudformation describe-stacks
{
  "Stacks": [
    {
      "StackName": "MyTimestreamMetrics",
      "Description": "Timestream database for system metrics",
      "Parameters": [
        {
          "ParameterKey": "DatabaseName",
          "ParameterValue": "MyEc2SystemMetrics"
        }
      ],
      "CreationTime": "2023-11-06T20:53:58.134000+00:00",
      "StackStatus": "CREATE_COMPLETE",
      ...

We can also view it in the console by choosing specific region and going to either Timestream or CloudFormation.

Database

CloudFormation Stack

Next we can create Timestream tables. Append to the previous YAML file the following resources and update the stack.

...
Resources:
...
  CpuTable:
    Type: AWS::Timestream::Table
    Properties:
      DatabaseName: !Ref TimestreamDatabase
      TableName: CpuUtilization
      RetentionProperties:
        MemoryStoreRetentionPeriodInHours: 36
        MagneticStoreRetentionPeriodInDays: 2
      Tags:
        - Key: Name
          Value: CpuUtilization

  MemoryTable:
    Type: AWS::Timestream::Table
    Properties:
      DatabaseName: !Ref TimestreamDatabase
      TableName: MemoryUtilization
      RetentionProperties:
        MemoryStoreRetentionPeriodInHours: 36
        MagneticStoreRetentionPeriodInDays: 2
      Tags:
        - Key: Name
          Value: MemoryUtilization

  FilesystemTable:
    Type: AWS::Timestream::Table
    Properties:
      DatabaseName: !Ref TimestreamDatabase
      TableName: DiskUsed
      RetentionProperties:
        MemoryStoreRetentionPeriodInHours: 36
        MagneticStoreRetentionPeriodInDays: 2
      Tags:
        - Key: Name
          Value: DiskUsed

Update the stack by uploading the template again.

Stack updated

IAM role with permissions

Now we would like to create a role in IAM that will be able to write to all the tables in Timestream database. We will use it later in our EC2 instances. Our CloudFormation template is growing so maybe we should split it. In that case, in the current template we would need to export at least the database name or ARN to be able to use it in the other template and specify which resource to allow the principal write to.

In the current CloudFormation template add a new section Outputs where we will create an exported value for the database ARN.

...
Resources:
...
Outputs:
  Ec2MetricsTimestreamDBArn:
    Description: ARN of the Timestream database for EC2 system metrics
    Value: !GetAtt TimestreamDatabase.Arn
    Export:
      Name: !Sub "Ec2MetricsTimestreamDBArn"

We will now reference it in the new template by name from the Export section. A new stack will contain the policy, role and instance profile. The policy will also restrict the access to the specific database (all tables). It would be nice if we could require the instance to add appropriate dimensions to the metrics based on instance ID, however as of today it's not possible - there are no condition keys in IAM for WriteRecords Timestream action.

---
AWSTemplateFormatVersion: 2010-09-09
Description: IAM setup for EC2 instances that will report metrics to Timestream

Parameters:
  RoleName:
    Type: String
    Description: Name of the role that will be created
    Default: MeteredEC2Role

  PolicyName:
    Type: String
    Description: Name of the policy that will be created
    Default: MeteredEC2TimestreamPolicy

Resources:
  EC2Role:
    Type: AWS::IAM::Role
    Properties:
      RoleName: !Ref RoleName
      AssumeRolePolicyDocument:
        Version: 2012-10-17
        Statement:
          - Effect: Allow
            Principal:
              Service: ec2.amazonaws.com
            Action: sts:AssumeRole

  ManagedEC2Policy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      # Attach this policy to EC2 Role on create
      Roles:
        - !Ref EC2Role
      ManagedPolicyName: !Ref PolicyName
      PolicyDocument:
        Version: 2012-10-17
        Statement:
          - Effect: Allow
            Action:
              - timestream:WriteRecords
            # Creates a wildcard table resource for the Timestream database
            Resource: !Join ["/", [!ImportValue Ec2MetricsTimestreamDBArn, "table", "*"]]
          - Effect: Allow
            Action:
              - timestream:DescribeEndpoints
            Resource: "*"

  EC2Profile:
    Type: AWS::IAM::InstanceProfile
    Properties:
      Path: /
      Roles:
        - !Ref EC2Role

Outputs:
  ProfileArn:
    Description: Name of the instance profile
    Value: !Ref EC2Profile
    Export:
      Name: "MeteredEC2Profile"

Apply it with IAM capabilities to a new stack.

$ aws cloudformation create-stack\
 --stack-name MyTimestreamEC2Role\
 --template-body file://ec2-timestream-iam.yaml\
 --region eu-west-1\
 --capabilities CAPABILITY_NAMED_IAM

You can watch if the stacks are created in the AWS Console or using the following CLI command.

$ aws cloudformation describe-stacks\
 --region eu-west-1\
 --query 'Stacks[].[StackName, StackStatus]'

Script to send metrics to Timestream

Now we will create a small script that will be frequently reporting data to the Timestream table. It needs to be a daemon because cron runs at most every one minute. We will use standard AWS SDK which is Boto3 and Python. For even more performance I would even recommend going with Go but it would take too much time for the demo.

import boto3, os, time, requests

"""
This script runs in a loop, collects CPU, Memory and root filesystem metrics and
sends it to Timestream.
"""

def get_cpu() -> list[float]:
    """
    Get average CPU utilization per core
    Based on: https://www.idnt.net/en-US/kb/941772
    """
    with open('/proc/stat') as f:
        lines = f.readlines()
    # Leave only each CPU core
    lines = [line.rstrip() for line in lines if line.startswith('cpu') and not line.startswith('cpu ')]
    # Split each line into values
    values = [line.split() for line in lines]
    # Convert to int
    values = [[int(v) for v in value[1:]] for value in values]
    # Get second measurement
    time.sleep(0.1)
    with open('/proc/stat') as f:
        lines = f.readlines()
    # Leave only each CPU core
    lines = [line.rstrip() for line in lines if line.startswith('cpu') and not line.startswith('cpu ')]
    values2 = [line.split() for line in lines]
    values2 = [[int(v) for v in value[1:]] for value in values2]

    total_deltas = [(sum(values2[i]) - sum(values[i])) for i in range(len(values))]
    total_idle = [(values2[i][3] - values[i][3]) for i in range(len(values))]
    total_used = [(total_deltas[i] - total_idle[i]) for i in range(len(values))]
    total_usage = [(total_used[i] / total_deltas[i]) * 100 for i in range(len(values))]
    return total_usage

def get_memory() -> (float, float, float):
    """
    Get memory utilization: (buffered, free, used) as % of total memory
    """
    # Returns something like:
    #               total        used        free      shared  buff/cache   available
    #Mem:             427         138          84           0         204         279
    #Swap:            426           0         426
    #Total:           854         138         511
    mem = os.popen("free -t -m").readlines()[1]
    values = mem.split()
    total = float(values[1])
    used_pct = float(values[2]) / total * 100
    free_pct = float(values[3]) / total * 100
    buffer_pct = float(values[5]) / total * 100

    return (buffer_pct, free_pct, used_pct)

def get_disk() -> (float, float):
    """
    Get disk utilization: (used, free) as % of total disk space
    """
    # Returns something like:
    # Filesystem     1K-blocks    Used Available Use% Mounted on
    # /dev/nvme0n1p1   8311788 1770492   6541296  22% /
    disk = os.popen("df /").readlines()[1]
    values = disk.split()
    total = float(values[1])
    used_pct = float(values[2]) / total * 100
    free_pct = float(values[3]) / total * 100

    return (used_pct, free_pct)

def main():
    database_name = os.environ['TS_DATABASE']
    # Ask IMDSv2 for Instance ID
    token = requests.put('http://169.254.169.254/latest/api/token', headers={'X-aws-ec2-metadata-token-ttl-seconds': '60'}).text
    instance_id = requests.get('http://169.254.169.254/latest/meta-data/instance-id', headers={'X-aws-ec2-metadata-token': token}).text
    region = requests.get('http://169.254.169.254/latest/meta-data/placement/region', headers={'X-aws-ec2-metadata-token': token}).text

    dimensions = [ {'Name': 'InstanceId', 'Value': instance_id} ]

    ts = boto3.client('timestream-write', region_name=region)
    while True:
        cpu = get_cpu()
        memory = get_memory()
        disk = get_disk()

        cpu_records = [
            {
                'Dimensions': dimensions + [{'Name': 'Core', 'Value': str(i)}],
                'MeasureName': 'cpu_utilization',
                'MeasureValue': str(cpu[i]),
                'MeasureValueType': 'DOUBLE',
                'Time': str(int(time.time() * 1000)),
                'TimeUnit': 'MILLISECONDS',
            } for i in range(len(cpu))
        ]

        memory_records = [
            {
                'Dimensions': dimensions,
                'MeasureName': 'memory_buffered',
                'MeasureValue': str(memory[0]),
                'MeasureValueType': 'DOUBLE',
                'Time': str(int(time.time() * 1000)),
                'TimeUnit': 'MILLISECONDS',
            },
            {
                'Dimensions': dimensions,
                'MeasureName': 'memory_free',
                'MeasureValue': str(memory[1]),
                'MeasureValueType': 'DOUBLE',
                'Time': str(int(time.time() * 1000)),
                'TimeUnit': 'MILLISECONDS',
            },
            {
                'Dimensions': dimensions,
                'MeasureName': 'memory_used',
                'MeasureValue': str(memory[2]),
                'MeasureValueType': 'DOUBLE',
                'Time': str(int(time.time() * 1000)),
                'TimeUnit': 'MILLISECONDS',
            }
        ]

        disk_records = [
            {
                'Dimensions': dimensions,
                'MeasureName': 'disk_used',
                'MeasureValue': str(disk[0]),
                'MeasureValueType': 'DOUBLE',
                'Time': str(int(time.time() * 1000)),
                'TimeUnit': 'MILLISECONDS',
            },
            {
                'Dimensions': dimensions,
                'MeasureName': 'disk_free',
                'MeasureValue': str(disk[1]),
                'MeasureValueType': 'DOUBLE',
                'Time': str(int(time.time() * 1000)),
                'TimeUnit': 'MILLISECONDS',
            }
        ]

        ts.write_records(DatabaseName=database_name, TableName='CpuUtilization', Records=cpu_records)
        ts.write_records(DatabaseName=database_name, TableName='MemoryUtilization', Records=memory_records)
        ts.write_records(DatabaseName=database_name, TableName='DiskUsed', Records=disk_records)
        time.sleep(0.2)


if __name__ == '__main__':
    main()

Creating Instances

Use the template below to create new instances that can be used for testing. This setup is written for eu-west-1 with EC2 Instance Connect. Adapt the security group CIDR to your liking. Also remember to update the AMI IDs SSM versions if you need newer version (current latest is 57). You will also need to supply your default VPC ID. Go to the region where you will be creating the instance and copy the default VPC ID from the VPCs section. Apply like this:

$ aws cloudformation create-stack\
 --region eu-west-1\
 --stack-name EC2InstancesTimestream\
 --template-body file://ec2-instances.yaml\
 --parameters ParameterKey=VPC,ParameterValue=vpc-090909098a8a

The template looks like the following. It also copies the instance profile name from the previous stack.

---
AWSTemplateFormatVersion: 2010-09-09
Description: EC2 instances that will report metrics to Timestream

Parameters:
  ArmAmi:
    Type: String
    Description: AL2023 AMI ID for ARM-based EC2 instances
    Default: "{{resolve:ssm:/aws/service/ami-amazon-linux-latest/al2023-ami-kernel-6.1-arm64:57}}"
  AmdAmi:
    Type: String
    Description: AL2023 AMI ID for AMD64-based EC2 instances
    Default: "{{resolve:ssm:/aws/service/ami-amazon-linux-latest/al2023-ami-kernel-6.1-x86_64:57}}"
  VPC:
    Type: AWS::EC2::VPC::Id
    Description: VPC ID for the EC2 instances

Resources:
  SSHSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      VpcId: !Ref VPC
      GroupName: SSHSecurityGroup
      GroupDescription: Allow SSH access from EC2 Instance Connect
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 22
          ToPort: 22
          # EC2 Instance Connect IP range for eu-west-1
          CidrIp: 18.202.216.48/29

  Instance1:
    Type: AWS::EC2::Instance
    Properties:
      InstanceType: t3a.micro
      ImageId: !Ref AmdAmi
      NetworkInterfaces:
        - DeviceIndex: 0
          GroupSet:
            - !Ref SSHSecurityGroup
          AssociatePublicIpAddress: true
      IamInstanceProfile: !ImportValue MeteredEC2Profile
      Tags:
        - Key: Name
          Value: Instance1

  Instance2:
    Type: AWS::EC2::Instance
    Properties:
      InstanceType: t4g.small
      ImageId: !Ref ArmAmi
      NetworkInterfaces:
        - DeviceIndex: 0
          GroupSet:
            - !Ref SSHSecurityGroup
          AssociatePublicIpAddress: true
      IamInstanceProfile: !ImportValue MeteredEC2Profile
      Tags:
        - Key: Name
          Value: Instance2

EC2 Instance Connect

To transfer the script to the instance you can use S3 bucket, some other HTTP endpoint or just use cat in the console. Select all the Python code, copy it, go to the EC2 Instance Connect window and do it like this:

$ cat > measure.py <<EOF
here you click paste and press enter/return
EOF

To run it in a deamon mode, we will use nohup and & to run it in Bash. Also install Pip, Boto3 and requests. Also specify the database name in environment variables.

$ sudo yum install -y python3-pip
$ sudo pip3 install boto3 requests
$ export TS_DATABASE=MyEc2SystemMetrics
$ nohup python3 measure.py &
# Press enter.

If you got a prompt without any message such as [1]+ Exit 1 nohup python3 measure.py that means that the script is running. Verify by typing ps aux | grep python3 and you should see the process running. For troubleshooting type cat nohup.out. Now you can do some heavy work on the instance.

$ for i in {1..1000}; do (echo $i | sha1sum 2>&1 1>/dev/null); done

And after a while you can check the results in Timestream. Open Timestream in AWS Console and head to Query Editor. You can use the following query to get some averages for some time of CPU utilization. It will be binned every 10 seconds. You should get results similar to those below.

SELECT
  "InstanceId",
  "Core",
  BIN(time, 10s) as "binnedTime",
  AVG("measure_value::double") as "average"
FROM "MyEc2SystemMetrics"."CpuUtilization"
WHERE time between ago(30m) and now()
GROUP BY "Core", "InstanceId", BIN(time, 10s)
LIMIT 100

Timestream Query Editor

It would be better to see some graphs. This can be done with QuickSight or Grafana but it's a topic for another post.