Automating Infrastructure Drift Detection and Remediation with Terraform and CloudWatch Events

Introduction#

Infrastructure as Code (IaC) has revolutionized how we manage and deploy infrastructure. Terraform, a popular IaC tool, allows us to define and provision infrastructure using declarative configuration files. However, even with Terraform, infrastructure can drift from its intended state due to manual changes, misconfigurations, or external factors. This drift can lead to inconsistencies, security vulnerabilities, and unexpected behavior.

This article explores how to automate infrastructure drift detection and remediation using Terraform and CloudWatch Events. By combining these tools, we can proactively identify and correct deviations from our desired infrastructure state, ensuring consistency, compliance, and operational stability.

Understanding Infrastructure Drift#

Infrastructure drift occurs when the actual state of your infrastructure deviates from the state defined in your Terraform configuration. This can happen for several reasons:

Manual Changes: Operators might make ad-hoc changes directly to the infrastructure through the cloud provider’s console or CLI, bypassing Terraform.
External Factors: Services or events outside of Terraform’s control might modify infrastructure configurations.
Configuration Errors: Mistakes in Terraform configurations can lead to unintended infrastructure changes.

Drift can have significant consequences, including:

Inconsistencies: Different parts of your infrastructure might have different configurations, leading to unpredictable behavior.
Security Vulnerabilities: Manual changes might introduce security loopholes that are not covered by your Terraform configuration.
Compliance Issues: Drift can violate compliance policies and regulations, leading to audits and penalties.
Difficulty in Troubleshooting: When infrastructure deviates from its defined state, troubleshooting becomes more complex and time-consuming.

Setting up Drift Detection with CloudWatch Events and Terraform#

We’ll leverage CloudWatch Events to schedule periodic Terraform plan executions. The Terraform plan output will then be analyzed to detect any drift.

1. Terraform Configuration:

First, ensure you have a Terraform configuration that defines your infrastructure. For example, let’s consider a simple configuration that creates an AWS S3 bucket:

resource "aws_s3_bucket" "example" {
  bucket = "my-unique-bucket-name" # Replace with your unique bucket name
  acl    = "private"

  tags = {
    Name        = "My Example Bucket"
    Environment = "Production"
  }
}

2. CloudWatch Event Rule:

Create a CloudWatch Event rule that triggers a Lambda function on a schedule (e.g., daily). You can do this via the AWS console, CLI, or using Terraform itself. Here’s an example CloudWatch Event rule defined in Terraform:

resource "aws_cloudwatch_event_rule" "drift_detection_rule" {
  name        = "drift-detection-rule"
  description = "Triggers Lambda function to detect infrastructure drift daily"
  schedule_expression = "cron(0 0 * * ? *)" # Runs daily at midnight UTC
}

resource "aws_cloudwatch_event_target" "drift_detection_target" {
  rule      = aws_cloudwatch_event_rule.drift_detection_rule.name
  arn       = aws_lambda_function.drift_detection_lambda.arn
  input_transformer {
    input_paths = {}
    input_template = "{}"
  }
}

This defines a rule that executes daily at midnight UTC. The target is a Lambda function (defined in the next step).

3. Lambda Function for Drift Detection:

Create a Lambda function that executes the Terraform plan and analyzes the output. This function will:

Assume an IAM role with the necessary permissions to execute Terraform commands (e.g., terraform init, terraform plan).
Download the Terraform configuration files from a source (e.g., S3 bucket, CodeCommit repository).
Execute terraform init and terraform plan.
Parse the output of terraform plan to identify any changes.
Trigger a notification (e.g., send an email, post to Slack) if drift is detected.

Here’s a Python example of such a Lambda function:

import subprocess
import os
import json
import boto3

def lambda_handler(event, context):
    try:
        # 1. Configure AWS credentials and region
        os.environ['AWS_ACCESS_KEY_ID'] = os.environ['TF_VAR_aws_access_key']
        os.environ['AWS_SECRET_ACCESS_KEY'] = os.environ['TF_VAR_aws_secret_key']
        os.environ['AWS_DEFAULT_REGION'] = os.environ['TF_VAR_aws_region']

        # 2. Download Terraform configuration (example: from S3)
        s3 = boto3.client('s3')
        bucket_name = os.environ['TERRAFORM_CONFIG_BUCKET']
        object_key = os.environ['TERRAFORM_CONFIG_KEY']
        download_path = '/tmp/terraform_config.zip'
        extract_path = '/tmp/terraform_config'

        s3.download_file(bucket_name, object_key, download_path)

        # Unzip the configuration
        subprocess.run(['unzip', download_path, '-d', extract_path], check=True)

        # 3. Initialize Terraform
        terraform_dir = extract_path
        subprocess.run(['terraform', 'init'], cwd=terraform_dir, check=True, capture_output=True)

        # 4. Execute Terraform plan
        plan_result = subprocess.run(['terraform', 'plan'], cwd=terraform_dir, capture_output=True, text=True)

        # 5. Analyze Terraform plan output
        plan_output = plan_result.stdout
        if "No changes. Your infrastructure matches the configuration." not in plan_output:
            # Drift detected!
            print("Drift detected!")
            print(plan_output)

            # 6. Send notification (example: email)
            sns = boto3.client('sns')
            topic_arn = os.environ['SNS_TOPIC_ARN']
            message = f"Infrastructure drift detected!\n\n{plan_output}"
            sns.publish(TopicArn=topic_arn, Message=message, Subject="Infrastructure Drift Alert")

            return {
                'statusCode': 200,
                'body': json.dumps('Drift detected and notification sent!')
            }
        else:
            print("No drift detected.")
            return {
                'statusCode': 200,
                'body': json.dumps('No drift detected.')
            }

    except Exception as e:
        print(f"Error: {e}")
        return {
            'statusCode': 500,
            'body': json.dumps(f'Error: {e}')
        }

Important Considerations for the Lambda Function:

IAM Role: The Lambda function’s IAM role needs permissions to:
- Read from the S3 bucket or CodeCommit repository where your Terraform configuration is stored.
- Execute Terraform commands (e.g., terraform init, terraform plan).
- Publish to an SNS topic (if you’re using SNS for notifications).
- Write logs to CloudWatch Logs.
Environment Variables: Use environment variables to securely store sensitive information like AWS credentials, bucket names, and SNS topic ARNs. Never hardcode credentials in your Lambda function code.
Error Handling: Implement robust error handling to catch exceptions and log errors for debugging.
Terraform Version: Ensure the Lambda function uses the same Terraform version as your infrastructure. You might need to include the Terraform binary in your Lambda deployment package.
State Management: Consider how Terraform state is managed. Ideally, store your Terraform state remotely (e.g., in an S3 bucket with DynamoDB locking) and configure the Lambda function to access it.

4. Notification Mechanism:

Configure a notification mechanism to alert you when drift is detected. This could be:

Email: Send an email notification using AWS Simple Email Service (SES).
Slack: Post a message to a Slack channel using the Slack API.
PagerDuty: Create an incident in PagerDuty.
SNS: Publish a message to an SNS topic, which can then be subscribed to by various notification services.

The Lambda function example above uses SNS for notification.

Automating Remediation#

While detecting drift is crucial, automatically remediating it is even more powerful. We can extend the Lambda function to automatically apply the Terraform configuration and correct the drift.

1. Enhanced Lambda Function:

Modify the Lambda function to execute terraform apply if drift is detected. Add the following to the Python code, after the drift detection logic:

        if "No changes. Your infrastructure matches the configuration