Cost Optimization

Systematic cost optimization reduces cloud spend without sacrificing performance or reliability. This guide covers proven strategies across compute, storage, and networking — with real Terraform modules, scripts, and decision frameworks you can deploy today.

The Cost Optimization Framework

Effective cost optimization follows a structured approach. Based on my Lean Six Sigma Black Belt training, I apply the DMAIC methodology to cloud cost reduction:

Define scope: Identify the target environment (dev, staging, prod), services, and team budgets.
Measure current state: Establish baseline costs using Cost Explorer or your FinOps tool. Identify top 10 cost drivers.
Analyze waste: Identify idle resources, overprovisioning, unused RIs, and opportunities for rate optimization.
Improve: Execute right-sizing, RI purchases, architectural changes, and policy enforcement.
Control: Implement automated governance to prevent regression. Set up anomaly detection and budget alerts.

Compute Optimization

Compute typically represents 50-70% of cloud spend. It is the highest-impact optimization target.

Right-Sizing Instances

AWS Compute Optimizer analyzes CloudWatch metrics and recommends optimal instance types. Enable it at the organization level for maximum coverage.

# Enable Compute Optimizer for the entire organization
aws compute-optimizer update-enrollment-status --status Active

# Enable memory metrics (requires CloudWatch agent with memory on instances)
aws compute-optimizer put-recommendation-preferences \
  --resource-type Ec2Instance \
  --enhanced-infrastructure-metrics Active

# Export recommendations for analysis
aws compute-optimizer get-ec2-instance-recommendations \
  --output json > compute_optimizer_recommendations.json

# Identify overprovisioned instances
jq '.instanceRecommendations[] |
  select(.finding == "Overprovisioned") |
  {
    instanceArn: .instanceArn,
    currentType: .currentInstanceType,
    recommendedType: .recommendationOptions[0].instanceType,
    estimatedMonthlySavings: .recommendationOptions[0].projectedUtilizationMetrics[0].value
  }' compute_optimizer_recommendations.json

Right-Sizing Safety: Always review memory utilization before downsizing. CPU-only metrics lead to memory-constrained outages. Install the CloudWatch agent and collect mem_used_percent before making sizing decisions.

Spot / Preemptible Instances

Spot instances offer up to 90% savings for fault-tolerant workloads. Use them for batch processing, CI/CD, data analytics, and stateless microservices.

# spot-instance-policy.yaml — EKS/Kubernetes Spot configuration
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: spot-workloads
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["m6i.large", "m6i.xlarge", "m5.large", "m5.xlarge"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
      nodeClassRef:
        name: default
  limits:
    cpu: 1000
    memory: 4000Gi
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h
  weight: 10
---
# Pod disruption budget for spot-aware workloads
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-service-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api-service

# AWS Auto Scaling Group with Spot and On-Demand mixed instances policy
resource "aws_autoscaling_group" "mixed_workload" {
  name                = "mixed-workload-asg"
  vpc_zone_identifier = var.private_subnet_ids
  desired_capacity    = 4
  min_size            = 2
  max_size            = 20

  mixed_instances_policy {
    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.mixed.id
        version            = "$Latest"
      }
      override {
        instance_type     = "m6i.xlarge"
        weighted_capacity = "4"
      }
      override {
        instance_type     = "m6i.2xlarge"
        weighted_capacity = "8"
      }
      override {
        instance_type     = "m5.xlarge"
        weighted_capacity = "4"
      }
    }

    instances_distribution {
      on_demand_base_capacity                  = 2      # Always 2 on-demand
      on_demand_percentage_above_base_capacity = 25     # 25% on-demand above base
      spot_allocation_strategy                 = "capacity-optimized"
      spot_instance_pools                      = 3
    }
  }

  tag {
    key                 = "CostOptimization"
    value               = "spot-mixed"
    propagate_at_launch = true
  }
}

Graviton / ARM-Based Instances

AWS Graviton2/3 instances offer up to 40% better price-performance over comparable x86 instances. GCP's Tau T2D and Azure's Dpsv5 series provide similar benefits.

# graviton-migration-checklist.sh
#!/bin/bash
# Check workload compatibility for Graviton migration

echo "=== Graviton Migration Readiness Check ==="

# 1. Check if application has ARM-compatible containers
if docker manifest inspect myapp:latest 2>/dev/null | grep -q "arm64"; then
    echo "✓ Container image has arm64 variant"
else
    echo "✗ Container image missing arm64 variant — rebuild with multi-arch"
fi

# 2. Check language runtime ARM support
if command -v python3 &>/dev/null; then
    python3 -c "import platform; print(f'✓ Python running on {platform.machine()}')"
fi

if command -v node &>/dev/null; then
    node -e "console.log('✓ Node.js:', process.arch)"
fi

# 3. Check for native dependencies
if ldd /opt/myapp/bin/binary 2>/dev/null | grep -q "x86-64"; then
    echo "✗ Binary compiled for x86-64 — recompile for arm64"
fi

# 4. Check database driver compatibility
pip list 2>/dev/null | grep -E "(psycopg2|PyMySQL|redis)" &>/dev/null && \
    echo "✓ Common Python DB drivers are ARM-compatible"

# 5. Verify no AVX/AVX2 CPU instructions required
grep -r "avx" /opt/myapp/config 2>/dev/null && \
    echo "⚠ AVX references found — verify ARM NEON equivalent exists"

echo ""
echo "Graviton instance pricing comparison (us-east-1):"
echo "  m6i.xlarge (x86):   $0.192/hour"
echo "  m6g.xlarge (ARM):   $0.154/hour  (~20% savings)"
echo "  m7g.xlarge (ARM):   $0.163/hour  (~15% savings, better perf)"

Autoscaling Policies and Scheduled Scaling

# cost_optimized_autoscaling.tf — Terraform module for cost-optimized EC2

locals {
  business_hours_scale_up = {
    min_size         = 4
    max_size         = 20
    desired_capacity   = 8
    recurrence       = "0 8 * * MON-FRI"  # 8 AM UTC, weekdays
    time_zone        = "America/New_York"
  }

  after_hours_scale_down = {
    min_size         = 1
    max_size         = 5
    desired_capacity   = 1
    recurrence       = "0 19 * * MON-FRI"  # 7 PM UTC, weekdays
    time_zone        = "America/New_York"
  }

  weekend_scale_down = {
    min_size         = 0
    max_size         = 2
    desired_capacity   = 0
    recurrence       = "0 0 * * SAT"  # Midnight Friday
    time_zone        = "America/New_York"
  }
}

resource "aws_autoscaling_group" "application" {
  name                = "${var.app_name}-asg"
  vpc_zone_identifier = var.private_subnet_ids
  target_group_arns   = var.target_group_arns
  health_check_type   = "ELB"
  health_check_grace_period = 300

  min_size         = var.min_size
  max_size         = var.max_size
  desired_capacity = var.desired_capacity

  launch_template {
    id      = aws_launch_template.application.id
    version = "$Latest"
  }

  instance_refresh {
    strategy = "Rolling"
    preferences {
      min_healthy_percentage = 66
      instance_warmup        = 120
    }
  }

  tag {
    key                 = "Name"
    value               = "${var.app_name}-instance"
    propagate_at_launch = true
  }
  tag {
    key                 = "Environment"
    value               = var.environment
    propagate_at_launch = true
  }
  tag {
    key                 = "AutoShutdown"
    value               = var.environment == "production" ? "false" : "true"
    propagate_at_launch = true
  }
  tag {
    key                 = "CostCenter"
    value               = var.cost_center
    propagate_at_launch = true
  }
}

# Target tracking scaling — scale based on average CPU
resource "aws_autoscaling_policy" "cpu_target_tracking" {
  name                   = "${var.app_name}-cpu-tracking"
  autoscaling_group_name = aws_autoscaling_group.application.name
  policy_type            = "TargetTrackingScaling"

  target_tracking_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ASGAverageCPUUtilization"
    }
    target_value     = 60.0
    disable_scale_in = false
  }
}

# Predictive scaling — ML-based proactive scaling
resource "aws_autoscaling_policy" "predictive" {
  count                  = var.enable_predictive_scaling ? 1 : 0
  name                   = "${var.app_name}-predictive"
  autoscaling_group_name = aws_autoscaling_group.application.name
  policy_type            = "PredictiveScaling"

  predictive_scaling_configuration {
    metric_specification {
      target_value = 60.0
      predefined_load_metric_specification {
        predefined_metric_type = "ASGTotalCPUUtilization"
        resource_label         = "${aws_autoscaling_group.application.arn}/${var.target_group_arns[0]}"
      }
      customized_scaling_metric_specification {
        metric_dimension {
          name  = "AutoScalingGroupName"
          value = aws_autoscaling_group.application.name
        }
        metric_name = "CPUUtilization"
        namespace   = "AWS/EC2"
        statistic   = "Average"
      }
    }
    mode                         = "ForecastAndScale"
    scheduling_buffer_time       = 10
  }
}

# Scheduled scaling — business hours only
resource "aws_autoscaling_schedule" "scale_up_morning" {
  count                  = var.environment != "production" ? 1 : 0
  scheduled_action_name  = "scale-up-morning"
  autoscaling_group_name = aws_autoscaling_group.application.name
  min_size               = local.business_hours_scale_up.min_size
  max_size               = local.business_hours_scale_up.max_size
  desired_capacity       = local.business_hours_scale_up.desired_capacity
  recurrence             = local.business_hours_scale_up.recurrence
  time_zone              = local.business_hours_scale_up.time_zone
}

resource "aws_autoscaling_schedule" "scale_down_evening" {
  count                  = var.environment != "production" ? 1 : 0
  scheduled_action_name  = "scale-down-evening"
  autoscaling_group_name = aws_autoscaling_group.application.name
  min_size               = local.after_hours_scale_down.min_size
  max_size               = local.after_hours_scale_down.max_size
  desired_capacity       = local.after_hours_scale_down.desired_capacity
  recurrence             = local.after_hours_scale_down.recurrence
  time_zone              = local.after_hours_scale_down.time_zone
}

resource "aws_autoscaling_schedule" "scale_down_weekend" {
  count                  = var.environment != "production" ? 1 : 0
  scheduled_action_name  = "scale-down-weekend"
  autoscaling_group_name = aws_autoscaling_group.application.name
  min_size               = local.weekend_scale_down.min_size
  max_size               = local.weekend_scale_down.max_size
  desired_capacity       = local.weekend_scale_down.desired_capacity
  recurrence             = local.weekend_scale_down.recurrence
  time_zone              = local.weekend_scale_down.time_zone
}

# Launch template with cost-optimized settings
resource "aws_launch_template" "application" {
  name_prefix   = "${var.app_name}-"
  image_id      = var.ami_id
  instance_type = var.instance_type
  key_name      = var.key_name

  iam_instance_profile {
    name = aws_iam_instance_profile.application.name
  }

  vpc_security_group_ids = var.security_group_ids

  # Enable detailed monitoring for better metrics granularity
  monitoring {
    enabled = true
  }

  # Use gp3 EBS — 20% cheaper than gp2 with better IOPS
  block_device_mappings {
    device_name = "/dev/xvda"
    ebs {
      volume_size           = var.root_volume_size
      volume_type           = "gp3"
      iops                  = 3000
      throughput            = 125
      encrypted             = true
      delete_on_termination = true
    }
  }

  metadata_options {
    http_endpoint               = "enabled"
    http_tokens                 = "required"  # IMDSv2 — security best practice
    http_put_response_hop_limit = 1
  }

  tag_specifications {
    resource_type = "instance"
    tags = {
      Name        = "${var.app_name}-instance"
      Environment = var.environment
      CostCenter  = var.cost_center
    }
  }

  tag_specifications {
    resource_type = "volume"
    tags = {
      Name        = "${var.app_name}-root-volume"
      Environment = var.environment
      CostCenter  = var.cost_center
    }
  }

  user_data = base64encode(templatefile("${path.module}/userdata.sh", {
    app_name    = var.app_name
    environment = var.environment
  }))
}

Savings Plans and Reserved Instances

Purchase Strategy

Payment Option	Discount vs On-Demand	Cash Flow Impact	Best For
All Upfront	Highest (~40-60%)	Large initial outlay	Organizations with committed budgets; highest discount
Partial Upfront (~50%)	Medium (~35-50%)	Moderate initial outlay	Balanced approach; most common choice
No Upfront	Lower (~25-40%)	None	Cash-constrained teams; still significant savings

Break-Even Analysis

#!/usr/bin/env python3
# ri_break_even.py — Reserved Instance break-even calculator
import argparse

def calculate_break_even(
    on_demand_hourly,
    ri_upfront,
    ri_hourly,
    hours_per_month=730
):
    """Calculate RI break-even point in months."""
    monthly_on_demand = on_demand_hourly * hours_per_month
    monthly_ri = ri_hourly * hours_per_month
    monthly_savings = monthly_on_demand - monthly_ri

    if monthly_savings <= 0:
        return None, None

    break_even_months = ri_upfront / monthly_savings
    total_savings_1yr = (monthly_savings * 12) - ri_upfront
    total_savings_3yr = (monthly_savings * 36) - ri_upfront

    roi_1yr = (total_savings_1yr / ri_upfront * 100) if ri_upfront > 0 else float('inf')
    roi_3yr = (total_savings_3yr / ri_upfront * 100) if ri_upfront > 0 else float('inf')

    return {
        "monthly_on_demand": monthly_on_demand,
        "monthly_ri_total": monthly_ri + (ri_upfront / 12),
        "monthly_savings": monthly_savings,
        "break_even_months": break_even_months,
        "total_savings_1yr": total_savings_1yr,
        "total_savings_3yr": total_savings_3yr,
        "roi_1yr_pct": roi_1yr,
        "roi_3yr_pct": roi_3yr
    }

def main():
    parser = argparse.ArgumentParser(description="RI Break-Even Calculator")
    parser.add_argument("--ondemand-hourly", type=float, required=True)
    parser.add_argument("--ri-upfront", type=float, default=0)
    parser.add_argument("--ri-hourly", type=float, required=True)
    parser.add_argument("--term", choices=["1yr", "3yr"], default="1yr")
    args = parser.parse_args()

    result = calculate_break_even(
        args.ondemand_hourly,
        args.ri_upfront,
        args.ri_hourly
    )

    if result is None:
        print("ERROR: RI costs more than on-demand. Not recommended.")
        return

    print(f"\n{'='*55}")
    print(f"  RI Break-Even Analysis")
    print(f"{'='*55}")
    print(f"  On-Demand Hourly:     ${args.ondemand_hourly:.4f}")
    print(f"  RI Upfront:           ${args.ri_upfront:,.2f}")
    print(f"  RI Hourly:            ${args.ri_hourly:.4f}")
    print(f"  Term:                 {args.term}")
    print(f"  {'-'*51}")
    print(f"  Monthly On-Demand:    ${result['monthly_on_demand']:,.2f}")
    print(f"  Monthly RI (amortized): ${result['monthly_ri_total']:,.2f}")
    print(f"  Monthly Savings:      ${result['monthly_savings']:,.2f}")
    print(f"  {'-'*51}")
    print(f"  Break-Even Point:     {result['break_even_months']:.1f} months")
    print(f"  1-Year Total Savings: ${result['total_savings_1yr']:,.2f}")
    print(f"  3-Year Total Savings: ${result['total_savings_3yr']:,.2f}")
    print(f"  1-Year ROI:           {result['roi_1yr_pct']:.1f}%")
    print(f"  3-Year ROI:           {result['roi_3yr_pct']:.1f}%")
    print(f"{'='*55}")

    if result['break_even_months'] <= 6:
        print("  ✅ RECOMMENDED: Break-even under 6 months")
    elif result['break_even_months'] <= 12:
        print("  ⚠️ CONDITIONAL: Review utilization before committing")
    else:
        print("  ❌ NOT RECOMMENDED: Break-even exceeds 12 months")

if __name__ == "__main__":
    main()

# Example: m6i.xlarge in us-east-1
# python3 ri_break_even.py --ondemand-hourly 0.192 --ri-upfront 514.0 --ri-hourly 0.054 --term 1yr

Storage Optimization

S3 Lifecycle Policies and Intelligent Tiering

# s3_lifecycle.tf — Cost-optimized S3 bucket with lifecycle policies
resource "aws_s3_bucket" "data_lake" {
  bucket = "company-data-lake-${var.environment}-${data.aws_caller_identity.current.account_id}"
}

# Enable Intelligent Tiering — automatic cost optimization
resource "aws_s3_bucket_intelligent_tiering_configuration" "data_lake" {
  bucket = aws_s3_bucket.data_lake.id
  name   = "EntireBucket"

  tiering {
    access_tier = "ARCHIVE_ACCESS"
    days        = 90
  }

  tiering {
    access_tier = "DEEP_ARCHIVE_ACCESS"
    days        = 180
  }
}

# Lifecycle policy for explicit transitions
resource "aws_s3_bucket_lifecycle_configuration" "data_lake" {
  bucket = aws_s3_bucket.data_lake.id

  rule {
    id     = "transition-to-cheaper-storage"
    status = "Enabled"

    filter {
      prefix = ""  # Apply to entire bucket
    }

    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }

    transition {
      days          = 90
      storage_class = "GLACIER_IR"  # Glacier Instant Retrieval — for rarely accessed but needed quickly
    }

    transition {
      days          = 365
      storage_class = "DEEP_ARCHIVE"
    }

    noncurrent_version_transition {
      noncurrent_days = 30
      storage_class   = "STANDARD_IA"
    }

    noncurrent_version_transition {
      noncurrent_days = 60
      storage_class   = "GLACIER"
    }

    noncurrent_version_expiration {
      noncurrent_days = 365
    }

    abort_incomplete_multipart_upload {
      days_after_initiation = 7
    }
  }

  rule {
    id     = "temp-data-cleanup"
    status = "Enabled"

    filter {
      prefix = "temp/"
    }

    expiration {
      days = 7
    }
  }

  rule {
    id     = "log-cleanup"
    status = "Enabled"

    filter {
      prefix = "logs/"
    }

    transition {
      days          = 30
      storage_class = "STANDARD_IA"
    }

    expiration {
      days = 90
    }
  }
}

# S3 Storage Lens for visibility into storage optimization opportunities
resource "aws_s3control_storage_lens_configuration" "organization" {
  config_id = "organization-storage-lens"
  account_id = var.master_account_id

  storage_lens_configuration {
    enabled = true

    account_level {
      activity_metrics {
        is_enabled = true
      }
      advanced_cost_optimization_metrics {
        is_enabled = true
      }
      advanced_data_protection_metrics {
        is_enabled = true
      }

      bucket_level {
        activity_metrics {
          is_enabled = true
        }
        prefix_level {
          storage_metrics {
            is_enabled          = true
            selection_criteria {
              delimiter         = "/"
              max_depth         = 3
              min_storage_bytes = 104857600  # 100MB minimum
            }
          }
        }
      }
    }

    data_export {
      s3_bucket_destination {
        format     = "Parquet"
        output_schema_version = "V_1"
        account_id = var.master_account_id
        arn        = aws_s3_bucket.storage_lens_export.arn
        prefix     = "storage-lens/"
      }
    }

    exclude {
      buckets = [
        aws_s3_bucket.temp_excluded.id
      ]
    }
  }
}

EBS Volume Right-Sizing and gp3 Migration

gp3 volumes offer 20% lower cost than gp2 at equivalent performance, with independent IOPS and throughput scaling. Migrating all gp2 volumes to gp3 is typically the easiest storage win.

#!/bin/bash
# ebs_gp3_migration.sh — Identify and migrate gp2 volumes to gp3

AWS_PROFILE=${AWS_PROFILE:-default}
REGIONS=$(aws ec2 describe-regions --query 'Regions[].RegionName' --output text)

total_volumes=0
total_savings=0.00

echo "=== EBS gp2 → gp3 Migration Analysis ==="
echo "Region,Volumes,Monthly Savings"

for REGION in $REGIONS; do
    volumes=$(aws ec2 describe-volumes \
        --region "$REGION" \
        --filters Name=volume-type,Values=gp2 \
        --query 'Volumes[?State==`available` || length(Attachments) > `0`].[VolumeId,Size,VolumeType,State]' \
        --output text)

    count=$(echo "$volumes" | wc -l)
    savings=$(echo "$count * 0.20 * 0.08 * 100" | bc -l 2>/dev/null || echo "0")

    if [ -n "$volumes" ]; then
        echo "$REGION,$count,\$$savings"
        total_volumes=$((total_volumes + count))
        total_savings=$(echo "$total_savings + $savings" | bc -l 2>/dev/null || echo "$total_savings")
    fi
done

echo ""
echo "=== Summary ==="
echo "Total gp2 volumes: $total_volumes"
echo "Estimated monthly savings from migration: \$$total_savings"

echo ""
read -p "Proceed with migration? (yes/no): " confirm
if [ "$confirm" != "yes" ]; then
    echo "Aborted."
    exit 0
fi

# Execute migration
for REGION in $REGIONS; do
    volume_ids=$(aws ec2 describe-volumes \
        --region "$REGION" \
        --filters Name=volume-type,Values=gp2 \
        --query 'Volumes[].VolumeId' --output text)

    for vol in $volume_ids; do
        echo "Migrating $vol in $REGION..."
        aws ec2 modify-volume \
            --region "$REGION" \
            --volume-id "$vol" \
            --volume-type gp3 \
            --iops 3000 \
            --throughput 125
    done
done

echo "Migration complete. Monitor CloudWatch VolumeReadOps/VolumeWriteOps to verify performance."

Network Optimization

NAT Gateway Alternatives

NAT Gateway is one of the most expensive networking services in AWS ($0.045/hour + $0.045/GB). For non-production or batch workloads, alternatives include NAT instances, VPC endpoints, and IPv6 egress-only gateways.

# nat_gateway_optimization.tf — Hybrid NAT strategy

# Option 1: VPC Endpoints for AWS services (avoids NAT entirely for S3, DynamoDB, etc.)
resource "aws_vpc_endpoint" "s3" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.${data.aws_region.current.name}.s3"
  route_table_ids = aws_route_table.private[*].id

  tags = {
    Name = "s3-vpc-endpoint"
  }
}

resource "aws_vpc_endpoint" "dynamodb" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.${data.aws_region.current.name}.dynamodb"
  route_table_ids = aws_route_table.private[*].id

  tags = {
    Name = "dynamodb-vpc-endpoint"
  }
}

# Interface endpoints for other AWS services
resource "aws_vpc_endpoint" "ssm" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.${data.aws_region.current.name}.ssm"
  vpc_endpoint_type   = "Interface"
  private_dns_enabled = true
  subnet_ids          = var.private_subnet_ids
  security_group_ids  = [aws_security_group.vpc_endpoints.id]

  tags = {
    Name = "ssm-endpoint"
  }
}

# Option 2: NAT Instance for dev/test environments (instead of NAT Gateway)
resource "aws_instance" "nat_instance" {
  count                  = var.environment != "production" ? 1 : 0
  ami                    = data.aws_ami.amazon_linux_2023.id
  instance_type          = "t4g.micro"  # ARM-based, cheapest option
  subnet_id              = var.public_subnet_ids[0]
  vpc_security_group_ids = [aws_security_group.nat_instance.id]
  source_dest_check      = false  # Required for NAT

  user_data = <<-EOF
    #!/bin/bash
    sysctl -w net.ipv4.ip_forward=1
    iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
    yum install -y iptables-services
    service iptables save
  EOF

  tags = {
    Name = "${var.environment}-nat-instance"
  }
}

# Option 3: IPv6 Egress-Only Internet Gateway (for IPv6 workloads)
resource "aws_egress_only_internet_gateway" "ipv6" {
  vpc_id = aws_vpc.main.id

  tags = {
    Name = "${var.environment}-eigw"
  }
}

# Cost comparison (us-east-1, per month):
# NAT Gateway:        ~$32.40 base + $0.045/GB
# NAT Instance (t4g.micro): ~$7.39 (24/7) — use spot for further savings
# VPC Endpoint:       ~$7.20/month per AZ + $0.01/GB

CloudFront Caching Strategies

# cloudfront_optimization.tf — Cost-optimized CloudFront distribution
resource "aws_cloudfront_distribution" "app" {
  enabled             = true
  is_ipv6_enabled     = true
  comment             = "${var.app_name} CDN"
  default_root_object = "index.html"
  price_class         = var.environment == "production" ? "PriceClass_All" : "PriceClass_100"

  # Origin: S3 static assets
  origin {
    domain_name = aws_s3_bucket.static_assets.bucket_regional_domain_name
    origin_id   = "S3-static"

    s3_origin_config {
      origin_access_identity = aws_cloudfront_origin_access_identity.oai.cloudfront_access_identity_path
    }

    # Origin Shield reduces origin load and improves cache hit ratio
    origin_shield {
      enabled              = true
      origin_shield_region = "us-east-1"
    }
  }

  # Origin: ALB for dynamic content
  origin {
    domain_name = aws_lb.app.dns_name
    origin_id   = "ALB-dynamic"

    custom_origin_config {
      http_port              = 80
      https_port             = 443
      origin_protocol_policy = "https-only"
      origin_ssl_protocols   = ["TLSv1.2"]
    }
  }

  # Cache behavior for static assets (aggressive caching)
  ordered_cache_behavior {
    path_pattern     = "/static/*"
    allowed_methods  = ["GET", "HEAD", "OPTIONS"]
    cached_methods   = ["GET", "HEAD"]
    target_origin_id = "S3-static"

    forwarded_values {
      query_string = false
      cookies {
        forward = "none"
      }
    }

    viewer_protocol_policy = "redirect-to-https"
    min_ttl                = 86400     # 1 day
    default_ttl            = 604800    # 7 days
    max_ttl                = 31536000  # 1 year
    compress               = true
  }

  # Cache behavior for images and media
  ordered_cache_behavior {
    path_pattern     = "/media/*"
    allowed_methods  = ["GET", "HEAD"]
    cached_methods   = ["GET", "HEAD"]
    target_origin_id = "S3-static"

    forwarded_values {
      query_string = false
      cookies {
        forward = "none"
      }
    }

    viewer_protocol_policy = "redirect-to-https"
    min_ttl                = 86400
    default_ttl            = 2592000   # 30 days
    max_ttl                = 31536000  # 1 year
    compress               = true
  }

  # Default cache behavior for dynamic content
  default_cache_behavior {
    allowed_methods  = ["DELETE", "GET", "HEAD", "OPTIONS", "PATCH", "POST", "PUT"]
    cached_methods   = ["GET", "HEAD"]
    target_origin_id = "ALB-dynamic"

    cache_policy_id          = aws_cloudfront_cache_policy.dynamic.id
    origin_request_policy_id = aws_cloudfront_origin_request_policy.dynamic.id
    viewer_protocol_policy   = "redirect-to-https"
    compress                 = true
  }

  restrictions {
    geo_restriction {
      restriction_type = "none"
    }
  }

  viewer_certificate {
    cloudfront_default_certificate = var.environment != "production"
    acm_certificate_arn            = var.environment == "production" ? var.acm_certificate_arn : null
    ssl_support_method             = "sni-only"
    minimum_protocol_version       = "TLSv1.2_2021"
  }
}

# Cache policy for dynamic content (short TTL, selective caching)
resource "aws_cloudfront_cache_policy" "dynamic" {
  name        = "${var.app_name}-dynamic-policy"
  comment     = "Cache dynamic content selectively"
  default_ttl = 0
  max_ttl     = 60
  min_ttl     = 0

  parameters_in_cache_key_and_forwarded_to_origin {
    enable_accept_encoding_gzip   = true
    enable_accept_encoding_brotli = true

    headers_config {
      header_behavior = "none"
    }
    cookies_config {
      cookie_behavior = "none"
    }
    query_strings_config {
      query_string_behavior = "whitelist"
      query_strings {
        items = ["version", "cache-buster"]
      }
    }
  }
}

Database Optimization

RDS Optimization

# rds_optimized.tf — Cost-optimized RDS configuration
resource "aws_db_instance" "primary" {
  identifier = "${var.app_name}-${var.environment}"

  # Engine configuration
  engine         = "postgres"
  engine_version = "15.4"
  instance_class = var.environment == "production" ? "db.m6g.xlarge" : "db.t4g.micro"

  # Storage — always use gp3 for new instances
  allocated_storage     = 100
  max_allocated_storage = 1000  # Enable storage autoscaling
  storage_type          = "gp3"
  storage_encrypted     = true
  iops                  = var.environment == "production" ? 3000 : null

  # Database configuration
  db_name  = var.database_name
  username = var.database_username
  password = var.database_password

  # High availability — only in production
  multi_az = var.environment == "production"

  # Backup and maintenance
  backup_retention_period = var.environment == "production" ? 30 : 7
  backup_window           = "03:00-04:00"
  maintenance_window      = "Mon:04:00-Mon:05:00"

  # Performance Insights — invaluable for right-sizing
  performance_insights_enabled    = true
  performance_insights_retention_period = 7

  # Enhanced monitoring for detailed metrics
  monitoring_interval = 60
  monitoring_role_arn = aws_iam_role.rds_monitoring.arn

  # Deletion protection and skip final snapshot for non-prod
  deletion_protection = var.environment == "production"
  skip_final_snapshot = var.environment != "production"

  # Auto minor version upgrades
  auto_minor_version_upgrade = true

  vpc_security_group_ids = var.database_security_group_ids
  db_subnet_group_name   = aws_db_subnet_group.main.name

  tags = {
    Name        = "${var.app_name}-postgres"
    Environment = var.environment
    CostCenter  = var.cost_center
  }
}

# Read replica for read-heavy production workloads
resource "aws_db_instance" "replica" {
  count               = var.environment == "production" && var.enable_read_replica ? 1 : 0
  identifier          = "${var.app_name}-${var.environment}-replica"
  replicate_source_db = aws_db_instance.primary.arn
  instance_class      = "db.m6g.large"  # Can be smaller than primary

  storage_type = "gp3"

  performance_insights_enabled    = true
  performance_insights_retention_period = 7

  monitoring_interval = 60
  monitoring_role_arn = aws_iam_role.rds_monitoring.arn

  auto_minor_version_upgrade = true
  skip_final_snapshot        = true

  vpc_security_group_ids = var.database_security_group_ids
  db_subnet_group_name   = aws_db_subnet_group.main.name

  tags = {
    Name        = "${var.app_name}-postgres-replica"
    Environment = var.environment
    CostCenter  = var.cost_center
  }
}

# RDS Reserved Instance for production databases
# Purchase via AWS Console or API after 30 days of stable usage
# Example: db.m6g.xlarge, 1 year, partial upfront = ~40% savings

Aurora Serverless for Variable Workloads

# aurora_serverless.tf — Aurora Serverless v2 for variable or dev workloads
resource "aws_rds_cluster" "serverless" {
  count              = var.use_serverless ? 1 : 0
  cluster_identifier = "${var.app_name}-${var.environment}-aurora"
  engine             = "aurora-postgresql"
  engine_mode        = "provisioned"  # Aurora Serverless v2 uses "provisioned" with scaling
  engine_version     = "15.4"

  database_name   = var.database_name
  master_username = var.database_username
  master_password = var.database_password

  # Serverless v2 scaling configuration
  serverlessv2_scaling_configuration {
    min_capacity = var.environment == "production" ? 1.0 : 0.5  # ACUs
    max_capacity = var.environment == "production" ? 16.0 : 4.0
  }

  db_subnet_group_name   = aws_db_subnet_group.main.name
  vpc_security_group_ids = var.database_security_group_ids
  storage_encrypted      = true

  backup_retention_period = 7
  preferred_backup_window = "03:00-04:00"

  skip_final_snapshot = var.environment != "production"

  tags = {
    Name        = "${var.app_name}-aurora-serverless"
    Environment = var.environment
    CostCenter  = var.cost_center
  }
}

resource "aws_rds_cluster_instance" "serverless_writer" {
  count              = var.use_serverless ? 1 : 0
  identifier         = "${var.app_name}-${var.environment}-aurora-1"
  cluster_identifier = aws_rds_cluster.serverless[0].id
  instance_class     = "db.serverless"
  engine             = aws_rds_cluster.serverless[0].engine
}

# Cost comparison for variable workload (50% idle time):
# Aurora Provisioned db.r6g.large:  ~$280/month (always on)
# Aurora Serverless v2 (avg 2 ACU): ~$175/month (scales to zero when idle)
# Savings: ~37% for variable workloads

Complete Cost-Optimized EC2 Terraform Module

# modules/cost_optimized_ec2/main.tf
# Complete reusable module for cost-optimized EC2 deployments

terraform {
  required_version = ">= 1.5"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

# Data sources for AMI selection
data "aws_ami" "amazon_linux_2023" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    name   = "name"
    values = ["al2023-ami-*-x86_64"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }
}

data "aws_ami" "amazon_linux_2023_arm64" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    name   = "name"
    values = ["al2023-ami-*-arm64"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }
}

locals {
  # Select AMI based on architecture preference
  ami_id = var.use_arm64 ? data.aws_ami.amazon_linux_2023_arm64.id : data.aws_ami.amazon_linux_2023.id

  # Map instance families to ARM equivalents
  arm64_equivalent = {
    "t3.micro"   = "t4g.micro"
    "t3.small"   = "t4g.small"
    "t3.medium"  = "t4g.medium"
    "t3.large"   = "t4g.large"
    "m5.large"   = "m6g.large"
    "m5.xlarge"  = "m6g.xlarge"
    "m5.2xlarge" = "m6g.2xlarge"
    "m6i.large"  = "m6g.large"
    "m6i.xlarge" = "m6g.xlarge"
    "c5.large"   = "c6g.large"
    "c5.xlarge"  = "c6g.xlarge"
    "r5.large"   = "r6g.large"
    "r5.xlarge"  = "r6g.xlarge"
  }

  effective_instance_type = var.use_arm64 ? lookup(local.arm64_equivalent, var.instance_type, var.instance_type) : var.instance_type
}

# IAM role and instance profile
resource "aws_iam_role" "instance" {
  name = "${var.name}-instance-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action = "sts:AssumeRole"
      Effect = "Allow"
      Principal = {
        Service = "ec2.amazonaws.com"
      }
    }]
  })
}

resource "aws_iam_instance_profile" "instance" {
  name = "${var.name}-profile"
  role = aws_iam_role.instance.name
}

# Security group
resource "aws_security_group" "instance" {
  name_prefix = "${var.name}-"
  vpc_id      = var.vpc_id
  description = "Security group for ${var.name}"

  dynamic "ingress" {
    for_each = var.ingress_rules
    content {
      from_port   = ingress.value.from_port
      to_port     = ingress.value.to_port
      protocol    = ingress.value.protocol
      cidr_blocks = ingress.value.cidr_blocks
      description = ingress.value.description
    }
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
    description = "Allow all outbound"
  }

  tags = merge(var.common_tags, {
    Name = "${var.name}-sg"
  })

  lifecycle {
    create_before_destroy = true
  }
}

# Launch template with all cost optimizations
resource "aws_launch_template" "this" {
  name_prefix   = "${var.name}-"
  image_id      = local.ami_id
  instance_type = local.effective_instance_type
  key_name      = var.key_name
  user_data     = base64encode(var.user_data)

  iam_instance_profile {
    name = aws_iam_instance_profile.instance.name
  }

  vpc_security_group_ids = [aws_security_group.instance.id]

  monitoring {
    enabled = var.enable_detailed_monitoring
  }

  block_device_mappings {
    device_name = "/dev/xvda"
    ebs {
      volume_size           = var.root_volume_size
      volume_type           = "gp3"
      iops                  = var.root_volume_iops
      throughput            = var.root_volume_throughput
      encrypted             = true
      kms_key_id            = var.kms_key_id
      delete_on_termination = true
    }
  }

  metadata_options {
    http_endpoint               = "enabled"
    http_tokens                 = "required"  # IMDSv2
    http_put_response_hop_limit = 1
    instance_metadata_tags      = "enabled"
  }

  tag_specifications {
    resource_type = "instance"
    tags = merge(var.common_tags, {
      Name = var.name
    })
  }

  tag_specifications {
    resource_type = "volume"
    tags = merge(var.common_tags, {
      Name = "${var.name}-root-volume"
    })
  }
}

# Spot instance request (optional)
resource "aws_spot_instance_request" "this" {
  count = var.use_spot ? 1 : 0

  launch_template {
    id      = aws_launch_template.this.id
    version = "$Latest"
  }

  spot_price                     = var.spot_max_price
  wait_for_fulfillment           = true
  spot_type                      = "persistent"
  instance_interruption_behavior = "stop"

  subnet_id = var.subnet_id

  tags = merge(var.common_tags, {
    Name = var.name
  })
}

# On-demand instance (default)
resource "aws_instance" "this" {
  count = var.use_spot ? 0 : 1

  launch_template {
    id      = aws_launch_template.this.id
    version = "$Latest"
  }

  subnet_id = var.subnet_id

  tags = merge(var.common_tags, {
    Name = var.name
  })

  lifecycle {
    ignore_changes = [ami]
  }
}

# Automated backup with Data Lifecycle Manager
data "aws_ssm_parameter" "dlm_role_arn" {
  name = "/iam/service-role/AWSDataLifecycleManagerDefaultRole"
}

resource "aws_dlm_lifecycle_policy" "instance_backups" {
  count       = var.enable_automated_backups ? 1 : 0
  description = "Daily backups for ${var.name}"
  execution_role_arn = try(data.aws_ssm_parameter.dlm_role_arn.value,
    "arn:aws:iam::${data.aws_caller_identity.current.account_id}:role/service-role/AWSDataLifecycleManagerDefaultRole")

  policy_details {
    resource_types = ["VOLUME"]

    schedule {
      name = "Daily backups"

      create_rule {
        interval      = 24
        interval_unit = "HOURS"
        times         = ["03:00"]
      }

      retain_rule {
        count = var.backup_retention_count
      }

      tags_to_add = {
        SnapshotCreator = "DLM"
        Name            = "${var.name}-daily"
      }

      copy_tags = true
    }

    target_tags = {
      Name = var.name
    }
  }
}

data "aws_caller_identity" "current" {}

Cost Optimization Checklist

## Monthly Cost Optimization Checklist

### Compute
- [ ] Run AWS Compute Optimizer and review recommendations
- [ ] Review idle EC2 instances (CPU < 5% for 7+ days)
- [ ] Check for orphaned/unattached EBS volumes
- [ ] Verify Spot instance usage meets 30%+ of eligible compute
- [ ] Review Auto Scaling group min/desired capacity settings
- [ ] Check for unused Elastic IPs ($0.005/hour each)
- [ ] Review Load Balancer utilization (requests per ALB)
- [ ] Evaluate Graviton migration candidates
- [ ] Review Lambda function memory allocation (Power Tuning)
- [ ] Check for stopped instances running >30 days

### Storage
- [ ] Review S3 bucket sizes and lifecycle policy coverage
- [ ] Identify incomplete multipart uploads older than 7 days
- [ ] Check for S3 buckets with no lifecycle policy
- [ ] Review EBS gp2 → gp3 migration candidates
- [ ] Identify unattached snapshots older than retention period
- [ ] Check EFS burst credit balance
- [ ] Review RDS snapshot retention and automated backups

### Database
- [ ] Review RDS Performance Insights for right-sizing
- [ ] Check RDS Reserved Instance coverage
- [ ] Evaluate Aurora Serverless for dev/test databases
- [ ] Review database storage autoscaling configuration
- [ ] Check for unused read replicas

### Network
- [ ] Review NAT Gateway data processing charges
- [ ] Verify VPC endpoint usage for AWS service traffic
- [ ] Check CloudFront cache hit ratio (target >80%)
- [ ] Review data transfer charges (inter-region, internet egress)
- [ ] Check for unused Elastic Load Balancers

### Rate Optimization
- [ ] Review Savings Plans utilization and coverage
- [ ] Check for expiring Reserved Instances (90-day lookahead)
- [ ] Evaluate 3-year vs 1-year commitment for stable workloads
- [ ] Review Savings Plans recommendation reports

30% Cost Reduction Playbook: Based on my experience introducing FinOps practices across multiple organizations, the typical quick wins that deliver 30% savings are: (1) gp2→gp3 migration (~5-8%), (2) implementing scheduled shutdown for dev/test (~10-15%), (3) Spot instance adoption for eligible workloads (~8-12%), (4) rightsizing top 20 instances (~5-10%), and (5) S3 lifecycle policies (~3-5%). These are non-disruptive changes that can be implemented within 30 days.