Cost Optimization
Systematic cost optimization reduces cloud spend without sacrificing performance or reliability. This guide covers proven strategies across compute, storage, and networking โ with real Terraform modules, scripts, and decision frameworks you can deploy today.
The Cost Optimization Framework
Effective cost optimization follows a structured approach. Based on my Lean Six Sigma Black Belt training, I apply the DMAIC methodology to cloud cost reduction:
- Define scope: Identify the target environment (dev, staging, prod), services, and team budgets.
- Measure current state: Establish baseline costs using Cost Explorer or your FinOps tool. Identify top 10 cost drivers.
- Analyze waste: Identify idle resources, overprovisioning, unused RIs, and opportunities for rate optimization.
- Improve: Execute right-sizing, RI purchases, architectural changes, and policy enforcement.
- Control: Implement automated governance to prevent regression. Set up anomaly detection and budget alerts.
Compute Optimization
Compute typically represents 50-70% of cloud spend. It is the highest-impact optimization target.
Right-Sizing Instances
AWS Compute Optimizer analyzes CloudWatch metrics and recommends optimal instance types. Enable it at the organization level for maximum coverage.
# Enable Compute Optimizer for the entire organization
aws compute-optimizer update-enrollment-status --status Active
# Enable memory metrics (requires CloudWatch agent with memory on instances)
aws compute-optimizer put-recommendation-preferences \
--resource-type Ec2Instance \
--enhanced-infrastructure-metrics Active
# Export recommendations for analysis
aws compute-optimizer get-ec2-instance-recommendations \
--output json > compute_optimizer_recommendations.json
# Identify overprovisioned instances
jq '.instanceRecommendations[] |
select(.finding == "Overprovisioned") |
{
instanceArn: .instanceArn,
currentType: .currentInstanceType,
recommendedType: .recommendationOptions[0].instanceType,
estimatedMonthlySavings: .recommendationOptions[0].projectedUtilizationMetrics[0].value
}' compute_optimizer_recommendations.json
mem_used_percent before making sizing decisions.
Spot / Preemptible Instances
Spot instances offer up to 90% savings for fault-tolerant workloads. Use them for batch processing, CI/CD, data analytics, and stateless microservices.
# spot-instance-policy.yaml โ EKS/Kubernetes Spot configuration
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: spot-workloads
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
- key: node.kubernetes.io/instance-type
operator: In
values: ["m6i.large", "m6i.xlarge", "m5.large", "m5.xlarge"]
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"]
nodeClassRef:
name: default
limits:
cpu: 1000
memory: 4000Gi
disruption:
consolidationPolicy: WhenUnderutilized
expireAfter: 720h
weight: 10
---
# Pod disruption budget for spot-aware workloads
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-service-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: api-service
# AWS Auto Scaling Group with Spot and On-Demand mixed instances policy
resource "aws_autoscaling_group" "mixed_workload" {
name = "mixed-workload-asg"
vpc_zone_identifier = var.private_subnet_ids
desired_capacity = 4
min_size = 2
max_size = 20
mixed_instances_policy {
launch_template {
launch_template_specification {
launch_template_id = aws_launch_template.mixed.id
version = "$Latest"
}
override {
instance_type = "m6i.xlarge"
weighted_capacity = "4"
}
override {
instance_type = "m6i.2xlarge"
weighted_capacity = "8"
}
override {
instance_type = "m5.xlarge"
weighted_capacity = "4"
}
}
instances_distribution {
on_demand_base_capacity = 2 # Always 2 on-demand
on_demand_percentage_above_base_capacity = 25 # 25% on-demand above base
spot_allocation_strategy = "capacity-optimized"
spot_instance_pools = 3
}
}
tag {
key = "CostOptimization"
value = "spot-mixed"
propagate_at_launch = true
}
}
Graviton / ARM-Based Instances
AWS Graviton2/3 instances offer up to 40% better price-performance over comparable x86 instances. GCP's Tau T2D and Azure's Dpsv5 series provide similar benefits.
# graviton-migration-checklist.sh
#!/bin/bash
# Check workload compatibility for Graviton migration
echo "=== Graviton Migration Readiness Check ==="
# 1. Check if application has ARM-compatible containers
if docker manifest inspect myapp:latest 2>/dev/null | grep -q "arm64"; then
echo "โ Container image has arm64 variant"
else
echo "โ Container image missing arm64 variant โ rebuild with multi-arch"
fi
# 2. Check language runtime ARM support
if command -v python3 &>/dev/null; then
python3 -c "import platform; print(f'โ Python running on {platform.machine()}')"
fi
if command -v node &>/dev/null; then
node -e "console.log('โ Node.js:', process.arch)"
fi
# 3. Check for native dependencies
if ldd /opt/myapp/bin/binary 2>/dev/null | grep -q "x86-64"; then
echo "โ Binary compiled for x86-64 โ recompile for arm64"
fi
# 4. Check database driver compatibility
pip list 2>/dev/null | grep -E "(psycopg2|PyMySQL|redis)" &>/dev/null && \
echo "โ Common Python DB drivers are ARM-compatible"
# 5. Verify no AVX/AVX2 CPU instructions required
grep -r "avx" /opt/myapp/config 2>/dev/null && \
echo "โ AVX references found โ verify ARM NEON equivalent exists"
echo ""
echo "Graviton instance pricing comparison (us-east-1):"
echo " m6i.xlarge (x86): $0.192/hour"
echo " m6g.xlarge (ARM): $0.154/hour (~20% savings)"
echo " m7g.xlarge (ARM): $0.163/hour (~15% savings, better perf)"
Autoscaling Policies and Scheduled Scaling
# cost_optimized_autoscaling.tf โ Terraform module for cost-optimized EC2
locals {
business_hours_scale_up = {
min_size = 4
max_size = 20
desired_capacity = 8
recurrence = "0 8 * * MON-FRI" # 8 AM UTC, weekdays
time_zone = "America/New_York"
}
after_hours_scale_down = {
min_size = 1
max_size = 5
desired_capacity = 1
recurrence = "0 19 * * MON-FRI" # 7 PM UTC, weekdays
time_zone = "America/New_York"
}
weekend_scale_down = {
min_size = 0
max_size = 2
desired_capacity = 0
recurrence = "0 0 * * SAT" # Midnight Friday
time_zone = "America/New_York"
}
}
resource "aws_autoscaling_group" "application" {
name = "${var.app_name}-asg"
vpc_zone_identifier = var.private_subnet_ids
target_group_arns = var.target_group_arns
health_check_type = "ELB"
health_check_grace_period = 300
min_size = var.min_size
max_size = var.max_size
desired_capacity = var.desired_capacity
launch_template {
id = aws_launch_template.application.id
version = "$Latest"
}
instance_refresh {
strategy = "Rolling"
preferences {
min_healthy_percentage = 66
instance_warmup = 120
}
}
tag {
key = "Name"
value = "${var.app_name}-instance"
propagate_at_launch = true
}
tag {
key = "Environment"
value = var.environment
propagate_at_launch = true
}
tag {
key = "AutoShutdown"
value = var.environment == "production" ? "false" : "true"
propagate_at_launch = true
}
tag {
key = "CostCenter"
value = var.cost_center
propagate_at_launch = true
}
}
# Target tracking scaling โ scale based on average CPU
resource "aws_autoscaling_policy" "cpu_target_tracking" {
name = "${var.app_name}-cpu-tracking"
autoscaling_group_name = aws_autoscaling_group.application.name
policy_type = "TargetTrackingScaling"
target_tracking_configuration {
predefined_metric_specification {
predefined_metric_type = "ASGAverageCPUUtilization"
}
target_value = 60.0
disable_scale_in = false
}
}
# Predictive scaling โ ML-based proactive scaling
resource "aws_autoscaling_policy" "predictive" {
count = var.enable_predictive_scaling ? 1 : 0
name = "${var.app_name}-predictive"
autoscaling_group_name = aws_autoscaling_group.application.name
policy_type = "PredictiveScaling"
predictive_scaling_configuration {
metric_specification {
target_value = 60.0
predefined_load_metric_specification {
predefined_metric_type = "ASGTotalCPUUtilization"
resource_label = "${aws_autoscaling_group.application.arn}/${var.target_group_arns[0]}"
}
customized_scaling_metric_specification {
metric_dimension {
name = "AutoScalingGroupName"
value = aws_autoscaling_group.application.name
}
metric_name = "CPUUtilization"
namespace = "AWS/EC2"
statistic = "Average"
}
}
mode = "ForecastAndScale"
scheduling_buffer_time = 10
}
}
# Scheduled scaling โ business hours only
resource "aws_autoscaling_schedule" "scale_up_morning" {
count = var.environment != "production" ? 1 : 0
scheduled_action_name = "scale-up-morning"
autoscaling_group_name = aws_autoscaling_group.application.name
min_size = local.business_hours_scale_up.min_size
max_size = local.business_hours_scale_up.max_size
desired_capacity = local.business_hours_scale_up.desired_capacity
recurrence = local.business_hours_scale_up.recurrence
time_zone = local.business_hours_scale_up.time_zone
}
resource "aws_autoscaling_schedule" "scale_down_evening" {
count = var.environment != "production" ? 1 : 0
scheduled_action_name = "scale-down-evening"
autoscaling_group_name = aws_autoscaling_group.application.name
min_size = local.after_hours_scale_down.min_size
max_size = local.after_hours_scale_down.max_size
desired_capacity = local.after_hours_scale_down.desired_capacity
recurrence = local.after_hours_scale_down.recurrence
time_zone = local.after_hours_scale_down.time_zone
}
resource "aws_autoscaling_schedule" "scale_down_weekend" {
count = var.environment != "production" ? 1 : 0
scheduled_action_name = "scale-down-weekend"
autoscaling_group_name = aws_autoscaling_group.application.name
min_size = local.weekend_scale_down.min_size
max_size = local.weekend_scale_down.max_size
desired_capacity = local.weekend_scale_down.desired_capacity
recurrence = local.weekend_scale_down.recurrence
time_zone = local.weekend_scale_down.time_zone
}
# Launch template with cost-optimized settings
resource "aws_launch_template" "application" {
name_prefix = "${var.app_name}-"
image_id = var.ami_id
instance_type = var.instance_type
key_name = var.key_name
iam_instance_profile {
name = aws_iam_instance_profile.application.name
}
vpc_security_group_ids = var.security_group_ids
# Enable detailed monitoring for better metrics granularity
monitoring {
enabled = true
}
# Use gp3 EBS โ 20% cheaper than gp2 with better IOPS
block_device_mappings {
device_name = "/dev/xvda"
ebs {
volume_size = var.root_volume_size
volume_type = "gp3"
iops = 3000
throughput = 125
encrypted = true
delete_on_termination = true
}
}
metadata_options {
http_endpoint = "enabled"
http_tokens = "required" # IMDSv2 โ security best practice
http_put_response_hop_limit = 1
}
tag_specifications {
resource_type = "instance"
tags = {
Name = "${var.app_name}-instance"
Environment = var.environment
CostCenter = var.cost_center
}
}
tag_specifications {
resource_type = "volume"
tags = {
Name = "${var.app_name}-root-volume"
Environment = var.environment
CostCenter = var.cost_center
}
}
user_data = base64encode(templatefile("${path.module}/userdata.sh", {
app_name = var.app_name
environment = var.environment
}))
}
Savings Plans and Reserved Instances
Purchase Strategy
| Payment Option | Discount vs On-Demand | Cash Flow Impact | Best For |
|---|---|---|---|
| All Upfront | Highest (~40-60%) | Large initial outlay | Organizations with committed budgets; highest discount |
| Partial Upfront (~50%) | Medium (~35-50%) | Moderate initial outlay | Balanced approach; most common choice |
| No Upfront | Lower (~25-40%) | None | Cash-constrained teams; still significant savings |
Break-Even Analysis
#!/usr/bin/env python3
# ri_break_even.py โ Reserved Instance break-even calculator
import argparse
def calculate_break_even(
on_demand_hourly,
ri_upfront,
ri_hourly,
hours_per_month=730
):
"""Calculate RI break-even point in months."""
monthly_on_demand = on_demand_hourly * hours_per_month
monthly_ri = ri_hourly * hours_per_month
monthly_savings = monthly_on_demand - monthly_ri
if monthly_savings <= 0:
return None, None
break_even_months = ri_upfront / monthly_savings
total_savings_1yr = (monthly_savings * 12) - ri_upfront
total_savings_3yr = (monthly_savings * 36) - ri_upfront
roi_1yr = (total_savings_1yr / ri_upfront * 100) if ri_upfront > 0 else float('inf')
roi_3yr = (total_savings_3yr / ri_upfront * 100) if ri_upfront > 0 else float('inf')
return {
"monthly_on_demand": monthly_on_demand,
"monthly_ri_total": monthly_ri + (ri_upfront / 12),
"monthly_savings": monthly_savings,
"break_even_months": break_even_months,
"total_savings_1yr": total_savings_1yr,
"total_savings_3yr": total_savings_3yr,
"roi_1yr_pct": roi_1yr,
"roi_3yr_pct": roi_3yr
}
def main():
parser = argparse.ArgumentParser(description="RI Break-Even Calculator")
parser.add_argument("--ondemand-hourly", type=float, required=True)
parser.add_argument("--ri-upfront", type=float, default=0)
parser.add_argument("--ri-hourly", type=float, required=True)
parser.add_argument("--term", choices=["1yr", "3yr"], default="1yr")
args = parser.parse_args()
result = calculate_break_even(
args.ondemand_hourly,
args.ri_upfront,
args.ri_hourly
)
if result is None:
print("ERROR: RI costs more than on-demand. Not recommended.")
return
print(f"\n{'='*55}")
print(f" RI Break-Even Analysis")
print(f"{'='*55}")
print(f" On-Demand Hourly: ${args.ondemand_hourly:.4f}")
print(f" RI Upfront: ${args.ri_upfront:,.2f}")
print(f" RI Hourly: ${args.ri_hourly:.4f}")
print(f" Term: {args.term}")
print(f" {'-'*51}")
print(f" Monthly On-Demand: ${result['monthly_on_demand']:,.2f}")
print(f" Monthly RI (amortized): ${result['monthly_ri_total']:,.2f}")
print(f" Monthly Savings: ${result['monthly_savings']:,.2f}")
print(f" {'-'*51}")
print(f" Break-Even Point: {result['break_even_months']:.1f} months")
print(f" 1-Year Total Savings: ${result['total_savings_1yr']:,.2f}")
print(f" 3-Year Total Savings: ${result['total_savings_3yr']:,.2f}")
print(f" 1-Year ROI: {result['roi_1yr_pct']:.1f}%")
print(f" 3-Year ROI: {result['roi_3yr_pct']:.1f}%")
print(f"{'='*55}")
if result['break_even_months'] <= 6:
print(" โ
RECOMMENDED: Break-even under 6 months")
elif result['break_even_months'] <= 12:
print(" โ ๏ธ CONDITIONAL: Review utilization before committing")
else:
print(" โ NOT RECOMMENDED: Break-even exceeds 12 months")
if __name__ == "__main__":
main()
# Example: m6i.xlarge in us-east-1
# python3 ri_break_even.py --ondemand-hourly 0.192 --ri-upfront 514.0 --ri-hourly 0.054 --term 1yr
Storage Optimization
S3 Lifecycle Policies and Intelligent Tiering
# s3_lifecycle.tf โ Cost-optimized S3 bucket with lifecycle policies
resource "aws_s3_bucket" "data_lake" {
bucket = "company-data-lake-${var.environment}-${data.aws_caller_identity.current.account_id}"
}
# Enable Intelligent Tiering โ automatic cost optimization
resource "aws_s3_bucket_intelligent_tiering_configuration" "data_lake" {
bucket = aws_s3_bucket.data_lake.id
name = "EntireBucket"
tiering {
access_tier = "ARCHIVE_ACCESS"
days = 90
}
tiering {
access_tier = "DEEP_ARCHIVE_ACCESS"
days = 180
}
}
# Lifecycle policy for explicit transitions
resource "aws_s3_bucket_lifecycle_configuration" "data_lake" {
bucket = aws_s3_bucket.data_lake.id
rule {
id = "transition-to-cheaper-storage"
status = "Enabled"
filter {
prefix = "" # Apply to entire bucket
}
transition {
days = 30
storage_class = "STANDARD_IA"
}
transition {
days = 90
storage_class = "GLACIER_IR" # Glacier Instant Retrieval โ for rarely accessed but needed quickly
}
transition {
days = 365
storage_class = "DEEP_ARCHIVE"
}
noncurrent_version_transition {
noncurrent_days = 30
storage_class = "STANDARD_IA"
}
noncurrent_version_transition {
noncurrent_days = 60
storage_class = "GLACIER"
}
noncurrent_version_expiration {
noncurrent_days = 365
}
abort_incomplete_multipart_upload {
days_after_initiation = 7
}
}
rule {
id = "temp-data-cleanup"
status = "Enabled"
filter {
prefix = "temp/"
}
expiration {
days = 7
}
}
rule {
id = "log-cleanup"
status = "Enabled"
filter {
prefix = "logs/"
}
transition {
days = 30
storage_class = "STANDARD_IA"
}
expiration {
days = 90
}
}
}
# S3 Storage Lens for visibility into storage optimization opportunities
resource "aws_s3control_storage_lens_configuration" "organization" {
config_id = "organization-storage-lens"
account_id = var.master_account_id
storage_lens_configuration {
enabled = true
account_level {
activity_metrics {
is_enabled = true
}
advanced_cost_optimization_metrics {
is_enabled = true
}
advanced_data_protection_metrics {
is_enabled = true
}
bucket_level {
activity_metrics {
is_enabled = true
}
prefix_level {
storage_metrics {
is_enabled = true
selection_criteria {
delimiter = "/"
max_depth = 3
min_storage_bytes = 104857600 # 100MB minimum
}
}
}
}
}
data_export {
s3_bucket_destination {
format = "Parquet"
output_schema_version = "V_1"
account_id = var.master_account_id
arn = aws_s3_bucket.storage_lens_export.arn
prefix = "storage-lens/"
}
}
exclude {
buckets = [
aws_s3_bucket.temp_excluded.id
]
}
}
}
EBS Volume Right-Sizing and gp3 Migration
gp3 volumes offer 20% lower cost than gp2 at equivalent performance, with independent IOPS and throughput scaling. Migrating all gp2 volumes to gp3 is typically the easiest storage win.
#!/bin/bash
# ebs_gp3_migration.sh โ Identify and migrate gp2 volumes to gp3
AWS_PROFILE=${AWS_PROFILE:-default}
REGIONS=$(aws ec2 describe-regions --query 'Regions[].RegionName' --output text)
total_volumes=0
total_savings=0.00
echo "=== EBS gp2 โ gp3 Migration Analysis ==="
echo "Region,Volumes,Monthly Savings"
for REGION in $REGIONS; do
volumes=$(aws ec2 describe-volumes \
--region "$REGION" \
--filters Name=volume-type,Values=gp2 \
--query 'Volumes[?State==`available` || length(Attachments) > `0`].[VolumeId,Size,VolumeType,State]' \
--output text)
count=$(echo "$volumes" | wc -l)
savings=$(echo "$count * 0.20 * 0.08 * 100" | bc -l 2>/dev/null || echo "0")
if [ -n "$volumes" ]; then
echo "$REGION,$count,\$$savings"
total_volumes=$((total_volumes + count))
total_savings=$(echo "$total_savings + $savings" | bc -l 2>/dev/null || echo "$total_savings")
fi
done
echo ""
echo "=== Summary ==="
echo "Total gp2 volumes: $total_volumes"
echo "Estimated monthly savings from migration: \$$total_savings"
echo ""
read -p "Proceed with migration? (yes/no): " confirm
if [ "$confirm" != "yes" ]; then
echo "Aborted."
exit 0
fi
# Execute migration
for REGION in $REGIONS; do
volume_ids=$(aws ec2 describe-volumes \
--region "$REGION" \
--filters Name=volume-type,Values=gp2 \
--query 'Volumes[].VolumeId' --output text)
for vol in $volume_ids; do
echo "Migrating $vol in $REGION..."
aws ec2 modify-volume \
--region "$REGION" \
--volume-id "$vol" \
--volume-type gp3 \
--iops 3000 \
--throughput 125
done
done
echo "Migration complete. Monitor CloudWatch VolumeReadOps/VolumeWriteOps to verify performance."
Network Optimization
NAT Gateway Alternatives
NAT Gateway is one of the most expensive networking services in AWS ($0.045/hour + $0.045/GB). For non-production or batch workloads, alternatives include NAT instances, VPC endpoints, and IPv6 egress-only gateways.
# nat_gateway_optimization.tf โ Hybrid NAT strategy
# Option 1: VPC Endpoints for AWS services (avoids NAT entirely for S3, DynamoDB, etc.)
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${data.aws_region.current.name}.s3"
route_table_ids = aws_route_table.private[*].id
tags = {
Name = "s3-vpc-endpoint"
}
}
resource "aws_vpc_endpoint" "dynamodb" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${data.aws_region.current.name}.dynamodb"
route_table_ids = aws_route_table.private[*].id
tags = {
Name = "dynamodb-vpc-endpoint"
}
}
# Interface endpoints for other AWS services
resource "aws_vpc_endpoint" "ssm" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${data.aws_region.current.name}.ssm"
vpc_endpoint_type = "Interface"
private_dns_enabled = true
subnet_ids = var.private_subnet_ids
security_group_ids = [aws_security_group.vpc_endpoints.id]
tags = {
Name = "ssm-endpoint"
}
}
# Option 2: NAT Instance for dev/test environments (instead of NAT Gateway)
resource "aws_instance" "nat_instance" {
count = var.environment != "production" ? 1 : 0
ami = data.aws_ami.amazon_linux_2023.id
instance_type = "t4g.micro" # ARM-based, cheapest option
subnet_id = var.public_subnet_ids[0]
vpc_security_group_ids = [aws_security_group.nat_instance.id]
source_dest_check = false # Required for NAT
user_data = <<-EOF
#!/bin/bash
sysctl -w net.ipv4.ip_forward=1
iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
yum install -y iptables-services
service iptables save
EOF
tags = {
Name = "${var.environment}-nat-instance"
}
}
# Option 3: IPv6 Egress-Only Internet Gateway (for IPv6 workloads)
resource "aws_egress_only_internet_gateway" "ipv6" {
vpc_id = aws_vpc.main.id
tags = {
Name = "${var.environment}-eigw"
}
}
# Cost comparison (us-east-1, per month):
# NAT Gateway: ~$32.40 base + $0.045/GB
# NAT Instance (t4g.micro): ~$7.39 (24/7) โ use spot for further savings
# VPC Endpoint: ~$7.20/month per AZ + $0.01/GB
CloudFront Caching Strategies
# cloudfront_optimization.tf โ Cost-optimized CloudFront distribution
resource "aws_cloudfront_distribution" "app" {
enabled = true
is_ipv6_enabled = true
comment = "${var.app_name} CDN"
default_root_object = "index.html"
price_class = var.environment == "production" ? "PriceClass_All" : "PriceClass_100"
# Origin: S3 static assets
origin {
domain_name = aws_s3_bucket.static_assets.bucket_regional_domain_name
origin_id = "S3-static"
s3_origin_config {
origin_access_identity = aws_cloudfront_origin_access_identity.oai.cloudfront_access_identity_path
}
# Origin Shield reduces origin load and improves cache hit ratio
origin_shield {
enabled = true
origin_shield_region = "us-east-1"
}
}
# Origin: ALB for dynamic content
origin {
domain_name = aws_lb.app.dns_name
origin_id = "ALB-dynamic"
custom_origin_config {
http_port = 80
https_port = 443
origin_protocol_policy = "https-only"
origin_ssl_protocols = ["TLSv1.2"]
}
}
# Cache behavior for static assets (aggressive caching)
ordered_cache_behavior {
path_pattern = "/static/*"
allowed_methods = ["GET", "HEAD", "OPTIONS"]
cached_methods = ["GET", "HEAD"]
target_origin_id = "S3-static"
forwarded_values {
query_string = false
cookies {
forward = "none"
}
}
viewer_protocol_policy = "redirect-to-https"
min_ttl = 86400 # 1 day
default_ttl = 604800 # 7 days
max_ttl = 31536000 # 1 year
compress = true
}
# Cache behavior for images and media
ordered_cache_behavior {
path_pattern = "/media/*"
allowed_methods = ["GET", "HEAD"]
cached_methods = ["GET", "HEAD"]
target_origin_id = "S3-static"
forwarded_values {
query_string = false
cookies {
forward = "none"
}
}
viewer_protocol_policy = "redirect-to-https"
min_ttl = 86400
default_ttl = 2592000 # 30 days
max_ttl = 31536000 # 1 year
compress = true
}
# Default cache behavior for dynamic content
default_cache_behavior {
allowed_methods = ["DELETE", "GET", "HEAD", "OPTIONS", "PATCH", "POST", "PUT"]
cached_methods = ["GET", "HEAD"]
target_origin_id = "ALB-dynamic"
cache_policy_id = aws_cloudfront_cache_policy.dynamic.id
origin_request_policy_id = aws_cloudfront_origin_request_policy.dynamic.id
viewer_protocol_policy = "redirect-to-https"
compress = true
}
restrictions {
geo_restriction {
restriction_type = "none"
}
}
viewer_certificate {
cloudfront_default_certificate = var.environment != "production"
acm_certificate_arn = var.environment == "production" ? var.acm_certificate_arn : null
ssl_support_method = "sni-only"
minimum_protocol_version = "TLSv1.2_2021"
}
}
# Cache policy for dynamic content (short TTL, selective caching)
resource "aws_cloudfront_cache_policy" "dynamic" {
name = "${var.app_name}-dynamic-policy"
comment = "Cache dynamic content selectively"
default_ttl = 0
max_ttl = 60
min_ttl = 0
parameters_in_cache_key_and_forwarded_to_origin {
enable_accept_encoding_gzip = true
enable_accept_encoding_brotli = true
headers_config {
header_behavior = "none"
}
cookies_config {
cookie_behavior = "none"
}
query_strings_config {
query_string_behavior = "whitelist"
query_strings {
items = ["version", "cache-buster"]
}
}
}
}
Database Optimization
RDS Optimization
# rds_optimized.tf โ Cost-optimized RDS configuration
resource "aws_db_instance" "primary" {
identifier = "${var.app_name}-${var.environment}"
# Engine configuration
engine = "postgres"
engine_version = "15.4"
instance_class = var.environment == "production" ? "db.m6g.xlarge" : "db.t4g.micro"
# Storage โ always use gp3 for new instances
allocated_storage = 100
max_allocated_storage = 1000 # Enable storage autoscaling
storage_type = "gp3"
storage_encrypted = true
iops = var.environment == "production" ? 3000 : null
# Database configuration
db_name = var.database_name
username = var.database_username
password = var.database_password
# High availability โ only in production
multi_az = var.environment == "production"
# Backup and maintenance
backup_retention_period = var.environment == "production" ? 30 : 7
backup_window = "03:00-04:00"
maintenance_window = "Mon:04:00-Mon:05:00"
# Performance Insights โ invaluable for right-sizing
performance_insights_enabled = true
performance_insights_retention_period = 7
# Enhanced monitoring for detailed metrics
monitoring_interval = 60
monitoring_role_arn = aws_iam_role.rds_monitoring.arn
# Deletion protection and skip final snapshot for non-prod
deletion_protection = var.environment == "production"
skip_final_snapshot = var.environment != "production"
# Auto minor version upgrades
auto_minor_version_upgrade = true
vpc_security_group_ids = var.database_security_group_ids
db_subnet_group_name = aws_db_subnet_group.main.name
tags = {
Name = "${var.app_name}-postgres"
Environment = var.environment
CostCenter = var.cost_center
}
}
# Read replica for read-heavy production workloads
resource "aws_db_instance" "replica" {
count = var.environment == "production" && var.enable_read_replica ? 1 : 0
identifier = "${var.app_name}-${var.environment}-replica"
replicate_source_db = aws_db_instance.primary.arn
instance_class = "db.m6g.large" # Can be smaller than primary
storage_type = "gp3"
performance_insights_enabled = true
performance_insights_retention_period = 7
monitoring_interval = 60
monitoring_role_arn = aws_iam_role.rds_monitoring.arn
auto_minor_version_upgrade = true
skip_final_snapshot = true
vpc_security_group_ids = var.database_security_group_ids
db_subnet_group_name = aws_db_subnet_group.main.name
tags = {
Name = "${var.app_name}-postgres-replica"
Environment = var.environment
CostCenter = var.cost_center
}
}
# RDS Reserved Instance for production databases
# Purchase via AWS Console or API after 30 days of stable usage
# Example: db.m6g.xlarge, 1 year, partial upfront = ~40% savings
Aurora Serverless for Variable Workloads
# aurora_serverless.tf โ Aurora Serverless v2 for variable or dev workloads
resource "aws_rds_cluster" "serverless" {
count = var.use_serverless ? 1 : 0
cluster_identifier = "${var.app_name}-${var.environment}-aurora"
engine = "aurora-postgresql"
engine_mode = "provisioned" # Aurora Serverless v2 uses "provisioned" with scaling
engine_version = "15.4"
database_name = var.database_name
master_username = var.database_username
master_password = var.database_password
# Serverless v2 scaling configuration
serverlessv2_scaling_configuration {
min_capacity = var.environment == "production" ? 1.0 : 0.5 # ACUs
max_capacity = var.environment == "production" ? 16.0 : 4.0
}
db_subnet_group_name = aws_db_subnet_group.main.name
vpc_security_group_ids = var.database_security_group_ids
storage_encrypted = true
backup_retention_period = 7
preferred_backup_window = "03:00-04:00"
skip_final_snapshot = var.environment != "production"
tags = {
Name = "${var.app_name}-aurora-serverless"
Environment = var.environment
CostCenter = var.cost_center
}
}
resource "aws_rds_cluster_instance" "serverless_writer" {
count = var.use_serverless ? 1 : 0
identifier = "${var.app_name}-${var.environment}-aurora-1"
cluster_identifier = aws_rds_cluster.serverless[0].id
instance_class = "db.serverless"
engine = aws_rds_cluster.serverless[0].engine
}
# Cost comparison for variable workload (50% idle time):
# Aurora Provisioned db.r6g.large: ~$280/month (always on)
# Aurora Serverless v2 (avg 2 ACU): ~$175/month (scales to zero when idle)
# Savings: ~37% for variable workloads
Complete Cost-Optimized EC2 Terraform Module
# modules/cost_optimized_ec2/main.tf
# Complete reusable module for cost-optimized EC2 deployments
terraform {
required_version = ">= 1.5"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
# Data sources for AMI selection
data "aws_ami" "amazon_linux_2023" {
most_recent = true
owners = ["amazon"]
filter {
name = "name"
values = ["al2023-ami-*-x86_64"]
}
filter {
name = "virtualization-type"
values = ["hvm"]
}
}
data "aws_ami" "amazon_linux_2023_arm64" {
most_recent = true
owners = ["amazon"]
filter {
name = "name"
values = ["al2023-ami-*-arm64"]
}
filter {
name = "virtualization-type"
values = ["hvm"]
}
}
locals {
# Select AMI based on architecture preference
ami_id = var.use_arm64 ? data.aws_ami.amazon_linux_2023_arm64.id : data.aws_ami.amazon_linux_2023.id
# Map instance families to ARM equivalents
arm64_equivalent = {
"t3.micro" = "t4g.micro"
"t3.small" = "t4g.small"
"t3.medium" = "t4g.medium"
"t3.large" = "t4g.large"
"m5.large" = "m6g.large"
"m5.xlarge" = "m6g.xlarge"
"m5.2xlarge" = "m6g.2xlarge"
"m6i.large" = "m6g.large"
"m6i.xlarge" = "m6g.xlarge"
"c5.large" = "c6g.large"
"c5.xlarge" = "c6g.xlarge"
"r5.large" = "r6g.large"
"r5.xlarge" = "r6g.xlarge"
}
effective_instance_type = var.use_arm64 ? lookup(local.arm64_equivalent, var.instance_type, var.instance_type) : var.instance_type
}
# IAM role and instance profile
resource "aws_iam_role" "instance" {
name = "${var.name}-instance-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "ec2.amazonaws.com"
}
}]
})
}
resource "aws_iam_instance_profile" "instance" {
name = "${var.name}-profile"
role = aws_iam_role.instance.name
}
# Security group
resource "aws_security_group" "instance" {
name_prefix = "${var.name}-"
vpc_id = var.vpc_id
description = "Security group for ${var.name}"
dynamic "ingress" {
for_each = var.ingress_rules
content {
from_port = ingress.value.from_port
to_port = ingress.value.to_port
protocol = ingress.value.protocol
cidr_blocks = ingress.value.cidr_blocks
description = ingress.value.description
}
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
description = "Allow all outbound"
}
tags = merge(var.common_tags, {
Name = "${var.name}-sg"
})
lifecycle {
create_before_destroy = true
}
}
# Launch template with all cost optimizations
resource "aws_launch_template" "this" {
name_prefix = "${var.name}-"
image_id = local.ami_id
instance_type = local.effective_instance_type
key_name = var.key_name
user_data = base64encode(var.user_data)
iam_instance_profile {
name = aws_iam_instance_profile.instance.name
}
vpc_security_group_ids = [aws_security_group.instance.id]
monitoring {
enabled = var.enable_detailed_monitoring
}
block_device_mappings {
device_name = "/dev/xvda"
ebs {
volume_size = var.root_volume_size
volume_type = "gp3"
iops = var.root_volume_iops
throughput = var.root_volume_throughput
encrypted = true
kms_key_id = var.kms_key_id
delete_on_termination = true
}
}
metadata_options {
http_endpoint = "enabled"
http_tokens = "required" # IMDSv2
http_put_response_hop_limit = 1
instance_metadata_tags = "enabled"
}
tag_specifications {
resource_type = "instance"
tags = merge(var.common_tags, {
Name = var.name
})
}
tag_specifications {
resource_type = "volume"
tags = merge(var.common_tags, {
Name = "${var.name}-root-volume"
})
}
}
# Spot instance request (optional)
resource "aws_spot_instance_request" "this" {
count = var.use_spot ? 1 : 0
launch_template {
id = aws_launch_template.this.id
version = "$Latest"
}
spot_price = var.spot_max_price
wait_for_fulfillment = true
spot_type = "persistent"
instance_interruption_behavior = "stop"
subnet_id = var.subnet_id
tags = merge(var.common_tags, {
Name = var.name
})
}
# On-demand instance (default)
resource "aws_instance" "this" {
count = var.use_spot ? 0 : 1
launch_template {
id = aws_launch_template.this.id
version = "$Latest"
}
subnet_id = var.subnet_id
tags = merge(var.common_tags, {
Name = var.name
})
lifecycle {
ignore_changes = [ami]
}
}
# Automated backup with Data Lifecycle Manager
data "aws_ssm_parameter" "dlm_role_arn" {
name = "/iam/service-role/AWSDataLifecycleManagerDefaultRole"
}
resource "aws_dlm_lifecycle_policy" "instance_backups" {
count = var.enable_automated_backups ? 1 : 0
description = "Daily backups for ${var.name}"
execution_role_arn = try(data.aws_ssm_parameter.dlm_role_arn.value,
"arn:aws:iam::${data.aws_caller_identity.current.account_id}:role/service-role/AWSDataLifecycleManagerDefaultRole")
policy_details {
resource_types = ["VOLUME"]
schedule {
name = "Daily backups"
create_rule {
interval = 24
interval_unit = "HOURS"
times = ["03:00"]
}
retain_rule {
count = var.backup_retention_count
}
tags_to_add = {
SnapshotCreator = "DLM"
Name = "${var.name}-daily"
}
copy_tags = true
}
target_tags = {
Name = var.name
}
}
}
data "aws_caller_identity" "current" {}
Cost Optimization Checklist
## Monthly Cost Optimization Checklist
### Compute
- [ ] Run AWS Compute Optimizer and review recommendations
- [ ] Review idle EC2 instances (CPU < 5% for 7+ days)
- [ ] Check for orphaned/unattached EBS volumes
- [ ] Verify Spot instance usage meets 30%+ of eligible compute
- [ ] Review Auto Scaling group min/desired capacity settings
- [ ] Check for unused Elastic IPs ($0.005/hour each)
- [ ] Review Load Balancer utilization (requests per ALB)
- [ ] Evaluate Graviton migration candidates
- [ ] Review Lambda function memory allocation (Power Tuning)
- [ ] Check for stopped instances running >30 days
### Storage
- [ ] Review S3 bucket sizes and lifecycle policy coverage
- [ ] Identify incomplete multipart uploads older than 7 days
- [ ] Check for S3 buckets with no lifecycle policy
- [ ] Review EBS gp2 โ gp3 migration candidates
- [ ] Identify unattached snapshots older than retention period
- [ ] Check EFS burst credit balance
- [ ] Review RDS snapshot retention and automated backups
### Database
- [ ] Review RDS Performance Insights for right-sizing
- [ ] Check RDS Reserved Instance coverage
- [ ] Evaluate Aurora Serverless for dev/test databases
- [ ] Review database storage autoscaling configuration
- [ ] Check for unused read replicas
### Network
- [ ] Review NAT Gateway data processing charges
- [ ] Verify VPC endpoint usage for AWS service traffic
- [ ] Check CloudFront cache hit ratio (target >80%)
- [ ] Review data transfer charges (inter-region, internet egress)
- [ ] Check for unused Elastic Load Balancers
### Rate Optimization
- [ ] Review Savings Plans utilization and coverage
- [ ] Check for expiring Reserved Instances (90-day lookahead)
- [ ] Evaluate 3-year vs 1-year commitment for stable workloads
- [ ] Review Savings Plans recommendation reports