41 pages ยท 8 sections
Ctrl K
GitHub Portfolio

Budgets & Alerts

Proactive budget monitoring prevents cost overruns and enables teams to make informed spending decisions before costs escalate. This guide covers AWS Budgets, anomaly detection, third-party tooling, and the operational processes that make cost alerting actionable.

AWS Budgets Setup and Configuration

AWS Budgets is the native cost monitoring service. It supports four budget types and integrates with SNS for notifications. All budgets should be created programmatically via CloudFormation or Terraform to ensure consistency across accounts.

Budget Types

Budget TypeMonitorsUse CaseAlert Triggers
Cost BudgetActual or forecasted spend against a fixed amountMonthly/quarterly spend caps by team or accountActual > 80%, 100%; Forecasted > 100%
Usage BudgetService usage quantities (e.g., EC2 instance hours)Track compute consumption for capacity planningUsage exceeds threshold
Reserved Instance BudgetRI utilization and coverageEnsure purchased RIs are fully utilizedUtilization < 80%, Coverage < 80%
Savings Plans BudgetSavings Plans utilization and coverageMonitor SP commitment consumptionUtilization < 80%, Coverage < 80%

Alert Thresholds

Best practice is a three-tier alerting system:

TierThresholdChannelResponse
๐ŸŸก AdvisoryForecasted > 80% of budgetSlack #finops-alertsInform team; no action required
๐ŸŸ  WarningActual > 90% or Forecasted > 100%Slack + Email to team leadInvestigate within 48 hours
๐Ÿ”ด CriticalActual > 100% or Anomaly detectedSlack + Email + PagerDutyImmediate investigation required

Complete AWS Budgets Terraform Configuration

# budgets.tf โ€” Comprehensive AWS Budgets configuration

locals {
  budget_notifications = {
    advisory = {
      comparison_operator = "GREATER_THAN"
      threshold           = 80
      threshold_type      = "PERCENTAGE"
      notification_type   = "FORECASTED"
      subscriber_sns_topic = aws_sns_topic.budget_alerts.arn
    }
    warning = {
      comparison_operator = "GREATER_THAN"
      threshold           = 90
      threshold_type      = "PERCENTAGE"
      notification_type   = "ACTUAL"
      subscriber_sns_topic = aws_sns_topic.budget_alerts.arn
    }
    critical = {
      comparison_operator = "GREATER_THAN"
      threshold           = 100
      threshold_type      = "PERCENTAGE"
      notification_type   = "ACTUAL"
      subscriber_sns_topic = aws_sns_topic.budget_critical.arn
    }
  }

  team_budgets = {
    platform = { amount = 50000, costcenter = "CC-10001" }
    product  = { amount = 30000, costcenter = "CC-10002" }
    data     = { amount = 25000, costcenter = "CC-10003" }
    security = { amount = 15000, costcenter = "CC-10004" }
  }
}

# SNS Topics for budget notifications
resource "aws_sns_topic" "budget_alerts" {
  name              = "budget-alerts"
  kms_master_key_id = aws_kms_key.sns.arn

  tags = local.mandatory_tags
}

resource "aws_sns_topic" "budget_critical" {
  name              = "budget-critical-alerts"
  kms_master_key_id = aws_kms_key.sns.arn

  tags = local.mandatory_tags
}

# SNS topic policy to allow Budgets to publish
resource "aws_sns_topic_policy" "budget_alerts" {
  arn = aws_sns_topic.budget_alerts.arn

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "AllowBudgetsToPublish"
        Effect = "Allow"
        Principal = {
          Service = "budgets.amazonaws.com"
        }
        Action   = "SNS:Publish"
        Resource = aws_sns_topic.budget_alerts.arn
      }
    ]
  })
}

# Overall account budget
resource "aws_budgets_budget" "monthly_total" {
  name              = "monthly-total-spend"
  budget_type       = "COST"
  limit_amount      = "150000"
  limit_unit        = "USD"
  time_unit         = "MONTHLY"
  time_period_start = "2024-01-01_00:00"

  cost_filter {
    name = "TagKeyValue"
    values = [
      "user:CostCenter$CC-10001",
      "user:CostCenter$CC-10002",
      "user:CostCenter$CC-10003",
      "user:CostCenter$CC-10004",
    ]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "FORECASTED"
    subscriber_sns_topic_arns  = [aws_sns_topic.budget_alerts.arn]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 90
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_sns_topic_arns  = [aws_sns_topic.budget_alerts.arn]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_sns_topic_arns  = [aws_sns_topic.budget_critical.arn]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type             = "PERCENTAGE"
    notification_type          = "FORECASTED"
    subscriber_sns_topic_arns  = [aws_sns_topic.budget_critical.arn]
  }
}

# Per-team budgets with tag-based filtering
resource "aws_budgets_budget" "team" {
  for_each = local.team_budgets

  name         = "team-${each.key}-monthly"
  budget_type  = "COST"
  limit_amount = each.value.amount
  limit_unit   = "USD"
  time_unit    = "MONTHLY"

  cost_filter {
    name   = "TagKeyValue"
    values = ["user:CostCenter$${each.value.costcenter}"]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "FORECASTED"
    subscriber_sns_topic_arns  = [aws_sns_topic.budget_alerts.arn]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_sns_topic_arns  = [aws_sns_topic.budget_critical.arn]
  }
}

# EC2-specific usage budget (track compute hours)
resource "aws_budgets_budget" "ec2_usage" {
  name         = "ec2-compute-hours"
  budget_type  = "USAGE"
  limit_amount = "50000"  # instance-hours per month
  limit_unit   = "Hours"
  time_unit    = "MONTHLY"

  cost_filter {
    name   = "Service"
    values = ["Amazon Elastic Compute Cloud - Compute"]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 90
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_sns_topic_arns  = [aws_sns_topic.budget_alerts.arn]
  }
}

# Savings Plans utilization budget
resource "aws_budgets_budget" "sp_utilization" {
  name              = "savings-plans-utilization"
  budget_type       = "SAVINGS_PLANS_UTILIZATION"
  limit_amount      = "100"
  limit_unit        = "PERCENTAGE"
  time_unit         = "MONTHLY"

  cost_types {
    include_subscription = true
    use_blended          = false
  }

  notification {
    comparison_operator        = "LESS_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_sns_topic_arns  = [aws_sns_topic.budget_alerts.arn]
  }
}

# Reserved Instance coverage budget
resource "aws_budgets_budget" "ri_coverage" {
  name              = "ri-coverage"
  budget_type       = "RI_COVERAGE"
  limit_amount      = "100"
  limit_unit        = "PERCENTAGE"
  time_unit         = "MONTHLY"

  cost_types {
    include_subscription = true
    use_blended          = false
  }

  notification {
    comparison_operator        = "LESS_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_sns_topic_arns  = [aws_sns_topic.budget_alerts.arn]
  }
}

# KMS key for SNS encryption
resource "aws_kms_key" "sns" {
  description             = "KMS key for budget alert SNS topics"
  deletion_window_in_days = 7
  enable_key_rotation     = true

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "Enable IAM User Permissions"
        Effect = "Allow"
        Principal = {
          AWS = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:root"
        }
        Action   = "kms:*"
        Resource = "*"
      },
      {
        Sid    = "Allow Budgets Service"
        Effect = "Allow"
        Principal = {
          Service = "budgets.amazonaws.com"
        }
        Action = [
          "kms:GenerateDataKey*",
          "kms:Decrypt"
        ]
        Resource = "*"
      }
    ]
  })

  tags = local.mandatory_tags
}

data "aws_caller_identity" "current" {}

AWS Budgets CloudFormation Template

# budgets.yaml โ€” CloudFormation equivalent for organizations using CloudFormation
AWSTemplateFormatVersion: '2010-09-09'
Description: 'AWS Budgets configuration with multi-tier alerting'

Parameters:
  MonthlyBudgetAmount:
    Type: Number
    Description: 'Monthly budget amount in USD'
    Default: 150000

  AlertEmail:
    Type: String
    Description: 'Email address for budget alerts'
    Default: 'finops-alerts@company.com'

  SlackWebhookSecretArn:
    Type: String
    Description: 'ARN of Secrets Manager secret containing Slack webhook URL'

Resources:
  # SNS Topic for budget notifications
  BudgetAlertTopic:
    Type: AWS::SNS::Topic
    Properties:
      TopicName: budget-alerts
      KmsMasterKeyId: !Ref BudgetAlertKey

  BudgetAlertKey:
    Type: AWS::KMS::Key
    Properties:
      Description: 'KMS key for budget alert SNS topics'
      EnableKeyRotation: true
      KeyPolicy:
        Version: '2012-10-17'
        Statement:
          - Sid: Enable IAM User Permissions
            Effect: Allow
            Principal:
              AWS: !Sub 'arn:aws:iam::${AWS::AccountId}:root'
            Action: 'kms:*'
            Resource: '*'
          - Sid: Allow Budgets Service
            Effect: Allow
            Principal:
              Service: budgets.amazonaws.com
            Action:
              - 'kms:GenerateDataKey*'
              - 'kms:Decrypt'
            Resource: '*'

  # SNS Topic Policy
  BudgetAlertTopicPolicy:
    Type: AWS::SNS::TopicPolicy
    Properties:
      Topics:
        - !Ref BudgetAlertTopic
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Sid: AllowBudgetsToPublish
            Effect: Allow
            Principal:
              Service: budgets.amazonaws.com
            Action: SNS:Publish
            Resource: !Ref BudgetAlertTopic

  # Lambda for Slack notification forwarding
  BudgetNotificationFunction:
    Type: AWS::Lambda::Function
    Properties:
      FunctionName: budget-to-slack
      Runtime: python3.11
      Handler: index.lambda_handler
      Timeout: 30
      Role: !GetAtt BudgetLambdaRole.Arn
      Environment:
        Variables:
          SLACK_SECRET_ARN: !Ref SlackWebhookSecretArn
      Code:
        ZipFile: |
          import json
          import os
          import urllib.request
          import boto3

          secrets = boto3.client('secretsmanager')

          def lambda_handler(event, context):
              secret = secrets.get_secret_value(SecretId=os.environ['SLACK_SECRET_ARN'])
              webhook_url = json.loads(secret['SecretString'])['webhook_url']

              for record in event.get('Records', []):
                  message = json.loads(record['Sns']['Message'])
                  budget_name = message.get('BudgetName', 'Unknown')
                  budget_type = message.get('BudgetType', 'Unknown')
                  alert_type = message.get('NotificationType', 'Unknown')
                  threshold = message.get('Trigger']['Threshold'] if message.get('Trigger') else 'N/A'
                  amount = message.get('Amount', 'N/A')
                  forecasted = message.get('ForecastedAmount', 'N/A')

                  severity = "warning" if float(str(threshold).replace('$', '')) < 100 else "danger"

                  slack_message = {
                      "attachments": [{
                          "color": severity,
                          "title": f"AWS Budget Alert: {budget_name}",
                          "fields": [
                              {"title": "Alert Type", "value": alert_type, "short": True},
                              {"title": "Threshold", "value": f"{threshold}%", "short": True},
                              {"title": "Budget Amount", "value": f"${amount}", "short": True},
                              {"title": "Forecasted", "value": f"${forecasted}", "short": True},
                          ],
                          "footer": "AWS Budgets",
                          "ts": json.loads(record['Sns']['Message']).get('TimePeriod', {}).get('Start', '')
                      }]
                  }

                  req = urllib.request.Request(
                      webhook_url,
                      data=json.dumps(slack_message).encode(),
                      headers={'Content-Type': 'application/json'}
                  )
                  urllib.request.urlopen(req, timeout=10)

              return {"statusCode": 200}

  BudgetLambdaRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
      Policies:
        - PolicyName: SecretsManagerAccess
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action: secretsmanager:GetSecretValue
                Resource: !Ref SlackWebhookSecretArn

  # SNS to Lambda subscription
  BudgetAlertSubscription:
    Type: AWS::SNS::Subscription
    Properties:
      Protocol: lambda
      TopicArn: !Ref BudgetAlertTopic
      Endpoint: !GetAtt BudgetNotificationFunction.Arn

  # Lambda permission for SNS
  BudgetLambdaPermission:
    Type: AWS::Lambda::Permission
    Properties:
      FunctionName: !Ref BudgetNotificationFunction
      Action: lambda:InvokeFunction
      Principal: sns.amazonaws.com
      SourceArn: !Ref BudgetAlertTopic

  # Monthly cost budget
  MonthlyBudget:
    Type: AWS::Budgets::Budget
    Properties:
      Budget:
        BudgetName: monthly-total-budget
        BudgetLimit:
          Amount: !Ref MonthlyBudgetAmount
          Unit: USD
        TimeUnit: MONTHLY
        BudgetType: COST
        CostTypes:
          IncludeTax: true
          IncludeSubscription: true
          UseBlended: false
      NotificationsWithSubscribers:
        - Notification:
            NotificationType: FORECASTED
            ComparisonOperator: GREATER_THAN
            Threshold: 80
          Subscribers:
            - SubscriptionType: SNS
              Address: !Ref BudgetAlertTopic
        - Notification:
            NotificationType: ACTUAL
            ComparisonOperator: GREATER_THAN
            Threshold: 90
          Subscribers:
            - SubscriptionType: SNS
              Address: !Ref BudgetAlertTopic
        - Notification:
            NotificationType: ACTUAL
            ComparisonOperator: GREATER_THAN
            Threshold: 100
          Subscribers:
            - SubscriptionType: SNS
              Address: !Ref BudgetAlertTopic
            - SubscriptionType: EMAIL
              Address: !Ref AlertEmail

Outputs:
  BudgetAlertTopicArn:
    Description: 'SNS topic for budget alerts'
    Value: !Ref BudgetAlertTopic
    Export:
      Name: BudgetAlertTopicArn

AWS Cost Anomaly Detection

Cost Anomaly Detection uses ML to identify unusual spending patterns. Unlike budget alerts (which are threshold-based), anomaly detection identifies deviations from historical patterns โ€” catching issues that budgets miss.

# anomaly_detection.tf โ€” AWS Cost Anomaly Detection configuration
resource "aws_ce_anomaly_monitor" "service_monitor" {
  name              = "service-level-monitor"
  monitor_type      = "DIMENSIONAL"
  monitor_dimension = "SERVICE"
}

resource "aws_ce_anomaly_monitor" "member_account_monitor" {
  name         = "member-account-monitor"
  monitor_type = "DIMENSIONAL"
  monitor_specification = jsonencode({
    Dimension = {
      Key    = "LINKED_ACCOUNT"
      MatchOptions = ["EQUALS"]
      Values = var.member_account_ids
    }
  })
}

# Anomaly subscription with alert threshold
resource "aws_ce_anomaly_subscription" "immediate_alerts" {
  name      = "immediate-anomaly-alerts"
  threshold = 100  # Alert when anomaly impact exceeds $100
  frequency = "IMMEDIATE"

  monitor_arn_list = [
    aws_ce_anomaly_monitor.service_monitor.arn,
    aws_ce_anomaly_monitor.member_account_monitor.arn
  ]

  subscriber {
    type    = "SNS"
    address = aws_sns_topic.budget_alerts.arn
  }

  subscriber {
    type    = "EMAIL"
    address = "finops-alerts@company.com"
  }

  depends_on = [aws_sns_topic_policy.budget_alerts]
}

# Weekly digest subscription
resource "aws_ce_anomaly_subscription" "weekly_digest" {
  name      = "weekly-anomaly-digest"
  threshold = 500  # Only include anomalies > $500 in digest
  frequency = "WEEKLY"

  monitor_arn_list = [
    aws_ce_anomaly_monitor.service_monitor.arn
  ]

  subscriber {
    type    = "EMAIL"
    address = "finops-weekly@company.com"
  }
}

CloudHealth, CloudCheckr, and Native Tools Comparison

CapabilityAWS NativeCloudHealthCloudCheckrVantage
Budget alertsโœ… Full supportโœ… Advancedโœ… Full supportโœ… Simple setup
Anomaly detectionโœ… ML-basedโœ… ML-basedโœ… Rule + MLโœ… Basic
Forecastingโœ… 12 monthsโœ… Custom modelsโœ… Trend-basedโœ… 12 months
Multi-cloudโŒ AWS onlyโœ… AWS + Azure + GCPโœ… AWS + Azure + GCPโœ… AWS + Azure + GCP
Right-sizingโœ… Compute Optimizerโœ… Full stackโœ… Full stackโŒ Limited
RI/SP recommendationsโœ… Nativeโœ… Advancedโœ… Full supportโœ… Basic
API accessโœ… Cost Explorer APIโœ… REST APIโœ… REST APIโœ… REST API
Pricing modelFree~1-3% of spend~1-3% of spendFixed monthly
Best forAWS-only, budget startEnterprise multi-cloudMSP environmentsDeveloper-focused

PagerDuty and Slack Integration

# budget_to_pagerduty.py โ€” Forward budget alerts to PagerDuty for critical thresholds
import json
import os
import requests
from datetime import datetime

PAGERDUTY_ROUTING_KEY = os.environ['PAGERDUTY_ROUTING_KEY']
PAGERDUTY_EVENTS_URL = "https://events.pagerduty.com/v2/enqueue"

def lambda_handler(event, context):
    """AWS Lambda handler for critical budget alerts to PagerDuty."""
    for record in event.get('Records', []):
        sns_message = json.loads(record['Sns']['Message'])

        budget_name = sns_message.get('BudgetName', 'Unknown')
        alert_type = sns_message.get('NotificationType', 'Unknown')
        threshold = sns_message.get('Trigger', {}).get('Threshold', 'N/A')
        actual_amount = sns_message.get('Amount', {}).get('Actual', 'N/A')
        budget_limit = sns_message.get('BudgetLimit', {}).get('Amount', 'N/A')

        severity = "critical" if float(str(threshold)) >= 100 else "warning"

        payload = {
            "routing_key": PAGERDUTY_ROUTING_KEY,
            "event_action": "trigger",
            "dedup_key": f"budget-{budget_name}-{datetime.now().strftime('%Y-%m')}",
            "payload": {
                "summary": f"Budget Alert: {budget_name} has exceeded {threshold}% threshold",
                "severity": severity,
                "source": "AWS Budgets",
                "component": "Cloud Cost",
                "group": "FinOps",
                "class": "budget-overrun",
                "custom_details": {
                    "budget_name": budget_name,
                    "alert_type": alert_type,
                    "threshold_percent": threshold,
                    "actual_spend": actual_amount,
                    "budget_limit": budget_limit,
                    "account_id": sns_message.get('AccountId', 'Unknown'),
                    "region": sns_message.get('Region', 'us-east-1')
                }
            }
        }

        response = requests.post(
            PAGERDUTY_EVENTS_URL,
            json=payload,
            timeout=10
        )
        response.raise_for_status()

        print(f"PagerDuty alert sent: {response.json().get('dedup_key')}")

    return {"statusCode": 200}
# budget_to_slack.py โ€” Format and send budget alerts to Slack
import json
import os
import urllib.request

SLACK_WEBHOOK_URL = os.environ['SLACK_WEBHOOK_URL']

def format_budget_alert(message):
    """Format budget alert message for Slack."""
    budget_name = message.get('BudgetName', 'Unknown')
    alert_type = message.get('NotificationType', 'Unknown')
    threshold = message.get('Trigger', {}).get('Threshold', 'N/A')
    actual = message.get('Amount', 'N/A')
    forecasted = message.get('ForecastedAmount', 'N/A')
    limit_amount = message.get('BudgetLimit', {}).get('Amount', 'N/A')

    color = "#ff9900"  # Orange for advisory
    if float(str(threshold)) >= 100:
        color = "#ff0000"  # Red for critical
    elif float(str(threshold)) >= 90:
        color = "#ffcc00"  # Yellow for warning

    return {
        "attachments": [{
            "color": color,
            "title": f"AWS Budget Alert: {budget_name}",
            "title_link": f"https://console.aws.amazon.com/billing/home#/budgets/details/{budget_name}",
            "fields": [
                {"title": "Alert Type", "value": alert_type, "short": True},
                {"title": "Threshold", "value": f"{threshold}%", "short": True},
                {"title": "Budget Limit", "value": f"${limit_amount}", "short": True},
                {"title": "Actual Spend", "value": f"${actual}", "short": True},
                {"title": "Forecasted", "value": f"${forecasted}", "short": True},
                {"title": "Account", "value": message.get('AccountId', 'Unknown'), "short": True}
            ],
            "footer": "AWS Budgets via KuyaOps FinOps",
            "ts": message.get('TimePeriod', {}).get('Start', '')
        }]
    }

def lambda_handler(event, context):
    """AWS Lambda handler for Slack budget notifications."""
    for record in event.get('Records', []):
        sns_message = json.loads(record['Sns']['Message'])
        slack_message = format_budget_alert(sns_message)

        req = urllib.request.Request(
            SLACK_WEBHOOK_URL,
            data=json.dumps(slack_message).encode('utf-8'),
            headers={'Content-Type': 'application/json'}
        )

        with urllib.request.urlopen(req, timeout=10) as response:
            print(f"Slack notification sent: {response.status}")

    return {"statusCode": 200}

Monthly Cost Review Process

A structured monthly review ensures budgets are meaningful and optimization opportunities are captured.

  1. Week 1 โ€” Data Collection: Run Cost Explorer reports, pull anomaly detection results, generate tag compliance scorecard.
  2. Week 1 โ€” Variance Analysis: Compare actual vs. budget for each team. Identify >10% variances requiring explanation.
  3. Week 2 โ€” Team Reviews: Meet with each engineering team to review their costs, optimization opportunities, and forecast trends.
  4. Week 2 โ€” Optimization Execution: Implement approved right-sizing, schedule shutdowns, purchase RIs/SPs as needed.
  5. Week 3 โ€” Executive Summary: Publish executive dashboard with month-over-month trends, optimization savings, and forecast.
  6. Week 4 โ€” Process Improvement: Update budgets based on trend analysis. Refine anomaly detection thresholds. Update tagging policies.

Cost Allocation Dashboards

# Example: Grafana dashboard JSON for cost visualization
# This assumes you're exporting CUR data to Athena and querying via Grafana

{
  "dashboard": {
    "title": "FinOps Cost Dashboard",
    "panels": [
      {
        "title": "Monthly Spend by Team",
        "type": "timeseries",
        "targets": [{
          "rawSql": "SELECT date_trunc('month', line_item_usage_start_date) as month, resource_tags_user_owner as team, SUM(line_item_blended_cost) as cost FROM cur WHERE line_item_usage_start_date >= date_add('month', -6, current_date) GROUP BY 1, 2 ORDER BY 1 DESC",
          "refId": "A"
        }],
        "fieldConfig": {
          "defaults": {
            "unit": "currencyUSD",
            "custom": {"axisLabel": "Cost (USD)"}
          }
        }
      },
      {
        "title": "Top 10 Services by Cost",
        "type": "piechart",
        "targets": [{
          "rawSql": "SELECT line_item_product_code as service, SUM(line_item_blended_cost) as cost FROM cur WHERE line_item_usage_start_date >= date_trunc('month', current_date) GROUP BY 1 ORDER BY 2 DESC LIMIT 10",
          "refId": "B"
        }]
      },
      {
        "title": "Daily Cost Trend vs Budget",
        "type": "timeseries",
        "targets": [
          {
            "rawSql": "SELECT date(line_item_usage_start_date) as day, SUM(line_item_blended_cost) as actual FROM cur WHERE line_item_usage_start_date >= date_trunc('month', current_date) GROUP BY 1 ORDER BY 1",
            "refId": "C"
          },
          {
            "rawSql": "SELECT date(line_item_usage_start_date) as day, 150000.0 / day(date_trunc('month', current_date) + interval '1' month - interval '1' day) as daily_budget FROM cur WHERE line_item_usage_start_date >= date_trunc('month', current_date) GROUP BY 1 ORDER BY 1",
            "refId": "D"
          }
        ]
      },
      {
        "title": "Tag Compliance Rate",
        "type": "gauge",
        "targets": [{
          "rawSql": "SELECT (COUNT(DISTINCT CASE WHEN resource_tags_user_costcenter IS NOT NULL THEN line_item_resource_id END) * 100.0 / COUNT(DISTINCT line_item_resource_id)) as compliance_rate FROM cur WHERE line_item_usage_start_date >= date_trunc('month', current_date)",
          "refId": "E"
        }],
        "fieldConfig": {
          "defaults": {
            "min": 0,
            "max": 100,
            "thresholds": {
              "steps": [
                {"color": "red", "value": 0},
                {"color": "yellow", "value": 85},
                {"color": "green", "value": 95}
              ]
            },
            "unit": "percent"
          }
        }
      }
    ]
  }
}

Automated Cost Reporting with Lambda

# weekly_cost_report.py โ€” Automated weekly cost report via Lambda
import boto3
import json
import os
from datetime import datetime, timedelta

def get_weekly_spend():
    """Get cost and usage data for the current week."""
    ce = boto3.client('ce')

    end = datetime.now().strftime('%Y-%m-%d')
    start = (datetime.now() - timedelta(days=7)).strftime('%Y-%m-%d')

    response = ce.get_cost_and_usage(
        TimePeriod={'Start': start, 'End': end},
        Granularity='DAILY',
        Metrics=['BlendedCost', 'UsageQuantity'],
        GroupBy=[
            {'Type': 'DIMENSION', 'Key': 'SERVICE'}
        ]
    )

    return response['ResultsByTime']

def get_mom_change():
    """Calculate month-over-month cost change."""
    ce = boto3.client('ce')

    # Current month
    now = datetime.now()
    current_start = now.replace(day=1).strftime('%Y-%m-%d')
    current_end = now.strftime('%Y-%m-%d')

    # Previous month
    prev_month = now.replace(day=1) - timedelta(days=1)
    prev_start = prev_month.replace(day=1).strftime('%Y-%m-%d')
    prev_end = prev_month.strftime('%Y-%m-%d')

    current = ce.get_cost_and_usage(
        TimePeriod={'Start': current_start, 'End': current_end},
        Granularity='MONTHLY',
        Metrics=['BlendedCost']
    )

    previous = ce.get_cost_and_usage(
        TimePeriod={'Start': prev_start, 'End': prev_end},
        Granularity='MONTHLY',
        Metrics=['BlendedCost']
    )

    current_cost = float(current['ResultsByTime'][0]['Total']['BlendedCost']['Amount'])
    prev_cost = float(previous['ResultsByTime'][0]['Total']['BlendedCost']['Amount'])

    # Normalize for days elapsed
    days_in_current = now.day
    days_in_prev = (prev_month.replace(day=1) + timedelta(days=32)).replace(day=1) - timedelta(days=1)
    days_in_prev = days_in_prev.day

    normalized_current = (current_cost / days_in_current) * days_in_prev
    change_pct = ((normalized_current - prev_cost) / prev_cost) * 100

    return change_pct, current_cost, prev_cost

def generate_report():
    """Generate formatted weekly cost report."""
    weekly = get_weekly_spend()
    mom_change, current_spend, prev_spend = get_mom_change()

    total_weekly = 0.0
    top_services = []

    for day in weekly:
        for group in day.get('Groups', []):
            cost = float(group['Metrics']['BlendedCost']['Amount'])
            if cost > 0:
                total_weekly += cost
                service = group['Keys'][0]
                top_services.append((service, cost))

    # Aggregate by service
    service_totals = {}
    for svc, cost in top_services:
        service_totals[svc] = service_totals.get(svc, 0) + cost

    top_services = sorted(service_totals.items(), key=lambda x: x[1], reverse=True)[:5]

    report = f"""
*๐Ÿ“Š Weekly Cloud Cost Report*
*Week of:* {datetime.now().strftime('%Y-%m-%d')}

*7-Day Total:* ${total_weekly:,.2f}
*MTD Spend:* ${current_spend:,.2f}
*MoM Trend:* {mom_change:+.1f}%

*Top 5 Services:*
"""
    for svc, cost in top_services:
        pct = (cost / total_weekly) * 100
        report += f"โ€ข `{svc}`: ${cost:,.2f} ({pct:.1f}%)\n"

    report += f"\n_Budget Status: {'โœ… On track' if mom_change < 10 else 'โš ๏ธ Review needed'}_"

    return report

def lambda_handler(event, context):
    """Lambda entry point for scheduled weekly reports."""
    report = generate_report()

    # Send to Slack
    slack_webhook = os.environ['SLACK_WEBHOOK_URL']
    import urllib.request
    req = urllib.request.Request(
        slack_webhook,
        data=json.dumps({"text": report}).encode(),
        headers={'Content-Type': 'application/json'}
    )
    urllib.request.urlopen(req, timeout=10)

    # Send to S3 for archival
    s3 = boto3.client('s3')
    s3.put_object(
        Bucket=os.environ['REPORTS_BUCKET'],
        Key=f"cost-reports/weekly/{datetime.now().strftime('%Y-%m-%d')}.txt",
        Body=report.encode()
    )

    return {"statusCode": 200, "body": "Report generated and sent."}

FinOps Review Meeting Agenda

## FinOps Monthly Review Meeting โ€” Agenda Template

*Date:* ___________  *Attendees:* ___________

### 1. Executive Summary (5 min)
- Total cloud spend vs. budget
- Month-over-month change (%)
- Year-to-date spend vs. annual budget

### 2. Cost Allocation by Team (10 min)
- Each team presents their spend, variance, and forecast
- Tag compliance score for each team
- Action items from previous month

### 3. Optimization Opportunities (15 min)
- Right-sizing recommendations (top 10 by savings)
- Idle/orphaned resources identified
- RI/SP utilization and coverage
- Spot instance adoption rate

### 4. Anomalies and Incidents (10 min)
- Cost anomalies detected and investigated
- Root cause for any budget overruns
- Preventive measures implemented

### 5. Forecasting and Planning (10 min)
- Next month forecast by team
- Upcoming infrastructure changes (launches, migrations)
- RI/SP purchase recommendations

### 6. Process Improvements (10 min)
- Tagging policy updates
- Automation opportunities
- Tool evaluation or changes
- Training needs

### Action Items
| # | Action | Owner | Due Date | Status |
|---|--------|-------|----------|--------|
| 1 | | | | |
| 2 | | | | |
| 3 | | | | |

*Next Meeting:* ___________

Emergency Cost Response Runbook

This runbook addresses sudden, unexpected cost spikes that threaten to exceed monthly budgets.

  1. T+0 โ€” Detection: Anomaly detection alert or budget threshold breach triggers PagerDuty incident. On-call engineer acknowledges within 15 minutes.
  2. T+15 โ€” Initial Assessment:
    • Check AWS Cost Explorer for the top cost-increasing services in the last 24 hours.
    • Identify the specific resource(s) driving the increase (instance ID, Lambda function, etc.).
    • Determine if the increase is expected (launch event, traffic spike) or anomalous.
  3. T+30 โ€” Containment (if anomalous):
    • For EC2: Scale down Auto Scaling group or stop non-critical instances.
    • For Lambda: Reduce concurrency limit or disable function (if non-critical).
    • For S3: Enable lifecycle policy to move data to cheaper tier immediately.
    • For data transfer: Identify and block unexpected egress if malicious.
    # Emergency cost containment script
    #!/bin/bash
    # emergency_containment.sh โ€” Quick containment for cost spikes
    
    SERVICE=$1  # e.g., ec2, lambda, rds
    ACTION=$2   # e.g., stop, downscale, limit
    
    case "$SERVICE-$ACTION" in
      ec2-stop)
        # Stop all dev/test instances immediately
        aws ec2 describe-instances \
          --filters "Name=tag:Environment,Values=dev,staging" \
                    "Name=instance-state-name,Values=running" \
          --query 'Reservations[].Instances[].InstanceId' \
          --output text | xargs -r aws ec2 stop-instances --instance-ids
        ;;
    
      lambda-limit)
        FUNCTION_NAME=$3
        aws lambda put-function-concurrency \
          --function-name "$FUNCTION_NAME" \
          --reserved-concurrent-executions 0
        ;;
    
      asg-downscale)
        ASG_NAME=$3
        aws autoscaling update-auto-scaling-group \
          --auto-scaling-group-name "$ASG_NAME" \
          --desired-capacity 1 \
          --min-size 0
        ;;
    
      *)
        echo "Unknown service-action combination: $SERVICE-$ACTION"
        exit 1
        ;;
    esac
    
  4. T+1h โ€” Communication: Post incident summary to #incidents Slack channel. Notify affected teams. Begin root cause analysis.
  5. T+4h โ€” Root Cause: Complete root cause analysis. Common causes:
    • Infinite loop in Lambda or ECS task
    • Runaway autoscaling due to incorrect metric or DDoS
    • Data transfer loop (cross-AZ, inter-region)
    • Crypto mining via compromised credentials
    • Accidental provisioning of large instances (e.g., xlarge instead of large)
  6. T+24h โ€” Remediation: Implement permanent fix. Update monitoring thresholds. Document lessons learned.
  7. T+1w โ€” Retrospective: Conduct blameless post-mortem. Update runbook with any new containment procedures identified.
Cost Spike Emergency: If you detect a cost spike exceeding $10K/day above baseline, the first priority is containment โ€” not root cause analysis. Stop the bleeding first, then investigate. A 4-hour delay in stopping a runaway process can mean $50K+ in unrecoverable spend. Pre-script your containment actions and keep them ready.
Communication Cadence: Budget alerts sent only to finance are ignored by engineering. Alerts sent only to engineering are invisible to finance. The FinOps function bridges this gap by ensuring alerts reach the right people with the right context. Every alert should include: what happened, how much it costs, who owns the resource, and what action to take.