Budgets & Alerts
Proactive budget monitoring prevents cost overruns and enables teams to make informed spending decisions before costs escalate. This guide covers AWS Budgets, anomaly detection, third-party tooling, and the operational processes that make cost alerting actionable.
AWS Budgets Setup and Configuration
AWS Budgets is the native cost monitoring service. It supports four budget types and integrates with SNS for notifications. All budgets should be created programmatically via CloudFormation or Terraform to ensure consistency across accounts.
Budget Types
| Budget Type | Monitors | Use Case | Alert Triggers |
|---|---|---|---|
| Cost Budget | Actual or forecasted spend against a fixed amount | Monthly/quarterly spend caps by team or account | Actual > 80%, 100%; Forecasted > 100% |
| Usage Budget | Service usage quantities (e.g., EC2 instance hours) | Track compute consumption for capacity planning | Usage exceeds threshold |
| Reserved Instance Budget | RI utilization and coverage | Ensure purchased RIs are fully utilized | Utilization < 80%, Coverage < 80% |
| Savings Plans Budget | Savings Plans utilization and coverage | Monitor SP commitment consumption | Utilization < 80%, Coverage < 80% |
Alert Thresholds
Best practice is a three-tier alerting system:
| Tier | Threshold | Channel | Response |
|---|---|---|---|
| ๐ก Advisory | Forecasted > 80% of budget | Slack #finops-alerts | Inform team; no action required |
| ๐ Warning | Actual > 90% or Forecasted > 100% | Slack + Email to team lead | Investigate within 48 hours |
| ๐ด Critical | Actual > 100% or Anomaly detected | Slack + Email + PagerDuty | Immediate investigation required |
Complete AWS Budgets Terraform Configuration
# budgets.tf โ Comprehensive AWS Budgets configuration
locals {
budget_notifications = {
advisory = {
comparison_operator = "GREATER_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "FORECASTED"
subscriber_sns_topic = aws_sns_topic.budget_alerts.arn
}
warning = {
comparison_operator = "GREATER_THAN"
threshold = 90
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_sns_topic = aws_sns_topic.budget_alerts.arn
}
critical = {
comparison_operator = "GREATER_THAN"
threshold = 100
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_sns_topic = aws_sns_topic.budget_critical.arn
}
}
team_budgets = {
platform = { amount = 50000, costcenter = "CC-10001" }
product = { amount = 30000, costcenter = "CC-10002" }
data = { amount = 25000, costcenter = "CC-10003" }
security = { amount = 15000, costcenter = "CC-10004" }
}
}
# SNS Topics for budget notifications
resource "aws_sns_topic" "budget_alerts" {
name = "budget-alerts"
kms_master_key_id = aws_kms_key.sns.arn
tags = local.mandatory_tags
}
resource "aws_sns_topic" "budget_critical" {
name = "budget-critical-alerts"
kms_master_key_id = aws_kms_key.sns.arn
tags = local.mandatory_tags
}
# SNS topic policy to allow Budgets to publish
resource "aws_sns_topic_policy" "budget_alerts" {
arn = aws_sns_topic.budget_alerts.arn
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "AllowBudgetsToPublish"
Effect = "Allow"
Principal = {
Service = "budgets.amazonaws.com"
}
Action = "SNS:Publish"
Resource = aws_sns_topic.budget_alerts.arn
}
]
})
}
# Overall account budget
resource "aws_budgets_budget" "monthly_total" {
name = "monthly-total-spend"
budget_type = "COST"
limit_amount = "150000"
limit_unit = "USD"
time_unit = "MONTHLY"
time_period_start = "2024-01-01_00:00"
cost_filter {
name = "TagKeyValue"
values = [
"user:CostCenter$CC-10001",
"user:CostCenter$CC-10002",
"user:CostCenter$CC-10003",
"user:CostCenter$CC-10004",
]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "FORECASTED"
subscriber_sns_topic_arns = [aws_sns_topic.budget_alerts.arn]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 90
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_sns_topic_arns = [aws_sns_topic.budget_alerts.arn]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 100
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_sns_topic_arns = [aws_sns_topic.budget_critical.arn]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 100
threshold_type = "PERCENTAGE"
notification_type = "FORECASTED"
subscriber_sns_topic_arns = [aws_sns_topic.budget_critical.arn]
}
}
# Per-team budgets with tag-based filtering
resource "aws_budgets_budget" "team" {
for_each = local.team_budgets
name = "team-${each.key}-monthly"
budget_type = "COST"
limit_amount = each.value.amount
limit_unit = "USD"
time_unit = "MONTHLY"
cost_filter {
name = "TagKeyValue"
values = ["user:CostCenter$${each.value.costcenter}"]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "FORECASTED"
subscriber_sns_topic_arns = [aws_sns_topic.budget_alerts.arn]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 100
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_sns_topic_arns = [aws_sns_topic.budget_critical.arn]
}
}
# EC2-specific usage budget (track compute hours)
resource "aws_budgets_budget" "ec2_usage" {
name = "ec2-compute-hours"
budget_type = "USAGE"
limit_amount = "50000" # instance-hours per month
limit_unit = "Hours"
time_unit = "MONTHLY"
cost_filter {
name = "Service"
values = ["Amazon Elastic Compute Cloud - Compute"]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 90
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_sns_topic_arns = [aws_sns_topic.budget_alerts.arn]
}
}
# Savings Plans utilization budget
resource "aws_budgets_budget" "sp_utilization" {
name = "savings-plans-utilization"
budget_type = "SAVINGS_PLANS_UTILIZATION"
limit_amount = "100"
limit_unit = "PERCENTAGE"
time_unit = "MONTHLY"
cost_types {
include_subscription = true
use_blended = false
}
notification {
comparison_operator = "LESS_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_sns_topic_arns = [aws_sns_topic.budget_alerts.arn]
}
}
# Reserved Instance coverage budget
resource "aws_budgets_budget" "ri_coverage" {
name = "ri-coverage"
budget_type = "RI_COVERAGE"
limit_amount = "100"
limit_unit = "PERCENTAGE"
time_unit = "MONTHLY"
cost_types {
include_subscription = true
use_blended = false
}
notification {
comparison_operator = "LESS_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_sns_topic_arns = [aws_sns_topic.budget_alerts.arn]
}
}
# KMS key for SNS encryption
resource "aws_kms_key" "sns" {
description = "KMS key for budget alert SNS topics"
deletion_window_in_days = 7
enable_key_rotation = true
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "Enable IAM User Permissions"
Effect = "Allow"
Principal = {
AWS = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:root"
}
Action = "kms:*"
Resource = "*"
},
{
Sid = "Allow Budgets Service"
Effect = "Allow"
Principal = {
Service = "budgets.amazonaws.com"
}
Action = [
"kms:GenerateDataKey*",
"kms:Decrypt"
]
Resource = "*"
}
]
})
tags = local.mandatory_tags
}
data "aws_caller_identity" "current" {}
AWS Budgets CloudFormation Template
# budgets.yaml โ CloudFormation equivalent for organizations using CloudFormation
AWSTemplateFormatVersion: '2010-09-09'
Description: 'AWS Budgets configuration with multi-tier alerting'
Parameters:
MonthlyBudgetAmount:
Type: Number
Description: 'Monthly budget amount in USD'
Default: 150000
AlertEmail:
Type: String
Description: 'Email address for budget alerts'
Default: 'finops-alerts@company.com'
SlackWebhookSecretArn:
Type: String
Description: 'ARN of Secrets Manager secret containing Slack webhook URL'
Resources:
# SNS Topic for budget notifications
BudgetAlertTopic:
Type: AWS::SNS::Topic
Properties:
TopicName: budget-alerts
KmsMasterKeyId: !Ref BudgetAlertKey
BudgetAlertKey:
Type: AWS::KMS::Key
Properties:
Description: 'KMS key for budget alert SNS topics'
EnableKeyRotation: true
KeyPolicy:
Version: '2012-10-17'
Statement:
- Sid: Enable IAM User Permissions
Effect: Allow
Principal:
AWS: !Sub 'arn:aws:iam::${AWS::AccountId}:root'
Action: 'kms:*'
Resource: '*'
- Sid: Allow Budgets Service
Effect: Allow
Principal:
Service: budgets.amazonaws.com
Action:
- 'kms:GenerateDataKey*'
- 'kms:Decrypt'
Resource: '*'
# SNS Topic Policy
BudgetAlertTopicPolicy:
Type: AWS::SNS::TopicPolicy
Properties:
Topics:
- !Ref BudgetAlertTopic
PolicyDocument:
Version: '2012-10-17'
Statement:
- Sid: AllowBudgetsToPublish
Effect: Allow
Principal:
Service: budgets.amazonaws.com
Action: SNS:Publish
Resource: !Ref BudgetAlertTopic
# Lambda for Slack notification forwarding
BudgetNotificationFunction:
Type: AWS::Lambda::Function
Properties:
FunctionName: budget-to-slack
Runtime: python3.11
Handler: index.lambda_handler
Timeout: 30
Role: !GetAtt BudgetLambdaRole.Arn
Environment:
Variables:
SLACK_SECRET_ARN: !Ref SlackWebhookSecretArn
Code:
ZipFile: |
import json
import os
import urllib.request
import boto3
secrets = boto3.client('secretsmanager')
def lambda_handler(event, context):
secret = secrets.get_secret_value(SecretId=os.environ['SLACK_SECRET_ARN'])
webhook_url = json.loads(secret['SecretString'])['webhook_url']
for record in event.get('Records', []):
message = json.loads(record['Sns']['Message'])
budget_name = message.get('BudgetName', 'Unknown')
budget_type = message.get('BudgetType', 'Unknown')
alert_type = message.get('NotificationType', 'Unknown')
threshold = message.get('Trigger']['Threshold'] if message.get('Trigger') else 'N/A'
amount = message.get('Amount', 'N/A')
forecasted = message.get('ForecastedAmount', 'N/A')
severity = "warning" if float(str(threshold).replace('$', '')) < 100 else "danger"
slack_message = {
"attachments": [{
"color": severity,
"title": f"AWS Budget Alert: {budget_name}",
"fields": [
{"title": "Alert Type", "value": alert_type, "short": True},
{"title": "Threshold", "value": f"{threshold}%", "short": True},
{"title": "Budget Amount", "value": f"${amount}", "short": True},
{"title": "Forecasted", "value": f"${forecasted}", "short": True},
],
"footer": "AWS Budgets",
"ts": json.loads(record['Sns']['Message']).get('TimePeriod', {}).get('Start', '')
}]
}
req = urllib.request.Request(
webhook_url,
data=json.dumps(slack_message).encode(),
headers={'Content-Type': 'application/json'}
)
urllib.request.urlopen(req, timeout=10)
return {"statusCode": 200}
BudgetLambdaRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Principal:
Service: lambda.amazonaws.com
Action: sts:AssumeRole
ManagedPolicyArns:
- arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
Policies:
- PolicyName: SecretsManagerAccess
PolicyDocument:
Version: '2012-10-17'
Statement:
- Effect: Allow
Action: secretsmanager:GetSecretValue
Resource: !Ref SlackWebhookSecretArn
# SNS to Lambda subscription
BudgetAlertSubscription:
Type: AWS::SNS::Subscription
Properties:
Protocol: lambda
TopicArn: !Ref BudgetAlertTopic
Endpoint: !GetAtt BudgetNotificationFunction.Arn
# Lambda permission for SNS
BudgetLambdaPermission:
Type: AWS::Lambda::Permission
Properties:
FunctionName: !Ref BudgetNotificationFunction
Action: lambda:InvokeFunction
Principal: sns.amazonaws.com
SourceArn: !Ref BudgetAlertTopic
# Monthly cost budget
MonthlyBudget:
Type: AWS::Budgets::Budget
Properties:
Budget:
BudgetName: monthly-total-budget
BudgetLimit:
Amount: !Ref MonthlyBudgetAmount
Unit: USD
TimeUnit: MONTHLY
BudgetType: COST
CostTypes:
IncludeTax: true
IncludeSubscription: true
UseBlended: false
NotificationsWithSubscribers:
- Notification:
NotificationType: FORECASTED
ComparisonOperator: GREATER_THAN
Threshold: 80
Subscribers:
- SubscriptionType: SNS
Address: !Ref BudgetAlertTopic
- Notification:
NotificationType: ACTUAL
ComparisonOperator: GREATER_THAN
Threshold: 90
Subscribers:
- SubscriptionType: SNS
Address: !Ref BudgetAlertTopic
- Notification:
NotificationType: ACTUAL
ComparisonOperator: GREATER_THAN
Threshold: 100
Subscribers:
- SubscriptionType: SNS
Address: !Ref BudgetAlertTopic
- SubscriptionType: EMAIL
Address: !Ref AlertEmail
Outputs:
BudgetAlertTopicArn:
Description: 'SNS topic for budget alerts'
Value: !Ref BudgetAlertTopic
Export:
Name: BudgetAlertTopicArn
AWS Cost Anomaly Detection
Cost Anomaly Detection uses ML to identify unusual spending patterns. Unlike budget alerts (which are threshold-based), anomaly detection identifies deviations from historical patterns โ catching issues that budgets miss.
# anomaly_detection.tf โ AWS Cost Anomaly Detection configuration
resource "aws_ce_anomaly_monitor" "service_monitor" {
name = "service-level-monitor"
monitor_type = "DIMENSIONAL"
monitor_dimension = "SERVICE"
}
resource "aws_ce_anomaly_monitor" "member_account_monitor" {
name = "member-account-monitor"
monitor_type = "DIMENSIONAL"
monitor_specification = jsonencode({
Dimension = {
Key = "LINKED_ACCOUNT"
MatchOptions = ["EQUALS"]
Values = var.member_account_ids
}
})
}
# Anomaly subscription with alert threshold
resource "aws_ce_anomaly_subscription" "immediate_alerts" {
name = "immediate-anomaly-alerts"
threshold = 100 # Alert when anomaly impact exceeds $100
frequency = "IMMEDIATE"
monitor_arn_list = [
aws_ce_anomaly_monitor.service_monitor.arn,
aws_ce_anomaly_monitor.member_account_monitor.arn
]
subscriber {
type = "SNS"
address = aws_sns_topic.budget_alerts.arn
}
subscriber {
type = "EMAIL"
address = "finops-alerts@company.com"
}
depends_on = [aws_sns_topic_policy.budget_alerts]
}
# Weekly digest subscription
resource "aws_ce_anomaly_subscription" "weekly_digest" {
name = "weekly-anomaly-digest"
threshold = 500 # Only include anomalies > $500 in digest
frequency = "WEEKLY"
monitor_arn_list = [
aws_ce_anomaly_monitor.service_monitor.arn
]
subscriber {
type = "EMAIL"
address = "finops-weekly@company.com"
}
}
CloudHealth, CloudCheckr, and Native Tools Comparison
| Capability | AWS Native | CloudHealth | CloudCheckr | Vantage |
|---|---|---|---|---|
| Budget alerts | โ Full support | โ Advanced | โ Full support | โ Simple setup |
| Anomaly detection | โ ML-based | โ ML-based | โ Rule + ML | โ Basic |
| Forecasting | โ 12 months | โ Custom models | โ Trend-based | โ 12 months |
| Multi-cloud | โ AWS only | โ AWS + Azure + GCP | โ AWS + Azure + GCP | โ AWS + Azure + GCP |
| Right-sizing | โ Compute Optimizer | โ Full stack | โ Full stack | โ Limited |
| RI/SP recommendations | โ Native | โ Advanced | โ Full support | โ Basic |
| API access | โ Cost Explorer API | โ REST API | โ REST API | โ REST API |
| Pricing model | Free | ~1-3% of spend | ~1-3% of spend | Fixed monthly |
| Best for | AWS-only, budget start | Enterprise multi-cloud | MSP environments | Developer-focused |
PagerDuty and Slack Integration
# budget_to_pagerduty.py โ Forward budget alerts to PagerDuty for critical thresholds
import json
import os
import requests
from datetime import datetime
PAGERDUTY_ROUTING_KEY = os.environ['PAGERDUTY_ROUTING_KEY']
PAGERDUTY_EVENTS_URL = "https://events.pagerduty.com/v2/enqueue"
def lambda_handler(event, context):
"""AWS Lambda handler for critical budget alerts to PagerDuty."""
for record in event.get('Records', []):
sns_message = json.loads(record['Sns']['Message'])
budget_name = sns_message.get('BudgetName', 'Unknown')
alert_type = sns_message.get('NotificationType', 'Unknown')
threshold = sns_message.get('Trigger', {}).get('Threshold', 'N/A')
actual_amount = sns_message.get('Amount', {}).get('Actual', 'N/A')
budget_limit = sns_message.get('BudgetLimit', {}).get('Amount', 'N/A')
severity = "critical" if float(str(threshold)) >= 100 else "warning"
payload = {
"routing_key": PAGERDUTY_ROUTING_KEY,
"event_action": "trigger",
"dedup_key": f"budget-{budget_name}-{datetime.now().strftime('%Y-%m')}",
"payload": {
"summary": f"Budget Alert: {budget_name} has exceeded {threshold}% threshold",
"severity": severity,
"source": "AWS Budgets",
"component": "Cloud Cost",
"group": "FinOps",
"class": "budget-overrun",
"custom_details": {
"budget_name": budget_name,
"alert_type": alert_type,
"threshold_percent": threshold,
"actual_spend": actual_amount,
"budget_limit": budget_limit,
"account_id": sns_message.get('AccountId', 'Unknown'),
"region": sns_message.get('Region', 'us-east-1')
}
}
}
response = requests.post(
PAGERDUTY_EVENTS_URL,
json=payload,
timeout=10
)
response.raise_for_status()
print(f"PagerDuty alert sent: {response.json().get('dedup_key')}")
return {"statusCode": 200}
# budget_to_slack.py โ Format and send budget alerts to Slack
import json
import os
import urllib.request
SLACK_WEBHOOK_URL = os.environ['SLACK_WEBHOOK_URL']
def format_budget_alert(message):
"""Format budget alert message for Slack."""
budget_name = message.get('BudgetName', 'Unknown')
alert_type = message.get('NotificationType', 'Unknown')
threshold = message.get('Trigger', {}).get('Threshold', 'N/A')
actual = message.get('Amount', 'N/A')
forecasted = message.get('ForecastedAmount', 'N/A')
limit_amount = message.get('BudgetLimit', {}).get('Amount', 'N/A')
color = "#ff9900" # Orange for advisory
if float(str(threshold)) >= 100:
color = "#ff0000" # Red for critical
elif float(str(threshold)) >= 90:
color = "#ffcc00" # Yellow for warning
return {
"attachments": [{
"color": color,
"title": f"AWS Budget Alert: {budget_name}",
"title_link": f"https://console.aws.amazon.com/billing/home#/budgets/details/{budget_name}",
"fields": [
{"title": "Alert Type", "value": alert_type, "short": True},
{"title": "Threshold", "value": f"{threshold}%", "short": True},
{"title": "Budget Limit", "value": f"${limit_amount}", "short": True},
{"title": "Actual Spend", "value": f"${actual}", "short": True},
{"title": "Forecasted", "value": f"${forecasted}", "short": True},
{"title": "Account", "value": message.get('AccountId', 'Unknown'), "short": True}
],
"footer": "AWS Budgets via KuyaOps FinOps",
"ts": message.get('TimePeriod', {}).get('Start', '')
}]
}
def lambda_handler(event, context):
"""AWS Lambda handler for Slack budget notifications."""
for record in event.get('Records', []):
sns_message = json.loads(record['Sns']['Message'])
slack_message = format_budget_alert(sns_message)
req = urllib.request.Request(
SLACK_WEBHOOK_URL,
data=json.dumps(slack_message).encode('utf-8'),
headers={'Content-Type': 'application/json'}
)
with urllib.request.urlopen(req, timeout=10) as response:
print(f"Slack notification sent: {response.status}")
return {"statusCode": 200}
Monthly Cost Review Process
A structured monthly review ensures budgets are meaningful and optimization opportunities are captured.
- Week 1 โ Data Collection: Run Cost Explorer reports, pull anomaly detection results, generate tag compliance scorecard.
- Week 1 โ Variance Analysis: Compare actual vs. budget for each team. Identify >10% variances requiring explanation.
- Week 2 โ Team Reviews: Meet with each engineering team to review their costs, optimization opportunities, and forecast trends.
- Week 2 โ Optimization Execution: Implement approved right-sizing, schedule shutdowns, purchase RIs/SPs as needed.
- Week 3 โ Executive Summary: Publish executive dashboard with month-over-month trends, optimization savings, and forecast.
- Week 4 โ Process Improvement: Update budgets based on trend analysis. Refine anomaly detection thresholds. Update tagging policies.
Cost Allocation Dashboards
# Example: Grafana dashboard JSON for cost visualization
# This assumes you're exporting CUR data to Athena and querying via Grafana
{
"dashboard": {
"title": "FinOps Cost Dashboard",
"panels": [
{
"title": "Monthly Spend by Team",
"type": "timeseries",
"targets": [{
"rawSql": "SELECT date_trunc('month', line_item_usage_start_date) as month, resource_tags_user_owner as team, SUM(line_item_blended_cost) as cost FROM cur WHERE line_item_usage_start_date >= date_add('month', -6, current_date) GROUP BY 1, 2 ORDER BY 1 DESC",
"refId": "A"
}],
"fieldConfig": {
"defaults": {
"unit": "currencyUSD",
"custom": {"axisLabel": "Cost (USD)"}
}
}
},
{
"title": "Top 10 Services by Cost",
"type": "piechart",
"targets": [{
"rawSql": "SELECT line_item_product_code as service, SUM(line_item_blended_cost) as cost FROM cur WHERE line_item_usage_start_date >= date_trunc('month', current_date) GROUP BY 1 ORDER BY 2 DESC LIMIT 10",
"refId": "B"
}]
},
{
"title": "Daily Cost Trend vs Budget",
"type": "timeseries",
"targets": [
{
"rawSql": "SELECT date(line_item_usage_start_date) as day, SUM(line_item_blended_cost) as actual FROM cur WHERE line_item_usage_start_date >= date_trunc('month', current_date) GROUP BY 1 ORDER BY 1",
"refId": "C"
},
{
"rawSql": "SELECT date(line_item_usage_start_date) as day, 150000.0 / day(date_trunc('month', current_date) + interval '1' month - interval '1' day) as daily_budget FROM cur WHERE line_item_usage_start_date >= date_trunc('month', current_date) GROUP BY 1 ORDER BY 1",
"refId": "D"
}
]
},
{
"title": "Tag Compliance Rate",
"type": "gauge",
"targets": [{
"rawSql": "SELECT (COUNT(DISTINCT CASE WHEN resource_tags_user_costcenter IS NOT NULL THEN line_item_resource_id END) * 100.0 / COUNT(DISTINCT line_item_resource_id)) as compliance_rate FROM cur WHERE line_item_usage_start_date >= date_trunc('month', current_date)",
"refId": "E"
}],
"fieldConfig": {
"defaults": {
"min": 0,
"max": 100,
"thresholds": {
"steps": [
{"color": "red", "value": 0},
{"color": "yellow", "value": 85},
{"color": "green", "value": 95}
]
},
"unit": "percent"
}
}
}
]
}
}
Automated Cost Reporting with Lambda
# weekly_cost_report.py โ Automated weekly cost report via Lambda
import boto3
import json
import os
from datetime import datetime, timedelta
def get_weekly_spend():
"""Get cost and usage data for the current week."""
ce = boto3.client('ce')
end = datetime.now().strftime('%Y-%m-%d')
start = (datetime.now() - timedelta(days=7)).strftime('%Y-%m-%d')
response = ce.get_cost_and_usage(
TimePeriod={'Start': start, 'End': end},
Granularity='DAILY',
Metrics=['BlendedCost', 'UsageQuantity'],
GroupBy=[
{'Type': 'DIMENSION', 'Key': 'SERVICE'}
]
)
return response['ResultsByTime']
def get_mom_change():
"""Calculate month-over-month cost change."""
ce = boto3.client('ce')
# Current month
now = datetime.now()
current_start = now.replace(day=1).strftime('%Y-%m-%d')
current_end = now.strftime('%Y-%m-%d')
# Previous month
prev_month = now.replace(day=1) - timedelta(days=1)
prev_start = prev_month.replace(day=1).strftime('%Y-%m-%d')
prev_end = prev_month.strftime('%Y-%m-%d')
current = ce.get_cost_and_usage(
TimePeriod={'Start': current_start, 'End': current_end},
Granularity='MONTHLY',
Metrics=['BlendedCost']
)
previous = ce.get_cost_and_usage(
TimePeriod={'Start': prev_start, 'End': prev_end},
Granularity='MONTHLY',
Metrics=['BlendedCost']
)
current_cost = float(current['ResultsByTime'][0]['Total']['BlendedCost']['Amount'])
prev_cost = float(previous['ResultsByTime'][0]['Total']['BlendedCost']['Amount'])
# Normalize for days elapsed
days_in_current = now.day
days_in_prev = (prev_month.replace(day=1) + timedelta(days=32)).replace(day=1) - timedelta(days=1)
days_in_prev = days_in_prev.day
normalized_current = (current_cost / days_in_current) * days_in_prev
change_pct = ((normalized_current - prev_cost) / prev_cost) * 100
return change_pct, current_cost, prev_cost
def generate_report():
"""Generate formatted weekly cost report."""
weekly = get_weekly_spend()
mom_change, current_spend, prev_spend = get_mom_change()
total_weekly = 0.0
top_services = []
for day in weekly:
for group in day.get('Groups', []):
cost = float(group['Metrics']['BlendedCost']['Amount'])
if cost > 0:
total_weekly += cost
service = group['Keys'][0]
top_services.append((service, cost))
# Aggregate by service
service_totals = {}
for svc, cost in top_services:
service_totals[svc] = service_totals.get(svc, 0) + cost
top_services = sorted(service_totals.items(), key=lambda x: x[1], reverse=True)[:5]
report = f"""
*๐ Weekly Cloud Cost Report*
*Week of:* {datetime.now().strftime('%Y-%m-%d')}
*7-Day Total:* ${total_weekly:,.2f}
*MTD Spend:* ${current_spend:,.2f}
*MoM Trend:* {mom_change:+.1f}%
*Top 5 Services:*
"""
for svc, cost in top_services:
pct = (cost / total_weekly) * 100
report += f"โข `{svc}`: ${cost:,.2f} ({pct:.1f}%)\n"
report += f"\n_Budget Status: {'โ
On track' if mom_change < 10 else 'โ ๏ธ Review needed'}_"
return report
def lambda_handler(event, context):
"""Lambda entry point for scheduled weekly reports."""
report = generate_report()
# Send to Slack
slack_webhook = os.environ['SLACK_WEBHOOK_URL']
import urllib.request
req = urllib.request.Request(
slack_webhook,
data=json.dumps({"text": report}).encode(),
headers={'Content-Type': 'application/json'}
)
urllib.request.urlopen(req, timeout=10)
# Send to S3 for archival
s3 = boto3.client('s3')
s3.put_object(
Bucket=os.environ['REPORTS_BUCKET'],
Key=f"cost-reports/weekly/{datetime.now().strftime('%Y-%m-%d')}.txt",
Body=report.encode()
)
return {"statusCode": 200, "body": "Report generated and sent."}
FinOps Review Meeting Agenda
## FinOps Monthly Review Meeting โ Agenda Template
*Date:* ___________ *Attendees:* ___________
### 1. Executive Summary (5 min)
- Total cloud spend vs. budget
- Month-over-month change (%)
- Year-to-date spend vs. annual budget
### 2. Cost Allocation by Team (10 min)
- Each team presents their spend, variance, and forecast
- Tag compliance score for each team
- Action items from previous month
### 3. Optimization Opportunities (15 min)
- Right-sizing recommendations (top 10 by savings)
- Idle/orphaned resources identified
- RI/SP utilization and coverage
- Spot instance adoption rate
### 4. Anomalies and Incidents (10 min)
- Cost anomalies detected and investigated
- Root cause for any budget overruns
- Preventive measures implemented
### 5. Forecasting and Planning (10 min)
- Next month forecast by team
- Upcoming infrastructure changes (launches, migrations)
- RI/SP purchase recommendations
### 6. Process Improvements (10 min)
- Tagging policy updates
- Automation opportunities
- Tool evaluation or changes
- Training needs
### Action Items
| # | Action | Owner | Due Date | Status |
|---|--------|-------|----------|--------|
| 1 | | | | |
| 2 | | | | |
| 3 | | | | |
*Next Meeting:* ___________
Emergency Cost Response Runbook
This runbook addresses sudden, unexpected cost spikes that threaten to exceed monthly budgets.
- T+0 โ Detection: Anomaly detection alert or budget threshold breach triggers PagerDuty incident. On-call engineer acknowledges within 15 minutes.
- T+15 โ Initial Assessment:
- Check AWS Cost Explorer for the top cost-increasing services in the last 24 hours.
- Identify the specific resource(s) driving the increase (instance ID, Lambda function, etc.).
- Determine if the increase is expected (launch event, traffic spike) or anomalous.
- T+30 โ Containment (if anomalous):
- For EC2: Scale down Auto Scaling group or stop non-critical instances.
- For Lambda: Reduce concurrency limit or disable function (if non-critical).
- For S3: Enable lifecycle policy to move data to cheaper tier immediately.
- For data transfer: Identify and block unexpected egress if malicious.
# Emergency cost containment script #!/bin/bash # emergency_containment.sh โ Quick containment for cost spikes SERVICE=$1 # e.g., ec2, lambda, rds ACTION=$2 # e.g., stop, downscale, limit case "$SERVICE-$ACTION" in ec2-stop) # Stop all dev/test instances immediately aws ec2 describe-instances \ --filters "Name=tag:Environment,Values=dev,staging" \ "Name=instance-state-name,Values=running" \ --query 'Reservations[].Instances[].InstanceId' \ --output text | xargs -r aws ec2 stop-instances --instance-ids ;; lambda-limit) FUNCTION_NAME=$3 aws lambda put-function-concurrency \ --function-name "$FUNCTION_NAME" \ --reserved-concurrent-executions 0 ;; asg-downscale) ASG_NAME=$3 aws autoscaling update-auto-scaling-group \ --auto-scaling-group-name "$ASG_NAME" \ --desired-capacity 1 \ --min-size 0 ;; *) echo "Unknown service-action combination: $SERVICE-$ACTION" exit 1 ;; esac - T+1h โ Communication: Post incident summary to #incidents Slack channel. Notify affected teams. Begin root cause analysis.
- T+4h โ Root Cause: Complete root cause analysis. Common causes:
- Infinite loop in Lambda or ECS task
- Runaway autoscaling due to incorrect metric or DDoS
- Data transfer loop (cross-AZ, inter-region)
- Crypto mining via compromised credentials
- Accidental provisioning of large instances (e.g., xlarge instead of large)
- T+24h โ Remediation: Implement permanent fix. Update monitoring thresholds. Document lessons learned.
- T+1w โ Retrospective: Conduct blameless post-mortem. Update runbook with any new containment procedures identified.