Project: Terraform Datadog Module
A reusable Terraform module for comprehensive Datadog integration, including monitors, dashboards, and synthetic tests โ designed for multi-environment AWS deployments.
Project Overview
In production environments, observability is not optional. This Terraform module codifies years of production SRE experience into a single, reusable component that provisions complete Datadog observability for any AWS-hosted service. Instead of manually creating monitors that drift out of date, teams declare their monitoring requirements alongside their infrastructure โ ensuring every deployment is monitored from day one.
- Convention over configuration: Sensible defaults for standard AWS services
- Environment parity: Same monitoring in dev, staging, and production โ with different alert thresholds
- Tag propagation: AWS resource tags automatically flow into Datadog for filtering and cost attribution
- Alert routing: Built-in PagerDuty and Slack integration with environment-aware routing
- Composable: Use the full module or pick individual sub-modules (monitors-only, dashboard-only)
Module Structure
terraform-datadog-module/
โโโ modules/
โ โโโ monitors/ # APM, infrastructure, and custom log monitors
โ โ โโโ apm_monitors.tf
โ โ โโโ infra_monitors.tf
โ โ โโโ log_monitors.tf
โ โ โโโ variables.tf
โ โ โโโ outputs.tf
โ โโโ dashboard/ # Service overview dashboard
โ โ โโโ main.tf
โ โ โโโ widgets/ # Reusable widget definitions
โ โ โโโ variables.tf
โ โ โโโ outputs.tf
โ โโโ synthetics/ # API and browser tests
โ โ โโโ api_tests.tf
โ โ โโโ variables.tf
โ โ โโโ outputs.tf
โ โโโ alert-routing/ # PagerDuty and Slack notification channels
โ โโโ main.tf
โ โโโ variables.tf
โ โโโ outputs.tf
โโโ examples/
โ โโโ complete/ # Full module usage example
โ โโโ monitors-only/ # Monitors sub-module only
โ โโโ multi-environment/ # Dev/staging/prod pattern
โโโ main.tf # Root module composition
โโโ variables.tf
โโโ outputs.tf
โโโ README.md
Module Usage
Complete Example
module "datadog" {
source = "github.com/j1-medilo06/terraform-datadog-module?ref=v2.1.0"
# ---------------------------------------------------------------------------
# Core Configuration
# ---------------------------------------------------------------------------
environment = "production"
service_name = "api-service"
team = "platform"
# AWS tag propagation โ monitors will include these tags automatically
tags = {
Environment = "production"
Service = "api-service"
Team = "platform"
CostCenter = "engineering"
ManagedBy = "terraform"
}
# ---------------------------------------------------------------------------
# APM Monitors
# ---------------------------------------------------------------------------
enable_apm_monitors = true
apm_error_rate_threshold = {
critical = 5.0 # Alert if error rate exceeds 5%
warning = 2.0 # Warn at 2%
}
apm_latency_p99_threshold = {
critical = 2000 # 2 seconds (critical)
warning = 1000 # 1 second (warning)
}
apm_latency_p95_threshold = {
critical = 1000 # 1 second
warning = 500 # 500ms
}
apm_request_rate_anomaly = true # Enable anomaly detection for traffic drops
# ---------------------------------------------------------------------------
# Infrastructure Monitors (auto-discovered from AWS tags)
# ---------------------------------------------------------------------------
enable_infra_monitors = true
# Target the ECS service by cluster and service name
ecs_cluster_name = "api-service-prod"
ecs_service_name = "api"
cpu_threshold = {
critical = 85
warning = 70
}
memory_threshold = {
critical = 90
warning = 75
}
disk_threshold = {
critical = 90
warning = 80
}
# ---------------------------------------------------------------------------
# Log-Based Monitors
# ---------------------------------------------------------------------------
enable_log_monitors = true
# High error rate from application logs
log_error_pattern = "status:error @service:api-service"
log_error_threshold = {
critical = 50 # 50 errors in evaluation window
warning = 20
evaluation_window = "last_5m"
}
# Detect specific error patterns (e.g., database connection failures)
custom_log_monitors = [
{
name = "Database Connection Failures"
query = "logs('service:api-service @error_type:db_connection').index('*').rollup('count').last('5m') > 10"
message = <<-EOT
Database connection failures detected on {{service.name}}.
{{#is_alert}}
Critical: More than 10 DB connection failures in the last 5 minutes.
Runbook: https://wiki.internal/runbooks/db-connection-failures
{{/is_alert}}
{{#is_warning}}
Warning: Elevated DB connection failures โ investigate before escalation.
{{/is_warning}}
EOT
priority = 2
}
]
# ---------------------------------------------------------------------------
# Synthetic Tests
# ---------------------------------------------------------------------------
enable_synthetics = true
synthetic_api_tests = [
{
name = "API Health Check"
url = "https://api.kuyaops.com/health"
method = "GET"
assertions = [
{ type = "statusCode", operator = "is", target = "200" },
{ type = "responseTime", operator = "lessThan", target = "2000" },
{ type = "body", operator = "contains", target = '"status":"ok"' }
]
locations = ["aws:us-east-1", "aws:us-west-2", "aws:eu-west-1"]
frequency = 60 # seconds
},
{
name = "Authentication Endpoint"
url = "https://api.kuyaops.com/v1/auth/token"
method = "POST"
request_body = jsonencode({
grant_type = "client_credentials"
scope = "read"
})
assertions = [
{ type = "statusCode", operator = "is", target = "200" },
{ type = "body", operator = "validatesJSONPath", target = "$.access_token" }
]
locations = ["aws:us-east-1"]
frequency = 300
}
]
# ---------------------------------------------------------------------------
# Dashboard
# ---------------------------------------------------------------------------
enable_dashboard = true
dashboard_title = "API Service Overview - Production"
# Dashboard will include:
# - Request rate, error rate, latency (p50/p95/p99)
# - Infrastructure metrics (CPU, memory, task count)
# - Log volume and error breakdown
# - Synthetic test results
# - Top errors and slowest endpoints
# ---------------------------------------------------------------------------
# Alert Routing
# ---------------------------------------------------------------------------
enable_alert_routing = true
# PagerDuty โ critical alerts create incidents
pagerduty_service_key = var.pagerduty_service_key # Stored in Vault/AWS SM
# Slack โ warnings and notifications
slack_channel = "#sre-alerts"
# Route by severity: critical โ PagerDuty, warning โ Slack only
alert_routing_rules = {
critical = ["pagerduty", "slack"]
warning = ["slack"]
info = ["slack"]
}
}
Resources Created
| Category | Resource | Description |
|---|---|---|
| APM Monitors | Error Rate Monitor | Alert on HTTP 5xx rate above threshold |
| P99 Latency Monitor | Alert on tail latency degradation | |
| P95 Latency Monitor | Earlier warning for latency trends | |
| Request Rate Anomaly | Detect traffic drops (outage indicator) | |
| Infrastructure | ECS CPU Monitor | Container CPU utilization |
| ECS Memory Monitor | Container memory utilization | |
| Disk Usage Monitor | EBS/ECS Fargate ephemeral storage | |
| Task Count Monitor | Ensure minimum healthy task count | |
| Log Monitors | Error Volume Monitor | Count of error-level log entries |
| Custom Pattern Monitors | User-defined log query alerts | |
| Synthetic Tests | API Health Checks | Multi-region HTTP endpoint monitoring |
| SSL Certificate Expiry | Alert 30/14/7 days before expiry | |
| Dashboard | Service Overview | Unified dashboard with all key metrics |
| Alert Routing | PagerDuty Integration | Critical incident creation |
| Slack Notifications | Warning and informational alerts |
Tag Propagation from AWS Resources
A key feature of this module is automatic tag discovery from AWS resources. When you tag your ECS services, RDS instances, and ALBs consistently, the module queries those tags and applies them to Datadog monitors โ enabling powerful filtering and cost attribution.
# modules/monitors/infra_monitors.tf
# Automatically discover AWS resource tags and apply to Datadog monitors
data "aws_ecs_service" "target" {
count = var.ecs_cluster_name != "" ? 1 : 0
cluster_name = var.ecs_cluster_name
service_name = var.ecs_service_name
}
locals {
# Merge module-provided tags with discovered AWS tags
datadog_tags = concat(
[for k, v in var.tags : "${lower(k)}:${lower(v)}"],
["env:${var.environment}"],
["service:${var.service_name}"]
)
# Standard monitor naming convention
monitor_name = "${var.service_name} โ ${var.environment}"
}
# ECS CPU Monitor
resource "datadog_monitor" "ecs_cpu" {
count = var.enable_infra_monitors && var.ecs_cluster_name != "" ? 1 : 0
name = "${local.monitor_name} โ ECS CPU High"
type = "metric alert"
message = <<-EOT
ECS CPU utilization is {{value}}% for {{service.name}}.
{{#is_alert}}
Cluster: ${var.ecs_cluster_name}
Service: ${var.ecs_service_name}
Runbook: ${var.runbook_url}
Escalation: @pagerduty-${var.service_name}
{{/is_alert}}
EOT
query = <<-EOQ
avg(last_5m):
avg:aws.ecs.service.cpuutilization{
cluster:${var.ecs_cluster_name},
service:${var.ecs_service_name}
} > ${var.cpu_threshold.critical}
EOQ
monitor_thresholds {
critical = var.cpu_threshold.critical
warning = var.cpu_threshold.warning
}
notify_no_data = true
no_data_timeframe = 10
evaluation_delay = 60
tags = local.datadog_tags
priority = 2 # P2 = high
}
Multi-Environment Configuration Pattern
The module uses environment-specific variable files to maintain consistency while adjusting thresholds for each environment's requirements.
# environments/production/main.tf
module "datadog_prod" {
source = "github.com/j1-medilo06/terraform-datadog-module?ref=v2.1.0"
environment = "production"
service_name = "api-service"
# Production: tight thresholds, PagerDuty on critical
apm_error_rate_threshold = { critical = 5.0, warning = 2.0 }
apm_latency_p99_threshold = { critical = 2000, warning = 1000 }
pagerduty_service_key = data.aws_secretsmanager_secret_version.pagerduty_prod.secret_string
slack_channel = "#sre-alerts"
}
# environments/staging/main.tf
module "datadog_staging" {
source = "github.com/j1-medilo06/terraform-datadog-module?ref=v2.1.0"
environment = "staging"
service_name = "api-service"
# Staging: more lenient, Slack only (no PagerDuty)
apm_error_rate_threshold = { critical = 10.0, warning = 5.0 }
apm_latency_p99_threshold = { critical = 5000, warning = 3000 }
enable_alert_routing = true
pagerduty_service_key = "" # No PagerDuty in staging
slack_channel = "#staging-alerts"
}
# environments/development/main.tf
module "datadog_dev" {
source = "github.com/j1-medilo06/terraform-datadog-module?ref=v2.1.0"
environment = "development"
service_name = "api-service"
# Dev: monitors for visibility, no alert routing
enable_apm_monitors = true
enable_infra_monitors = true
enable_synthetics = false # No synthetics in dev
enable_alert_routing = false # No alerts in dev
enable_dashboard = true
}
Alert Routing with PagerDuty and Slack
The alert routing sub-module creates notification channels and applies them to monitors based on severity rules.
# modules/alert-routing/main.tf
# ---------------------------------------------------------------------------
# PagerDuty Integration
# ---------------------------------------------------------------------------
resource "datadog_integration_pagerduty_service_object" "critical" {
count = var.pagerduty_service_key != "" ? 1 : 0
service_name = "${var.service_name}-${var.environment}"
service_key = var.pagerduty_service_key
}
# ---------------------------------------------------------------------------
# Slack Integration
# ---------------------------------------------------------------------------
resource "datadog_slack_channel" "alerts" {
count = var.slack_channel != "" ? 1 : 0
channel_name = var.slack_channel
account_name = var.slack_account_name # Configured in Datadog org settings
display {
message = true
notified = true
snapshot = true
tags = true
}
}
# ---------------------------------------------------------------------------
# Helper outputs for monitor module consumption
# ---------------------------------------------------------------------------
locals {
critical_mentions = compact([
var.pagerduty_service_key != "" ? "@pagerduty-${var.service_name}" : "",
var.slack_channel != "" ? "@slack-${var.slack_channel}" : "",
])
warning_mentions = compact([
var.slack_channel != "" ? "@slack-${var.slack_channel}" : "",
])
# Build mention strings
mention_critical = length(local.critical_mentions) > 0 ? join(" ", local.critical_mentions) : ""
mention_warning = length(local.warning_mentions) > 0 ? join(" ", local.warning_mentions) : ""
}
output "critical_mention" {
description = "Mention string for critical alerts"
value = local.mention_critical
}
output "warning_mention" {
description = "Mention string for warning alerts"
value = local.mention_warning
}
Dashboard Example
The module generates a comprehensive service overview dashboard. Here is a sample of the widget configuration:
# modules/dashboard/main.tf (excerpt)
resource "datadog_dashboard" "service_overview" {
count = var.enable_dashboard ? 1 : 0
title = var.dashboard_title
description = "Auto-generated dashboard for ${var.service_name} (${var.environment})"
layout_type = "ordered"
# Group: Request Overview
widget {
group_definition {
title = "Request Overview"
layout_type = "ordered"
# Request rate timeseries
widget {
timeseries_definition {
title = "Request Rate (rpm)"
request {
q = "sum:aws.applicationelb.request_count{service:${var.service_name}}.as_rate()"
style { palette = "dog_classic" }
}
}
}
# Error rate
widget {
timeseries_definition {
title = "Error Rate (%)"
request {
q = "( sum:aws.applicationelb.httpcode_target_5xx{service:${var.service_name}}.as_rate() / sum:aws.applicationelb.request_count{service:${var.service_name}}.as_rate() ) * 100"
style { palette = "warm" }
}
}
}
# Latency distribution
widget {
timeseries_definition {
title = "Latency (p50 / p95 / p99)"
request {
q = "avg:aws.applicationelb.target_response_time{service:${var.service_name}}.rollup(avg)"
display_type = "line"
}
}
}
}
}
# Group: Infrastructure
widget {
group_definition {
title = "Infrastructure"
layout_type = "ordered"
widget {
query_value_definition {
title = "ECS Task Count"
unit = "none"
autoscale = true
request {
q = "avg:aws.ecs.service.running{cluster:${var.ecs_cluster_name},service:${var.ecs_service_name}}"
}
}
}
widget {
heatmap_definition {
title = "CPU vs Memory Heatmap"
request {
q = "avg:aws.ecs.service.cpuutilization{cluster:${var.ecs_cluster_name}}"
}
request {
q = "avg:aws.ecs.service.memory_utilization{cluster:${var.ecs_cluster_name}}"
}
}
}
}
}
tags = local.datadog_tags
}
Integration with Existing AWS Infrastructure
This module is designed to be added to existing Terraform projects without conflict. Simply reference it alongside your existing AWS resources:
# Existing infrastructure โ no changes needed
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
# ... existing config
}
module "ecs" {
source = "terraform-aws-modules/ecs/aws"
# ... existing config
}
# Add Datadog monitoring โ references existing resources by name
module "monitoring" {
source = "github.com/j1-medilo06/terraform-datadog-module?ref=v2.1.0"
environment = "production"
service_name = "payment-api"
ecs_cluster_name = module.ecs.cluster_name
ecs_service_name = "payment-service"
enable_apm_monitors = true
enable_infra_monitors = true
enable_synthetics = true
enable_dashboard = true
enable_alert_routing = true
pagerduty_service_key = var.pagerduty_payment_key
slack_channel = "#payments-alerts"
}
SRE Best Practices Implemented
- Monitoring the Four Golden Signals: Latency, Traffic, Errors, and Saturation monitors for every service
- Alert on symptoms, not causes: Error rate and latency alerts fire on user-impacting metrics, not just resource thresholds
- Tunable thresholds per environment: Production uses strict SLOs; dev/staging use relaxed values
- Actionable alerts: Every alert includes runbook links and escalation paths
- No alert fatigue: Warning thresholds catch trends early; critical thresholds require immediate action
- Tag-driven automation: Consistent tagging enables automated monitor scoping and cost allocation
GitHub Repository
| Resource | Link |
|---|---|
| Terraform Module Repository | github.com/j1-medilo06/terraform-datadog-module |
| Datadog Terraform Provider | registry.terraform.io/providers/DataDog/datadog |
| Author GitHub | github.com/j1-medilo06 |