Project: Terraform Datadog Module

A reusable Terraform module for comprehensive Datadog integration, including monitors, dashboards, and synthetic tests — designed for multi-environment AWS deployments.

Project Overview

In production environments, observability is not optional. This Terraform module codifies years of production SRE experience into a single, reusable component that provisions complete Datadog observability for any AWS-hosted service. Instead of manually creating monitors that drift out of date, teams declare their monitoring requirements alongside their infrastructure — ensuring every deployment is monitored from day one.

Design Goals:

Convention over configuration: Sensible defaults for standard AWS services
Environment parity: Same monitoring in dev, staging, and production — with different alert thresholds
Tag propagation: AWS resource tags automatically flow into Datadog for filtering and cost attribution
Alert routing: Built-in PagerDuty and Slack integration with environment-aware routing
Composable: Use the full module or pick individual sub-modules (monitors-only, dashboard-only)

Module Structure

terraform-datadog-module/
├── modules/
│   ├── monitors/           # APM, infrastructure, and custom log monitors
│   │   ├── apm_monitors.tf
│   │   ├── infra_monitors.tf
│   │   ├── log_monitors.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   ├── dashboard/          # Service overview dashboard
│   │   ├── main.tf
│   │   ├── widgets/        # Reusable widget definitions
│   │   ├── variables.tf
│   │   └── outputs.tf
│   ├── synthetics/         # API and browser tests
│   │   ├── api_tests.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   └── alert-routing/      # PagerDuty and Slack notification channels
│       ├── main.tf
│       ├── variables.tf
│       └── outputs.tf
├── examples/
│   ├── complete/           # Full module usage example
│   ├── monitors-only/      # Monitors sub-module only
│   └── multi-environment/  # Dev/staging/prod pattern
├── main.tf                 # Root module composition
├── variables.tf
├── outputs.tf
└── README.md

Module Usage

Complete Example

module "datadog" {
  source = "github.com/j1-medilo06/terraform-datadog-module?ref=v2.1.0"

  # ---------------------------------------------------------------------------
  # Core Configuration
  # ---------------------------------------------------------------------------
  environment  = "production"
  service_name = "api-service"
  team         = "platform"
  
  # AWS tag propagation — monitors will include these tags automatically
  tags = {
    Environment = "production"
    Service     = "api-service"
    Team        = "platform"
    CostCenter  = "engineering"
    ManagedBy   = "terraform"
  }

  # ---------------------------------------------------------------------------
  # APM Monitors
  # ---------------------------------------------------------------------------
  enable_apm_monitors = true
  
  apm_error_rate_threshold = {
    critical = 5.0    # Alert if error rate exceeds 5%
    warning  = 2.0    # Warn at 2%
  }
  
  apm_latency_p99_threshold = {
    critical = 2000   # 2 seconds (critical)
    warning  = 1000   # 1 second (warning)
  }
  
  apm_latency_p95_threshold = {
    critical = 1000   # 1 second
    warning  = 500    # 500ms
  }
  
  apm_request_rate_anomaly = true  # Enable anomaly detection for traffic drops

  # ---------------------------------------------------------------------------
  # Infrastructure Monitors (auto-discovered from AWS tags)
  # ---------------------------------------------------------------------------
  enable_infra_monitors = true
  
  # Target the ECS service by cluster and service name
  ecs_cluster_name = "api-service-prod"
  ecs_service_name = "api"
  
  cpu_threshold = {
    critical = 85
    warning  = 70
  }
  
  memory_threshold = {
    critical = 90
    warning  = 75
  }
  
  disk_threshold = {
    critical = 90
    warning  = 80
  }

  # ---------------------------------------------------------------------------
  # Log-Based Monitors
  # ---------------------------------------------------------------------------
  enable_log_monitors = true
  
  # High error rate from application logs
  log_error_pattern = "status:error @service:api-service"
  log_error_threshold = {
    critical          = 50    # 50 errors in evaluation window
    warning           = 20
    evaluation_window = "last_5m"
  }
  
  # Detect specific error patterns (e.g., database connection failures)
  custom_log_monitors = [
    {
      name     = "Database Connection Failures"
      query    = "logs('service:api-service @error_type:db_connection').index('*').rollup('count').last('5m') > 10"
      message  = <<-EOT
        Database connection failures detected on {{service.name}}.
        
        {{#is_alert}}
        Critical: More than 10 DB connection failures in the last 5 minutes.
        Runbook: https://wiki.internal/runbooks/db-connection-failures
        {{/is_alert}}
        
        {{#is_warning}}
        Warning: Elevated DB connection failures — investigate before escalation.
        {{/is_warning}}
      EOT
      priority = 2
    }
  ]

  # ---------------------------------------------------------------------------
  # Synthetic Tests
  # ---------------------------------------------------------------------------
  enable_synthetics = true
  
  synthetic_api_tests = [
    {
      name    = "API Health Check"
      url     = "https://api.kuyaops.com/health"
      method  = "GET"
      assertions = [
        { type = "statusCode", operator = "is", target = "200" },
        { type = "responseTime", operator = "lessThan", target = "2000" },
        { type = "body", operator = "contains", target = '"status":"ok"' }
      ]
      locations = ["aws:us-east-1", "aws:us-west-2", "aws:eu-west-1"]
      frequency = 60  # seconds
    },
    {
      name    = "Authentication Endpoint"
      url     = "https://api.kuyaops.com/v1/auth/token"
      method  = "POST"
      request_body = jsonencode({
        grant_type = "client_credentials"
        scope      = "read"
      })
      assertions = [
        { type = "statusCode", operator = "is", target = "200" },
        { type = "body", operator = "validatesJSONPath", target = "$.access_token" }
      ]
      locations = ["aws:us-east-1"]
      frequency = 300
    }
  ]

  # ---------------------------------------------------------------------------
  # Dashboard
  # ---------------------------------------------------------------------------
  enable_dashboard = true
  dashboard_title  = "API Service Overview - Production"
  
  # Dashboard will include:
  # - Request rate, error rate, latency (p50/p95/p99)
  # - Infrastructure metrics (CPU, memory, task count)
  # - Log volume and error breakdown
  # - Synthetic test results
  # - Top errors and slowest endpoints

  # ---------------------------------------------------------------------------
  # Alert Routing
  # ---------------------------------------------------------------------------
  enable_alert_routing = true
  
  # PagerDuty — critical alerts create incidents
  pagerduty_service_key = var.pagerduty_service_key  # Stored in Vault/AWS SM
  
  # Slack — warnings and notifications
  slack_channel = "#sre-alerts"
  
  # Route by severity: critical → PagerDuty, warning → Slack only
  alert_routing_rules = {
    critical = ["pagerduty", "slack"]
    warning  = ["slack"]
    info     = ["slack"]
  }
}

Resources Created

Category	Resource	Description
APM Monitors	Error Rate Monitor	Alert on HTTP 5xx rate above threshold
	P99 Latency Monitor	Alert on tail latency degradation
	P95 Latency Monitor	Earlier warning for latency trends
	Request Rate Anomaly	Detect traffic drops (outage indicator)
Infrastructure	ECS CPU Monitor	Container CPU utilization
	ECS Memory Monitor	Container memory utilization
	Disk Usage Monitor	EBS/ECS Fargate ephemeral storage
	Task Count Monitor	Ensure minimum healthy task count
Log Monitors	Error Volume Monitor	Count of error-level log entries
Log Monitors	Custom Pattern Monitors	User-defined log query alerts
Synthetic Tests	API Health Checks	Multi-region HTTP endpoint monitoring
Synthetic Tests	SSL Certificate Expiry	Alert 30/14/7 days before expiry
Dashboard	Service Overview	Unified dashboard with all key metrics
Alert Routing	PagerDuty Integration	Critical incident creation
Alert Routing	Slack Notifications	Warning and informational alerts

Tag Propagation from AWS Resources

A key feature of this module is automatic tag discovery from AWS resources. When you tag your ECS services, RDS instances, and ALBs consistently, the module queries those tags and applies them to Datadog monitors — enabling powerful filtering and cost attribution.

# modules/monitors/infra_monitors.tf
# Automatically discover AWS resource tags and apply to Datadog monitors

data "aws_ecs_service" "target" {
  count        = var.ecs_cluster_name != "" ? 1 : 0
  cluster_name = var.ecs_cluster_name
  service_name = var.ecs_service_name
}

locals {
  # Merge module-provided tags with discovered AWS tags
  datadog_tags = concat(
    [for k, v in var.tags : "${lower(k)}:${lower(v)}"],
    ["env:${var.environment}"],
    ["service:${var.service_name}"]
  )
  
  # Standard monitor naming convention
  monitor_name = "${var.service_name} — ${var.environment}"
}

# ECS CPU Monitor
resource "datadog_monitor" "ecs_cpu" {
  count = var.enable_infra_monitors && var.ecs_cluster_name != "" ? 1 : 0
  
  name    = "${local.monitor_name} — ECS CPU High"
  type    = "metric alert"
  message = <<-EOT
    ECS CPU utilization is {{value}}% for {{service.name}}.
    
    {{#is_alert}}
    Cluster: ${var.ecs_cluster_name}
    Service: ${var.ecs_service_name}
    Runbook: ${var.runbook_url}
    Escalation: @pagerduty-${var.service_name}
    {{/is_alert}}
  EOT

  query = <<-EOQ
    avg(last_5m):
      avg:aws.ecs.service.cpuutilization{
        cluster:${var.ecs_cluster_name},
        service:${var.ecs_service_name}
      } > ${var.cpu_threshold.critical}
  EOQ

  monitor_thresholds {
    critical = var.cpu_threshold.critical
    warning  = var.cpu_threshold.warning
  }

  notify_no_data    = true
  no_data_timeframe = 10
  evaluation_delay  = 60

  tags = local.datadog_tags

  priority = 2  # P2 = high
}

Multi-Environment Configuration Pattern

The module uses environment-specific variable files to maintain consistency while adjusting thresholds for each environment's requirements.

# environments/production/main.tf
module "datadog_prod" {
  source = "github.com/j1-medilo06/terraform-datadog-module?ref=v2.1.0"

  environment  = "production"
  service_name = "api-service"
  
  # Production: tight thresholds, PagerDuty on critical
  apm_error_rate_threshold = { critical = 5.0, warning = 2.0 }
  apm_latency_p99_threshold = { critical = 2000, warning = 1000 }
  
  pagerduty_service_key = data.aws_secretsmanager_secret_version.pagerduty_prod.secret_string
  slack_channel         = "#sre-alerts"
}

# environments/staging/main.tf
module "datadog_staging" {
  source = "github.com/j1-medilo06/terraform-datadog-module?ref=v2.1.0"

  environment  = "staging"
  service_name = "api-service"
  
  # Staging: more lenient, Slack only (no PagerDuty)
  apm_error_rate_threshold = { critical = 10.0, warning = 5.0 }
  apm_latency_p99_threshold = { critical = 5000, warning = 3000 }
  
  enable_alert_routing   = true
  pagerduty_service_key  = ""  # No PagerDuty in staging
  slack_channel          = "#staging-alerts"
}

# environments/development/main.tf
module "datadog_dev" {
  source = "github.com/j1-medilo06/terraform-datadog-module?ref=v2.1.0"

  environment  = "development"
  service_name = "api-service"
  
  # Dev: monitors for visibility, no alert routing
  enable_apm_monitors      = true
  enable_infra_monitors    = true
  enable_synthetics        = false  # No synthetics in dev
  enable_alert_routing     = false  # No alerts in dev
  enable_dashboard         = true
}

Alert Routing with PagerDuty and Slack

The alert routing sub-module creates notification channels and applies them to monitors based on severity rules.

# modules/alert-routing/main.tf

# ---------------------------------------------------------------------------
# PagerDuty Integration
# ---------------------------------------------------------------------------
resource "datadog_integration_pagerduty_service_object" "critical" {
  count = var.pagerduty_service_key != "" ? 1 : 0
  
  service_name = "${var.service_name}-${var.environment}"
  service_key  = var.pagerduty_service_key
}

# ---------------------------------------------------------------------------
# Slack Integration
# ---------------------------------------------------------------------------
resource "datadog_slack_channel" "alerts" {
  count = var.slack_channel != "" ? 1 : 0
  
  channel_name = var.slack_channel
  account_name = var.slack_account_name  # Configured in Datadog org settings
  
  display {
    message  = true
    notified = true
    snapshot = true
    tags     = true
  }
}

# ---------------------------------------------------------------------------
# Helper outputs for monitor module consumption
# ---------------------------------------------------------------------------
locals {
  critical_mentions = compact([
    var.pagerduty_service_key != "" ? "@pagerduty-${var.service_name}" : "",
    var.slack_channel != "" ? "@slack-${var.slack_channel}" : "",
  ])
  
  warning_mentions = compact([
    var.slack_channel != "" ? "@slack-${var.slack_channel}" : "",
  ])
  
  # Build mention strings
  mention_critical = length(local.critical_mentions) > 0 ? join(" ", local.critical_mentions) : ""
  mention_warning  = length(local.warning_mentions) > 0 ? join(" ", local.warning_mentions) : ""
}

output "critical_mention" {
  description = "Mention string for critical alerts"
  value       = local.mention_critical
}

output "warning_mention" {
  description = "Mention string for warning alerts"
  value       = local.mention_warning
}

Dashboard Example

The module generates a comprehensive service overview dashboard. Here is a sample of the widget configuration:

# modules/dashboard/main.tf (excerpt)

resource "datadog_dashboard" "service_overview" {
  count = var.enable_dashboard ? 1 : 0

  title       = var.dashboard_title
  description = "Auto-generated dashboard for ${var.service_name} (${var.environment})"
  layout_type = "ordered"

  # Group: Request Overview
  widget {
    group_definition {
      title       = "Request Overview"
      layout_type = "ordered"
      
      # Request rate timeseries
      widget {
        timeseries_definition {
          title = "Request Rate (rpm)"
          request {
            q = "sum:aws.applicationelb.request_count{service:${var.service_name}}.as_rate()"
            style { palette = "dog_classic" }
          }
        }
      }
      
      # Error rate
      widget {
        timeseries_definition {
          title = "Error Rate (%)"
          request {
            q = "( sum:aws.applicationelb.httpcode_target_5xx{service:${var.service_name}}.as_rate() / sum:aws.applicationelb.request_count{service:${var.service_name}}.as_rate() ) * 100"
            style { palette = "warm" }
          }
        }
      }
      
      # Latency distribution
      widget {
        timeseries_definition {
          title = "Latency (p50 / p95 / p99)"
          request {
            q = "avg:aws.applicationelb.target_response_time{service:${var.service_name}}.rollup(avg)"
            display_type = "line"
          }
        }
      }
    }
  }

  # Group: Infrastructure
  widget {
    group_definition {
      title       = "Infrastructure"
      layout_type = "ordered"
      
      widget {
        query_value_definition {
          title  = "ECS Task Count"
          unit   = "none"
          autoscale = true
          request {
            q = "avg:aws.ecs.service.running{cluster:${var.ecs_cluster_name},service:${var.ecs_service_name}}"
          }
        }
      }
      
      widget {
        heatmap_definition {
          title = "CPU vs Memory Heatmap"
          request {
            q = "avg:aws.ecs.service.cpuutilization{cluster:${var.ecs_cluster_name}}"
          }
          request {
            q = "avg:aws.ecs.service.memory_utilization{cluster:${var.ecs_cluster_name}}"
          }
        }
      }
    }
  }

  tags = local.datadog_tags
}

Integration with Existing AWS Infrastructure

This module is designed to be added to existing Terraform projects without conflict. Simply reference it alongside your existing AWS resources:

# Existing infrastructure — no changes needed
module "vpc" {
  source = "terraform-aws-modules/vpc/aws"
  # ... existing config
}

module "ecs" {
  source = "terraform-aws-modules/ecs/aws"
  # ... existing config
}

# Add Datadog monitoring — references existing resources by name
module "monitoring" {
  source = "github.com/j1-medilo06/terraform-datadog-module?ref=v2.1.0"
  
  environment       = "production"
  service_name      = "payment-api"
  ecs_cluster_name  = module.ecs.cluster_name
  ecs_service_name  = "payment-service"
  
  enable_apm_monitors  = true
  enable_infra_monitors = true
  enable_synthetics    = true
  enable_dashboard     = true
  enable_alert_routing = true
  
  pagerduty_service_key = var.pagerduty_payment_key
  slack_channel         = "#payments-alerts"
}

SRE Best Practices Implemented

Google SRE Principles Applied:

Monitoring the Four Golden Signals: Latency, Traffic, Errors, and Saturation monitors for every service
Alert on symptoms, not causes: Error rate and latency alerts fire on user-impacting metrics, not just resource thresholds
Tunable thresholds per environment: Production uses strict SLOs; dev/staging use relaxed values
Actionable alerts: Every alert includes runbook links and escalation paths
No alert fatigue: Warning thresholds catch trends early; critical thresholds require immediate action
Tag-driven automation: Consistent tagging enables automated monitor scoping and cost allocation

GitHub Repository

Resource	Link
Terraform Module Repository	github.com/j1-medilo06/terraform-datadog-module
Datadog Terraform Provider	registry.terraform.io/providers/DataDog/datadog
Author GitHub	github.com/j1-medilo06