41 pages ยท 8 sections
Ctrl K
GitHub Portfolio

Project: Terraform Datadog Module

A reusable Terraform module for comprehensive Datadog integration, including monitors, dashboards, and synthetic tests โ€” designed for multi-environment AWS deployments.

Project Overview

In production environments, observability is not optional. This Terraform module codifies years of production SRE experience into a single, reusable component that provisions complete Datadog observability for any AWS-hosted service. Instead of manually creating monitors that drift out of date, teams declare their monitoring requirements alongside their infrastructure โ€” ensuring every deployment is monitored from day one.

Design Goals:
  • Convention over configuration: Sensible defaults for standard AWS services
  • Environment parity: Same monitoring in dev, staging, and production โ€” with different alert thresholds
  • Tag propagation: AWS resource tags automatically flow into Datadog for filtering and cost attribution
  • Alert routing: Built-in PagerDuty and Slack integration with environment-aware routing
  • Composable: Use the full module or pick individual sub-modules (monitors-only, dashboard-only)

Module Structure

terraform-datadog-module/
โ”œโ”€โ”€ modules/
โ”‚   โ”œโ”€โ”€ monitors/           # APM, infrastructure, and custom log monitors
โ”‚   โ”‚   โ”œโ”€โ”€ apm_monitors.tf
โ”‚   โ”‚   โ”œโ”€โ”€ infra_monitors.tf
โ”‚   โ”‚   โ”œโ”€โ”€ log_monitors.tf
โ”‚   โ”‚   โ”œโ”€โ”€ variables.tf
โ”‚   โ”‚   โ””โ”€โ”€ outputs.tf
โ”‚   โ”œโ”€โ”€ dashboard/          # Service overview dashboard
โ”‚   โ”‚   โ”œโ”€โ”€ main.tf
โ”‚   โ”‚   โ”œโ”€โ”€ widgets/        # Reusable widget definitions
โ”‚   โ”‚   โ”œโ”€โ”€ variables.tf
โ”‚   โ”‚   โ””โ”€โ”€ outputs.tf
โ”‚   โ”œโ”€โ”€ synthetics/         # API and browser tests
โ”‚   โ”‚   โ”œโ”€โ”€ api_tests.tf
โ”‚   โ”‚   โ”œโ”€โ”€ variables.tf
โ”‚   โ”‚   โ””โ”€โ”€ outputs.tf
โ”‚   โ””โ”€โ”€ alert-routing/      # PagerDuty and Slack notification channels
โ”‚       โ”œโ”€โ”€ main.tf
โ”‚       โ”œโ”€โ”€ variables.tf
โ”‚       โ””โ”€โ”€ outputs.tf
โ”œโ”€โ”€ examples/
โ”‚   โ”œโ”€โ”€ complete/           # Full module usage example
โ”‚   โ”œโ”€โ”€ monitors-only/      # Monitors sub-module only
โ”‚   โ””โ”€โ”€ multi-environment/  # Dev/staging/prod pattern
โ”œโ”€โ”€ main.tf                 # Root module composition
โ”œโ”€โ”€ variables.tf
โ”œโ”€โ”€ outputs.tf
โ””โ”€โ”€ README.md

Module Usage

Complete Example

module "datadog" {
  source = "github.com/j1-medilo06/terraform-datadog-module?ref=v2.1.0"

  # ---------------------------------------------------------------------------
  # Core Configuration
  # ---------------------------------------------------------------------------
  environment  = "production"
  service_name = "api-service"
  team         = "platform"
  
  # AWS tag propagation โ€” monitors will include these tags automatically
  tags = {
    Environment = "production"
    Service     = "api-service"
    Team        = "platform"
    CostCenter  = "engineering"
    ManagedBy   = "terraform"
  }

  # ---------------------------------------------------------------------------
  # APM Monitors
  # ---------------------------------------------------------------------------
  enable_apm_monitors = true
  
  apm_error_rate_threshold = {
    critical = 5.0    # Alert if error rate exceeds 5%
    warning  = 2.0    # Warn at 2%
  }
  
  apm_latency_p99_threshold = {
    critical = 2000   # 2 seconds (critical)
    warning  = 1000   # 1 second (warning)
  }
  
  apm_latency_p95_threshold = {
    critical = 1000   # 1 second
    warning  = 500    # 500ms
  }
  
  apm_request_rate_anomaly = true  # Enable anomaly detection for traffic drops

  # ---------------------------------------------------------------------------
  # Infrastructure Monitors (auto-discovered from AWS tags)
  # ---------------------------------------------------------------------------
  enable_infra_monitors = true
  
  # Target the ECS service by cluster and service name
  ecs_cluster_name = "api-service-prod"
  ecs_service_name = "api"
  
  cpu_threshold = {
    critical = 85
    warning  = 70
  }
  
  memory_threshold = {
    critical = 90
    warning  = 75
  }
  
  disk_threshold = {
    critical = 90
    warning  = 80
  }

  # ---------------------------------------------------------------------------
  # Log-Based Monitors
  # ---------------------------------------------------------------------------
  enable_log_monitors = true
  
  # High error rate from application logs
  log_error_pattern = "status:error @service:api-service"
  log_error_threshold = {
    critical          = 50    # 50 errors in evaluation window
    warning           = 20
    evaluation_window = "last_5m"
  }
  
  # Detect specific error patterns (e.g., database connection failures)
  custom_log_monitors = [
    {
      name     = "Database Connection Failures"
      query    = "logs('service:api-service @error_type:db_connection').index('*').rollup('count').last('5m') > 10"
      message  = <<-EOT
        Database connection failures detected on {{service.name}}.
        
        {{#is_alert}}
        Critical: More than 10 DB connection failures in the last 5 minutes.
        Runbook: https://wiki.internal/runbooks/db-connection-failures
        {{/is_alert}}
        
        {{#is_warning}}
        Warning: Elevated DB connection failures โ€” investigate before escalation.
        {{/is_warning}}
      EOT
      priority = 2
    }
  ]

  # ---------------------------------------------------------------------------
  # Synthetic Tests
  # ---------------------------------------------------------------------------
  enable_synthetics = true
  
  synthetic_api_tests = [
    {
      name    = "API Health Check"
      url     = "https://api.kuyaops.com/health"
      method  = "GET"
      assertions = [
        { type = "statusCode", operator = "is", target = "200" },
        { type = "responseTime", operator = "lessThan", target = "2000" },
        { type = "body", operator = "contains", target = '"status":"ok"' }
      ]
      locations = ["aws:us-east-1", "aws:us-west-2", "aws:eu-west-1"]
      frequency = 60  # seconds
    },
    {
      name    = "Authentication Endpoint"
      url     = "https://api.kuyaops.com/v1/auth/token"
      method  = "POST"
      request_body = jsonencode({
        grant_type = "client_credentials"
        scope      = "read"
      })
      assertions = [
        { type = "statusCode", operator = "is", target = "200" },
        { type = "body", operator = "validatesJSONPath", target = "$.access_token" }
      ]
      locations = ["aws:us-east-1"]
      frequency = 300
    }
  ]

  # ---------------------------------------------------------------------------
  # Dashboard
  # ---------------------------------------------------------------------------
  enable_dashboard = true
  dashboard_title  = "API Service Overview - Production"
  
  # Dashboard will include:
  # - Request rate, error rate, latency (p50/p95/p99)
  # - Infrastructure metrics (CPU, memory, task count)
  # - Log volume and error breakdown
  # - Synthetic test results
  # - Top errors and slowest endpoints

  # ---------------------------------------------------------------------------
  # Alert Routing
  # ---------------------------------------------------------------------------
  enable_alert_routing = true
  
  # PagerDuty โ€” critical alerts create incidents
  pagerduty_service_key = var.pagerduty_service_key  # Stored in Vault/AWS SM
  
  # Slack โ€” warnings and notifications
  slack_channel = "#sre-alerts"
  
  # Route by severity: critical โ†’ PagerDuty, warning โ†’ Slack only
  alert_routing_rules = {
    critical = ["pagerduty", "slack"]
    warning  = ["slack"]
    info     = ["slack"]
  }
}

Resources Created

CategoryResourceDescription
APM MonitorsError Rate MonitorAlert on HTTP 5xx rate above threshold
P99 Latency MonitorAlert on tail latency degradation
P95 Latency MonitorEarlier warning for latency trends
Request Rate AnomalyDetect traffic drops (outage indicator)
InfrastructureECS CPU MonitorContainer CPU utilization
ECS Memory MonitorContainer memory utilization
Disk Usage MonitorEBS/ECS Fargate ephemeral storage
Task Count MonitorEnsure minimum healthy task count
Log MonitorsError Volume MonitorCount of error-level log entries
Custom Pattern MonitorsUser-defined log query alerts
Synthetic TestsAPI Health ChecksMulti-region HTTP endpoint monitoring
SSL Certificate ExpiryAlert 30/14/7 days before expiry
DashboardService OverviewUnified dashboard with all key metrics
Alert RoutingPagerDuty IntegrationCritical incident creation
Slack NotificationsWarning and informational alerts

Tag Propagation from AWS Resources

A key feature of this module is automatic tag discovery from AWS resources. When you tag your ECS services, RDS instances, and ALBs consistently, the module queries those tags and applies them to Datadog monitors โ€” enabling powerful filtering and cost attribution.

# modules/monitors/infra_monitors.tf
# Automatically discover AWS resource tags and apply to Datadog monitors

data "aws_ecs_service" "target" {
  count        = var.ecs_cluster_name != "" ? 1 : 0
  cluster_name = var.ecs_cluster_name
  service_name = var.ecs_service_name
}

locals {
  # Merge module-provided tags with discovered AWS tags
  datadog_tags = concat(
    [for k, v in var.tags : "${lower(k)}:${lower(v)}"],
    ["env:${var.environment}"],
    ["service:${var.service_name}"]
  )
  
  # Standard monitor naming convention
  monitor_name = "${var.service_name} โ€” ${var.environment}"
}

# ECS CPU Monitor
resource "datadog_monitor" "ecs_cpu" {
  count = var.enable_infra_monitors && var.ecs_cluster_name != "" ? 1 : 0
  
  name    = "${local.monitor_name} โ€” ECS CPU High"
  type    = "metric alert"
  message = <<-EOT
    ECS CPU utilization is {{value}}% for {{service.name}}.
    
    {{#is_alert}}
    Cluster: ${var.ecs_cluster_name}
    Service: ${var.ecs_service_name}
    Runbook: ${var.runbook_url}
    Escalation: @pagerduty-${var.service_name}
    {{/is_alert}}
  EOT

  query = <<-EOQ
    avg(last_5m):
      avg:aws.ecs.service.cpuutilization{
        cluster:${var.ecs_cluster_name},
        service:${var.ecs_service_name}
      } > ${var.cpu_threshold.critical}
  EOQ

  monitor_thresholds {
    critical = var.cpu_threshold.critical
    warning  = var.cpu_threshold.warning
  }

  notify_no_data    = true
  no_data_timeframe = 10
  evaluation_delay  = 60

  tags = local.datadog_tags

  priority = 2  # P2 = high
}

Multi-Environment Configuration Pattern

The module uses environment-specific variable files to maintain consistency while adjusting thresholds for each environment's requirements.

# environments/production/main.tf
module "datadog_prod" {
  source = "github.com/j1-medilo06/terraform-datadog-module?ref=v2.1.0"

  environment  = "production"
  service_name = "api-service"
  
  # Production: tight thresholds, PagerDuty on critical
  apm_error_rate_threshold = { critical = 5.0, warning = 2.0 }
  apm_latency_p99_threshold = { critical = 2000, warning = 1000 }
  
  pagerduty_service_key = data.aws_secretsmanager_secret_version.pagerduty_prod.secret_string
  slack_channel         = "#sre-alerts"
}

# environments/staging/main.tf
module "datadog_staging" {
  source = "github.com/j1-medilo06/terraform-datadog-module?ref=v2.1.0"

  environment  = "staging"
  service_name = "api-service"
  
  # Staging: more lenient, Slack only (no PagerDuty)
  apm_error_rate_threshold = { critical = 10.0, warning = 5.0 }
  apm_latency_p99_threshold = { critical = 5000, warning = 3000 }
  
  enable_alert_routing   = true
  pagerduty_service_key  = ""  # No PagerDuty in staging
  slack_channel          = "#staging-alerts"
}

# environments/development/main.tf
module "datadog_dev" {
  source = "github.com/j1-medilo06/terraform-datadog-module?ref=v2.1.0"

  environment  = "development"
  service_name = "api-service"
  
  # Dev: monitors for visibility, no alert routing
  enable_apm_monitors      = true
  enable_infra_monitors    = true
  enable_synthetics        = false  # No synthetics in dev
  enable_alert_routing     = false  # No alerts in dev
  enable_dashboard         = true
}

Alert Routing with PagerDuty and Slack

The alert routing sub-module creates notification channels and applies them to monitors based on severity rules.

# modules/alert-routing/main.tf

# ---------------------------------------------------------------------------
# PagerDuty Integration
# ---------------------------------------------------------------------------
resource "datadog_integration_pagerduty_service_object" "critical" {
  count = var.pagerduty_service_key != "" ? 1 : 0
  
  service_name = "${var.service_name}-${var.environment}"
  service_key  = var.pagerduty_service_key
}

# ---------------------------------------------------------------------------
# Slack Integration
# ---------------------------------------------------------------------------
resource "datadog_slack_channel" "alerts" {
  count = var.slack_channel != "" ? 1 : 0
  
  channel_name = var.slack_channel
  account_name = var.slack_account_name  # Configured in Datadog org settings
  
  display {
    message  = true
    notified = true
    snapshot = true
    tags     = true
  }
}

# ---------------------------------------------------------------------------
# Helper outputs for monitor module consumption
# ---------------------------------------------------------------------------
locals {
  critical_mentions = compact([
    var.pagerduty_service_key != "" ? "@pagerduty-${var.service_name}" : "",
    var.slack_channel != "" ? "@slack-${var.slack_channel}" : "",
  ])
  
  warning_mentions = compact([
    var.slack_channel != "" ? "@slack-${var.slack_channel}" : "",
  ])
  
  # Build mention strings
  mention_critical = length(local.critical_mentions) > 0 ? join(" ", local.critical_mentions) : ""
  mention_warning  = length(local.warning_mentions) > 0 ? join(" ", local.warning_mentions) : ""
}

output "critical_mention" {
  description = "Mention string for critical alerts"
  value       = local.mention_critical
}

output "warning_mention" {
  description = "Mention string for warning alerts"
  value       = local.mention_warning
}

Dashboard Example

The module generates a comprehensive service overview dashboard. Here is a sample of the widget configuration:

# modules/dashboard/main.tf (excerpt)

resource "datadog_dashboard" "service_overview" {
  count = var.enable_dashboard ? 1 : 0

  title       = var.dashboard_title
  description = "Auto-generated dashboard for ${var.service_name} (${var.environment})"
  layout_type = "ordered"

  # Group: Request Overview
  widget {
    group_definition {
      title       = "Request Overview"
      layout_type = "ordered"
      
      # Request rate timeseries
      widget {
        timeseries_definition {
          title = "Request Rate (rpm)"
          request {
            q = "sum:aws.applicationelb.request_count{service:${var.service_name}}.as_rate()"
            style { palette = "dog_classic" }
          }
        }
      }
      
      # Error rate
      widget {
        timeseries_definition {
          title = "Error Rate (%)"
          request {
            q = "( sum:aws.applicationelb.httpcode_target_5xx{service:${var.service_name}}.as_rate() / sum:aws.applicationelb.request_count{service:${var.service_name}}.as_rate() ) * 100"
            style { palette = "warm" }
          }
        }
      }
      
      # Latency distribution
      widget {
        timeseries_definition {
          title = "Latency (p50 / p95 / p99)"
          request {
            q = "avg:aws.applicationelb.target_response_time{service:${var.service_name}}.rollup(avg)"
            display_type = "line"
          }
        }
      }
    }
  }

  # Group: Infrastructure
  widget {
    group_definition {
      title       = "Infrastructure"
      layout_type = "ordered"
      
      widget {
        query_value_definition {
          title  = "ECS Task Count"
          unit   = "none"
          autoscale = true
          request {
            q = "avg:aws.ecs.service.running{cluster:${var.ecs_cluster_name},service:${var.ecs_service_name}}"
          }
        }
      }
      
      widget {
        heatmap_definition {
          title = "CPU vs Memory Heatmap"
          request {
            q = "avg:aws.ecs.service.cpuutilization{cluster:${var.ecs_cluster_name}}"
          }
          request {
            q = "avg:aws.ecs.service.memory_utilization{cluster:${var.ecs_cluster_name}}"
          }
        }
      }
    }
  }

  tags = local.datadog_tags
}

Integration with Existing AWS Infrastructure

This module is designed to be added to existing Terraform projects without conflict. Simply reference it alongside your existing AWS resources:

# Existing infrastructure โ€” no changes needed
module "vpc" {
  source = "terraform-aws-modules/vpc/aws"
  # ... existing config
}

module "ecs" {
  source = "terraform-aws-modules/ecs/aws"
  # ... existing config
}

# Add Datadog monitoring โ€” references existing resources by name
module "monitoring" {
  source = "github.com/j1-medilo06/terraform-datadog-module?ref=v2.1.0"
  
  environment       = "production"
  service_name      = "payment-api"
  ecs_cluster_name  = module.ecs.cluster_name
  ecs_service_name  = "payment-service"
  
  enable_apm_monitors  = true
  enable_infra_monitors = true
  enable_synthetics    = true
  enable_dashboard     = true
  enable_alert_routing = true
  
  pagerduty_service_key = var.pagerduty_payment_key
  slack_channel         = "#payments-alerts"
}

SRE Best Practices Implemented

Google SRE Principles Applied:
  • Monitoring the Four Golden Signals: Latency, Traffic, Errors, and Saturation monitors for every service
  • Alert on symptoms, not causes: Error rate and latency alerts fire on user-impacting metrics, not just resource thresholds
  • Tunable thresholds per environment: Production uses strict SLOs; dev/staging use relaxed values
  • Actionable alerts: Every alert includes runbook links and escalation paths
  • No alert fatigue: Warning thresholds catch trends early; critical thresholds require immediate action
  • Tag-driven automation: Consistent tagging enables automated monitor scoping and cost allocation

GitHub Repository

ResourceLink
Terraform Module Repositorygithub.com/j1-medilo06/terraform-datadog-module
Datadog Terraform Providerregistry.terraform.io/providers/DataDog/datadog
Author GitHubgithub.com/j1-medilo06