41 pages Β· 8 sections
Ctrl K
GitHub Portfolio

Monitoring & Observability

Observability provides insight into distributed systems through metrics, logs, and traces. This guide covers Datadog, Prometheus, Grafana, OpenTelemetry, and modern observability practices with production-ready configuration examples, Ansible playbooks, and alert routing.

Monitoring vs. Observability vs. Telemetry

These terms are often used interchangeably but have distinct meanings. Understanding the difference is important for designing effective observability systems.

Term Definition Analogy
Monitoring Collecting and alerting on predefined metrics. You decide what to measure in advance and watch for known failure modes. A dashboard showing your car's speed, fuel level, and engine temperature. You know what to watch for.
Observability The property of a system that allows you to understand its internal state by examining its outputs. You can ask novel questions about system behavior without deploying new code. A mechanic connecting a diagnostic scanner to your car's computer. They can investigate any problem, even ones you didn't anticipate.
Telemetry The data itself β€” metrics, logs, traces, and events emitted by the system. The raw material of observability. The electrical signals from all the car's sensors. Data flowing from the system to the observer.

A system is observable when you can determine the internal state by examining its outputs β€” without needing to ship new code to answer new questions. Observability enables you to understand unknown-unknowns: failure modes you haven't seen before and didn't instrument for.

Three Pillars of Observability

1. Metrics

Numeric measurements collected at regular intervals. Metrics are efficient for alerting and trending but provide limited context. Examples: request rate, error rate, latency percentiles, CPU utilization, memory usage.

2. Logs

Timestamped records of discrete events. Logs provide rich context but are expensive to store and query at scale. Structured logs (JSON) are essential for automated analysis. Examples: application error logs, access logs, audit logs.

3. Traces

End-to-end request flows through distributed systems. A trace follows a single request as it propagates through multiple services, database calls, and message queues. Essential for understanding latency in microservices architectures.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    THREE PILLARS VISUALIZED                      β”‚
β”‚                                                                  β”‚
β”‚  METRICS (What)          LOGS (Why)           TRACES (Where)    β”‚
β”‚                                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”‚
β”‚  β”‚ CPU: 85% β”‚           β”‚ ERROR:   β”‚          β”‚ [req-abc]β”‚      β”‚
β”‚  β”‚ Mem: 12G β”‚           β”‚ Database β”‚          β”‚  β”œβ”€svc-a β”‚      β”‚
β”‚  β”‚ RPS: 2.4Kβ”‚           β”‚ connect  β”‚          β”‚  β”‚  15ms  β”‚      β”‚
β”‚  β”‚ p95: 230msβ”‚          β”‚ failed:  β”‚          β”‚  β”œβ”€svc-b β”‚      β”‚
β”‚  β”‚ errs: 0.1%β”‚          β”‚ conn     β”‚          β”‚  β”‚  45ms  β”‚      β”‚
β”‚  β”‚          β”‚           β”‚ refused  β”‚          β”‚  β”œβ”€db-q  β”‚      β”‚
β”‚  β”‚ [numbers]β”‚           β”‚ [text]   β”‚          β”‚  β”‚  200ms β”‚      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚  └─cache  β”‚      β”‚
β”‚                                               β”‚     5ms   β”‚      β”‚
β”‚  Good for: Alerting     Good for: Debugging   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚
β”‚  Trending               Root cause            Good for:          β”‚
β”‚  Capacity planning      Audit trails          Latency analysis   β”‚
β”‚  SLO measurement        Security forensics    Dependency mapping β”‚
β”‚                                                                  β”‚
β”‚  When an alert fires at 2 AM:                                    β”‚
β”‚  1. METRICS tell you SOMETHING is wrong (CPU spiked)            β”‚
β”‚  2. TRACES tell you WHERE it's wrong (DB query in svc-b)        β”‚
β”‚  3. LOGS tell you WHY it's wrong (connection pool exhausted)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

RED Method (Rate, Errors, Duration)

The RED method provides a minimal set of metrics for monitoring request-driven services. Every microservice should have RED metrics.

Metric Definition Prometheus Example Alert On
Rate Requests per second rate(http_requests_total[5m]) Unexpected drop (traffic loss) or spike (DDoS)
Errors Error rate (failed requests / total requests) rate(http_errors_total[5m]) / rate(http_requests_total[5m]) Error rate above SLO threshold
Duration Request latency distribution histogram_quantile(0.95, rate(http_duration_bucket[5m])) p95 or p99 above SLO threshold

USE Method (Utilization, Saturation, Errors)

The USE method, developed by Brendan Gregg, provides metrics for infrastructure resources. Apply USE to every resource in your system: CPU, memory, disk, network, connection pools.

Metric Definition Prometheus Example
Utilization Percent of resource actively used 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Saturation Amount of work queued / unable to be serviced node_load1 / count without(cpu, mode) (node_cpu_seconds_total{mode="idle"})
Errors Count of error events node_filesystem_device_error or application-specific error counters

Datadog Setup

Datadog is the recommended observability platform for its unified approach to metrics, traces, logs, and synthetic monitoring. The following sections cover agent installation, APM configuration, and log pipelines.

Complete Datadog Agent Installation β€” Ansible Playbook

# playbook-datadog.yml
# Ansible playbook to install and configure Datadog Agent across all nodes
# Supports Ubuntu, CentOS, and Amazon Linux

---
- name: Install Datadog Agent
  hosts: all
  become: yes
  vars:
    datadog_api_key: "{{ vault_datadog_api_key }}"  # From Ansible Vault
    datadog_site: "datadoghq.com"
    datadog_agent_version: "7.50.0"
    
    # APM Configuration
    datadog_apm_enabled: true
    datadog_apm_non_local_traffic: true
    datadog_apm_env: "{{ env }}"  # prod, staging, dev
    datadog_apm_tags:
      - "service:{{ service_name }}"
      - "team:{{ team_name }}"
      - "region:{{ aws_region }}"
    
    # Process monitoring
    datadog_process_config:
      enabled: true
      interval: 10
    
    # Log collection
    datadog_logs_enabled: true
    datadog_logs_config:
      - type: file
        path: /var/log/application/*.json
        service: "{{ service_name }}"
        source: "{{ service_name }}"
        sourcecategory: "application"
    
    # Integrations
    datadog_integrations:
      - name: postgres
        enabled: true
        config:
          host: localhost
          port: 5432
          username: datadog
          password: "{{ vault_pg_datadog_password }}"
          dbname: payment_db
          collect_default_metrics: true
      
      - name: redisdb
        enabled: true
        config:
          host: localhost
          port: 6379
      
      - name: docker
        enabled: true
        config:
          url: "unix://var/run/docker.sock"
      
      - name: nginx
        enabled: true
        config:
          nginx_status_url: http://localhost:80/nginx_status

  tasks:
    - name: Install Datadog Agent GPG key
      ansible.builtin.apt_key:
        url: https://keys.datadoghq.com/DATADOG_APT_KEY_CURRENT.public
        state: present
      when: ansible_os_family == "Debian"

    - name: Add Datadog repository
      ansible.builtin.apt_repository:
        repo: "deb https://apt.datadoghq.com/ stable 7"
        state: present
        update_cache: yes
      when: ansible_os_family == "Debian"

    - name: Install Datadog Agent
      ansible.builtin.apt:
        name: "datadog-agent={{ datadog_agent_version }}"
        state: present
      when: ansible_os_family == "Debian"

    - name: Configure datadog.yaml
      ansible.builtin.template:
        src: datadog.yaml.j2
        dest: /etc/datadog-agent/datadog.yaml
        owner: dd-agent
        group: dd-agent
        mode: '0640'
      notify: restart datadog-agent

    - name: Configure APM
      ansible.builtin.template:
        src: apm.yaml.j2
        dest: /etc/datadog-agent/conf.d/apm.yaml
        owner: dd-agent
        group: dd-agent
        mode: '0640'
      when: datadog_apm_enabled
      notify: restart datadog-agent

    - name: Configure Postgres integration
      ansible.builtin.template:
        src: postgres.yaml.j2
        dest: /etc/datadog-agent/conf.d/postgres.d/conf.yaml
        owner: dd-agent
        group: dd-agent
        mode: '0640'
      notify: restart datadog-agent

    - name: Enable and start Datadog Agent
      ansible.builtin.systemd:
        name: datadog-agent
        enabled: yes
        state: started

  handlers:
    - name: restart datadog-agent
      ansible.builtin.systemd:
        name: datadog-agent
        state: restarted

Datadog APM Configuration (Application-Side)

# Java application with Datadog APM (dd-java-agent)
# Dockerfile

FROM eclipse-temurin:17-jre-alpine

# Download dd-java-agent
ADD https://dtdg.co/latest-java-tracer /opt/datadog/dd-java-agent.jar

COPY target/payment-api-*.jar /app/payment-api.jar

ENV DD_SERVICE=payment-api
ENV DD_ENV=prod
ENV DD_VERSION=2.4.0
ENV DD_AGENT_HOST=datadog-agent
ENV DD_TRACE_AGENT_PORT=8126
ENV DD_PROFILING_ENABLED=true
ENV DD_APPSEC_ENABLED=true
ENV DD_TAGS="team:pay-sre,region:us-west-2"

# JVM options for APM
ENV JAVA_TOOL_OPTIONS="-javaagent:/opt/datadog/dd-java-agent.jar \
  -Ddd.logs.injection=true \
  -Ddd.trace.sample.rate=0.1 \
  -Ddd.dbm.propagation.mode=service"

EXPOSE 8080
ENTRYPOINT ["java", "-jar", "/app/payment-api.jar"]
# Python application with Datadog APM (ddtrace)
# requirements.txt: ddtrace>=2.0.0

from ddtrace import patch, config
from fastapi import FastAPI
import os

# Configure Datadog APM
config.env = os.environ.get("DD_ENV", "dev")
config.service = "payment-api"
config.version = os.environ.get("DD_VERSION", "unknown")

# Auto-instrument libraries
patch(fastapi=True, sqlalchemy=True, redis=True, requests=True)

app = FastAPI()

# Distributed tracing headers will be automatically extracted
# and injected for requests made with the `requests` library

@app.get("/health")
def health_check():
    return {"status": "healthy"}

@app.post("/v1/payments")
def process_payment(request: PaymentRequest):
    # This endpoint is automatically traced
    # Spans for SQL queries and Redis calls are automatically created
    ...

Datadog Dashboard JSON (Microservice Template)

{
  "title": "SRE Microservice Dashboard β€” Payment API",
  "description": "Standard SRE dashboard for microservice observability",
  "widgets": [
    {
      "id": 1,
      "definition": {
        "type": "group",
        "title": "RED Metrics (Service Health)",
        "widgets": [
          {
            "definition": {
              "type": "timeseries",
              "title": "Request Rate (RPS)",
              "requests": [
                {
                  "q": "sum:trace.http.request.hits{service:payment-api,env:prod}.as_rate()",
                  "display_type": "line",
                  "style": {"palette": "dog_classic", "line_type": "solid", "line_width": "normal"}
                }
              ],
              "yaxis": {"label": "requests/sec", "min": 0}
            }
          },
          {
            "definition": {
              "type": "timeseries",
              "title": "Error Rate (%)",
              "requests": [
                {
                  "q": "(sum:trace.http.request.errors{service:payment-api,env:prod}.as_rate() / sum:trace.http.request.hits{service:payment-api,env:prod}.as_rate()) * 100",
                  "display_type": "line",
                  "style": {"palette": "warm", "line_type": "solid"}
                }
              ],
              "yaxis": {"label": "error %", "max": 5},
              "markers": [
                {"value": "y = 0.1", "display_type": "error dashed", "label": "SLO: 0.1%"}
              ]
            }
          },
          {
            "definition": {
              "type": "timeseries",
              "title": "Latency (p50, p95, p99)",
              "requests": [
                {
                  "q": "avg:trace.http.request.duration{service:payment-api,env:prod}.rollup(avg).by percentile",
                  "display_type": "line"
                }
              ],
              "markers": [
                {"value": "y = 0.3", "display_type": "warning dashed", "label": "SLO: 300ms"}
              ]
            }
          }
        ]
      }
    },
    {
      "id": 2,
      "definition": {
        "type": "group",
        "title": "USE Metrics (Resource Health)",
        "widgets": [
          {
            "definition": {
              "type": "timeseries",
              "title": "CPU Utilization (%)",
              "requests": [
                {
                  "q": "avg:system.cpu.user{service:payment-api,env:prod} + avg:system.cpu.system{service:payment-api,env:prod}",
                  "display_type": "line"
                }
              ],
              "markers": [
                {"value": "y = 80", "display_type": "warning dashed", "label": "80% threshold"}
              ]
            }
          },
          {
            "definition": {
              "type": "timeseries",
              "title": "Memory Utilization (%)",
              "requests": [
                {
                  "q": "avg:system.mem.pct_usable{service:payment-api,env:prod}",
                  "display_type": "line"
                }
              ]
            }
          },
          {
            "definition": {
              "type": "timeseries",
              "title": "Database Connection Pool",
              "requests": [
                {
                  "q": "avg:payment_api.db.connections.active{env:prod}",
                  "display_type": "line",
                  "style": {"palette": "warm"}
                },
                {
                  "q": "avg:payment_api.db.connections.max{env:prod}",
                  "display_type": "line",
                  "style": {"palette": "cool", "line_type": "dashed"}
                }
              ]
            }
          }
        ]
      }
    }
  ],
  "template_variables": [
    {"name": "env", "prefix": "env", "default": "prod"},
    {"name": "service", "prefix": "service", "default": "payment-api"}
  ],
  "layout_type": "ordered",
  "is_read_only": false,
  "notify_list": ["pay-sre@company.com"]
}

Datadog Synthetic Tests

Synthetic tests are the single most effective tool for reducing MTTD. They probe your endpoints from multiple global locations at regular intervals, catching issues before users do.

# Terraform: Datadog Synthetic Tests for critical endpoints
# terraform/synthetic-tests.tf

# API Health Check β€” runs every 60 seconds from 5 locations
resource "datadog_synthetics_test" "payment_api_health" {
  type    = "api"
  subtype = "http"
  name    = "[Critical] Payment API Health Check"
  message = "@pagerduty-pay-sre @slack-alerts-pay-sre"
  tags    = ["env:prod", "service:payment-api", "team:pay-sre", "critical:true"]
  status  = "live"

  request_definition {
    method = "GET"
    url    = "https://api.payment.samsung.com/v1/health"
  }

  assertion {
    type     = "statusCode"
    operator = "is"
    target   = "200"
  }

  assertion {
    type     = "responseTime"
    operator = "lessThan"
    target   = "2000"
  }

  assertion {
    type     = "body"
    operator = "contains"
    target   = "\"status\":\"healthy\""
  }

  options_list {
    tick_every          = 60    # Run every 60 seconds
    min_failure_duration = 120  # Alert after 2 consecutive failures
    min_location_failed  = 2    # At least 2 locations must fail
    retry {
      count    = 2
      interval = 5000  # 5 second retry interval
    }
    monitor_options {
      renotify_interval = 30
    }
  }

  locations = [
    "aws:us-west-2",
    "aws:us-east-1",
    "aws:eu-west-1",
    "aws:ap-northeast-1",
    "aws:ap-southeast-1"
  ]
}

# Multi-step API Test β€” full user journey
resource "datadog_synthetics_test" "payment_flow" {
  type = "api"
  name = "[Critical] Payment End-to-End Flow"
  message = "@pagerduty-pay-sre @slack-alerts-pay-sre"
  tags = ["env:prod", "service:payment-api", "team:pay-sre", "flow:e2e"]
  status = "live"

  # Step 1: Authenticate
  api_step {
    name    = "Get Auth Token"
    subtype = "http"
    request_definition {
      method = "POST"
      url    = "https://auth.samsung.com/v1/token"
      body   = jsonencode({grant_type = "client_credentials"})
    }
    assertion {
      type     = "statusCode"
      operator = "is"
      target   = "200"
    }
    assertion {
      type     = "body"
      operator = "contains"
      target   = "access_token"
    }
    # Extract token for next step
    extraction_rule {
      name  = "AUTH_TOKEN"
      type  = "json_path"
      value = "$.access_token"
    }
  }

  # Step 2: Process payment
  api_step {
    name    = "Process Test Payment"
    subtype = "http"
    request_definition {
      method = "POST"
      url    = "https://api.payment.samsung.com/v1/payments"
      headers = {
        Authorization = "Bearer {{ AUTH_TOKEN }}"
      }
      body = jsonencode({
        amount   = 1.00
        currency = "USD"
        card_token = "test-token-valid"
      })
    }
    assertion {
      type     = "statusCode"
      operator = "is"
      target   = "200"
    }
    assertion {
      type     = "body"
      operator = "contains"
      target   = "\"status\":\"approved\""
    }
  }

  options_list {
    tick_every           = 300   # Every 5 minutes (less frequent for E2E)
    min_failure_duration = 300
    min_location_failed  = 1
  }

  locations = ["aws:us-west-2", "aws:us-east-1", "aws:eu-west-1"]
}

Prometheus + Grafana on Kubernetes

For teams using open-source stacks, Prometheus and Grafana provide powerful metrics collection and visualization. This section covers deployment via the kube-prometheus-stack Helm chart.

# Install kube-prometheus-stack via Helm
# This installs Prometheus, Grafana, Alertmanager, and node-exporter

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# values-override.yaml
prometheus:
  prometheusSpec:
    retention: 30d
    retentionSize: "50GB"
    resources:
      requests:
        memory: "2Gi"
        cpu: "500m"
      limits:
        memory: "4Gi"
        cpu: "2000m"
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 50Gi
    additionalScrapeConfigs:
      # Scrape your application metrics
      - job_name: 'payment-api'
        kubernetes_sd_configs:
          - role: pod
            namespaces:
              names:
                - payment
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
            action: replace
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
            target_label: __address__
          - action: labelmap
            regex: __meta_kubernetes_pod_label_(.+)

grafana:
  enabled: true
  adminPassword: "CHANGE_ME"  # Use SealedSecret or ExternalSecret
  persistence:
    enabled: true
    size: 10Gi
  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
        - name: 'default'
          orgId: 1
          folder: ''
          type: file
          disableDeletion: false
          editable: true
          options:
            path: /var/lib/grafana/dashboards/default
  dashboards:
    default:
      payment-api-red:
        url: https://raw.githubusercontent.com/org/grafana-dashboards/main/payment-api-red.json

alertmanager:
  enabled: true
  config:
    global:
      slack_api_url: '${SLACK_WEBHOOK_URL}'
      pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
    route:
      receiver: 'default'
      group_by: ['alertname', 'severity', 'service']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
      routes:
        - match:
            severity: critical
          receiver: 'pagerduty-critical'
          continue: true
        - match:
            severity: warning
          receiver: 'slack-warnings'
    receivers:
      - name: 'default'
        slack_configs:
          - channel: '#alerts-default'
            title: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
            text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
      - name: 'pagerduty-critical'
        pagerduty_configs:
          - service_key: '${PAGERDUTY_INTEGRATION_KEY}'
            severity: critical
            description: '{{ .CommonAnnotations.summary }}'
      - name: 'slack-warnings'
        slack_configs:
          - channel: '#alerts-warnings'
            send_resolved: true

# Install
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --values values-override.yaml

Alert Routing (PagerDuty, Slack, Email)

Effective alert routing ensures the right people are notified through the right channels at the right urgency level.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    ALERT ROUTING FLOW                            β”‚
β”‚                                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                   β”‚
β”‚  β”‚ Prometheusβ”‚   β”‚ Datadog  β”‚   β”‚ Syntheticβ”‚                    β”‚
β”‚  β”‚ Alerts   β”‚    β”‚ Monitors β”‚    β”‚ Tests    β”‚                    β”‚
β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜                   β”‚
β”‚       β”‚               β”‚               β”‚                          β”‚
β”‚       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                          β”‚
β”‚                       β–Ό                                          β”‚
β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                 β”‚
β”‚              β”‚  Alertmanager   β”‚                                 β”‚
β”‚              β”‚  / PagerDuty    β”‚                                 β”‚
β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                 β”‚
β”‚                       β”‚                                          β”‚
β”‚          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                             β”‚
β”‚          β–Ό            β–Ό            β–Ό                             β”‚
β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                      β”‚
β”‚    β”‚ Critical β”‚ β”‚ Warning  β”‚ β”‚ Info     β”‚                      β”‚
β”‚    β”‚ β†’ PAGE   β”‚ β”‚ β†’ Slack  β”‚ β”‚ β†’ Ticket β”‚                      β”‚
β”‚    β”‚ (2 AM)   β”‚ β”‚ (9 AM)   β”‚ β”‚ (daily)  β”‚                      β”‚
β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                      β”‚
β”‚                                                                  β”‚
β”‚  CRITICAL: PagerDuty page + Slack #alerts-critical             β”‚
β”‚  WARNING:  Slack #alerts-warnings + email (business hours)      β”‚
β”‚  INFO:     Jira ticket + Slack #alerts-info (next business day) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Log Aggregation

Structured Logging Best Practices

# Python structured logging example
import json
import logging
import sys
from datetime import datetime, timezone
from pythonjsonlogger import jsonlogger

class StructuredLogFormatter(jsonlogger.JsonFormatter):
    def add_fields(self, log_record, record, message_dict):
        super().add_fields(log_record, record, message_dict)
        log_record['timestamp'] = datetime.now(timezone.utc).isoformat()
        log_record['level'] = record.levelname
        log_record['service'] = 'payment-api'
        log_record['environment'] = 'prod'
        log_record['logger'] = record.name
        if hasattr(record, 'trace_id'):
            log_record['trace_id'] = record.trace_id
        if hasattr(record, 'span_id'):
            log_record['span_id'] = record.span_id

log_handler = logging.StreamHandler(sys.stdout)
formatter = StructuredLogFormatter(
    '%(timestamp)s %(level)s %(service)s %(message)s'
)
log_handler.setFormatter(formatter)

logger = logging.getLogger('payment-api')
logger.addHandler(log_handler)
logger.setLevel(logging.INFO)

# Usage β€” always log with context
logger.info(
    'Payment processed',
    extra={
        'event': 'payment.processed',
        'payment_id': 'pay_12345',
        'amount': 99.99,
        'currency': 'USD',
        'merchant_id': 'merch_678',
        'latency_ms': 145,
        'trace_id': 'abc123def456'
    }
)

# Output (JSON, machine-parseable):
# {
#   "timestamp": "2024-01-15T14:30:00+00:00",
#   "level": "INFO",
#   "service": "payment-api",
#   "message": "Payment processed",
#   "event": "payment.processed",
#   "payment_id": "pay_12345",
#   "amount": 99.99,
#   "currency": "USD",
#   "merchant_id": "merch_678",
#   "latency_ms": 145,
#   "trace_id": "abc123def456"
# }

Datadog Log Pipeline Configuration

# datadog-agent log pipeline configuration
# /etc/datadog-agent/conf.d/payment-api.d/conf.yaml

logs:
  - type: file
    path: /var/log/payment-api/app.json
    service: payment-api
    source: python
    sourcecategory: application
    
    # Log processing pipeline
    log_processing_rules:
      - type: include_at_match
        name: include_only_errors_and_slow_requests
        pattern: '(?i)("level":\s*"(ERROR|WARN|CRITICAL)"|latency_ms":\s*\d{4,})'
      
      # Parse JSON automatically
      - type: mask_sequences
        name: mask_card_numbers
        replace_placeholder: "[MASKED]"
        pattern: '"card_number":\s*"\d{13,16}"'
    
    tags:
      - env:prod
      - team:pay-sre

# For high-volume services, use log sampling
# Only index ERROR/WARN logs and 1% of INFO logs

OpenTelemetry Overview

OpenTelemetry (OTel) is the emerging open standard for instrumentation. It provides a vendor-neutral way to collect traces, metrics, and logs. OTel is particularly valuable for organizations that want to avoid vendor lock-in or need to send telemetry to multiple backends.

# OpenTelemetry Collector configuration
# otel-collector-config.yaml
# Receives OTLP data and exports to multiple backends

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  
  resource:
    attributes:
      - key: environment
        value: prod
        action: upsert
      - key: team
        value: pay-sre
        action: upsert
  
  # Tail-based sampling: keep all errors and 10% of successes
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 1000
    policies:
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow_requests
        type: latency
        latency: {threshold_ms: 1000}
      - name: probabilistic
        type: probabilistic
        probabilistic: {sampling_percentage: 10}

exporters:
  # Export traces to Datadog
  datadog:
    api:
      key: ${DD_API_KEY}
  
  # Export traces to Jaeger (for development/debugging)
  jaeger:
    endpoint: jaeger-collector.monitoring:14250
    tls:
      insecure: true
  
  # Export metrics to Prometheus
  prometheusremotewrite:
    endpoint: http://prometheus.monitoring:9090/api/v1/write
  
  # Also export to S3 for long-term archival
  awss3:
    region: us-west-2
    s3uploader:
      bucket: otel-traces-archive
      prefix: traces/

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource, tail_sampling]
      exporters: [datadog, jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [prometheusremotewrite, datadog]

Alert Fatigue Prevention

Alert fatigue is one of the most common and dangerous problems in operations. When engineers receive too many alerts, they begin to ignore them β€” leading to missed real incidents. Based on experience at Samsung, these strategies are effective:

Strategy Implementation Target
SLO-based alerting Alert on burn rate and error budget, not thresholds < 2 pages per day per team
Actionable alerts only Every alert must have a runbook and a defined response 100% of alerts have runbooks
Page on symptoms, not causes Page when users are affected, not when an internal metric is unusual > 80% of pages are symptom-based
Alert review Weekly review of all alerts that fired β€” tune or remove noisy ones < 5% false positive rate
Time-based suppression Suppress maintenance windows, known issues 0 pages during planned maintenance
Alert grouping Group related alerts into single notification 1 notification per incident, not per symptom
Graduated severity Tickets for slow-burn; pages only for fast-burn < 5 pages per on-call week
Alert Quality Metric: Track the "page-to-action ratio" β€” the percentage of pages that resulted in a meaningful human action (not just "acknowledged and watched it resolve"). Target: > 90% of pages require meaningful action. If a page fires and the on-call routinely does nothing, the alert is wrong β€” tune it or delete it.

Weekly Alert Review Process

WEEKLY ALERT REVIEW AGENDA (15 minutes)

Attendees: On-call engineer (last week), SRE lead
Time: Every Monday 10:00 AM

1. ALERTS FIRED (5 min)
   Review every alert that fired last week:
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ Alert    β”‚ Fired    β”‚ Action β”‚ Real Issue?    β”‚
   β”‚ Name     β”‚ Times    β”‚ Taken  β”‚ Y/N/Unsure     β”‚
   β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
   β”‚ HighCPU  β”‚ 12       β”‚ None   β”‚ N β€” baseline   β”‚
   β”‚ SlowDB   β”‚ 3        β”‚ Tuned  β”‚ Y              β”‚
   β”‚ DiskFull β”‚ 1        β”‚ Clean  β”‚ Y              β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

2. FALSE POSITIVE ANALYSIS (5 min)
   For alerts marked "N":
   β€’ Tune threshold? [Action + Owner]
   β€’ Remove alert? [Action + Owner]
   β€’ Improve runbook? [Action + Owner]

3. MISSING ALERTS (3 min)
   Were there issues that SHOULD have alerted but didn't?
   β€’ New alert needed? [Action + Owner]

4. METRICS TRACKING (2 min)
   β€’ Pages this week: [X]
   β€’ Pages per on-call week (rolling 4-week avg): [X]
   β€’ Target: < 5 pages/week
   β€’ Action if trending up: ____________________

MTTD Reduction Strategies

Mean Time To Detect (MTTD) is the metric most directly controllable by the observability team. At Samsung, we reduced MTTD by 50% through these specific improvements:

  1. Synthetic monitors on critical user journeys: 60-second probes from 5 global locations caught issues before users reported them. This alone accounted for 60% of the MTTD improvement.
  2. Anomaly detection on error rate: Datadog's anomaly detection monitors flagged gradual error rate increases that threshold-based alerts missed. Static thresholds work for sudden failures; anomaly detection catches slow degradation.
  3. Correlation dashboards: A single "service health" dashboard per service showing RED metrics, error budget, and deployment markers. On-call engineers could triage in under 2 minutes.
  4. Deployment correlation: Automatic markers on all dashboards showing when deployments occurred. 80% of incidents at Samsung were correlated with a deployment within 30 minutes.

Last updated: June 2026 | Author: SRE Team