Monitoring & Observability
Observability provides insight into distributed systems through metrics, logs, and traces. This guide covers Datadog, Prometheus, Grafana, OpenTelemetry, and modern observability practices with production-ready configuration examples, Ansible playbooks, and alert routing.
Monitoring vs. Observability vs. Telemetry
These terms are often used interchangeably but have distinct meanings. Understanding the difference is important for designing effective observability systems.
| Term | Definition | Analogy |
|---|---|---|
| Monitoring | Collecting and alerting on predefined metrics. You decide what to measure in advance and watch for known failure modes. | A dashboard showing your car's speed, fuel level, and engine temperature. You know what to watch for. |
| Observability | The property of a system that allows you to understand its internal state by examining its outputs. You can ask novel questions about system behavior without deploying new code. | A mechanic connecting a diagnostic scanner to your car's computer. They can investigate any problem, even ones you didn't anticipate. |
| Telemetry | The data itself β metrics, logs, traces, and events emitted by the system. The raw material of observability. | The electrical signals from all the car's sensors. Data flowing from the system to the observer. |
A system is observable when you can determine the internal state by examining its outputs β without needing to ship new code to answer new questions. Observability enables you to understand unknown-unknowns: failure modes you haven't seen before and didn't instrument for.
Three Pillars of Observability
1. Metrics
Numeric measurements collected at regular intervals. Metrics are efficient for alerting and trending but provide limited context. Examples: request rate, error rate, latency percentiles, CPU utilization, memory usage.
2. Logs
Timestamped records of discrete events. Logs provide rich context but are expensive to store and query at scale. Structured logs (JSON) are essential for automated analysis. Examples: application error logs, access logs, audit logs.
3. Traces
End-to-end request flows through distributed systems. A trace follows a single request as it propagates through multiple services, database calls, and message queues. Essential for understanding latency in microservices architectures.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β THREE PILLARS VISUALIZED β
β β
β METRICS (What) LOGS (Why) TRACES (Where) β
β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β CPU: 85% β β ERROR: β β [req-abc]β β
β β Mem: 12G β β Database β β ββsvc-a β β
β β RPS: 2.4Kβ β connect β β β 15ms β β
β β p95: 230msβ β failed: β β ββsvc-b β β
β β errs: 0.1%β β conn β β β 45ms β β
β β β β refused β β ββdb-q β β
β β [numbers]β β [text] β β β 200ms β β
β ββββββββββββ ββββββββββββ β ββcache β β
β β 5ms β β
β Good for: Alerting Good for: Debugging ββββββββββββ β
β Trending Root cause Good for: β
β Capacity planning Audit trails Latency analysis β
β SLO measurement Security forensics Dependency mapping β
β β
β When an alert fires at 2 AM: β
β 1. METRICS tell you SOMETHING is wrong (CPU spiked) β
β 2. TRACES tell you WHERE it's wrong (DB query in svc-b) β
β 3. LOGS tell you WHY it's wrong (connection pool exhausted) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
RED Method (Rate, Errors, Duration)
The RED method provides a minimal set of metrics for monitoring request-driven services. Every microservice should have RED metrics.
| Metric | Definition | Prometheus Example | Alert On |
|---|---|---|---|
| Rate | Requests per second | rate(http_requests_total[5m]) |
Unexpected drop (traffic loss) or spike (DDoS) |
| Errors | Error rate (failed requests / total requests) | rate(http_errors_total[5m]) / rate(http_requests_total[5m]) |
Error rate above SLO threshold |
| Duration | Request latency distribution | histogram_quantile(0.95, rate(http_duration_bucket[5m])) |
p95 or p99 above SLO threshold |
USE Method (Utilization, Saturation, Errors)
The USE method, developed by Brendan Gregg, provides metrics for infrastructure resources. Apply USE to every resource in your system: CPU, memory, disk, network, connection pools.
| Metric | Definition | Prometheus Example |
|---|---|---|
| Utilization | Percent of resource actively used | 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) |
| Saturation | Amount of work queued / unable to be serviced | node_load1 / count without(cpu, mode) (node_cpu_seconds_total{mode="idle"}) |
| Errors | Count of error events | node_filesystem_device_error or application-specific error counters |
Datadog Setup
Datadog is the recommended observability platform for its unified approach to metrics, traces, logs, and synthetic monitoring. The following sections cover agent installation, APM configuration, and log pipelines.
Complete Datadog Agent Installation β Ansible Playbook
# playbook-datadog.yml
# Ansible playbook to install and configure Datadog Agent across all nodes
# Supports Ubuntu, CentOS, and Amazon Linux
---
- name: Install Datadog Agent
hosts: all
become: yes
vars:
datadog_api_key: "{{ vault_datadog_api_key }}" # From Ansible Vault
datadog_site: "datadoghq.com"
datadog_agent_version: "7.50.0"
# APM Configuration
datadog_apm_enabled: true
datadog_apm_non_local_traffic: true
datadog_apm_env: "{{ env }}" # prod, staging, dev
datadog_apm_tags:
- "service:{{ service_name }}"
- "team:{{ team_name }}"
- "region:{{ aws_region }}"
# Process monitoring
datadog_process_config:
enabled: true
interval: 10
# Log collection
datadog_logs_enabled: true
datadog_logs_config:
- type: file
path: /var/log/application/*.json
service: "{{ service_name }}"
source: "{{ service_name }}"
sourcecategory: "application"
# Integrations
datadog_integrations:
- name: postgres
enabled: true
config:
host: localhost
port: 5432
username: datadog
password: "{{ vault_pg_datadog_password }}"
dbname: payment_db
collect_default_metrics: true
- name: redisdb
enabled: true
config:
host: localhost
port: 6379
- name: docker
enabled: true
config:
url: "unix://var/run/docker.sock"
- name: nginx
enabled: true
config:
nginx_status_url: http://localhost:80/nginx_status
tasks:
- name: Install Datadog Agent GPG key
ansible.builtin.apt_key:
url: https://keys.datadoghq.com/DATADOG_APT_KEY_CURRENT.public
state: present
when: ansible_os_family == "Debian"
- name: Add Datadog repository
ansible.builtin.apt_repository:
repo: "deb https://apt.datadoghq.com/ stable 7"
state: present
update_cache: yes
when: ansible_os_family == "Debian"
- name: Install Datadog Agent
ansible.builtin.apt:
name: "datadog-agent={{ datadog_agent_version }}"
state: present
when: ansible_os_family == "Debian"
- name: Configure datadog.yaml
ansible.builtin.template:
src: datadog.yaml.j2
dest: /etc/datadog-agent/datadog.yaml
owner: dd-agent
group: dd-agent
mode: '0640'
notify: restart datadog-agent
- name: Configure APM
ansible.builtin.template:
src: apm.yaml.j2
dest: /etc/datadog-agent/conf.d/apm.yaml
owner: dd-agent
group: dd-agent
mode: '0640'
when: datadog_apm_enabled
notify: restart datadog-agent
- name: Configure Postgres integration
ansible.builtin.template:
src: postgres.yaml.j2
dest: /etc/datadog-agent/conf.d/postgres.d/conf.yaml
owner: dd-agent
group: dd-agent
mode: '0640'
notify: restart datadog-agent
- name: Enable and start Datadog Agent
ansible.builtin.systemd:
name: datadog-agent
enabled: yes
state: started
handlers:
- name: restart datadog-agent
ansible.builtin.systemd:
name: datadog-agent
state: restarted
Datadog APM Configuration (Application-Side)
# Java application with Datadog APM (dd-java-agent)
# Dockerfile
FROM eclipse-temurin:17-jre-alpine
# Download dd-java-agent
ADD https://dtdg.co/latest-java-tracer /opt/datadog/dd-java-agent.jar
COPY target/payment-api-*.jar /app/payment-api.jar
ENV DD_SERVICE=payment-api
ENV DD_ENV=prod
ENV DD_VERSION=2.4.0
ENV DD_AGENT_HOST=datadog-agent
ENV DD_TRACE_AGENT_PORT=8126
ENV DD_PROFILING_ENABLED=true
ENV DD_APPSEC_ENABLED=true
ENV DD_TAGS="team:pay-sre,region:us-west-2"
# JVM options for APM
ENV JAVA_TOOL_OPTIONS="-javaagent:/opt/datadog/dd-java-agent.jar \
-Ddd.logs.injection=true \
-Ddd.trace.sample.rate=0.1 \
-Ddd.dbm.propagation.mode=service"
EXPOSE 8080
ENTRYPOINT ["java", "-jar", "/app/payment-api.jar"]
# Python application with Datadog APM (ddtrace)
# requirements.txt: ddtrace>=2.0.0
from ddtrace import patch, config
from fastapi import FastAPI
import os
# Configure Datadog APM
config.env = os.environ.get("DD_ENV", "dev")
config.service = "payment-api"
config.version = os.environ.get("DD_VERSION", "unknown")
# Auto-instrument libraries
patch(fastapi=True, sqlalchemy=True, redis=True, requests=True)
app = FastAPI()
# Distributed tracing headers will be automatically extracted
# and injected for requests made with the `requests` library
@app.get("/health")
def health_check():
return {"status": "healthy"}
@app.post("/v1/payments")
def process_payment(request: PaymentRequest):
# This endpoint is automatically traced
# Spans for SQL queries and Redis calls are automatically created
...
Datadog Dashboard JSON (Microservice Template)
{
"title": "SRE Microservice Dashboard β Payment API",
"description": "Standard SRE dashboard for microservice observability",
"widgets": [
{
"id": 1,
"definition": {
"type": "group",
"title": "RED Metrics (Service Health)",
"widgets": [
{
"definition": {
"type": "timeseries",
"title": "Request Rate (RPS)",
"requests": [
{
"q": "sum:trace.http.request.hits{service:payment-api,env:prod}.as_rate()",
"display_type": "line",
"style": {"palette": "dog_classic", "line_type": "solid", "line_width": "normal"}
}
],
"yaxis": {"label": "requests/sec", "min": 0}
}
},
{
"definition": {
"type": "timeseries",
"title": "Error Rate (%)",
"requests": [
{
"q": "(sum:trace.http.request.errors{service:payment-api,env:prod}.as_rate() / sum:trace.http.request.hits{service:payment-api,env:prod}.as_rate()) * 100",
"display_type": "line",
"style": {"palette": "warm", "line_type": "solid"}
}
],
"yaxis": {"label": "error %", "max": 5},
"markers": [
{"value": "y = 0.1", "display_type": "error dashed", "label": "SLO: 0.1%"}
]
}
},
{
"definition": {
"type": "timeseries",
"title": "Latency (p50, p95, p99)",
"requests": [
{
"q": "avg:trace.http.request.duration{service:payment-api,env:prod}.rollup(avg).by percentile",
"display_type": "line"
}
],
"markers": [
{"value": "y = 0.3", "display_type": "warning dashed", "label": "SLO: 300ms"}
]
}
}
]
}
},
{
"id": 2,
"definition": {
"type": "group",
"title": "USE Metrics (Resource Health)",
"widgets": [
{
"definition": {
"type": "timeseries",
"title": "CPU Utilization (%)",
"requests": [
{
"q": "avg:system.cpu.user{service:payment-api,env:prod} + avg:system.cpu.system{service:payment-api,env:prod}",
"display_type": "line"
}
],
"markers": [
{"value": "y = 80", "display_type": "warning dashed", "label": "80% threshold"}
]
}
},
{
"definition": {
"type": "timeseries",
"title": "Memory Utilization (%)",
"requests": [
{
"q": "avg:system.mem.pct_usable{service:payment-api,env:prod}",
"display_type": "line"
}
]
}
},
{
"definition": {
"type": "timeseries",
"title": "Database Connection Pool",
"requests": [
{
"q": "avg:payment_api.db.connections.active{env:prod}",
"display_type": "line",
"style": {"palette": "warm"}
},
{
"q": "avg:payment_api.db.connections.max{env:prod}",
"display_type": "line",
"style": {"palette": "cool", "line_type": "dashed"}
}
]
}
}
]
}
}
],
"template_variables": [
{"name": "env", "prefix": "env", "default": "prod"},
{"name": "service", "prefix": "service", "default": "payment-api"}
],
"layout_type": "ordered",
"is_read_only": false,
"notify_list": ["pay-sre@company.com"]
}
Datadog Synthetic Tests
Synthetic tests are the single most effective tool for reducing MTTD. They probe your endpoints from multiple global locations at regular intervals, catching issues before users do.
# Terraform: Datadog Synthetic Tests for critical endpoints
# terraform/synthetic-tests.tf
# API Health Check β runs every 60 seconds from 5 locations
resource "datadog_synthetics_test" "payment_api_health" {
type = "api"
subtype = "http"
name = "[Critical] Payment API Health Check"
message = "@pagerduty-pay-sre @slack-alerts-pay-sre"
tags = ["env:prod", "service:payment-api", "team:pay-sre", "critical:true"]
status = "live"
request_definition {
method = "GET"
url = "https://api.payment.samsung.com/v1/health"
}
assertion {
type = "statusCode"
operator = "is"
target = "200"
}
assertion {
type = "responseTime"
operator = "lessThan"
target = "2000"
}
assertion {
type = "body"
operator = "contains"
target = "\"status\":\"healthy\""
}
options_list {
tick_every = 60 # Run every 60 seconds
min_failure_duration = 120 # Alert after 2 consecutive failures
min_location_failed = 2 # At least 2 locations must fail
retry {
count = 2
interval = 5000 # 5 second retry interval
}
monitor_options {
renotify_interval = 30
}
}
locations = [
"aws:us-west-2",
"aws:us-east-1",
"aws:eu-west-1",
"aws:ap-northeast-1",
"aws:ap-southeast-1"
]
}
# Multi-step API Test β full user journey
resource "datadog_synthetics_test" "payment_flow" {
type = "api"
name = "[Critical] Payment End-to-End Flow"
message = "@pagerduty-pay-sre @slack-alerts-pay-sre"
tags = ["env:prod", "service:payment-api", "team:pay-sre", "flow:e2e"]
status = "live"
# Step 1: Authenticate
api_step {
name = "Get Auth Token"
subtype = "http"
request_definition {
method = "POST"
url = "https://auth.samsung.com/v1/token"
body = jsonencode({grant_type = "client_credentials"})
}
assertion {
type = "statusCode"
operator = "is"
target = "200"
}
assertion {
type = "body"
operator = "contains"
target = "access_token"
}
# Extract token for next step
extraction_rule {
name = "AUTH_TOKEN"
type = "json_path"
value = "$.access_token"
}
}
# Step 2: Process payment
api_step {
name = "Process Test Payment"
subtype = "http"
request_definition {
method = "POST"
url = "https://api.payment.samsung.com/v1/payments"
headers = {
Authorization = "Bearer {{ AUTH_TOKEN }}"
}
body = jsonencode({
amount = 1.00
currency = "USD"
card_token = "test-token-valid"
})
}
assertion {
type = "statusCode"
operator = "is"
target = "200"
}
assertion {
type = "body"
operator = "contains"
target = "\"status\":\"approved\""
}
}
options_list {
tick_every = 300 # Every 5 minutes (less frequent for E2E)
min_failure_duration = 300
min_location_failed = 1
}
locations = ["aws:us-west-2", "aws:us-east-1", "aws:eu-west-1"]
}
Prometheus + Grafana on Kubernetes
For teams using open-source stacks, Prometheus and Grafana provide powerful metrics collection and visualization. This section covers deployment via the kube-prometheus-stack Helm chart.
# Install kube-prometheus-stack via Helm
# This installs Prometheus, Grafana, Alertmanager, and node-exporter
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# values-override.yaml
prometheus:
prometheusSpec:
retention: 30d
retentionSize: "50GB"
resources:
requests:
memory: "2Gi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "2000m"
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: gp3
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
additionalScrapeConfigs:
# Scrape your application metrics
- job_name: 'payment-api'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- payment
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
grafana:
enabled: true
adminPassword: "CHANGE_ME" # Use SealedSecret or ExternalSecret
persistence:
enabled: true
size: 10Gi
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards/default
dashboards:
default:
payment-api-red:
url: https://raw.githubusercontent.com/org/grafana-dashboards/main/payment-api-red.json
alertmanager:
enabled: true
config:
global:
slack_api_url: '${SLACK_WEBHOOK_URL}'
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
route:
receiver: 'default'
group_by: ['alertname', 'severity', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
continue: true
- match:
severity: warning
receiver: 'slack-warnings'
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts-default'
title: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: '${PAGERDUTY_INTEGRATION_KEY}'
severity: critical
description: '{{ .CommonAnnotations.summary }}'
- name: 'slack-warnings'
slack_configs:
- channel: '#alerts-warnings'
send_resolved: true
# Install
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--values values-override.yaml
Alert Routing (PagerDuty, Slack, Email)
Effective alert routing ensures the right people are notified through the right channels at the right urgency level.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ALERT ROUTING FLOW β
β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β Prometheusβ β Datadog β β Syntheticβ β
β β Alerts β β Monitors β β Tests β β
β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ β
β β β β β
β βββββββββββββββββΌββββββββββββββββ β
β βΌ β
β βββββββββββββββββββ β
β β Alertmanager β β
β β / PagerDuty β β
β ββββββββββ¬βββββββββ β
β β β
β ββββββββββββββΌβββββββββββββ β
β βΌ βΌ βΌ β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β Critical β β Warning β β Info β β
β β β PAGE β β β Slack β β β Ticket β β
β β (2 AM) β β (9 AM) β β (daily) β β
β ββββββββββββ ββββββββββββ ββββββββββββ β
β β
β CRITICAL: PagerDuty page + Slack #alerts-critical β
β WARNING: Slack #alerts-warnings + email (business hours) β
β INFO: Jira ticket + Slack #alerts-info (next business day) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Log Aggregation
Structured Logging Best Practices
# Python structured logging example
import json
import logging
import sys
from datetime import datetime, timezone
from pythonjsonlogger import jsonlogger
class StructuredLogFormatter(jsonlogger.JsonFormatter):
def add_fields(self, log_record, record, message_dict):
super().add_fields(log_record, record, message_dict)
log_record['timestamp'] = datetime.now(timezone.utc).isoformat()
log_record['level'] = record.levelname
log_record['service'] = 'payment-api'
log_record['environment'] = 'prod'
log_record['logger'] = record.name
if hasattr(record, 'trace_id'):
log_record['trace_id'] = record.trace_id
if hasattr(record, 'span_id'):
log_record['span_id'] = record.span_id
log_handler = logging.StreamHandler(sys.stdout)
formatter = StructuredLogFormatter(
'%(timestamp)s %(level)s %(service)s %(message)s'
)
log_handler.setFormatter(formatter)
logger = logging.getLogger('payment-api')
logger.addHandler(log_handler)
logger.setLevel(logging.INFO)
# Usage β always log with context
logger.info(
'Payment processed',
extra={
'event': 'payment.processed',
'payment_id': 'pay_12345',
'amount': 99.99,
'currency': 'USD',
'merchant_id': 'merch_678',
'latency_ms': 145,
'trace_id': 'abc123def456'
}
)
# Output (JSON, machine-parseable):
# {
# "timestamp": "2024-01-15T14:30:00+00:00",
# "level": "INFO",
# "service": "payment-api",
# "message": "Payment processed",
# "event": "payment.processed",
# "payment_id": "pay_12345",
# "amount": 99.99,
# "currency": "USD",
# "merchant_id": "merch_678",
# "latency_ms": 145,
# "trace_id": "abc123def456"
# }
Datadog Log Pipeline Configuration
# datadog-agent log pipeline configuration
# /etc/datadog-agent/conf.d/payment-api.d/conf.yaml
logs:
- type: file
path: /var/log/payment-api/app.json
service: payment-api
source: python
sourcecategory: application
# Log processing pipeline
log_processing_rules:
- type: include_at_match
name: include_only_errors_and_slow_requests
pattern: '(?i)("level":\s*"(ERROR|WARN|CRITICAL)"|latency_ms":\s*\d{4,})'
# Parse JSON automatically
- type: mask_sequences
name: mask_card_numbers
replace_placeholder: "[MASKED]"
pattern: '"card_number":\s*"\d{13,16}"'
tags:
- env:prod
- team:pay-sre
# For high-volume services, use log sampling
# Only index ERROR/WARN logs and 1% of INFO logs
OpenTelemetry Overview
OpenTelemetry (OTel) is the emerging open standard for instrumentation. It provides a vendor-neutral way to collect traces, metrics, and logs. OTel is particularly valuable for organizations that want to avoid vendor lock-in or need to send telemetry to multiple backends.
# OpenTelemetry Collector configuration
# otel-collector-config.yaml
# Receives OTLP data and exports to multiple backends
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
resource:
attributes:
- key: environment
value: prod
action: upsert
- key: team
value: pay-sre
action: upsert
# Tail-based sampling: keep all errors and 10% of successes
tail_sampling:
decision_wait: 10s
num_traces: 100000
expected_new_traces_per_sec: 1000
policies:
- name: errors
type: status_code
status_code: {status_codes: [ERROR]}
- name: slow_requests
type: latency
latency: {threshold_ms: 1000}
- name: probabilistic
type: probabilistic
probabilistic: {sampling_percentage: 10}
exporters:
# Export traces to Datadog
datadog:
api:
key: ${DD_API_KEY}
# Export traces to Jaeger (for development/debugging)
jaeger:
endpoint: jaeger-collector.monitoring:14250
tls:
insecure: true
# Export metrics to Prometheus
prometheusremotewrite:
endpoint: http://prometheus.monitoring:9090/api/v1/write
# Also export to S3 for long-term archival
awss3:
region: us-west-2
s3uploader:
bucket: otel-traces-archive
prefix: traces/
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, resource, tail_sampling]
exporters: [datadog, jaeger]
metrics:
receivers: [otlp]
processors: [batch, resource]
exporters: [prometheusremotewrite, datadog]
Alert Fatigue Prevention
Alert fatigue is one of the most common and dangerous problems in operations. When engineers receive too many alerts, they begin to ignore them β leading to missed real incidents. Based on experience at Samsung, these strategies are effective:
| Strategy | Implementation | Target |
|---|---|---|
| SLO-based alerting | Alert on burn rate and error budget, not thresholds | < 2 pages per day per team |
| Actionable alerts only | Every alert must have a runbook and a defined response | 100% of alerts have runbooks |
| Page on symptoms, not causes | Page when users are affected, not when an internal metric is unusual | > 80% of pages are symptom-based |
| Alert review | Weekly review of all alerts that fired β tune or remove noisy ones | < 5% false positive rate |
| Time-based suppression | Suppress maintenance windows, known issues | 0 pages during planned maintenance |
| Alert grouping | Group related alerts into single notification | 1 notification per incident, not per symptom |
| Graduated severity | Tickets for slow-burn; pages only for fast-burn | < 5 pages per on-call week |
Weekly Alert Review Process
WEEKLY ALERT REVIEW AGENDA (15 minutes)
Attendees: On-call engineer (last week), SRE lead
Time: Every Monday 10:00 AM
1. ALERTS FIRED (5 min)
Review every alert that fired last week:
ββββββββββββ¬βββββββββββ¬βββββββββ¬βββββββββββββββββ
β Alert β Fired β Action β Real Issue? β
β Name β Times β Taken β Y/N/Unsure β
ββββββββββββΌβββββββββββΌβββββββββΌβββββββββββββββββ€
β HighCPU β 12 β None β N β baseline β
β SlowDB β 3 β Tuned β Y β
β DiskFull β 1 β Clean β Y β
ββββββββββββ΄βββββββββββ΄βββββββββ΄βββββββββββββββββ
2. FALSE POSITIVE ANALYSIS (5 min)
For alerts marked "N":
β’ Tune threshold? [Action + Owner]
β’ Remove alert? [Action + Owner]
β’ Improve runbook? [Action + Owner]
3. MISSING ALERTS (3 min)
Were there issues that SHOULD have alerted but didn't?
β’ New alert needed? [Action + Owner]
4. METRICS TRACKING (2 min)
β’ Pages this week: [X]
β’ Pages per on-call week (rolling 4-week avg): [X]
β’ Target: < 5 pages/week
β’ Action if trending up: ____________________
MTTD Reduction Strategies
Mean Time To Detect (MTTD) is the metric most directly controllable by the observability team. At Samsung, we reduced MTTD by 50% through these specific improvements:
- Synthetic monitors on critical user journeys: 60-second probes from 5 global locations caught issues before users reported them. This alone accounted for 60% of the MTTD improvement.
- Anomaly detection on error rate: Datadog's anomaly detection monitors flagged gradual error rate increases that threshold-based alerts missed. Static thresholds work for sudden failures; anomaly detection catches slow degradation.
- Correlation dashboards: A single "service health" dashboard per service showing RED metrics, error budget, and deployment markers. On-call engineers could triage in under 2 minutes.
- Deployment correlation: Automatic markers on all dashboards showing when deployments occurred. 80% of incidents at Samsung were correlated with a deployment within 30 minutes.
Related Topics
- SLO / SLI / SLA β Defining reliability targets and SLO-based alerting
- Incident Management β Responding to alerts and incidents
- Reliability Engineering β Building observable systems
Last updated: June 2026 | Author: SRE Team