41 pages Β· 8 sections
Ctrl K
GitHub Portfolio

Incident Management

Effective incident management minimizes service disruption and restores normal operations quickly. This guide covers the complete incident lifecycle from detection to resolution, with severity classification, on-call design, communication templates, and production-ready runbooks.

Incident Lifecycle

Every incident follows a predictable lifecycle. Understanding these phases enables teams to build structured processes that reduce MTTR and improve outcomes.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ DETECTION│───▢│  TRIAGE  │───▢│ ESCALATION│───▢│ RESOLUTION│───▢│POST-MORTEM β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     β”‚               β”‚                β”‚                 β”‚                  β”‚
     β–Ό               β–Ό                β–Ό                 β–Ό                  β–Ό
 Alert fires    Determine     Engage right      Restore service    Document,
 Monitoring     severity      people with       Apply fix          learn,
 User report    Assess        needed skills     Apply workaround   improve
 Synthetic test  impact        Communicate      Verify recovery
                Establish     to stakeholders
                IC role

Target Times:
  Detection:  Automated (MTTD target: < 2 minutes)
  Triage:     < 5 minutes from detection
  Escalation: < 10 minutes from detection (if needed)
  Resolution: Varies by severity (SEV1 target: < 1 hour)
  Post-mortem: Within 48 hours of resolution

Phase 1: Detection

Incidents are detected through automated alerting, synthetic monitoring, user reports, or anomaly detection. The detection phase ends when a responder acknowledges the alert and begins triage. Automated detection is always preferred β€” it eliminates the lag of user-reported incidents.

MTTD Reduction Strategy: At Samsung, we reduced MTTD by 50% by implementing Datadog synthetic monitors on critical user journeys (login, payment, device registration). These monitors run every 60 seconds from multiple global locations and page immediately on failure β€” catching issues before users report them.

Phase 2: Triage

The responder assesses severity, determines impact scope, and decides whether to activate the incident response process. If the issue is a SEV1-2, an Incident Commander (IC) is appointed. Triage should complete within 5 minutes.

Phase 3: Escalation

The IC engages additional responders based on the escalation policy. This includes paging subject matter experts, opening a war room bridge, and notifying stakeholders. Communication channels are established and maintained throughout the incident.

Phase 4: Resolution

Responders work to restore service. The priority is always mitigation over root cause β€” get the service working first, then investigate. A fix can be a rollback, traffic shift, configuration change, or code patch. Service restoration is verified before the incident is closed.

Phase 5: Post-Mortem

A blameless post-mortem is conducted to understand what happened, why, and how to prevent recurrence. All SEV1 and SEV2 incidents require post-mortems within 48 hours of resolution. See the Post-Mortem Culture guide for templates and facilitation techniques.

Severity Levels (SEV1–SEV4)

Consistent severity classification ensures the right level of response for every incident. These definitions should be adopted organization-wide.

Level Name Definition Response Examples
SEV1 Critical Complete service outage; critical functionality unusable; significant revenue impact; data loss or security breach All hands; IC appointed; war room opened; executive notified within 15 min; page all on-call engineers Payment processing completely down; all users cannot authenticate; data corruption affecting production
SEV2 Major Significant degradation; major feature impaired; partial outage affecting a large user segment IC appointed; war room optional; engineering manager notified; page primary on-call + relevant SMEs Latency >5s for 50% of requests; checkout failure rate >10%; mobile app crashes on launch for Android users
SEV3 Minor Localized or partial issue; workaround available; limited user impact Primary on-call handles; ticket created; no war room; fix during business hours acceptable Admin dashboard slow; one region experiencing elevated errors; non-critical feature degraded
SEV4 Informational No user impact; potential issue detected; proactive maintenance needed Ticket created; scheduled for next sprint; no immediate action required Certificate expiring in 30 days; disk usage >80%; single node down in multi-node cluster
Severity Escalation: If in doubt, escalate. It is always better to downgrade a SEV1 to SEV2 than to realize too late that a SEV2 should have been SEV1. The IC has the authority to change severity at any time based on new information.

On-Call Rotation Design

Well-designed on-call rotations prevent burnout while ensuring adequate coverage. Based on experience directing a 6-person SRE team, these are the principles for sustainable on-call.

Design Principles

  • Primary + Secondary: Always have a secondary on-call who is paged if the primary doesn't acknowledge within 5 minutes.
  • Follow-the-sun: For global services, have on-call engineers in multiple time zones (APAC, EMEA, Americas) to reduce overnight pages.
  • Rotation length: One week rotations are standard. Avoid rotations longer than one week β€” burnout increases significantly after 7 days.
  • Post-incident rest: If an engineer handles a SEV1 overnight, they should not be expected in standup the next morning. Protect sleep.
  • Handoff meeting: A brief (10-minute) handoff between outgoing and incoming on-call at rotation boundaries captures ongoing issues.

Sample On-Call Schedule (6-Person Team)

Week Primary On-Call Secondary On-Call Shadow (optional)
Jan 1–7 Alice (SRE) Bob (SRE) β€”
Jan 8–14 Bob (SRE) Charlie (SRE) β€”
Jan 15–21 Charlie (SRE) Diana (SRE) Eve (new hire)
Jan 22–28 Diana (SRE) Eve (SRE) β€”
Jan 29–Feb 4 Eve (SRE) Frank (SRE) β€”
Feb 5–11 Frank (SRE) Alice (SRE) β€”

This schedule gives each engineer a 5-week gap between on-call weeks. With 6 engineers on a weekly rotation, each engineer is on-call approximately 8–9 weeks per year.

PagerDuty Schedule Configuration

# PagerDuty Terraform configuration for on-call rotation
# File: terraform/pagerduty-oncall.tf

# Create the team
resource "pagerduty_team" "sre" {
  name = "SRE Team"
}

# Team members
resource "pagerduty_user" "alice" {
  name  = "Alice Smith"
  email = "alice.smith@company.com"
  role  = "user"
}

# ... (define other team members similarly)

# Create the on-call schedule
resource "pagerduty_schedule" "sre_primary" {
  name      = "SRE Primary On-Call"
  time_zone = "America/Los_Angeles"

  layer {
    name                         = "Primary Layer"
    start                        = "2024-01-01T00:00:00-08:00"
    rotation_virtual_start       = "2024-01-01T00:00:00-08:00"
    rotation_turn_length_seconds = 604800  # 7 days

    users = [
      pagerduty_user.alice.id,
      pagerduty_user.bob.id,
      pagerduty_user.charlie.id,
      pagerduty_user.diana.id,
      pagerduty_user.eve.id,
      pagerduty_user.frank.id,
    ]
  }

  # Handoff time: Monday 9:00 AM
  layer {
    name                         = "Secondary Layer"
    start                        = "2024-01-01T00:00:00-08:00"
    rotation_virtual_start       = "2024-01-01T00:00:00-08:00"
    rotation_turn_length_seconds = 604800

    users = [
      pagerduty_user.bob.id,
      pagerduty_user.charlie.id,
      pagerduty_user.diana.id,
      pagerduty_user.eve.id,
      pagerduty_user.frank.id,
      pagerduty_user.alice.id,
    ]
  }

  # Restrictions: No on-call during team offsite
  layer {
    name                         = "Restriction Layer"
    start                        = "2024-07-15T00:00:00-07:00"
    end                          = "2024-07-19T23:59:59-07:00"
    rotation_virtual_start       = "2024-07-15T00:00:00-07:00"
    rotation_turn_length_seconds = 604800
    restriction {
      type              = "daily_restriction"
      start_time_of_day = "00:00:00"
      duration_seconds  = 86400
    }
  }
}

# Create escalation policy
resource "pagerduty_escalation_policy" "sre_critical" {
  name      = "SRE Critical Escalation"
  num_loops = 3  # Repeat escalation cycle 3 times before giving up

  rule {
    escalation_delay_in_minutes = 5

    target {
      type = "schedule_reference"
      id   = pagerduty_schedule.sre_primary.id
    }
  }

  # Escalate to engineering manager after 15 minutes
  rule {
    escalation_delay_in_minutes = 10

    target {
      type = "user_reference"
      id   = pagerduty_user.eng_manager.id
    }
  }

  # Escalate to director after 30 minutes
  rule {
    escalation_delay_in_minutes = 15

    target {
      type = "user_reference"
      id   = pagerduty_user.eng_director.id
    }
  }
}

Incident Commander Role and Responsibilities

The Incident Commander (IC) is the single point of coordination during an incident. The IC does not need to be the person fixing the problem β€” their role is to coordinate, communicate, and make decisions.

IC Responsibilities

Phase Responsibilities
Triage (0-5 min) β€’ Declare the incident and assign severity
β€’ Appoint IC (self-appoint if first responder)
β€’ Open incident channel/war room
β€’ Send initial notification
Escalation (5-15 min) β€’ Page additional responders as needed
β€’ Establish communication channels
β€’ Notify stakeholders per severity
β€’ Begin timeline documentation
Resolution (ongoing) β€’ Coordinate between responders
β€’ Approve mitigation strategies
β€’ Make go/no-go decisions on risky fixes
β€’ Communicate status updates every 15-30 min
β€’ Ensure timeline is being updated
Closeout β€’ Confirm service is fully restored
β€’ Send all-clear notification
β€’ Schedule post-mortem
β€’ Close incident channel after 24h
β€’ Hand off to post-mortem owner
IC Rotation: The IC role rotates every 2 hours during prolonged incidents to prevent decision fatigue. The outgoing IC briefs the incoming IC on current state, active workstreams, and key decisions made.

Communication Templates

Status Page Update Template

─── INCIDENT OPENED ──────────────────────────────────────────────

[Investigating] Payment Processing Degradation

We are investigating reports of elevated error rates on the
Payment Processing API. Some users may experience failed
transactions. We will provide updates every 15 minutes.

Started: [TIMESTAMP UTC]
Severity: SEV[1-4]
Impact: [BRIEF DESCRIPTION]

─── STATUS UPDATE (every 15-30 min) ────────────────────────────

[Update X] Payment Processing Degradation

We have identified the root cause as [CAUSE]. The team is
[CURRENT ACTION]. We estimate [TIMELINE] for full resolution.

Next update: [TIME]

─── RESOLVED ────────────────────────────────────────────────────

[Resolved] Payment Processing Degradation

The issue has been resolved. All payment processing is now
operating normally.

Root cause: [BRIEF]
Resolution: [BRIEF]
Duration: [X minutes]
Impact: [SUMMARY]

A post-mortem will be published within 48 hours.

Stakeholder Notification Template (SEV1)

Subject: [SEV1] Incident DECLARED β€” [Service Name] β€” [YYYYMMDD-HHMM]

SEV1 Incident declared for [Service Name].

IMPACT:    [Service] is experiencing [outage/degradation].
           [X]% of users affected. [Revenue impact if known].

STARTED:   [TIMESTAMP UTC]
STATUS:    Investigating

RESPONDERS: [IC Name] (IC), [Engineer 1], [Engineer 2]
WAR ROOM:  [Zoom/Meet link]
CHANNEL:   [Slack channel]

CUSTOMER IMPACT: [Yes/No β€” if yes, status page updated]
STATUS PAGE: [link]

NEXT UPDATE: [TIME UTC]

---
This is an automated notification from the Incident Management System.
Reply STOP to disable notifications for this incident.

War Room / Bridge Procedures

For SEV1 incidents, open a war room (video bridge) immediately. The war room is the central coordination point for all incident response activity.

War Room Rules

  1. IC speaks first: The IC opens the bridge, establishes the agenda, and assigns roles.
  2. One conversation at a time: Side conversations happen in Slack, not on the bridge.
  3. Declare workstreams: Responders declare what they are investigating to prevent duplicate work.
  4. No blame: Focus on fixing the problem, not assigning fault. Blameless culture is critical during incidents.
  5. IC approves changes: No production changes during a SEV1 without explicit IC approval.
  6. Take notes: Someone (designated scribe) captures timeline, decisions, and actions in real-time.
  7. Rotate every 2 hours: IC and key responders rotate to prevent fatigue.

War Room Checklist

WAR ROOM OPEN CHECKLIST
Incident: [ID] β€” [DESCRIPTION]
IC: [NAME] β€” opened at [TIME]

β–‘ Bridge opened and link shared in #incidents
β–‘ IC has confirmed role and declared severity
β–‘ Scribe assigned: [NAME]
β–‘ Primary responders on bridge: [NAMES]
β–‘ Communication channels established
  β–‘ Slack channel: #[channel-name]
  β–‘ Status page updated (if customer-facing)
  β–‘ Stakeholders notified per escalation policy
β–‘ Timeline document created: [DOC LINK]
β–‘ Initial status update sent

EVERY 15 MINUTES:
β–‘ Status update posted to Slack
β–‘ Status page updated (if public)
β–‘ Timeline updated with new findings

RESOLUTION:
β–‘ Service fully restored and verified
β–‘ All-clear communicated
β–‘ War room closed: [TIME]
β–‘ Post-mortem scheduled: [DATE/TIME]

Complete Incident Response Runbook Template

Runbooks are step-by-step procedures for responding to specific alert types. Every alert should have a corresponding runbook. Store runbooks in a searchable repository (Confluence, GitHub, wiki).

═══════════════════════════════════════════════════════════════════
                    INCIDENT RUNBOOK
                    Template v2.0
═══════════════════════════════════════════════════════════════════

Runbook ID: RB-[SERVICE]-[ALERT]-###
Alert: [Alert name exactly as it appears in PagerDuty/Datadog]
Service: [Service name]
Severity: [SEV level typically triggered]
Owner: [Team / On-call rotation]
Last Updated: [DATE]

─── ALERT DESCRIPTION ────────────────────────────────────────────

[Describe what this alert means in plain language. What system
condition triggers it? What does it indicate about service health?]

─── IMPACT ───────────────────────────────────────────────────────

User Impact: [What does the user see when this alert fires?]
Business Impact: [Revenue, SLA, reputation impact]
Scope: [How many users/regions/features affected?]

─── INITIAL ASSESSMENT (do this first, within 2 minutes) ─────────

1. Check service health dashboard:
   [Direct link to Datadog/Grafana dashboard]

2. Check for correlated alerts:
   [Link to alert history or correlated alerts dashboard]

3. Check recent deployments:
   $ kubectl get deployments -n [namespace] --sort-by=.metadata.creationTimestamp
   [Or link to deployment tracking]

4. Check error logs (last 15 minutes):
   Datadog: service:[service] status:error @timestamp:>-15m
   [Or equivalent log query]

─── COMMON CAUSES (check in order) ───────────────────────────────

Cause 1: [Most common cause]
  Symptoms: [How to recognize this cause]
  Verification: [Command or check to confirm]
  Fix: [Step-by-step fix procedure]

Cause 2: [Second most common]
  Symptoms:
  Verification:
  Fix:

Cause 3: [Third most common]
  Symptoms:
  Verification:
  Fix:

─── MITIGATION PROCEDURES ────────────────────────────────────────

If the service is severely degraded, apply these mitigations
(in order of speed/safety):

1. ROLLBACK (fastest, if recent deploy):
   $ ./scripts/rollback.sh [service] [previous-version]
   [Or kubectl rollout undo deployment/[name]]

2. TRAFFIC SHIFT (if multi-region):
   $ ./scripts/drain-region.sh [problem-region]
   $ ./scripts/increase-capacity.sh [healthy-region]

3. CIRCUIT BREAKER (if downstream dependency failing):
   [Procedure to open circuit breaker]

4. SCALE UP (if capacity issue):
   $ kubectl scale deployment [name] --replicas=[N] -n [namespace]

─── VERIFICATION ─────────────────────────────────────────────────

After applying a fix:
1. Check error rate has returned to baseline
2. Check latency is within SLO
3. Run smoke test: [link to smoke tests or command]
4. Monitor for 15 minutes before declaring resolved

─── ESCALATION ───────────────────────────────────────────────────

Escalate to [TEAM/ROLE] if:
  β€’ Fix does not resolve within 15 minutes
  β€’ Root cause is unclear after 30 minutes
  β€’ Impact is larger than documented scope
  β€’ Data integrity concerns exist

Escalation path:
  Primary: [On-call engineer]
  Secondary: [Engineering manager] β€” page after 15 min
  Tertiary: [Director] β€” page after 30 min

─── RELATED RUNBOOKS ─────────────────────────────────────────────

β€’ RB-[SERVICE]-002: [Related runbook]
β€’ RB-[SERVICE]-003: [Related runbook]
β€’ [Link to service architecture diagram]
β€’ [Link to dependency runbooks]

─── CHANGE LOG ───────────────────────────────────────────────────

YYYY-MM-DD: v1.0 β€” Initial runbook (Author)
YYYY-MM-DD: v1.1 β€” Added Cause 3 after INC-2024-0015 (Author)
YYYY-MM-DD: v2.0 β€” Major rewrite after Q1 game day (Author)

Escalation Policy Example

═══════════════════════════════════════════════════════════════════
                    ESCALATION POLICY
                    Payment Platform Team
═══════════════════════════════════════════════════════════════════

This escalation policy applies to all production services owned by
the Payment Platform team.

─── SEV1 ESCALATION ──────────────────────────────────────────────

T+0 min:    Primary on-call paged
            β†’ Acknowledge within 5 minutes

T+5 min:    If unacknowledged β†’ Secondary on-call paged
            β†’ War room opened automatically

T+10 min:   If still unacknowledged β†’ Engineering Manager paged
            β†’ All SRE team members notified

T+15 min:   VP Engineering notified (email + Slack)
            β†’ Customer support activated (if user-facing)
            β†’ Status page updated

T+30 min:   CTO notified
            β†’ Executive briefing prepared

T+60 min:   If unresolved β†’ External communications team engaged
            β†’ Legal notified (if compliance/regulatory impact)

─── SEV2 ESCALATION ──────────────────────────────────────────────

T+0 min:    Primary on-call paged

T+10 min:   If unacknowledged β†’ Secondary on-call paged

T+20 min:   If unacknowledged β†’ Engineering Manager paged

T+30 min:   IC appointed (if not already done by responders)

─── SEV3 ESCALATION ──────────────────────────────────────────────

T+0:        Ticket created, assigned to on-call
T+4 hours:  If untouched β†’ Reassigned to secondary
T+24 hours: If unresolved β†’ Escalated to engineering manager

─── ESCALATION BYPASS ────────────────────────────────────────────

Any responder may immediately escalate to Engineering Manager
or VP if:
  β€’ Data loss is suspected
  β€’ Security breach indicators present
  β€’ Customer-facing impact exceeds $100K/hour
  β€’ Regulatory compliance at risk

Incident Tracking with Jira and Opsgenie

Every incident should have a corresponding tracking ticket for action item follow-up. Integrate PagerDuty/Opsgenie with Jira for automatic ticket creation.

# Jira incident ticket template (created automatically from PagerDuty)

Project: INC (Incidents)
Issue Type: Incident
Summary: [SEV{X}] [Service] β€” [Brief description]
Priority: [Critical/High/Medium/Low based on severity]
Labels: incident, sev{X}, [service-name]

Description:
  Incident ID: INC-2024-XXXX
  Started: [TIMESTAMP]
  Resolved: [TIMESTAMP]
  Duration: [X minutes]
  Severity: SEV[X]
  Service: [NAME]
  
  Impact:
  - Availability: [X%]
  - Users affected: [N or "unknown"]
  - Revenue impact: [$X or "unknown"]
  
  Root Cause: [To be filled in post-mortem]
  
  Post-Mortem: [Link to document]
  
  Action Items:
  1. [ ] [Description] β€” Owner: [Name] β€” Due: [Date]
  2. [ ] [Description] β€” Owner: [Name] β€” Due: [Date]

Linked Issues:
  - Caused by: [Jira ticket if known bug]
  - Related: [Other incident tickets]

# Jira Automation Rule:
# When incident created:
#   1. Set priority based on severity label
#   2. Assign to current on-call engineer
#   3. Add watchers: engineering manager, SRE lead
#   4. Link to service component in JSM
#   5. Post to #incidents Slack channel

Game Days and Chaos Engineering

Game days are scheduled exercises where teams deliberately inject failures into production or staging systems to validate resilience and practice response procedures. They are the most effective way to ensure your incident response process works when it matters.

Game Day Planning Template

═══════════════════════════════════════════════════════════════════
                        GAME DAY PLAN
═══════════════════════════════════════════════════════════════════

Game Day ID: GD-2024-Q1-001
Date: [DATE]
Time: [START] – [END] (2 hours)
Location: [Video link / Room]
Scope: [Staging / Production (canary only) / Full Production]

─── OBJECTIVES ───────────────────────────────────────────────────

1. Validate Payment API auto-failover to secondary region
2. Verify on-call response time meets MTTD target (< 2 min)
3. Test IC handoff procedure
4. Confirm runbook accuracy for region failover

─── SCENARIOS ────────────────────────────────────────────────────

Scenario 1: Region Failure Simulation (45 min)
  Inject:     Terminate all pods in us-west-2 for payment-api
  Expected:   Traffic auto-routes to us-east-1
  Success:    < 30s failover, no 5xx errors
  Observer:   [Name] β€” monitors dashboards, does not intervene

Scenario 2: Database Primary Failure (45 min)
  Inject:     Failover database primary to replica
  Expected:   Automatic promotion, brief read-only period
  Success:    < 60s read-only window, no data loss
  Observer:   [Name]

─── PARTICIPANTS ─────────────────────────────────────────────────

IC (practice):      [Name]
Primary Responder:  [Name]
SRE Observer:       [Name] β€” evaluates response
Engineering Mgr:    [Name] β€” observes, available if needed

─── SAFETY GUARDRAILS ────────────────────────────────────────────

β–‘ Abort criteria defined: [What triggers immediate stop]
β–‘ Rollback plan ready: [How to restore if experiment goes wrong]
β–‘ Blast radius limited: [Only canary users affected]
β–‘ Observer has kill switch access
β–‘ Customer support on standby

─── SCHEDULE ─────────────────────────────────────────────────────

T+0:   Briefing and scenario reveal
T+15:  Injection 1 β€” observe response
T+45:  Debrief Injection 1
T+60:  Injection 2 β€” observe response
T+90:  Debrief Injection 2
T+105: Retrospective and action items
T+120: Close

─── POST-GAME-DAY ────────────────────────────────────────────────

Action items captured in: [Jira epic / Board]
Retro document: [Link]
Process improvements identified: [List]

Incident Metrics: MTTD, MTTR, MTBF

Track these metrics quarterly to measure the maturity of your incident management practice. Use them to identify trends and justify investments in reliability.

Metric Definition Formula Target How to Improve
MTTD
Mean Time to Detect
Average time from incident start to detection Sum(detection_time - start_time) / incident_count < 2 minutes Synthetic monitoring, tighter alerting thresholds, anomaly detection
MTTR
Mean Time to Resolve
Average time from detection to full resolution Sum(resolution_time - detection_time) / incident_count < 30 min (SEV1)
< 4 hours (SEV2)
Runbook automation, self-healing systems, feature flags for rollback
MTBF
Mean Time Between Failures
Average time between incident starts Total_operating_time / incident_count > 7 days Fix root causes, chaos engineering, production readiness reviews
MTTA
Mean Time to Acknowledge
Average time from page to acknowledgment Sum(ack_time - page_time) / page_count < 5 minutes Clear escalation policies, pager training, notification channels
# Incident Metrics Dashboard Query (Prometheus/Datadog)
# Track these metrics monthly and report quarterly

# MTTD calculation from incident data
# Export from PagerDuty/Opsgenie API and calculate:

import json
from datetime import datetime

def calculate_metrics(incidents):
    """Calculate MTTD, MTTR, MTBF from incident data."""
    mttd_values = []
    mttr_values = []
    
    for inc in incidents:
        start = datetime.fromisoformat(inc['started_at'])
        detected = datetime.fromisoformat(inc['detected_at'])
        resolved = datetime.fromisoformat(inc['resolved_at'])
        
        mttd_values.append((detected - start).total_seconds())
        mttr_values.append((resolved - detected).total_seconds())
    
    mttd = sum(mttd_values) / len(mttd_values) / 60  # minutes
    mttr = sum(mttr_values) / len(mttr_values) / 60  # minutes
    
    # MTBF: total period / number of incidents
    first_incident = min(datetime.fromisoformat(i['started_at']) for i in incidents)
    last_incident = max(datetime.fromisoformat(i['started_at']) for i in incidents)
    period_days = (last_incident - first_incident).total_seconds() / 86400
    mtbf = period_days / len(incidents)
    
    return {
        'mttd_minutes': round(mttd, 1),
        'mttr_minutes': round(mttr, 1),
        'mtbf_days': round(mtbf, 1),
        'total_incidents': len(incidents)
    }

# Example: Monthly report
# {
#   'mttd_minutes': 1.8,
#   'mttr_minutes': 22.4,
#   'mtbf_days': 4.2,
#   'total_incidents': 7
# }
Metrics That Matter: MTTR is the single most important incident metric. Reducing MTTR by 45% (as achieved at Samsung through Datadog APM and runbook automation) directly translates to reduced revenue loss and improved user experience. Focus your reliability investments on MTTR reduction β€” better detection, faster mitigation, and reliable rollbacks.

Last updated: June 2026 | Author: SRE Team