Incident Management
Effective incident management minimizes service disruption and restores normal operations quickly. This guide covers the complete incident lifecycle from detection to resolution, with severity classification, on-call design, communication templates, and production-ready runbooks.
Incident Lifecycle
Every incident follows a predictable lifecycle. Understanding these phases enables teams to build structured processes that reduce MTTR and improve outcomes.
ββββββββββββ ββββββββββββ βββββββββββββ βββββββββββββ ββββββββββββββ
β DETECTIONβββββΆβ TRIAGE βββββΆβ ESCALATIONβββββΆβ RESOLUTIONβββββΆβPOST-MORTEM β
ββββββββββββ ββββββββββββ βββββββββββββ βββββββββββββ ββββββββββββββ
β β β β β
βΌ βΌ βΌ βΌ βΌ
Alert fires Determine Engage right Restore service Document,
Monitoring severity people with Apply fix learn,
User report Assess needed skills Apply workaround improve
Synthetic test impact Communicate Verify recovery
Establish to stakeholders
IC role
Target Times:
Detection: Automated (MTTD target: < 2 minutes)
Triage: < 5 minutes from detection
Escalation: < 10 minutes from detection (if needed)
Resolution: Varies by severity (SEV1 target: < 1 hour)
Post-mortem: Within 48 hours of resolution
Phase 1: Detection
Incidents are detected through automated alerting, synthetic monitoring, user reports, or anomaly detection. The detection phase ends when a responder acknowledges the alert and begins triage. Automated detection is always preferred β it eliminates the lag of user-reported incidents.
Phase 2: Triage
The responder assesses severity, determines impact scope, and decides whether to activate the incident response process. If the issue is a SEV1-2, an Incident Commander (IC) is appointed. Triage should complete within 5 minutes.
Phase 3: Escalation
The IC engages additional responders based on the escalation policy. This includes paging subject matter experts, opening a war room bridge, and notifying stakeholders. Communication channels are established and maintained throughout the incident.
Phase 4: Resolution
Responders work to restore service. The priority is always mitigation over root cause β get the service working first, then investigate. A fix can be a rollback, traffic shift, configuration change, or code patch. Service restoration is verified before the incident is closed.
Phase 5: Post-Mortem
A blameless post-mortem is conducted to understand what happened, why, and how to prevent recurrence. All SEV1 and SEV2 incidents require post-mortems within 48 hours of resolution. See the Post-Mortem Culture guide for templates and facilitation techniques.
Severity Levels (SEV1βSEV4)
Consistent severity classification ensures the right level of response for every incident. These definitions should be adopted organization-wide.
| Level | Name | Definition | Response | Examples |
|---|---|---|---|---|
| SEV1 | Critical | Complete service outage; critical functionality unusable; significant revenue impact; data loss or security breach | All hands; IC appointed; war room opened; executive notified within 15 min; page all on-call engineers | Payment processing completely down; all users cannot authenticate; data corruption affecting production |
| SEV2 | Major | Significant degradation; major feature impaired; partial outage affecting a large user segment | IC appointed; war room optional; engineering manager notified; page primary on-call + relevant SMEs | Latency >5s for 50% of requests; checkout failure rate >10%; mobile app crashes on launch for Android users |
| SEV3 | Minor | Localized or partial issue; workaround available; limited user impact | Primary on-call handles; ticket created; no war room; fix during business hours acceptable | Admin dashboard slow; one region experiencing elevated errors; non-critical feature degraded |
| SEV4 | Informational | No user impact; potential issue detected; proactive maintenance needed | Ticket created; scheduled for next sprint; no immediate action required | Certificate expiring in 30 days; disk usage >80%; single node down in multi-node cluster |
On-Call Rotation Design
Well-designed on-call rotations prevent burnout while ensuring adequate coverage. Based on experience directing a 6-person SRE team, these are the principles for sustainable on-call.
Design Principles
- Primary + Secondary: Always have a secondary on-call who is paged if the primary doesn't acknowledge within 5 minutes.
- Follow-the-sun: For global services, have on-call engineers in multiple time zones (APAC, EMEA, Americas) to reduce overnight pages.
- Rotation length: One week rotations are standard. Avoid rotations longer than one week β burnout increases significantly after 7 days.
- Post-incident rest: If an engineer handles a SEV1 overnight, they should not be expected in standup the next morning. Protect sleep.
- Handoff meeting: A brief (10-minute) handoff between outgoing and incoming on-call at rotation boundaries captures ongoing issues.
Sample On-Call Schedule (6-Person Team)
| Week | Primary On-Call | Secondary On-Call | Shadow (optional) |
|---|---|---|---|
| Jan 1β7 | Alice (SRE) | Bob (SRE) | β |
| Jan 8β14 | Bob (SRE) | Charlie (SRE) | β |
| Jan 15β21 | Charlie (SRE) | Diana (SRE) | Eve (new hire) |
| Jan 22β28 | Diana (SRE) | Eve (SRE) | β |
| Jan 29βFeb 4 | Eve (SRE) | Frank (SRE) | β |
| Feb 5β11 | Frank (SRE) | Alice (SRE) | β |
This schedule gives each engineer a 5-week gap between on-call weeks. With 6 engineers on a weekly rotation, each engineer is on-call approximately 8β9 weeks per year.
PagerDuty Schedule Configuration
# PagerDuty Terraform configuration for on-call rotation
# File: terraform/pagerduty-oncall.tf
# Create the team
resource "pagerduty_team" "sre" {
name = "SRE Team"
}
# Team members
resource "pagerduty_user" "alice" {
name = "Alice Smith"
email = "alice.smith@company.com"
role = "user"
}
# ... (define other team members similarly)
# Create the on-call schedule
resource "pagerduty_schedule" "sre_primary" {
name = "SRE Primary On-Call"
time_zone = "America/Los_Angeles"
layer {
name = "Primary Layer"
start = "2024-01-01T00:00:00-08:00"
rotation_virtual_start = "2024-01-01T00:00:00-08:00"
rotation_turn_length_seconds = 604800 # 7 days
users = [
pagerduty_user.alice.id,
pagerduty_user.bob.id,
pagerduty_user.charlie.id,
pagerduty_user.diana.id,
pagerduty_user.eve.id,
pagerduty_user.frank.id,
]
}
# Handoff time: Monday 9:00 AM
layer {
name = "Secondary Layer"
start = "2024-01-01T00:00:00-08:00"
rotation_virtual_start = "2024-01-01T00:00:00-08:00"
rotation_turn_length_seconds = 604800
users = [
pagerduty_user.bob.id,
pagerduty_user.charlie.id,
pagerduty_user.diana.id,
pagerduty_user.eve.id,
pagerduty_user.frank.id,
pagerduty_user.alice.id,
]
}
# Restrictions: No on-call during team offsite
layer {
name = "Restriction Layer"
start = "2024-07-15T00:00:00-07:00"
end = "2024-07-19T23:59:59-07:00"
rotation_virtual_start = "2024-07-15T00:00:00-07:00"
rotation_turn_length_seconds = 604800
restriction {
type = "daily_restriction"
start_time_of_day = "00:00:00"
duration_seconds = 86400
}
}
}
# Create escalation policy
resource "pagerduty_escalation_policy" "sre_critical" {
name = "SRE Critical Escalation"
num_loops = 3 # Repeat escalation cycle 3 times before giving up
rule {
escalation_delay_in_minutes = 5
target {
type = "schedule_reference"
id = pagerduty_schedule.sre_primary.id
}
}
# Escalate to engineering manager after 15 minutes
rule {
escalation_delay_in_minutes = 10
target {
type = "user_reference"
id = pagerduty_user.eng_manager.id
}
}
# Escalate to director after 30 minutes
rule {
escalation_delay_in_minutes = 15
target {
type = "user_reference"
id = pagerduty_user.eng_director.id
}
}
}
Incident Commander Role and Responsibilities
The Incident Commander (IC) is the single point of coordination during an incident. The IC does not need to be the person fixing the problem β their role is to coordinate, communicate, and make decisions.
IC Responsibilities
| Phase | Responsibilities |
|---|---|
| Triage (0-5 min) |
β’ Declare the incident and assign severity β’ Appoint IC (self-appoint if first responder) β’ Open incident channel/war room β’ Send initial notification |
| Escalation (5-15 min) |
β’ Page additional responders as needed β’ Establish communication channels β’ Notify stakeholders per severity β’ Begin timeline documentation |
| Resolution (ongoing) |
β’ Coordinate between responders β’ Approve mitigation strategies β’ Make go/no-go decisions on risky fixes β’ Communicate status updates every 15-30 min β’ Ensure timeline is being updated |
| Closeout |
β’ Confirm service is fully restored β’ Send all-clear notification β’ Schedule post-mortem β’ Close incident channel after 24h β’ Hand off to post-mortem owner |
Communication Templates
Status Page Update Template
βββ INCIDENT OPENED ββββββββββββββββββββββββββββββββββββββββββββββ
[Investigating] Payment Processing Degradation
We are investigating reports of elevated error rates on the
Payment Processing API. Some users may experience failed
transactions. We will provide updates every 15 minutes.
Started: [TIMESTAMP UTC]
Severity: SEV[1-4]
Impact: [BRIEF DESCRIPTION]
βββ STATUS UPDATE (every 15-30 min) ββββββββββββββββββββββββββββ
[Update X] Payment Processing Degradation
We have identified the root cause as [CAUSE]. The team is
[CURRENT ACTION]. We estimate [TIMELINE] for full resolution.
Next update: [TIME]
βββ RESOLVED ββββββββββββββββββββββββββββββββββββββββββββββββββββ
[Resolved] Payment Processing Degradation
The issue has been resolved. All payment processing is now
operating normally.
Root cause: [BRIEF]
Resolution: [BRIEF]
Duration: [X minutes]
Impact: [SUMMARY]
A post-mortem will be published within 48 hours.
Stakeholder Notification Template (SEV1)
Subject: [SEV1] Incident DECLARED β [Service Name] β [YYYYMMDD-HHMM]
SEV1 Incident declared for [Service Name].
IMPACT: [Service] is experiencing [outage/degradation].
[X]% of users affected. [Revenue impact if known].
STARTED: [TIMESTAMP UTC]
STATUS: Investigating
RESPONDERS: [IC Name] (IC), [Engineer 1], [Engineer 2]
WAR ROOM: [Zoom/Meet link]
CHANNEL: [Slack channel]
CUSTOMER IMPACT: [Yes/No β if yes, status page updated]
STATUS PAGE: [link]
NEXT UPDATE: [TIME UTC]
---
This is an automated notification from the Incident Management System.
Reply STOP to disable notifications for this incident.
War Room / Bridge Procedures
For SEV1 incidents, open a war room (video bridge) immediately. The war room is the central coordination point for all incident response activity.
War Room Rules
- IC speaks first: The IC opens the bridge, establishes the agenda, and assigns roles.
- One conversation at a time: Side conversations happen in Slack, not on the bridge.
- Declare workstreams: Responders declare what they are investigating to prevent duplicate work.
- No blame: Focus on fixing the problem, not assigning fault. Blameless culture is critical during incidents.
- IC approves changes: No production changes during a SEV1 without explicit IC approval.
- Take notes: Someone (designated scribe) captures timeline, decisions, and actions in real-time.
- Rotate every 2 hours: IC and key responders rotate to prevent fatigue.
War Room Checklist
WAR ROOM OPEN CHECKLIST
Incident: [ID] β [DESCRIPTION]
IC: [NAME] β opened at [TIME]
β‘ Bridge opened and link shared in #incidents
β‘ IC has confirmed role and declared severity
β‘ Scribe assigned: [NAME]
β‘ Primary responders on bridge: [NAMES]
β‘ Communication channels established
β‘ Slack channel: #[channel-name]
β‘ Status page updated (if customer-facing)
β‘ Stakeholders notified per escalation policy
β‘ Timeline document created: [DOC LINK]
β‘ Initial status update sent
EVERY 15 MINUTES:
β‘ Status update posted to Slack
β‘ Status page updated (if public)
β‘ Timeline updated with new findings
RESOLUTION:
β‘ Service fully restored and verified
β‘ All-clear communicated
β‘ War room closed: [TIME]
β‘ Post-mortem scheduled: [DATE/TIME]
Complete Incident Response Runbook Template
Runbooks are step-by-step procedures for responding to specific alert types. Every alert should have a corresponding runbook. Store runbooks in a searchable repository (Confluence, GitHub, wiki).
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
INCIDENT RUNBOOK
Template v2.0
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Runbook ID: RB-[SERVICE]-[ALERT]-###
Alert: [Alert name exactly as it appears in PagerDuty/Datadog]
Service: [Service name]
Severity: [SEV level typically triggered]
Owner: [Team / On-call rotation]
Last Updated: [DATE]
βββ ALERT DESCRIPTION ββββββββββββββββββββββββββββββββββββββββββββ
[Describe what this alert means in plain language. What system
condition triggers it? What does it indicate about service health?]
βββ IMPACT βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
User Impact: [What does the user see when this alert fires?]
Business Impact: [Revenue, SLA, reputation impact]
Scope: [How many users/regions/features affected?]
βββ INITIAL ASSESSMENT (do this first, within 2 minutes) βββββββββ
1. Check service health dashboard:
[Direct link to Datadog/Grafana dashboard]
2. Check for correlated alerts:
[Link to alert history or correlated alerts dashboard]
3. Check recent deployments:
$ kubectl get deployments -n [namespace] --sort-by=.metadata.creationTimestamp
[Or link to deployment tracking]
4. Check error logs (last 15 minutes):
Datadog: service:[service] status:error @timestamp:>-15m
[Or equivalent log query]
βββ COMMON CAUSES (check in order) βββββββββββββββββββββββββββββββ
Cause 1: [Most common cause]
Symptoms: [How to recognize this cause]
Verification: [Command or check to confirm]
Fix: [Step-by-step fix procedure]
Cause 2: [Second most common]
Symptoms:
Verification:
Fix:
Cause 3: [Third most common]
Symptoms:
Verification:
Fix:
βββ MITIGATION PROCEDURES ββββββββββββββββββββββββββββββββββββββββ
If the service is severely degraded, apply these mitigations
(in order of speed/safety):
1. ROLLBACK (fastest, if recent deploy):
$ ./scripts/rollback.sh [service] [previous-version]
[Or kubectl rollout undo deployment/[name]]
2. TRAFFIC SHIFT (if multi-region):
$ ./scripts/drain-region.sh [problem-region]
$ ./scripts/increase-capacity.sh [healthy-region]
3. CIRCUIT BREAKER (if downstream dependency failing):
[Procedure to open circuit breaker]
4. SCALE UP (if capacity issue):
$ kubectl scale deployment [name] --replicas=[N] -n [namespace]
βββ VERIFICATION βββββββββββββββββββββββββββββββββββββββββββββββββ
After applying a fix:
1. Check error rate has returned to baseline
2. Check latency is within SLO
3. Run smoke test: [link to smoke tests or command]
4. Monitor for 15 minutes before declaring resolved
βββ ESCALATION βββββββββββββββββββββββββββββββββββββββββββββββββββ
Escalate to [TEAM/ROLE] if:
β’ Fix does not resolve within 15 minutes
β’ Root cause is unclear after 30 minutes
β’ Impact is larger than documented scope
β’ Data integrity concerns exist
Escalation path:
Primary: [On-call engineer]
Secondary: [Engineering manager] β page after 15 min
Tertiary: [Director] β page after 30 min
βββ RELATED RUNBOOKS βββββββββββββββββββββββββββββββββββββββββββββ
β’ RB-[SERVICE]-002: [Related runbook]
β’ RB-[SERVICE]-003: [Related runbook]
β’ [Link to service architecture diagram]
β’ [Link to dependency runbooks]
βββ CHANGE LOG βββββββββββββββββββββββββββββββββββββββββββββββββββ
YYYY-MM-DD: v1.0 β Initial runbook (Author)
YYYY-MM-DD: v1.1 β Added Cause 3 after INC-2024-0015 (Author)
YYYY-MM-DD: v2.0 β Major rewrite after Q1 game day (Author)
Escalation Policy Example
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ESCALATION POLICY
Payment Platform Team
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
This escalation policy applies to all production services owned by
the Payment Platform team.
βββ SEV1 ESCALATION ββββββββββββββββββββββββββββββββββββββββββββββ
T+0 min: Primary on-call paged
β Acknowledge within 5 minutes
T+5 min: If unacknowledged β Secondary on-call paged
β War room opened automatically
T+10 min: If still unacknowledged β Engineering Manager paged
β All SRE team members notified
T+15 min: VP Engineering notified (email + Slack)
β Customer support activated (if user-facing)
β Status page updated
T+30 min: CTO notified
β Executive briefing prepared
T+60 min: If unresolved β External communications team engaged
β Legal notified (if compliance/regulatory impact)
βββ SEV2 ESCALATION ββββββββββββββββββββββββββββββββββββββββββββββ
T+0 min: Primary on-call paged
T+10 min: If unacknowledged β Secondary on-call paged
T+20 min: If unacknowledged β Engineering Manager paged
T+30 min: IC appointed (if not already done by responders)
βββ SEV3 ESCALATION ββββββββββββββββββββββββββββββββββββββββββββββ
T+0: Ticket created, assigned to on-call
T+4 hours: If untouched β Reassigned to secondary
T+24 hours: If unresolved β Escalated to engineering manager
βββ ESCALATION BYPASS ββββββββββββββββββββββββββββββββββββββββββββ
Any responder may immediately escalate to Engineering Manager
or VP if:
β’ Data loss is suspected
β’ Security breach indicators present
β’ Customer-facing impact exceeds $100K/hour
β’ Regulatory compliance at risk
Incident Tracking with Jira and Opsgenie
Every incident should have a corresponding tracking ticket for action item follow-up. Integrate PagerDuty/Opsgenie with Jira for automatic ticket creation.
# Jira incident ticket template (created automatically from PagerDuty)
Project: INC (Incidents)
Issue Type: Incident
Summary: [SEV{X}] [Service] β [Brief description]
Priority: [Critical/High/Medium/Low based on severity]
Labels: incident, sev{X}, [service-name]
Description:
Incident ID: INC-2024-XXXX
Started: [TIMESTAMP]
Resolved: [TIMESTAMP]
Duration: [X minutes]
Severity: SEV[X]
Service: [NAME]
Impact:
- Availability: [X%]
- Users affected: [N or "unknown"]
- Revenue impact: [$X or "unknown"]
Root Cause: [To be filled in post-mortem]
Post-Mortem: [Link to document]
Action Items:
1. [ ] [Description] β Owner: [Name] β Due: [Date]
2. [ ] [Description] β Owner: [Name] β Due: [Date]
Linked Issues:
- Caused by: [Jira ticket if known bug]
- Related: [Other incident tickets]
# Jira Automation Rule:
# When incident created:
# 1. Set priority based on severity label
# 2. Assign to current on-call engineer
# 3. Add watchers: engineering manager, SRE lead
# 4. Link to service component in JSM
# 5. Post to #incidents Slack channel
Game Days and Chaos Engineering
Game days are scheduled exercises where teams deliberately inject failures into production or staging systems to validate resilience and practice response procedures. They are the most effective way to ensure your incident response process works when it matters.
Game Day Planning Template
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
GAME DAY PLAN
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Game Day ID: GD-2024-Q1-001
Date: [DATE]
Time: [START] β [END] (2 hours)
Location: [Video link / Room]
Scope: [Staging / Production (canary only) / Full Production]
βββ OBJECTIVES βββββββββββββββββββββββββββββββββββββββββββββββββββ
1. Validate Payment API auto-failover to secondary region
2. Verify on-call response time meets MTTD target (< 2 min)
3. Test IC handoff procedure
4. Confirm runbook accuracy for region failover
βββ SCENARIOS ββββββββββββββββββββββββββββββββββββββββββββββββββββ
Scenario 1: Region Failure Simulation (45 min)
Inject: Terminate all pods in us-west-2 for payment-api
Expected: Traffic auto-routes to us-east-1
Success: < 30s failover, no 5xx errors
Observer: [Name] β monitors dashboards, does not intervene
Scenario 2: Database Primary Failure (45 min)
Inject: Failover database primary to replica
Expected: Automatic promotion, brief read-only period
Success: < 60s read-only window, no data loss
Observer: [Name]
βββ PARTICIPANTS βββββββββββββββββββββββββββββββββββββββββββββββββ
IC (practice): [Name]
Primary Responder: [Name]
SRE Observer: [Name] β evaluates response
Engineering Mgr: [Name] β observes, available if needed
βββ SAFETY GUARDRAILS ββββββββββββββββββββββββββββββββββββββββββββ
β‘ Abort criteria defined: [What triggers immediate stop]
β‘ Rollback plan ready: [How to restore if experiment goes wrong]
β‘ Blast radius limited: [Only canary users affected]
β‘ Observer has kill switch access
β‘ Customer support on standby
βββ SCHEDULE βββββββββββββββββββββββββββββββββββββββββββββββββββββ
T+0: Briefing and scenario reveal
T+15: Injection 1 β observe response
T+45: Debrief Injection 1
T+60: Injection 2 β observe response
T+90: Debrief Injection 2
T+105: Retrospective and action items
T+120: Close
βββ POST-GAME-DAY ββββββββββββββββββββββββββββββββββββββββββββββββ
Action items captured in: [Jira epic / Board]
Retro document: [Link]
Process improvements identified: [List]
Incident Metrics: MTTD, MTTR, MTBF
Track these metrics quarterly to measure the maturity of your incident management practice. Use them to identify trends and justify investments in reliability.
| Metric | Definition | Formula | Target | How to Improve |
|---|---|---|---|---|
| MTTD Mean Time to Detect |
Average time from incident start to detection | Sum(detection_time - start_time) / incident_count | < 2 minutes | Synthetic monitoring, tighter alerting thresholds, anomaly detection |
| MTTR Mean Time to Resolve |
Average time from detection to full resolution | Sum(resolution_time - detection_time) / incident_count | < 30 min (SEV1) < 4 hours (SEV2) |
Runbook automation, self-healing systems, feature flags for rollback |
| MTBF Mean Time Between Failures |
Average time between incident starts | Total_operating_time / incident_count | > 7 days | Fix root causes, chaos engineering, production readiness reviews |
| MTTA Mean Time to Acknowledge |
Average time from page to acknowledgment | Sum(ack_time - page_time) / page_count | < 5 minutes | Clear escalation policies, pager training, notification channels |
# Incident Metrics Dashboard Query (Prometheus/Datadog)
# Track these metrics monthly and report quarterly
# MTTD calculation from incident data
# Export from PagerDuty/Opsgenie API and calculate:
import json
from datetime import datetime
def calculate_metrics(incidents):
"""Calculate MTTD, MTTR, MTBF from incident data."""
mttd_values = []
mttr_values = []
for inc in incidents:
start = datetime.fromisoformat(inc['started_at'])
detected = datetime.fromisoformat(inc['detected_at'])
resolved = datetime.fromisoformat(inc['resolved_at'])
mttd_values.append((detected - start).total_seconds())
mttr_values.append((resolved - detected).total_seconds())
mttd = sum(mttd_values) / len(mttd_values) / 60 # minutes
mttr = sum(mttr_values) / len(mttr_values) / 60 # minutes
# MTBF: total period / number of incidents
first_incident = min(datetime.fromisoformat(i['started_at']) for i in incidents)
last_incident = max(datetime.fromisoformat(i['started_at']) for i in incidents)
period_days = (last_incident - first_incident).total_seconds() / 86400
mtbf = period_days / len(incidents)
return {
'mttd_minutes': round(mttd, 1),
'mttr_minutes': round(mttr, 1),
'mtbf_days': round(mtbf, 1),
'total_incidents': len(incidents)
}
# Example: Monthly report
# {
# 'mttd_minutes': 1.8,
# 'mttr_minutes': 22.4,
# 'mtbf_days': 4.2,
# 'total_incidents': 7
# }
Related Topics
- Post-Mortem Culture β Blameless post-mortem templates and facilitation
- Monitoring & Observability β Detection through synthetic monitoring and alerting
- SRE Principles β Core SRE practices and error budgets
Last updated: June 2026 | Author: SRE Team