41 pages ยท 8 sections
Ctrl K
GitHub Portfolio

Post-Mortem Culture

Blameless post-mortems are essential for learning from failures and preventing recurrence. This guide provides templates, processes, facilitation techniques, and a complete example document to help teams build a culture of continuous learning from incidents.

Blameless Culture Principles

A blameless post-mortem culture is built on the understanding that humans are an essential component of complex systems, and that human error is a symptom of systemic problems, not a cause. The goal is not to identify who made a mistake, but to understand how the system allowed a mistake to lead to an incident.

Core Tenets

  1. No individual blame: Never name individuals as causes. Instead of "Bob didn't restart the service," write "The service restart procedure was not automatically triggered."
  2. System over individual: Focus on system design, process gaps, tooling deficiencies, and automation failures โ€” the things that allowed a human action (or inaction) to cause harm.
  3. Psychological safety: Engineers must feel safe to share exactly what they did, including mistakes. If people hide information out of fear, the organization cannot learn.
  4. Actionable outcomes: Every post-mortem must produce concrete, assigned action items that prevent recurrence or improve response.
  5. Share widely: Learning from one team's incident benefits the entire organization. Publish post-mortems broadly unless they contain sensitive security information.
Blameless โ‰  Consequence-Free: Blameless culture does not mean there are no consequences for negligent or malicious behavior. If someone deliberately bypasses safety procedures, that is a performance management issue, not a post-mortem topic. Post-mortems address systemic failures, not misconduct.

Reframing: Individual Blame vs. System Analysis

Blame-Oriented (Wrong) System-Oriented (Right)
"Alice forgot to update the config after the deployment." "The deployment process did not automatically validate that configuration matched the deployed code version."
"Bob was tired and clicked the wrong button in the admin panel."" "The admin panel had no confirmation dialog for destructive actions, and the 'delete' button was visually similar to 'disable'."
"The on-call engineer didn't know how to handle the alert." "The runbook for this alert was outdated (last updated 8 months ago) and did not cover the current failure mode."
"Charlie didn't check the database connection pool before deploying." "The CI pipeline does not include a pre-deploy health check for database connection capacity."
"The team didn't follow the change management process." "The change management process was bypassed because it added 4 hours of delay, and engineers felt pressure to deploy quickly."

When to Write a Post-Mortem

Requiring post-mortems for every alert would create too much overhead. These are the standard triggers that balance learning with sustainable workload.

Trigger Post-Mortem Required Timeline
SEV1 incident (Critical) Always โ€” mandatory Within 48 hours
SEV2 incident (Major) Always โ€” mandatory Within 48 hours
Data loss of any kind Always โ€” mandatory Within 24 hours
Customer-visible outage Yes, if duration > 15 minutes Within 48 hours
Security incident Yes, after security review After security clearance
On-call page with no action taken Optional โ€” if pattern detected Weekly review
Near-miss (almost an incident) Recommended โ€” valuable learning Within 1 week
SEV3/SEV4 Not required unless trending Review in weekly SRE meeting

Post-Mortem Template Structure

This template is the standard format for all post-mortems. Consistent structure makes it easy to find information and compare across incidents.

โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
                    POST-MORTEM TEMPLATE
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

Incident ID: [INC-YYYY-NNNN]
Title: [Service Name] โ€” [Brief description of what happened]
Date: [YYYY-MM-DD]
Author: [Name, Team]
Status: [DRAFT / IN REVIEW / PUBLISHED]
Reviewers: [Names]

โ”€โ”€โ”€ EXECUTIVE SUMMARY โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

[2-3 sentences describing what happened, impact, and resolution.
This should be understandable by non-technical leadership.]

โ”€โ”€โ”€ INCIDENT METRICS โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

| Metric          | Value        |
|-----------------|--------------|
| Severity        | SEV[1-4]     |
| Detection Time  | [HH:MM UTC]  |
| Start Time      | [HH:MM UTC]  |
| Resolution Time | [HH:MM UTC]  |
| Total Duration  | [X minutes]  |
| MTTD            | [X minutes]  |
| MTTR            | [X minutes]  |

โ”€โ”€โ”€ IMPACT โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

User Impact:
  โ€ข [Description of what users experienced]
  โ€ข [Number/percentage of users affected]

Business Impact:
  โ€ข Revenue: [$ impact or "none"]
  โ€ข SLA: [Did this breach SLA? Which one?]
  โ€ข Reputation: [Social media mentions, support tickets]

โ”€โ”€โ”€ TIMELINE โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

[Reconstruct the incident minute-by-minute. Use UTC. Include
system logs, chat messages, and human actions. This is the most
important section โ€” it tells the story of the incident.]

[HH:MM:SS] โ€” [Event description]
[HH:MM:SS] โ€” [Event description]
[HH:MM:SS] โ€” [Event description]

โ”€โ”€โ”€ ROOT CAUSE ANALYSIS โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

[Use 5 Whys or Fault Tree Analysis. Identify contributing causes
and the root cause. Distinguish between proximate cause and
underlying systemic issues.]

Proximate Cause: [What directly caused the incident]

Root Cause: [The underlying systemic issue that allowed the
             proximate cause to become an incident]

Contributing Factors:
  1. [Factor that made the incident worse or more likely]
  2. [Another factor]

โ”€โ”€โ”€ LESSONS LEARNED โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

What Went Well:
  โ€ข [Something the team did well during response]
  โ€ข [Effective process, tool, or decision]

What Went Poorly:
  โ€ข [Something that delayed resolution or made impact worse]
  โ€ข [Process gap, missing tooling, unclear procedure]

Where We Got Lucky:
  โ€ข [Something that prevented worse impact but was not by design]
  โ€ข [Acknowledge luck to avoid complacency]

โ”€โ”€โ”€ ACTION ITEMS โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

| # | Action | Owner | Priority | Due Date | Status |
|---|--------|-------|----------|----------|--------|
| 1 | [Description] | [Name] | P0/P1/P2 | [Date] | OPEN |

โ”€โ”€โ”€ RELATED INCIDENTS โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

[Link to any past incidents with the same or similar root cause.
This helps identify patterns and recurring issues.]

โ”€โ”€โ”€ APPENDIX โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

[Links to logs, metrics, dashboards, chat transcripts]
[Grafana dashboard snapshot]
[Relevant code diffs or configuration changes]

Timeline Construction Techniques

The timeline is the most important section of the post-mortem. A good timeline tells the complete story of the incident and enables readers to understand the sequence of events that led to the outcome.

Data Sources for Timeline Construction

  • Monitoring systems: Datadog, Prometheus, CloudWatch โ€” extract metric anomalies and alert timestamps
  • Log aggregation: Correlated log entries from all affected services
  • Deployment records: CI/CD pipeline timestamps for code or config changes
  • Chat history: Slack/Teams incident channel timestamps
  • PagerDuty/Opsgenie: Page, acknowledgment, escalation timestamps
  • Infrastructure changes: Terraform applies, Kubernetes events, scaling actions
  • Version control: Git commits, pull request merges

Automated Timeline Extraction Script

#!/usr/bin/env python3
"""
timeline_extractor.py โ€” Extract incident timeline from multiple sources.
Usage: python timeline_extractor.py --start "2024-01-15T14:00:00Z" \
                                    --end "2024-01-15T16:00:00Z" \
                                    --service payment-api \
                                    --output incident-timeline.md
"""

import argparse
import json
from datetime import datetime, timedelta
from dataclasses import dataclass
from typing import List

@dataclass
class TimelineEvent:
    timestamp: datetime
    source: str
    description: str
    severity: str = "info"  # info, warning, critical, action

class TimelineExtractor:
    def __init__(self, start: datetime, end: datetime, service: str):
        self.start = start
        self.end = end
        self.service = service
        self.events: List[TimelineEvent] = []
    
    def add_manual_event(self, timestamp: str, description: str, severity: str = "info"):
        """Add a manually recorded event (from Slack, human recollection)."""
        self.events.append(TimelineEvent(
            timestamp=datetime.fromisoformat(timestamp.replace('Z', '+00:00')),
            source="manual",
            description=description,
            severity=severity
        ))
    
    def add_alert_event(self, timestamp: str, alert_name: str, status: str):
        """Add an alert firing/recovery event."""
        icon = "๐Ÿ”ฅ" if status == "firing" else "โœ…"
        self.events.append(TimelineEvent(
            timestamp=datetime.fromisoformat(timestamp.replace('Z', '+00:00')),
            source="alert",
            description=f"{icon} Alert [{alert_name}] {status}",
            severity="critical" if status == "firing" else "info"
        ))
    
    def add_deployment_event(self, timestamp: str, version: str, user: str):
        """Add a deployment event."""
        self.events.append(TimelineEvent(
            timestamp=datetime.fromisoformat(timestamp.replace('Z', '+00:00')),
            source="deployment",
            description=f"๐Ÿš€ Deployment: version {version} by {user}",
            severity="warning"
        ))
    
    def add_human_action(self, timestamp: str, actor: str, action: str):
        """Add a human action during incident response."""
        self.events.append(TimelineEvent(
            timestamp=datetime.fromisoformat(timestamp.replace('Z', '+00:00')),
            source="human",
            description=f"๐Ÿ‘ค {actor}: {action}",
            severity="action"
        ))
    
    def render_markdown(self) -> str:
        """Render timeline as markdown for post-mortem."""
        self.events.sort(key=lambda e: e.timestamp)
        
        lines = ["### Incident Timeline\n", "| Time (UTC) | Source | Event |", "|------------|--------|-------|"]
        
        for event in self.events:
            if self.start <= event.timestamp <= self.end:
                time_str = event.timestamp.strftime("%H:%M:%S")
                lines.append(f"| {time_str} | {event.source} | {event.description} |")
        
        return "\n".join(lines)
    
    def save(self, filepath: str):
        with open(filepath, 'w') as f:
            f.write(self.render_markdown())
        print(f"Timeline saved to {filepath}")


# Example usage for a real incident
if __name__ == "__main__":
    extractor = TimelineExtractor(
        start=datetime.fromisoformat("2024-01-15T14:00:00+00:00"),
        end=datetime.fromisoformat("2024-01-15T16:00:00+00:00"),
        service="payment-api"
    )
    
    # These would normally be extracted from APIs
    extractor.add_deployment_event("2024-01-15T14:23:00Z", "2.4.0", "deploy-bot")
    extractor.add_alert_event("2024-01-15T14:25:00Z", "PaymentAPI_ErrorRateHigh", "firing")
    extractor.add_human_action("2024-01-15T14:26:00Z", "oncall-alice", "Acknowledged alert, checking dashboard")
    extractor.add_human_action("2024-01-15T14:30:00Z", "oncall-alice", "Identified latency spike correlating with v2.4.0 deploy")
    extractor.add_human_action("2024-01-15T14:32:00Z", "oncall-alice", "Opened war room, paging secondary")
    extractor.add_human_action("2024-01-15T14:35:00Z", "sre-bob", "Initiated rollback to v2.3.9")
    extractor.add_deployment_event("2024-01-15T14:40:00Z", "2.3.9 (rollback)", "sre-bob")
    extractor.add_alert_event("2024-01-15T14:45:00Z", "PaymentAPI_ErrorRateHigh", "resolved")
    extractor.add_human_action("2024-01-15T14:50:00Z", "oncall-alice", "Confirmed metrics normal, closing war room")
    
    print(extractor.render_markdown())
    # extractor.save("incident-timeline.md")

5 Whys Root Cause Analysis

The 5 Whys technique involves asking "why?" repeatedly until you reach the systemic root cause. Typically 3โ€“7 iterations are needed. The root cause should be a systemic issue โ€” a missing process, inadequate tooling, or insufficient automation โ€” not a human error.

Example: Payment API Latency Incident

Problem: Payment API p95 latency exceeded 5 seconds for 30 minutes.

Why? (1) โ€” Why was latency high?
โ†’ The database connection pool was exhausted.

Why? (2) โ€” Why was the connection pool exhausted?
โ†’ A new query in v2.4.0 held connections open 10x longer than before.

Why? (3) โ€” Why did a slow query get deployed?
โ†’ The query was tested with 100 rows in staging but production has 10M rows.

Why? (4) โ€” Why wasn't production data volume tested?
โ†’ The CI pipeline doesn't include performance testing with production-like data.

Why? (5) โ€” Why doesn't the CI pipeline include performance testing?
โ†’ The team has not prioritized building a performance test environment
   with production data volume. [ROOT CAUSE โ€” Systemic]

Additional Why for connection handling:
Why didn't the connection pool reject queries instead of hanging?
โ†’ The connection pool doesn't have a query timeout configured.
   [CONTRIBUTING CAUSE โ€” Systemic]

โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

ACTION ITEMS FROM 5 WHYS:

1. Add query timeout (30s) to all database connection pools
   Owner: sre-team | Priority: P0

2. Build performance test stage in CI with production-like
   data volume (10M+ rows)
   Owner: platform-team | Priority: P1

3. Add deployment gate: block deploy if performance tests
   show >20% latency regression
   Owner: eng-team | Priority: P1

Action Items: Prioritization and Tracking

Action items are the primary output of a post-mortem. Without them, the exercise is just storytelling. Every action item must have an owner, priority, and due date.

Priority Definition Due Date Escalation
P0 Prevents incident recurrence; addresses safety-critical gap Within 1 week Daily tracking; VP notified if overdue
P1 Significantly reduces likelihood or impact of similar incidents Within 1 month Weekly tracking; EM notified if overdue
P2 Nice-to-have improvement; reduces toil or improves observability Within 1 quarter Monthly review; standard sprint planning
# Action Item Tracking (Jira query for post-mortem action items)
# Save this as a Jira filter and review weekly

project IN (SRE, PLATFORM) AND
labels = "post-mortem-action" AND
status != Done AND
priority = P0
ORDER BY due_date ASC

# Dashboard: Post-Mortem Action Items
# Show by priority, by owner, by due date, by status
# Alert: P0 items overdue by >24 hours

Post-Mortem Review Meeting Facilitation

The review meeting is where the draft post-mortem is socialized, root cause is validated, and action items are agreed upon. Effective facilitation ensures psychological safety and productive outcomes.

Review Meeting Agenda (45 minutes)

  1. Read-aloud (5 min): The author reads the executive summary and timeline. Everyone else listens. No interruptions.
  2. Clarifying questions (10 min): Attendees ask factual questions about the timeline or technical details. No debate, no solutions yet.
  3. Root cause validation (10 min): Discuss the root cause analysis. Does everyone agree? Are there alternative contributing factors?
  4. Action item review (15 min): Review each proposed action item. Confirm priority, feasibility, and owner. Add missing items.
  5. Closing (5 min): Confirm publish date, approvers, and next steps. Thank participants.

Facilitation Rules

  • Start with safety: Remind everyone that this is blameless. If someone used a name in a blaming way, gently reframe.
  • No "should have": Ban phrases like "they should have" or "I would have." Focus on what the system should do differently.
  • Timebox debate: If root cause analysis is contentious, timebox discussion and document areas of disagreement.
  • Include the responders: The people who responded to the incident must be in the review. Their perspective is irreplaceable.
  • Action items must be concrete: Vague actions like "improve testing" are unacceptable. Specific: "Add integration test for payment webhook timeout scenario."

Publishing and Sharing Learnings

Post-mortems should be published within 48 hours of the incident resolution. The publishing workflow:

POST-MORTEM PUBLISHING WORKFLOW

1. AUTHOR writes draft post-mortem (within 24h of resolution)
   โ†’ Stores in post-mortem repository

2. REVIEWERS validate (within 12h)
   โ†’ SRE Lead: validates technical accuracy
   โ†’ Engineering Manager: validates action item feasibility
   โ†’ IC from incident: validates timeline accuracy

3. REVIEW MEETING (within 36h)
   โ†’ Team discusses and approves

4. PUBLISH (within 48h)
   โ†’ Post to #post-mortems Slack channel
   โ†’ Add to incident tracking Jira ticket
   โ†’ Update status page with post-mortem link
   โ†’ Archive incident channel (after 7 days)

5. ORGANIZATIONAL LEARNING
   โ†’ Include summary in weekly SRE newsletter
   โ†’ Add to quarterly incident review deck
   โ†’ If pattern detected, trigger larger initiative
   โ†’ If relevant to other teams, share cross-team

Post-Mortem Repository Organization

post-mortems/
โ”œโ”€โ”€ README.md                    # Index and search instructions
โ”œโ”€โ”€ TEMPLATE.md                  # Copy for new post-mortems
โ”œโ”€โ”€ 2024/
โ”‚   โ”œโ”€โ”€ Q1/
โ”‚   โ”‚   โ”œโ”€โ”€ INC-2024-0001-payment-api-connection-pool.md
โ”‚   โ”‚   โ”œโ”€โ”€ INC-2024-0002-knox-auth-degraded-latency.md
โ”‚   โ”‚   โ””โ”€โ”€ INC-2024-0003-cache-invalidation-failure.md
โ”‚   โ”œโ”€โ”€ Q2/
โ”‚   โ”‚   โ”œโ”€โ”€ INC-2024-0015-smartthings-device-registration.md
โ”‚   โ”‚   โ””โ”€โ”€ INC-2024-0016-pass-biometric-timeout.md
โ”‚   โ””โ”€โ”€ Q3/
โ”‚       โ””โ”€โ”€ ...
โ”œโ”€โ”€ 2023/
โ”‚   โ””โ”€โ”€ ...
โ””โ”€โ”€ patterns/
    โ”œโ”€โ”€ PATTERN-001-deployment-correlated-failures.md
    โ”œโ”€โ”€ PATTERN-002-cascade-failures.md
    โ””โ”€โ”€ PATTERN-003-third-party-outage-handling.md

Sample Post-Mortem Document (Full Example)

Below is a complete, filled-out post-mortem for a real-world-style incident. Use this as a reference for quality and depth expectations.

โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
                    POST-MORTEM REPORT
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

Incident ID: INC-2024-0015
Title: Samsung Pay โ€” Payment Processing Latency Degradation
Date: 2024-01-15
Author: Alice Smith, Pay SRE Team
Status: PUBLISHED
Reviewers: Bob Johnson (SRE Lead), Carol Williams (Eng Manager)

โ”€โ”€โ”€ EXECUTIVE SUMMARY โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

On January 15, 2024, at 14:23 UTC, deployment of Payment API
v2.4.0 introduced a slow database query that exhausted the
PostgreSQL connection pool. This caused p95 latency to spike
from ~250ms to over 5 seconds for 22 minutes, affecting ~12%
of payment transactions. The issue was resolved by rolling back
to v2.3.9 at 14:45 UTC. No payments were lost; failed
transactions were automatically retried by the client SDK.

โ”€โ”€โ”€ INCIDENT METRICS โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

| Metric          | Value        |
|-----------------|--------------|
| Severity        | SEV2         |
| Detection Time  | 14:25 UTC    |
| Start Time      | 14:23 UTC    |
| Resolution Time | 14:45 UTC    |
| Total Duration  | 22 minutes   |
| MTTD            | 2 minutes    |
| MTTR            | 20 minutes   |

โ”€โ”€โ”€ IMPACT โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

User Impact:
  โ€ข ~12% of payment transactions experienced latency > 5 seconds
  โ€ข 3% of transactions failed with timeout errors
  โ€ข Client SDK automatic retry succeeded for 99.2% of failed txs
  โ€ข 47 users contacted support about slow payments

Business Impact:
  โ€ข Revenue: No direct loss (retries succeeded)
  โ€ข SLA: Payment latency SLO (p95 < 300ms) breached
  โ€ข Error budget consumed: 15% of monthly availability budget
  โ€ข Reputation: 3 negative app store reviews mentioning "slow"

โ”€โ”€โ”€ TIMELINE โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

| Time (UTC) | Source | Event |
|------------|--------|-------|
| 14:23:00 | deployment | v2.4.0 deployed to production by deploy-bot |
| 14:25:12 | alert | ๐Ÿ”ฅ Alert [PaymentAPI_LatencyP95High] firing |
| 14:25:45 | alert | ๐Ÿ”ฅ Alert [PaymentAPI_ErrorRateElevated] firing |
| 14:26:00 | human | ๐Ÿ‘ค oncall-alice: Acknowledged alert, checking dashboard |
| 14:28:00 | human | ๐Ÿ‘ค oncall-alice: Latency spike correlates with v2.4.0 deploy |
| 14:30:00 | human | ๐Ÿ‘ค oncall-alice: Opened #incident-pay-20240115 war room |
| 14:32:00 | human | ๐Ÿ‘ค sre-bob: Joined war room, analyzing database metrics |
| 14:33:00 | human | ๐Ÿ‘ค sre-bob: Connection pool at 100/100 connections, all active |
| 14:35:00 | human | ๐Ÿ‘ค oncall-alice: Initiated rollback to v2.3.9 |
| 14:40:00 | deployment | ๐Ÿš€ Rollback to v2.3.9 completed |
| 14:42:00 | human | ๐Ÿ‘ค sre-bob: Connection pool draining, latency dropping |
| 14:45:00 | alert | โœ… Alert [PaymentAPI_LatencyP95High] resolved |
| 14:45:00 | alert | โœ… Alert [PaymentAPI_ErrorRateElevated] resolved |
| 14:50:00 | human | ๐Ÿ‘ค oncall-alice: All metrics normal, closing war room |

โ”€โ”€โ”€ ROOT CAUSE ANALYSIS โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

Proximate Cause:
  A new analytics query added to v2.4.0 (PR #1847) performed a
  full table scan on the transactions table. In staging (100K
  rows), the query completed in <100ms. In production (12M rows),
  it took 8-12 seconds and held a database connection for the
  entire duration.

Root Cause:
  The CI pipeline does not include performance testing with
  production-like data volumes. Queries are only tested against
  a small staging dataset, which masks performance issues that
  only appear at scale.

Contributing Factors:
  1. The connection pool max size (100) had not been reviewed in
     18 months despite 3x traffic growth during that period.
  2. The deployment dashboard showed v2.4.0 deploy succeeded
     (green checkmark) even though the canary analysis only
     checked for 5xx errors, not latency regression.
  3. The query timeout was not configured โ€” connections would
     wait indefinitely for slow queries to complete.

5 Whys Analysis:
  Why was latency high? โ†’ Connection pool exhausted.
  Why was pool exhausted? โ†’ Slow query held connections open.
  Why was query slow? โ†’ Full table scan on 12M row table.
  Why wasn't this caught? โ†’ No performance testing at scale.
  Why no perf testing? โ†’ Not prioritized; no infrastructure for it.

โ”€โ”€โ”€ LESSONS LEARNED โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

What Went Well:
  โ€ข MTTD was only 2 minutes โ€” Datadog synthetic monitors caught
    the latency spike immediately.
  โ€ข The rollback procedure worked smoothly and completed in 5 min.
  โ€ข The client SDK's automatic retry prevented significant
    user-visible failures.
  โ€ข The war room was opened quickly and communication was clear.

What Went Poorly:
  โ€ข The canary analysis did not detect the latency regression.
    The canary only checked error rate, not p95 latency.
  โ€ข No query timeout meant a single slow query could exhaust
    the entire pool instead of being killed.
  โ€ข The connection pool size had not been reviewed as traffic grew.

Where We Got Lucky:
  โ€ข The incident happened during business hours when the full SRE
    team was available. If this had occurred at 3 AM, MTTR would
    have been significantly longer.
  โ€ข The client SDK retry logic masked most of the impact. Without
    it, the user-visible failure rate would have been much higher.

โ”€โ”€โ”€ ACTION ITEMS โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

| # | Action | Owner | Priority | Due | Status |
|---|--------|-------|----------|-----|--------|
| 1 | Add 30s query timeout to all DB connection pools | sre-team | P0 | Jan 22 | DONE |
| 2 | Add latency check to canary analysis (>50% regression = block) | platform-team | P0 | Jan 29 | OPEN |
| 3 | Build perf test stage in CI with 10M+ row dataset | platform-team | P1 | Feb 15 | OPEN |
| 4 | Review and increase connection pool size from 100 to 200 | sre-team | P1 | Jan 29 | DONE |
| 5 | Add connection pool utilization alert (>80% for 5 min) | sre-team | P1 | Feb 5 | OPEN |

โ”€โ”€โ”€ RELATED INCIDENTS โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

โ€ข INC-2023-0089: Similar connection pool exhaustion on Knox API
  (different service, same root cause โ€” no perf testing at scale)
โ€ข Pattern match: Both incidents involved deployment of queries
  that passed staging but failed in production due to data volume.

โ”€โ”€โ”€ APPENDIX โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

Datadog Dashboard (snapshot): https://app.datadoghq.com/dashboard/...
Grafana Dashboard: https://grafana.internal/d/payment-api
PR #1847 (reverted): https://github.com/org/payment-api/pull/1847
Deploy pipeline run: https://ci.internal/payment-api/build/48291

โ”€โ”€โ”€ SIGN-OFF โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

SRE Lead:        Bob Johnson      Date: 2024-01-17
Eng Manager:     Carol Williams   Date: 2024-01-17
Incident IC:     Alice Smith      Date: 2024-01-17

Repeat Incident Tracking

One of the key metrics for SRE effectiveness is the rate of repeat incidents โ€” incidents with the same root cause as a previous incident. Track this metric and target a reduction of 30%+ year over year.

# Repeat Incident Analysis (quarterly)
# Query: How many incidents this quarter had a root cause
# matching a previous incident's root cause?

# Classification tags for root causes:
# - deployment-bug          Code bug introduced in deployment
# - config-change           Configuration change caused issue
# - capacity-limit          Insufficient capacity for load
# - dependency-failure      Third-party or internal dependency failed
# - data-volume             Issue only visible at production data scale
# - cascading-failure       Failure spread to other systems
# - human-error             Process/tooling allowed human mistake
# - security                Security-related incident
# - unknown                 Root cause not determined

# Quarterly report:
# Q1 2024: 23 total incidents, 4 repeat incidents (17.4%)
# Q2 2024: 19 total incidents, 3 repeat incidents (15.8%)
# Q3 2024: 18 total incidents, 2 repeat incidents (11.1%)  โ† Target met
Pro Tip: At Samsung, we reduced repeat incidents by 35% over 12 months by implementing two practices: (1) mandatory cross-team pattern review โ€” all SEV1/2 post-mortems from all teams (Knox, Pay, SmartThings) were reviewed monthly to identify recurring themes; and (2) action item tracking with weekly executive review of P0 items โ€” visibility at the leadership level ensured action items actually got done.

Last updated: June 2026 | Author: SRE Team