41 pages Β· 8 sections
Ctrl K
GitHub Portfolio

SLO / SLI / SLA

Service Level Objectives, Indicators, and Agreements form the foundation of SRE practice. This guide covers how to define, measure, and enforce reliability targets with concrete examples, burn rate calculations, alerting rules, and production-ready templates.

Definitions

SLI β€” Service Level Indicator

An SLI is a quantitative measure of some aspect of the service level. It answers the question: "What do we measure?" SLIs should be directly tied to user experience β€” if the SLI degrades, users should notice.

SLO β€” Service Level Objective

An SLO is a target value for an SLI over a specific time period. It answers: "What target do we aim for?" SLOs are internal goals used to make engineering decisions. They should be achievable but meaningful β€” setting SLOs too tight causes unnecessary toil; too loose erodes user trust.

SLA β€” Service Level Agreement

An SLA is a contract with your users (internal or external) that specifies consequences for failing to meet SLOs. It answers: "What happens if we miss?" Not every SLO needs an SLA β€” only those with business-negotiated penalties.

Relationship: SLAs are built on SLOs. SLOs are built on SLIs. You should have many SLIs, a focused set of SLOs (3-5 per service), and SLAs only for business-critical commitments. A good rule: if there's no business consequence for missing it, it's an SLO, not an SLA.

Choosing Good SLIs

Good SLIs share these characteristics: they are user-centric (measure what users care about), measurable (can be instrumented automatically), aggregatable (can be combined across instances), and actionable (degradation triggers a clear response).

Common SLI Categories

Category Definition Typical Measurement Applies To
Availability The proportion of requests that result in a successful response good_requests / total_requests All request-driven services
Latency The time it takes to process a request Distribution (p50, p95, p99) Interactive services
Throughput The rate at which the service processes requests Requests per second Data pipelines, queues
Error Rate The proportion of requests that result in an error error_requests / total_requests All request-driven services
Durability The proportion of data that is not lost 1 - (lost_records / total_records) Storage systems
Freshness How current the data served by the system is max(time_since_last_update) Caches, replicas, ETL
Correctness The proportion of responses that contain correct data correct_responses / total_responses Financial, data services

Mapping SLIs to User Journeys

Start with user journeys, not system metrics. A user journey is a sequence of steps a user takes to achieve a goal. For each step, identify what could go wrong from the user's perspective.

User Journey: "Purchase an item on Samsung Pay"

Step 1: Open Samsung Pay app
  └─ SLI: App availability (successful opens / attempts)
  └─ SLI: App load time (time to interactive)

Step 2: Select payment card
  └─ SLI: Card list API availability
  └─ SLI: Card list API latency (p95)

Step 3: Authenticate (fingerprint/PIN)
  └─ SLI: Authentication success rate
  └─ SLI: Authentication latency

Step 4: Complete payment
  └─ SLI: Payment API availability
  └─ SLI: Payment API latency (p99 β€” financial, needs tight SLO)
  └─ SLI: Payment processing error rate

Step 5: Receive confirmation
  └─ SLI: Push notification delivery rate
  └─ SLI: Notification latency (time to receive)

SLI Specification Template

Every SLI should be documented with a formal specification that eliminates ambiguity about what is measured, how it's calculated, and what constitutes "good."

═══════════════════════════════════════════════════════════════════
                    SLI SPECIFICATION
═══════════════════════════════════════════════════════════════════

SLI ID: PAY-API-LAT-001
Service: Samsung Pay β€” Payment Processing API
SLI Name: Payment Request Latency (p95)

─── DEFINITION ───────────────────────────────────────────────────

Description: The 95th percentile latency of successful payment
             processing requests as measured at the API gateway.

User Impact: Users experiencing payment delays may abandon
             transactions, leading to lost revenue.

─── MEASUREMENT ──────────────────────────────────────────────────

Event Source: API Gateway access logs (AWS ALB / Kong / Nginx)

Good Event: A request to POST /v1/payments that:
  1. Returns HTTP 2xx status code
  2. Has response_time <= 300ms
  3. Is not a health check request (User-Agent != health-checker)

Valid Event: Any request to POST /v1/payments from a
            non-internal source IP that is not a health check.

Exclusions:
  - Health check requests
  - Requests from Samsung internal monitoring tools
  - Requests during scheduled maintenance windows (pre-announced)

Calculation:
  SLI value = percentile(good_events.response_time, 95)

─── TARGET ───────────────────────────────────────────────────────

SLO Target: p95 latency <= 300ms over 30-day rolling window

Error Budget: 5% of requests may exceed 300ms

Measurement Window: 30 days, rolling

─── DATA SOURCES ─────────────────────────────────────────────────

Primary: Datadog APM trace metrics (trace.http.request.duration)
Backup:  ALB access logs in S3, processed by Athena query

─── ALERTING ─────────────────────────────────────────────────────

Fast Burn:   p95 > 600ms for 10 minutes     β†’ PAGE
Medium Burn: p95 > 300ms for 1 hour         β†’ PAGE
Slow Burn:   p95 > 300ms for 6 hours        β†’ TICKET (next business day)

─── REVIEW ───────────────────────────────────────────────────────

Last Reviewed: [DATE]
Review Cycle: Quarterly
Next Review: [DATE]
Approved By: [NAME]

─── CHANGE LOG ───────────────────────────────────────────────────

YYYY-MM-DD: Initial SLI definition (v1.0)
YYYY-MM-DD: Changed threshold from 500ms to 300ms after performance
            improvements in v2.3.1 (v1.1)

SLO Calculation with Burn Rate

Burn rate measures how fast you're consuming your error budget relative to the ideal consumption rate. A burn rate of 1.0 means you'll exactly exhaust your budget at the end of the compliance period. Burn rate is the key metric for SLO-based alerting.

# Burn Rate Calculation
#
# burn_rate = (actual_error_rate) / (error_budget_rate)
#
# Where:
#   actual_error_rate   = errors in window / total requests in window
#   error_budget_rate   = 1 - SLO (e.g., 0.001 for 99.9%)

# Example: 99.9% availability SLO (0.1% error budget)
error_budget_rate = 0.001  # 0.1%

# Scenario 1: Normal operation
errors_1h = 10
total_1h = 1_000_000
actual_error_rate = errors_1h / total_1h  # 0.00001
burn_rate = actual_error_rate / error_budget_rate  # 0.01
# Status: Healthy (burning 1% of budget rate)

# Scenario 2: Degraded service
errors_1h = 500
total_1h = 1_000_000
actual_error_rate = errors_1h / total_1h  # 0.0005
burn_rate = actual_error_rate / error_budget_rate  # 0.5
# Status: Concerning (burning 50% of budget rate)

# Scenario 3: Outage
errors_1h = 5000
total_1h = 1_000_000
actual_error_rate = errors_1h / total_1h  # 0.005
burn_rate = actual_error_rate / error_budget_rate  # 5.0
# Status: CRITICAL β€” will exhaust budget in 6 hours if sustained

# Time to exhaust budget at various burn rates:
# burn_rate 1.0  β†’ 30 days (entire compliance period)
# burn_rate 2.0  β†’ 15 days
# burn_rate 6.0  β†’ 5 days
# burn_rate 14.4 β†’ 2 days
# burn_rate 30.0 β†’ 1 day
# burn_rate 72.0 β†’ 10 hours

time_to_exhaust_hours = 30 * 24 / burn_rate

Complete Example: Define SLOs for a REST API

Here is a complete, production-ready SLO definition for a hypothetical e-commerce payment API. This demonstrates the full specification format used in enterprise environments.

═══════════════════════════════════════════════════════════════════
        SLO DOCUMENT: Payment Processing API v2
        Service: payment-api.prod.internal
        Owner: Pay Platform Team
        SRE Owner: Pay SRE Team
        Effective Date: 2024-01-01
        Review Cycle: Quarterly
═══════════════════════════════════════════════════════════════════

─── SERVICE OVERVIEW ─────────────────────────────────────────────

The Payment Processing API handles payment authorization, capture,
and refund operations for Samsung Pay merchants. It is a
user-facing critical service β€” any degradation directly impacts
revenue and customer trust.

Dependencies:
  - payment-db.primary (PostgreSQL 14)
  - payment-db.replica (PostgreSQL 14 read replica)
  - fraud-service.internal
  - token-service.internal
  - redis-cache.internal (session state)
  - kafka.internal (event publishing)

─── SLIs AND SLOs ────────────────────────────────────────────────

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ SLI             β”‚ Target (30d)         β”‚ Rationale              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Availability    β”‚ 99.95%               β”‚ Payment failures cause β”‚
β”‚                 β”‚ (43.2 min downtime)  β”‚ immediate user-visible β”‚
β”‚                 β”‚                      β”‚ errors and revenue lossβ”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Latency (p50)   β”‚ ≀ 100ms              β”‚ 50% of requests should β”‚
β”‚                 β”‚                      β”‚ feel instant           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Latency (p95)   β”‚ ≀ 300ms              β”‚ 95% of requests within β”‚
β”‚                 β”‚                      β”‚ acceptable UX thresholdβ”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Latency (p99)   β”‚ ≀ 1000ms             β”‚ Worst-case acceptable  β”‚
β”‚                 β”‚                      β”‚ for complex operations β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Error Rate      β”‚ ≀ 0.1%               < 1 in 1000 requests    β”‚
β”‚                 β”‚                      β”‚ may fail               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Throughput      β”‚ β‰₯ 5000 req/s peak    β”‚ Sustained Black Friday β”‚
β”‚                 β”‚                      β”‚ traffic in 2023        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

─── ERROR BUDGET ─────────────────────────────────────────────────

Primary Error Budget (Availability):
  SLO: 99.95%
  Budget: 0.05% failed requests
  30-day budget: 50,000 failed requests per 100M requests

Secondary Error Budget (Latency p95):
  SLO: ≀ 300ms
  Budget: 5% of requests may exceed 300ms

─── BUDGET CONSUMPTION TRACKING ──────────────────────────────────

Dashboard: https://datadog.internal/dashboard/payment-api-slo

Alerting Thresholds:
  β€’ 2% budget consumed in 1 hour   β†’ Page (fast burn)
  β€’ 5% budget consumed in 6 hours  β†’ Page (medium burn)
  β€’ 10% budget consumed in 3 days  β†’ Ticket (slow burn)
  β€’ 50% budget consumed total      β†’ Freeze non-critical releases
  β€’ 75% budget consumed total      β†’ Emergency reliability sprint

─── EXCLUSIONS ───────────────────────────────────────────────────

The following do not consume error budget:
  1. Scheduled maintenance windows (approved 48h in advance)
  2. Client errors (HTTP 4xx) β€” these are not service failures
  3. Requests exceeding client timeout (client-side abandonment)
  4. Force majeure events (AWS region outage beyond our control)

─── ESCALATION ───────────────────────────────────────────────────

Budget > 75%:    Escalate to Engineering Director
Budget > 90%:    Escalate to VP Engineering + Product
Budget 100%:     Full release freeze, emergency retro

─── APPROVALS ────────────────────────────────────────────────────

Tech Lead:      _______________________  Date: _______________
SRE Lead:       _______________________  Date: _______________
Product Manager: _______________________ Date: _______________

SLO Compliance Dashboard Specification

A well-designed SLO dashboard provides at-a-glance visibility into service health. Here is the specification for a Datadog dashboard layout.

Datadog Dashboard: "SLO Overview β€” Payment API"

Row 1: SERVICE HEALTH SUMMARY (3 widgets)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Availability  β”‚  β”‚   Latency p95   β”‚  β”‚   Error Rate    β”‚
β”‚                 β”‚  β”‚                 β”‚  β”‚                 β”‚
β”‚     99.97%      β”‚  β”‚     245ms       β”‚  β”‚    0.04%       β”‚
β”‚   ↑ 0.02%       β”‚  β”‚   ↓ 12ms        β”‚  β”‚   ↓ 0.01%      β”‚
β”‚   [GREEN]       β”‚  β”‚   [GREEN]       β”‚  β”‚   [GREEN]       β”‚
β”‚   vs SLO: 99.95%β”‚  β”‚   vs SLO: 300ms β”‚  β”‚   vs SLO: 0.1% β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Row 2: ERROR BUDGET (2 widgets)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Availability Error Budget β”‚  β”‚   Latency Error Budget      β”‚
β”‚                             β”‚  β”‚                             β”‚
β”‚   [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘] 35%   β”‚  β”‚   [β–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘] 18%  β”‚
β”‚   consumed (65% remaining)  β”‚  β”‚   consumed (82% remaining)  β”‚
β”‚                             β”‚  β”‚                             β”‚
β”‚   Projected: 18 days left   β”‚  β”‚   Projected: 25 days left   β”‚
β”‚   at current burn rate      β”‚  β”‚   at current burn rate      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Row 3: BURN RATE TRENDS (1 widget)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Burn Rate (7-day rolling)                                 β”‚
β”‚                                                             β”‚
β”‚   10 ─                                          β•±β•²         β”‚
β”‚    8 ─                                       β•±β•±   β•²β•²      β”‚
β”‚    6 ─                              OUTAGE  β•±β•±       β•²     β”‚
β”‚    4 ─    β•±β•²   β•±β•²        β•±β•²              β•±β•±               β”‚
β”‚    2 ─╱╲╱  β•²β•²β•±  β•²β•²β•±β•²β•±β•²β•±β•²β•±  β•²β•²β•±β•²β•±β•²β•±β•²β•±β•²β•±β•²β•±β•±                  β”‚
β”‚    0 ┼────────────────────────────────────────────────       β”‚
β”‚      MTWTFSSMTWTFSSMTWTFSSMTWTFSSMTWTFSSMTWTFSS           β”‚
β”‚                                                             β”‚
β”‚   Alert threshold: 14.4 (dashed line)                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Row 4: REQUEST BREAKDOWN (3 widgets)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Requests/sec  β”‚  β”‚   Status Code   β”‚  β”‚   Latency       β”‚
β”‚                 β”‚  β”‚   Distribution  β”‚  β”‚   Distribution  β”‚
β”‚   2.4K avg      β”‚  β”‚   200: 99.2%   β”‚  β”‚   p50: 87ms     β”‚
β”‚   [timeseries]  β”‚  β”‚   4xx: 0.6%    β”‚  β”‚   p95: 245ms    β”‚
β”‚                 β”‚  β”‚   5xx: 0.2%    β”‚  β”‚   p99: 412ms    β”‚
β”‚                 β”‚  β”‚   [pie chart]  β”‚  β”‚   [heatmap]     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Error Budget Burn Rate Alerting

Google's Site Reliability Workbook defines the multi-window, multi-burn-rate alerting approach. This method pages on significant events while avoiding alert fatigue from minor blips.

Alerting Principles

  • Page alerts (immediate human response): Require intervention within minutes. Trigger on fast burn rates over short windows.
  • Ticket alerts (next business day): Indicate trends requiring attention but not immediate action. Trigger on slow burn rates over long windows.
  • Threshold-based alerts are a backup for symptoms that burn-rate alerts might miss (e.g., total outage where no requests succeed).

Recommended Burn Rate Thresholds

Burn Rate Alert Window Budget Consumed Action Use Case
14.4x 1 hour 2% Page immediately Fast-burn outage detection
6x 6 hours 5% Page Sustained degradation
3x 3 days 10% Ticket Slow-burn trend
1x 30 days 100% Review in planning Budget exhaustion warning

SLO-Based Alerting Rules

Prometheus Recording and Alerting Rules

# slo-rules.yml β€” Prometheus alerting rules for SLO-based alerting
# Apply with: kubectl apply -f slo-rules.yml
# Validate with: promtool check rules slo-rules.yml

groups:
  - name: payment_api_slo_availability
    interval: 60s
    rules:
      # Recording rule: error rate over 5m
      - record: job:slo_errors_per_request:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{job="payment-api",status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total{job="payment-api"}[5m]))

      # Recording rule: error rate over 1h
      - record: job:slo_errors_per_request:ratio_rate1h
        expr: |
          sum(rate(http_requests_total{job="payment-api",status=~"5.."}[1h]))
          /
          sum(rate(http_requests_total{job="payment-api"}[1h]))

      # Recording rule: error rate over 6h
      - record: job:slo_errors_per_request:ratio_rate6h
        expr: |
          sum(rate(http_requests_total{job="payment-api",status=~"5.."}[6h]))
          /
          sum(rate(http_requests_total{job="payment-api"}[6h]))

      # Recording rule: error rate over 3d
      - record: job:slo_errors_per_request:ratio_rate3d
        expr: |
          sum(rate(http_requests_total{job="payment-api",status=~"5.."}[3d]))
          /
          sum(rate(http_requests_total{job="payment-api"}[3d]))

      # Burn rate calculation
      # SLO = 99.95% β†’ error budget = 0.05% = 0.0005
      - record: job:slo_availability_burn_rate:ratio
        expr: |
          job:slo_errors_per_request:ratio_rate1h / 0.0005

  - name: payment_api_alerts
    rules:
      # Fast burn: 14.4x over 1 hour β†’ page immediately
      - alert: PaymentAPI_FastBurn
        expr: job:slo_errors_per_request:ratio_rate1h > (14.4 * 0.0005)
        for: 2m
        labels:
          severity: critical
          team: pay-sre
          slo: availability
        annotations:
          summary: "Payment API fast error budget burn"
          description: "Error rate is {{ $value | humanizePercentage }} over 1h. "
                       "This is {{ $value | humanize }}x the burn rate needed to "
                       "exhaust the 30-day error budget in 2 days."
          runbook_url: "https://wiki.internal/runbooks/payment-api-fast-burn"
          dashboard: "https://grafana.internal/d/payment-api"

      # Medium burn: 6x over 6 hours β†’ page
      - alert: PaymentAPI_MediumBurn
        expr: job:slo_errors_per_request:ratio_rate6h > (6 * 0.0005)
        for: 5m
        labels:
          severity: warning
          team: pay-sre
          slo: availability
        annotations:
          summary: "Payment API medium error budget burn"
          description: "Error rate is {{ $value | humanizePercentage }} over 6h. "
                       "This will exhaust the 30-day error budget in 5 days."
          runbook_url: "https://wiki.internal/runbooks/payment-api-medium-burn"

      # Slow burn: 3x over 3 days β†’ ticket
      - alert: PaymentAPI_SlowBurn
        expr: job:slo_errors_per_request:ratio_rate3d > (3 * 0.0005)
        for: 1h
        labels:
          severity: info
          team: pay-sre
          slo: availability
        annotations:
          summary: "Payment API slow error budget burn"
          description: "Error rate is {{ $value | humanizePercentage }} over 3d. "
                       "Investigate during business hours."

      # Latency SLO: p95 > 300ms
      - alert: PaymentAPI_LatencyP95High
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket{job="payment-api"}[5m])) by (le)
          ) > 0.3
        for: 10m
        labels:
          severity: warning
          team: pay-sre
          slo: latency
        annotations:
          summary: "Payment API p95 latency exceeds 300ms"
          description: "p95 latency is {{ $value }}s. SLO threshold: 300ms."

      # Total budget exhaustion check
      - alert: PaymentAPI_ErrorBudgetExhausted
        expr: |
          (
            sum(increase(http_requests_total{job="payment-api",status=~"5.."}[30d]))
            /
            sum(increase(http_requests_total{job="payment-api"}[30d]))
          ) > 0.0005
        for: 1h
        labels:
          severity: critical
          team: pay-sre
          slo: availability
        annotations:
          summary: "Payment API 30-day error budget exhausted"
          description: "30-day error budget has been consumed. Release freeze in effect."

Datadog Monitors for SLO Alerting

# Datadog monitors can be defined via Terraform for infrastructure-as-code
# This example shows SLO-based monitors using Datadog's SLO resource

# terraform/slo-monitors.tf

# First, create the SLO in Datadog
resource "datadog_service_level_objective" "payment_api_availability" {
  name        = "Payment API Availability"
  description = "Availability SLO for Payment Processing API"
  type        = "metric"
  query {
    numerator   = "sum:trace.http.request.hits{service:payment-api,status:2xx}.as_count()"
    denominator = "sum:trace.http.request.hits{service:payment-api}.as_count()"
  }

  threshold {
    timeframe = "30d"
    target    = 99.95
    warning   = 99.97
  }

  threshold {
    timeframe = "7d"
    target    = 99.9
    warning   = 99.95
  }

  tags = ["team:pay-sre", "service:payment-api", "env:prod"]
}

# Fast burn rate monitor (pages immediately)
resource "datadog_monitor" "payment_api_fast_burn" {
  name    = "[CRIT] Payment API Fast Error Budget Burn"
  type    = "metric alert"
  message = <<-EOT
    {{#is_alert}}
    Payment API is burning error budget at {{burn_rate}}x the normal rate.
    
    Current 1h error rate: {{error_rate}}
    Error budget: 0.05% (99.95% SLO)
    
    This will exhaust the 30-day budget in ~2 days if sustained.
    
    @pagerduty-pay-sre-critical
    @slack-alerts-pay-sre
    {{/is_alert}}
    
    Runbook: https://wiki.internal/runbooks/payment-api-fast-burn
    Dashboard: https://app.datadoghq.com/dashboard/payment-api
  EOT

  query = <<-EOT
    avg(last_1h):
      (
        sum:trace.http.request.errors{service:payment-api}.as_rate()
        /
        sum:trace.http.request.hits{service:payment-api}.as_rate()
      ) > 0.0072
  EOT

  monitor_thresholds {
    critical = 0.0072  # 14.4 Γ— 0.0005
    warning  = 0.0036  # 7.2 Γ— 0.0005
  }

  notify_no_data    = false
  renotify_interval = 30

  tags = ["team:pay-sre", "service:payment-api", "alert-type:fast-burn", "slo:availability"]
}

# Latency SLO monitor
resource "datadog_monitor" "payment_api_latency_p95" {
  name    = "[WARN] Payment API p95 Latency SLO Breach"
  type    = "metric alert"
  message = <<-EOT
    {{#is_alert}}
    Payment API p95 latency is {{value}}ms, exceeding the 300ms SLO.
    
    SLO target: p95 <= 300ms
    Current p95: {{value}}ms
    
    Check for:
    - Database slow queries
    - Upstream service degradation
    - Cache hit rate drops
    - Increased request volume
    
    @slack-alerts-pay-sre
    {{/is_alert}}
  EOT

  query = "avg(last_10m): percentile(trace.http.request.duration{service:payment-api}, 95) > 0.3"

  monitor_thresholds {
    critical = 0.5   # 500ms
    warning  = 0.3   # 300ms
  }

  notify_no_data    = false
  renotify_interval = 60

  tags = ["team:pay-sre", "service:payment-api", "slo:latency"]
}

Quarterly SLO Review Process

SLOs are not static. They must be reviewed quarterly to ensure they remain aligned with user needs, business requirements, and engineering capabilities.

Quarterly SLO Review Agenda

  1. SLO Achievement Report β€” Review each SLO's achievement rate over the quarter. Highlight any misses and trends.
  2. Error Budget Analysis β€” How much budget was consumed? What were the top contributors? Was the budget too generous or too tight?
  3. User Feedback Review β€” Correlate SLO misses with user complaints, support tickets, and NPS feedback.
  4. Dependency Impact β€” Did upstream or downstream services contribute to SLO misses?
  5. SLO Proposals β€” Engineering proposes SLO changes with justification. Tightening requires reliability investment; loosening requires business sign-off.
  6. SLI Refinement β€” Are SLIs still measuring what matters? Any new user journeys to cover?
  7. Approval β€” SRE lead, engineering lead, and product manager approve updated SLOs.
═══════════════════════════════════════════════════════════════════
            QUARTERLY SLO REVIEW β€” TEMPLATE
═══════════════════════════════════════════════════════════════════

Quarter: Q[X] 20XX
Service: [SERVICE_NAME]
Review Date: [DATE]
Attendees: [SRE Lead, Eng Lead, Product Manager]

─── SLO ACHIEVEMENT SUMMARY ──────────────────────────────────────

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ SLO          β”‚ Target  β”‚ Achieved β”‚ Budget   β”‚ Status       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Availability β”‚ 99.95%  β”‚ 99.97%   β”‚ 35% used β”‚ βœ“ Met        β”‚
β”‚ Latency p95  β”‚ 300ms   β”‚ 245ms    β”‚ 18% used β”‚ βœ“ Met        β”‚
β”‚ Latency p99  β”‚ 1000ms  β”‚ 890ms    β”‚ 22% used β”‚ βœ“ Met        β”‚
β”‚ Error Rate   β”‚ 0.1%    β”‚ 0.04%    β”‚ 40% used β”‚ βœ“ Met        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

─── KEY EVENTS THIS QUARTER ──────────────────────────────────────

Date        Event                           Impact on SLO
─────────────────────────────────────────────────────────────────
Jan 15      Database failover test          No impact (planned)
Feb 3       CDN degradation (Akamai)        -0.02% availability
Feb 22      Release v2.4.0 latency regress  p95 spike to 450ms
Mar 8       Redis cache failure             5 min degraded perf

─── USER IMPACT CORRELATION ──────────────────────────────────────

SLO Misses vs. User Complaints:
  β€’ Feb 3 CDN event: 12 support tickets, 2 Twitter mentions
  β€’ Feb 22 latency regress: 8 support tickets, app store review

NPS Trend: 42 β†’ 45 (slight improvement, latency work paid off)

─── PROPOSED CHANGES ─────────────────────────────────────────────

1. TIGHTEN availability SLO: 99.95% β†’ 99.99%
   Rationale: Consistently achieving 99.97%+; infrastructure
              improvements (multi-AZ) justify tighter target.
   Engineering Cost: 2 sprint points for additional automation
   Business Impact: Improved merchant trust
   Decision: [ ] Approve [ ] Reject [ ] Defer

2. LOOSEN latency p99: 1000ms β†’ 1500ms
   Rationale: p99 only affects <1% of users; current SLO drives
              excessive optimization work for minimal user benefit.
   Engineering Cost: Frees ~3 sprint points per quarter
   Business Impact: Minimal β€” p95 (300ms) covers 95% of users
   Decision: [ ] Approve [ ] Reject [ ] Defer

─── ACTION ITEMS ─────────────────────────────────────────────────

#   Action Item                    Owner      Due
─────────────────────────────────────────────────────────────────
1   Implement p99 SLO change       eng-lead   20XX-XX-XX
2   Update Datadog dashboards      sre-lead   20XX-XX-XX
3   Communicate SLA changes        pm         20XX-XX-XX
4   Update runbook thresholds      sre-lead   20XX-XX-XX

─── SIGN-OFF ─────────────────────────────────────────────────────

SRE Lead:       _______________________  Date: _______________
Engineering Lead: _____________________  Date: _______________
Product Manager: ______________________  Date: _______________

Sample SLO Document Template

This is a blank template ready for adaptation to any service. Copy this template, fill in each section, and store it in your service repository under docs/SLO.md.

═══════════════════════════════════════════════════════════════════
                    SLO DOCUMENT TEMPLATE
═══════════════════════════════════════════════════════════════════

# Service Name: [SERVICE_NAME]
# Team: [TEAM_NAME]
# SRE Owner: [SRE_OWNER]
# Last Updated: [DATE]
# Review Cycle: Quarterly
# Status: [DRAFT / ACTIVE / DEPRECATED]

─── SERVICE DESCRIPTION ──────────────────────────────────────────

[Describe what this service does, its criticality level,
and who its users are. List all user-facing endpoints.]

─── DEPENDENCIES ─────────────────────────────────────────────────

[Map all upstream and downstream dependencies with criticality
ratings. Include external dependencies (CDNs, payment processors,
cloud providers).]

Upstream:   [Service] β†’ [Criticality: HIGH/MED/LOW]
Downstream: [Service] β†’ [Criticality: HIGH/MED/LOW]

─── SLIs AND SLOs ────────────────────────────────────────────────

| Category | SLI Name | Measurement | SLO Target | Window |
|----------|----------|-------------|------------|--------|
| [avail/  | [name]   | [metric]    | [target]   | [30d]  |
|  latency/ |          |             |            |        |
|  errors]  |          |             |            |        |

─── ERROR BUDGET ─────────────────────────────────────────────────

Primary Error Budget:
  SLO: [TARGET]
  Budget: [1 - SLO] = [BUDGET]
  Measurement: [request-based / time-based]

─── ALERTING RULES ───────────────────────────────────────────────

[Reference alerting rule files or Datadog monitor IDs]

─── EXCLUSIONS ───────────────────────────────────────────────────

[List all scenarios that do not consume error budget]

─── ESCALATION POLICY ────────────────────────────────────────────

[Budget thresholds and corresponding actions]

─── APPROVALS ────────────────────────────────────────────────────

Tech Lead:      _______________________  Date: _______________
SRE Lead:       _______________________  Date: _______________
Product Manager: ______________________  Date: _______________
Avoid These SLO Anti-Patterns:
  • Aspirational SLOs: Setting SLOs based on what you wish the service did, not what it actually does. This creates constant paging.
  • Too many SLOs: More than 5 SLOs per service creates noise and dilutes focus. Pick the metrics that matter most to users.
  • SLIs you can't measure: Every SLI must have a concrete, automated measurement pipeline. If you can't measure it, you can't alert on it.
  • SLAs without SLOs: Never commit to an SLA without first establishing internal SLOs and proving you can meet them.
  • Static SLOs: SLOs that are never reviewed become meaningless. Schedule quarterly reviews.

Related Topics

Last updated: June 2026 | Author: SRE Team