SLO / SLI / SLA
Service Level Objectives, Indicators, and Agreements form the foundation of SRE practice. This guide covers how to define, measure, and enforce reliability targets with concrete examples, burn rate calculations, alerting rules, and production-ready templates.
Definitions
SLI β Service Level Indicator
An SLI is a quantitative measure of some aspect of the service level. It answers the question: "What do we measure?" SLIs should be directly tied to user experience β if the SLI degrades, users should notice.
SLO β Service Level Objective
An SLO is a target value for an SLI over a specific time period. It answers: "What target do we aim for?" SLOs are internal goals used to make engineering decisions. They should be achievable but meaningful β setting SLOs too tight causes unnecessary toil; too loose erodes user trust.
SLA β Service Level Agreement
An SLA is a contract with your users (internal or external) that specifies consequences for failing to meet SLOs. It answers: "What happens if we miss?" Not every SLO needs an SLA β only those with business-negotiated penalties.
Choosing Good SLIs
Good SLIs share these characteristics: they are user-centric (measure what users care about), measurable (can be instrumented automatically), aggregatable (can be combined across instances), and actionable (degradation triggers a clear response).
Common SLI Categories
| Category | Definition | Typical Measurement | Applies To |
|---|---|---|---|
| Availability | The proportion of requests that result in a successful response | good_requests / total_requests | All request-driven services |
| Latency | The time it takes to process a request | Distribution (p50, p95, p99) | Interactive services |
| Throughput | The rate at which the service processes requests | Requests per second | Data pipelines, queues |
| Error Rate | The proportion of requests that result in an error | error_requests / total_requests | All request-driven services |
| Durability | The proportion of data that is not lost | 1 - (lost_records / total_records) | Storage systems |
| Freshness | How current the data served by the system is | max(time_since_last_update) | Caches, replicas, ETL |
| Correctness | The proportion of responses that contain correct data | correct_responses / total_responses | Financial, data services |
Mapping SLIs to User Journeys
Start with user journeys, not system metrics. A user journey is a sequence of steps a user takes to achieve a goal. For each step, identify what could go wrong from the user's perspective.
User Journey: "Purchase an item on Samsung Pay"
Step 1: Open Samsung Pay app
ββ SLI: App availability (successful opens / attempts)
ββ SLI: App load time (time to interactive)
Step 2: Select payment card
ββ SLI: Card list API availability
ββ SLI: Card list API latency (p95)
Step 3: Authenticate (fingerprint/PIN)
ββ SLI: Authentication success rate
ββ SLI: Authentication latency
Step 4: Complete payment
ββ SLI: Payment API availability
ββ SLI: Payment API latency (p99 β financial, needs tight SLO)
ββ SLI: Payment processing error rate
Step 5: Receive confirmation
ββ SLI: Push notification delivery rate
ββ SLI: Notification latency (time to receive)
SLI Specification Template
Every SLI should be documented with a formal specification that eliminates ambiguity about what is measured, how it's calculated, and what constitutes "good."
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
SLI SPECIFICATION
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
SLI ID: PAY-API-LAT-001
Service: Samsung Pay β Payment Processing API
SLI Name: Payment Request Latency (p95)
βββ DEFINITION βββββββββββββββββββββββββββββββββββββββββββββββββββ
Description: The 95th percentile latency of successful payment
processing requests as measured at the API gateway.
User Impact: Users experiencing payment delays may abandon
transactions, leading to lost revenue.
βββ MEASUREMENT ββββββββββββββββββββββββββββββββββββββββββββββββββ
Event Source: API Gateway access logs (AWS ALB / Kong / Nginx)
Good Event: A request to POST /v1/payments that:
1. Returns HTTP 2xx status code
2. Has response_time <= 300ms
3. Is not a health check request (User-Agent != health-checker)
Valid Event: Any request to POST /v1/payments from a
non-internal source IP that is not a health check.
Exclusions:
- Health check requests
- Requests from Samsung internal monitoring tools
- Requests during scheduled maintenance windows (pre-announced)
Calculation:
SLI value = percentile(good_events.response_time, 95)
βββ TARGET βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
SLO Target: p95 latency <= 300ms over 30-day rolling window
Error Budget: 5% of requests may exceed 300ms
Measurement Window: 30 days, rolling
βββ DATA SOURCES βββββββββββββββββββββββββββββββββββββββββββββββββ
Primary: Datadog APM trace metrics (trace.http.request.duration)
Backup: ALB access logs in S3, processed by Athena query
βββ ALERTING βββββββββββββββββββββββββββββββββββββββββββββββββββββ
Fast Burn: p95 > 600ms for 10 minutes β PAGE
Medium Burn: p95 > 300ms for 1 hour β PAGE
Slow Burn: p95 > 300ms for 6 hours β TICKET (next business day)
βββ REVIEW βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Last Reviewed: [DATE]
Review Cycle: Quarterly
Next Review: [DATE]
Approved By: [NAME]
βββ CHANGE LOG βββββββββββββββββββββββββββββββββββββββββββββββββββ
YYYY-MM-DD: Initial SLI definition (v1.0)
YYYY-MM-DD: Changed threshold from 500ms to 300ms after performance
improvements in v2.3.1 (v1.1)
SLO Calculation with Burn Rate
Burn rate measures how fast you're consuming your error budget relative to the ideal consumption rate. A burn rate of 1.0 means you'll exactly exhaust your budget at the end of the compliance period. Burn rate is the key metric for SLO-based alerting.
# Burn Rate Calculation
#
# burn_rate = (actual_error_rate) / (error_budget_rate)
#
# Where:
# actual_error_rate = errors in window / total requests in window
# error_budget_rate = 1 - SLO (e.g., 0.001 for 99.9%)
# Example: 99.9% availability SLO (0.1% error budget)
error_budget_rate = 0.001 # 0.1%
# Scenario 1: Normal operation
errors_1h = 10
total_1h = 1_000_000
actual_error_rate = errors_1h / total_1h # 0.00001
burn_rate = actual_error_rate / error_budget_rate # 0.01
# Status: Healthy (burning 1% of budget rate)
# Scenario 2: Degraded service
errors_1h = 500
total_1h = 1_000_000
actual_error_rate = errors_1h / total_1h # 0.0005
burn_rate = actual_error_rate / error_budget_rate # 0.5
# Status: Concerning (burning 50% of budget rate)
# Scenario 3: Outage
errors_1h = 5000
total_1h = 1_000_000
actual_error_rate = errors_1h / total_1h # 0.005
burn_rate = actual_error_rate / error_budget_rate # 5.0
# Status: CRITICAL β will exhaust budget in 6 hours if sustained
# Time to exhaust budget at various burn rates:
# burn_rate 1.0 β 30 days (entire compliance period)
# burn_rate 2.0 β 15 days
# burn_rate 6.0 β 5 days
# burn_rate 14.4 β 2 days
# burn_rate 30.0 β 1 day
# burn_rate 72.0 β 10 hours
time_to_exhaust_hours = 30 * 24 / burn_rate
Complete Example: Define SLOs for a REST API
Here is a complete, production-ready SLO definition for a hypothetical e-commerce payment API. This demonstrates the full specification format used in enterprise environments.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
SLO DOCUMENT: Payment Processing API v2
Service: payment-api.prod.internal
Owner: Pay Platform Team
SRE Owner: Pay SRE Team
Effective Date: 2024-01-01
Review Cycle: Quarterly
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββ SERVICE OVERVIEW βββββββββββββββββββββββββββββββββββββββββββββ
The Payment Processing API handles payment authorization, capture,
and refund operations for Samsung Pay merchants. It is a
user-facing critical service β any degradation directly impacts
revenue and customer trust.
Dependencies:
- payment-db.primary (PostgreSQL 14)
- payment-db.replica (PostgreSQL 14 read replica)
- fraud-service.internal
- token-service.internal
- redis-cache.internal (session state)
- kafka.internal (event publishing)
βββ SLIs AND SLOs ββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββ¬βββββββββββββββββββββββ¬βββββββββββββββββββββββββ
β SLI β Target (30d) β Rationale β
βββββββββββββββββββΌβββββββββββββββββββββββΌβββββββββββββββββββββββββ€
β Availability β 99.95% β Payment failures cause β
β β (43.2 min downtime) β immediate user-visible β
β β β errors and revenue lossβ
βββββββββββββββββββΌβββββββββββββββββββββββΌβββββββββββββββββββββββββ€
β Latency (p50) β β€ 100ms β 50% of requests should β
β β β feel instant β
βββββββββββββββββββΌβββββββββββββββββββββββΌβββββββββββββββββββββββββ€
β Latency (p95) β β€ 300ms β 95% of requests within β
β β β acceptable UX thresholdβ
βββββββββββββββββββΌβββββββββββββββββββββββΌβββββββββββββββββββββββββ€
β Latency (p99) β β€ 1000ms β Worst-case acceptable β
β β β for complex operations β
βββββββββββββββββββΌβββββββββββββββββββββββΌβββββββββββββββββββββββββ€
β Error Rate β β€ 0.1% < 1 in 1000 requests β
β β β may fail β
βββββββββββββββββββΌβββββββββββββββββββββββΌβββββββββββββββββββββββββ€
β Throughput β β₯ 5000 req/s peak β Sustained Black Friday β
β β β traffic in 2023 β
βββββββββββββββββββ΄βββββββββββββββββββββββ΄βββββββββββββββββββββββββ
βββ ERROR BUDGET βββββββββββββββββββββββββββββββββββββββββββββββββ
Primary Error Budget (Availability):
SLO: 99.95%
Budget: 0.05% failed requests
30-day budget: 50,000 failed requests per 100M requests
Secondary Error Budget (Latency p95):
SLO: β€ 300ms
Budget: 5% of requests may exceed 300ms
βββ BUDGET CONSUMPTION TRACKING ββββββββββββββββββββββββββββββββββ
Dashboard: https://datadog.internal/dashboard/payment-api-slo
Alerting Thresholds:
β’ 2% budget consumed in 1 hour β Page (fast burn)
β’ 5% budget consumed in 6 hours β Page (medium burn)
β’ 10% budget consumed in 3 days β Ticket (slow burn)
β’ 50% budget consumed total β Freeze non-critical releases
β’ 75% budget consumed total β Emergency reliability sprint
βββ EXCLUSIONS βββββββββββββββββββββββββββββββββββββββββββββββββββ
The following do not consume error budget:
1. Scheduled maintenance windows (approved 48h in advance)
2. Client errors (HTTP 4xx) β these are not service failures
3. Requests exceeding client timeout (client-side abandonment)
4. Force majeure events (AWS region outage beyond our control)
βββ ESCALATION βββββββββββββββββββββββββββββββββββββββββββββββββββ
Budget > 75%: Escalate to Engineering Director
Budget > 90%: Escalate to VP Engineering + Product
Budget 100%: Full release freeze, emergency retro
βββ APPROVALS ββββββββββββββββββββββββββββββββββββββββββββββββββββ
Tech Lead: _______________________ Date: _______________
SRE Lead: _______________________ Date: _______________
Product Manager: _______________________ Date: _______________
SLO Compliance Dashboard Specification
A well-designed SLO dashboard provides at-a-glance visibility into service health. Here is the specification for a Datadog dashboard layout.
Datadog Dashboard: "SLO Overview β Payment API"
Row 1: SERVICE HEALTH SUMMARY (3 widgets)
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Availability β β Latency p95 β β Error Rate β
β β β β β β
β 99.97% β β 245ms β β 0.04% β
β β 0.02% β β β 12ms β β β 0.01% β
β [GREEN] β β [GREEN] β β [GREEN] β
β vs SLO: 99.95%β β vs SLO: 300ms β β vs SLO: 0.1% β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
Row 2: ERROR BUDGET (2 widgets)
βββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββ
β Availability Error Budget β β Latency Error Budget β
β β β β
β [βββββββββββββββββ] 35% β β [βββββββββββββββββ] 18% β
β consumed (65% remaining) β β consumed (82% remaining) β
β β β β
β Projected: 18 days left β β Projected: 25 days left β
β at current burn rate β β at current burn rate β
βββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββ
Row 3: BURN RATE TRENDS (1 widget)
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Burn Rate (7-day rolling) β
β β
β 10 β€ β±β² β
β 8 β€ β±β± β²β² β
β 6 β€ OUTAGE β±β± β² β
β 4 β€ β±β² β±β² β±β² β±β± β
β 2 β€β±β²β± β²β²β± β²β²β±β²β±β²β±β²β± β²β²β±β²β±β²β±β²β±β²β±β²β±β± β
β 0 βΌββββββββββββββββββββββββββββββββββββββββββββββββ β
β MTWTFSSMTWTFSSMTWTFSSMTWTFSSMTWTFSSMTWTFSS β
β β
β Alert threshold: 14.4 (dashed line) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Row 4: REQUEST BREAKDOWN (3 widgets)
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Requests/sec β β Status Code β β Latency β
β β β Distribution β β Distribution β
β 2.4K avg β β 200: 99.2% β β p50: 87ms β
β [timeseries] β β 4xx: 0.6% β β p95: 245ms β
β β β 5xx: 0.2% β β p99: 412ms β
β β β [pie chart] β β [heatmap] β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
Error Budget Burn Rate Alerting
Google's Site Reliability Workbook defines the multi-window, multi-burn-rate alerting approach. This method pages on significant events while avoiding alert fatigue from minor blips.
Alerting Principles
- Page alerts (immediate human response): Require intervention within minutes. Trigger on fast burn rates over short windows.
- Ticket alerts (next business day): Indicate trends requiring attention but not immediate action. Trigger on slow burn rates over long windows.
- Threshold-based alerts are a backup for symptoms that burn-rate alerts might miss (e.g., total outage where no requests succeed).
Recommended Burn Rate Thresholds
| Burn Rate | Alert Window | Budget Consumed | Action | Use Case |
|---|---|---|---|---|
| 14.4x | 1 hour | 2% | Page immediately | Fast-burn outage detection |
| 6x | 6 hours | 5% | Page | Sustained degradation |
| 3x | 3 days | 10% | Ticket | Slow-burn trend |
| 1x | 30 days | 100% | Review in planning | Budget exhaustion warning |
SLO-Based Alerting Rules
Prometheus Recording and Alerting Rules
# slo-rules.yml β Prometheus alerting rules for SLO-based alerting
# Apply with: kubectl apply -f slo-rules.yml
# Validate with: promtool check rules slo-rules.yml
groups:
- name: payment_api_slo_availability
interval: 60s
rules:
# Recording rule: error rate over 5m
- record: job:slo_errors_per_request:ratio_rate5m
expr: |
sum(rate(http_requests_total{job="payment-api",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="payment-api"}[5m]))
# Recording rule: error rate over 1h
- record: job:slo_errors_per_request:ratio_rate1h
expr: |
sum(rate(http_requests_total{job="payment-api",status=~"5.."}[1h]))
/
sum(rate(http_requests_total{job="payment-api"}[1h]))
# Recording rule: error rate over 6h
- record: job:slo_errors_per_request:ratio_rate6h
expr: |
sum(rate(http_requests_total{job="payment-api",status=~"5.."}[6h]))
/
sum(rate(http_requests_total{job="payment-api"}[6h]))
# Recording rule: error rate over 3d
- record: job:slo_errors_per_request:ratio_rate3d
expr: |
sum(rate(http_requests_total{job="payment-api",status=~"5.."}[3d]))
/
sum(rate(http_requests_total{job="payment-api"}[3d]))
# Burn rate calculation
# SLO = 99.95% β error budget = 0.05% = 0.0005
- record: job:slo_availability_burn_rate:ratio
expr: |
job:slo_errors_per_request:ratio_rate1h / 0.0005
- name: payment_api_alerts
rules:
# Fast burn: 14.4x over 1 hour β page immediately
- alert: PaymentAPI_FastBurn
expr: job:slo_errors_per_request:ratio_rate1h > (14.4 * 0.0005)
for: 2m
labels:
severity: critical
team: pay-sre
slo: availability
annotations:
summary: "Payment API fast error budget burn"
description: "Error rate is {{ $value | humanizePercentage }} over 1h. "
"This is {{ $value | humanize }}x the burn rate needed to "
"exhaust the 30-day error budget in 2 days."
runbook_url: "https://wiki.internal/runbooks/payment-api-fast-burn"
dashboard: "https://grafana.internal/d/payment-api"
# Medium burn: 6x over 6 hours β page
- alert: PaymentAPI_MediumBurn
expr: job:slo_errors_per_request:ratio_rate6h > (6 * 0.0005)
for: 5m
labels:
severity: warning
team: pay-sre
slo: availability
annotations:
summary: "Payment API medium error budget burn"
description: "Error rate is {{ $value | humanizePercentage }} over 6h. "
"This will exhaust the 30-day error budget in 5 days."
runbook_url: "https://wiki.internal/runbooks/payment-api-medium-burn"
# Slow burn: 3x over 3 days β ticket
- alert: PaymentAPI_SlowBurn
expr: job:slo_errors_per_request:ratio_rate3d > (3 * 0.0005)
for: 1h
labels:
severity: info
team: pay-sre
slo: availability
annotations:
summary: "Payment API slow error budget burn"
description: "Error rate is {{ $value | humanizePercentage }} over 3d. "
"Investigate during business hours."
# Latency SLO: p95 > 300ms
- alert: PaymentAPI_LatencyP95High
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{job="payment-api"}[5m])) by (le)
) > 0.3
for: 10m
labels:
severity: warning
team: pay-sre
slo: latency
annotations:
summary: "Payment API p95 latency exceeds 300ms"
description: "p95 latency is {{ $value }}s. SLO threshold: 300ms."
# Total budget exhaustion check
- alert: PaymentAPI_ErrorBudgetExhausted
expr: |
(
sum(increase(http_requests_total{job="payment-api",status=~"5.."}[30d]))
/
sum(increase(http_requests_total{job="payment-api"}[30d]))
) > 0.0005
for: 1h
labels:
severity: critical
team: pay-sre
slo: availability
annotations:
summary: "Payment API 30-day error budget exhausted"
description: "30-day error budget has been consumed. Release freeze in effect."
Datadog Monitors for SLO Alerting
# Datadog monitors can be defined via Terraform for infrastructure-as-code
# This example shows SLO-based monitors using Datadog's SLO resource
# terraform/slo-monitors.tf
# First, create the SLO in Datadog
resource "datadog_service_level_objective" "payment_api_availability" {
name = "Payment API Availability"
description = "Availability SLO for Payment Processing API"
type = "metric"
query {
numerator = "sum:trace.http.request.hits{service:payment-api,status:2xx}.as_count()"
denominator = "sum:trace.http.request.hits{service:payment-api}.as_count()"
}
threshold {
timeframe = "30d"
target = 99.95
warning = 99.97
}
threshold {
timeframe = "7d"
target = 99.9
warning = 99.95
}
tags = ["team:pay-sre", "service:payment-api", "env:prod"]
}
# Fast burn rate monitor (pages immediately)
resource "datadog_monitor" "payment_api_fast_burn" {
name = "[CRIT] Payment API Fast Error Budget Burn"
type = "metric alert"
message = <<-EOT
{{#is_alert}}
Payment API is burning error budget at {{burn_rate}}x the normal rate.
Current 1h error rate: {{error_rate}}
Error budget: 0.05% (99.95% SLO)
This will exhaust the 30-day budget in ~2 days if sustained.
@pagerduty-pay-sre-critical
@slack-alerts-pay-sre
{{/is_alert}}
Runbook: https://wiki.internal/runbooks/payment-api-fast-burn
Dashboard: https://app.datadoghq.com/dashboard/payment-api
EOT
query = <<-EOT
avg(last_1h):
(
sum:trace.http.request.errors{service:payment-api}.as_rate()
/
sum:trace.http.request.hits{service:payment-api}.as_rate()
) > 0.0072
EOT
monitor_thresholds {
critical = 0.0072 # 14.4 Γ 0.0005
warning = 0.0036 # 7.2 Γ 0.0005
}
notify_no_data = false
renotify_interval = 30
tags = ["team:pay-sre", "service:payment-api", "alert-type:fast-burn", "slo:availability"]
}
# Latency SLO monitor
resource "datadog_monitor" "payment_api_latency_p95" {
name = "[WARN] Payment API p95 Latency SLO Breach"
type = "metric alert"
message = <<-EOT
{{#is_alert}}
Payment API p95 latency is {{value}}ms, exceeding the 300ms SLO.
SLO target: p95 <= 300ms
Current p95: {{value}}ms
Check for:
- Database slow queries
- Upstream service degradation
- Cache hit rate drops
- Increased request volume
@slack-alerts-pay-sre
{{/is_alert}}
EOT
query = "avg(last_10m): percentile(trace.http.request.duration{service:payment-api}, 95) > 0.3"
monitor_thresholds {
critical = 0.5 # 500ms
warning = 0.3 # 300ms
}
notify_no_data = false
renotify_interval = 60
tags = ["team:pay-sre", "service:payment-api", "slo:latency"]
}
Quarterly SLO Review Process
SLOs are not static. They must be reviewed quarterly to ensure they remain aligned with user needs, business requirements, and engineering capabilities.
Quarterly SLO Review Agenda
- SLO Achievement Report β Review each SLO's achievement rate over the quarter. Highlight any misses and trends.
- Error Budget Analysis β How much budget was consumed? What were the top contributors? Was the budget too generous or too tight?
- User Feedback Review β Correlate SLO misses with user complaints, support tickets, and NPS feedback.
- Dependency Impact β Did upstream or downstream services contribute to SLO misses?
- SLO Proposals β Engineering proposes SLO changes with justification. Tightening requires reliability investment; loosening requires business sign-off.
- SLI Refinement β Are SLIs still measuring what matters? Any new user journeys to cover?
- Approval β SRE lead, engineering lead, and product manager approve updated SLOs.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
QUARTERLY SLO REVIEW β TEMPLATE
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Quarter: Q[X] 20XX
Service: [SERVICE_NAME]
Review Date: [DATE]
Attendees: [SRE Lead, Eng Lead, Product Manager]
βββ SLO ACHIEVEMENT SUMMARY ββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββ¬ββββββββββ¬βββββββββββ¬βββββββββββ¬βββββββββββββββ
β SLO β Target β Achieved β Budget β Status β
ββββββββββββββββΌββββββββββΌβββββββββββΌβββββββββββΌβββββββββββββββ€
β Availability β 99.95% β 99.97% β 35% used β β Met β
β Latency p95 β 300ms β 245ms β 18% used β β Met β
β Latency p99 β 1000ms β 890ms β 22% used β β Met β
β Error Rate β 0.1% β 0.04% β 40% used β β Met β
ββββββββββββββββ΄ββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββββββ
βββ KEY EVENTS THIS QUARTER ββββββββββββββββββββββββββββββββββββββ
Date Event Impact on SLO
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Jan 15 Database failover test No impact (planned)
Feb 3 CDN degradation (Akamai) -0.02% availability
Feb 22 Release v2.4.0 latency regress p95 spike to 450ms
Mar 8 Redis cache failure 5 min degraded perf
βββ USER IMPACT CORRELATION ββββββββββββββββββββββββββββββββββββββ
SLO Misses vs. User Complaints:
β’ Feb 3 CDN event: 12 support tickets, 2 Twitter mentions
β’ Feb 22 latency regress: 8 support tickets, app store review
NPS Trend: 42 β 45 (slight improvement, latency work paid off)
βββ PROPOSED CHANGES βββββββββββββββββββββββββββββββββββββββββββββ
1. TIGHTEN availability SLO: 99.95% β 99.99%
Rationale: Consistently achieving 99.97%+; infrastructure
improvements (multi-AZ) justify tighter target.
Engineering Cost: 2 sprint points for additional automation
Business Impact: Improved merchant trust
Decision: [ ] Approve [ ] Reject [ ] Defer
2. LOOSEN latency p99: 1000ms β 1500ms
Rationale: p99 only affects <1% of users; current SLO drives
excessive optimization work for minimal user benefit.
Engineering Cost: Frees ~3 sprint points per quarter
Business Impact: Minimal β p95 (300ms) covers 95% of users
Decision: [ ] Approve [ ] Reject [ ] Defer
βββ ACTION ITEMS βββββββββββββββββββββββββββββββββββββββββββββββββ
# Action Item Owner Due
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
1 Implement p99 SLO change eng-lead 20XX-XX-XX
2 Update Datadog dashboards sre-lead 20XX-XX-XX
3 Communicate SLA changes pm 20XX-XX-XX
4 Update runbook thresholds sre-lead 20XX-XX-XX
βββ SIGN-OFF βββββββββββββββββββββββββββββββββββββββββββββββββββββ
SRE Lead: _______________________ Date: _______________
Engineering Lead: _____________________ Date: _______________
Product Manager: ______________________ Date: _______________
Sample SLO Document Template
This is a blank template ready for adaptation to any service. Copy this template, fill in each section, and store it in your service repository under docs/SLO.md.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
SLO DOCUMENT TEMPLATE
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# Service Name: [SERVICE_NAME]
# Team: [TEAM_NAME]
# SRE Owner: [SRE_OWNER]
# Last Updated: [DATE]
# Review Cycle: Quarterly
# Status: [DRAFT / ACTIVE / DEPRECATED]
βββ SERVICE DESCRIPTION ββββββββββββββββββββββββββββββββββββββββββ
[Describe what this service does, its criticality level,
and who its users are. List all user-facing endpoints.]
βββ DEPENDENCIES βββββββββββββββββββββββββββββββββββββββββββββββββ
[Map all upstream and downstream dependencies with criticality
ratings. Include external dependencies (CDNs, payment processors,
cloud providers).]
Upstream: [Service] β [Criticality: HIGH/MED/LOW]
Downstream: [Service] β [Criticality: HIGH/MED/LOW]
βββ SLIs AND SLOs ββββββββββββββββββββββββββββββββββββββββββββββββ
| Category | SLI Name | Measurement | SLO Target | Window |
|----------|----------|-------------|------------|--------|
| [avail/ | [name] | [metric] | [target] | [30d] |
| latency/ | | | | |
| errors] | | | | |
βββ ERROR BUDGET βββββββββββββββββββββββββββββββββββββββββββββββββ
Primary Error Budget:
SLO: [TARGET]
Budget: [1 - SLO] = [BUDGET]
Measurement: [request-based / time-based]
βββ ALERTING RULES βββββββββββββββββββββββββββββββββββββββββββββββ
[Reference alerting rule files or Datadog monitor IDs]
βββ EXCLUSIONS βββββββββββββββββββββββββββββββββββββββββββββββββββ
[List all scenarios that do not consume error budget]
βββ ESCALATION POLICY ββββββββββββββββββββββββββββββββββββββββββββ
[Budget thresholds and corresponding actions]
βββ APPROVALS ββββββββββββββββββββββββββββββββββββββββββββββββββββ
Tech Lead: _______________________ Date: _______________
SRE Lead: _______________________ Date: _______________
Product Manager: ______________________ Date: _______________
- Aspirational SLOs: Setting SLOs based on what you wish the service did, not what it actually does. This creates constant paging.
- Too many SLOs: More than 5 SLOs per service creates noise and dilutes focus. Pick the metrics that matter most to users.
- SLIs you can't measure: Every SLI must have a concrete, automated measurement pipeline. If you can't measure it, you can't alert on it.
- SLAs without SLOs: Never commit to an SLA without first establishing internal SLOs and proving you can meet them.
- Static SLOs: SLOs that are never reviewed become meaningless. Schedule quarterly reviews.
Related Topics
- SRE Principles β Core SRE fundamentals and error budget concepts
- Monitoring & Observability β Instrumentation for SLI measurement
- Incident Management β Responding to SLO breaches
- Google SRE Workbook: Alerting on SLOs (Chapter 5)
Last updated: June 2026 | Author: SRE Team