SRE Principles

Site Reliability Engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. SRE teams use software as a tool to manage systems, solve problems, and automate operations tasks. SRE is what happens when you ask a software engineer to design an operations function.

What is SRE?

SRE was pioneered at Google starting in 2003 when Ben Treynor Sloss was tasked with running a production engineering team. The fundamental insight of SRE is that running large-scale distributed systems requires both software engineering and operations expertise — and that software engineering approaches can dramatically improve the reliability and efficiency of operations work.

The SRE model treats operations as a software problem. Instead of manually configuring servers, restarting services, and responding to pages ad hoc, SREs write software to automate these tasks, design systems that fail less often, and build tooling that makes the entire organization more effective.

Core Thesis: SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labor.

SRE vs. Traditional Operations vs. DevOps

Understanding how SRE differs from traditional operations and relates to DevOps is essential for organizations adopting SRE practices. Each approach has distinct characteristics, though they share common goals.

Dimension	Traditional Operations	DevOps	SRE
Primary Focus	Keeping systems running through manual intervention	Breaking down silos between dev and ops; culture of collaboration	Applying software engineering to operations; reliability through engineering
Reliability Target	"Five nines" without explicit measurement	Defined by team consensus	Explicit SLOs with error budgets
Toil Approach	Accepted as job responsibility	Reduced through automation culture	Explicitly measured; capped at 50% of time
Deployment	Scheduled maintenance windows, change advisory boards	CI/CD pipelines, frequent deployments	Progressive rollouts, canary deployments, automated rollbacks
Incident Response	Ad hoc heroics	Shared on-call rotation	Structured incident command, blameless post-mortems
Skills Profile	System administration, scripting	Full-stack, infrastructure as code	Software engineering (50%+ coding), systems design
Origin	ITIL, enterprise IT	2009 movement (Patrick Debois, John Willis)	Google (Ben Treynor Sloss, 2003)

Key Insight: SRE implements DevOps principles with specific concrete practices. While DevOps is primarily a cultural movement, SRE provides concrete prescriptive methods — error budgets, SLOs, toil budgets, and defined team structures. The relationship: "SRE is a specific implementation of DevOps with some opinionated spins on it." — Ben Treynor Sloss

Core SRE Principles

These five principles form the foundation of SRE practice. Every SRE team should internalize and apply these principles in their daily work.

1. Embrace Risk

Reliability is not an absolute goal. 100% reliability is impossible and often undesirable — the cost of achieving perfect reliability exceeds the business value. SRE quantifies acceptable risk through error budgets and makes explicit tradeoffs between reliability and velocity.

A service with 99.99% availability (four nines) can be down only 52.6 minutes per year. A service with 99.999% (five nines) can be down only 5.26 minutes per year. The engineering effort and cost to bridge that gap is typically enormous.

2. SLO-Driven Reliability

Service Level Objectives (SLOs) are the primary tool for aligning technical reliability with business needs. SRE teams define measurable SLOs, track against them, and use them to make release and prioritization decisions. Every engineering decision should trace back to SLOs.

3. Eliminate Toil

Toil is manual, repetitive, automatable work that scales linearly with service growth. SRE teams explicitly measure toil and aim to eliminate it through automation. Google's rule: no SRE should spend more than 50% of their time on toil. The remaining 50% is for project work — improving the service through engineering.

4. Monitor Everything

Monitoring is the foundation of observability. SREs instrument services to emit metrics, logs, and traces; define alerts based on SLOs rather than symptom thresholds; and build dashboards that provide meaningful insight. Alerts should be actionable, novel, and require human judgment.

5. Automate Everything

Automation is the primary lever for eliminating toil and improving reliability. SREs automate provisioning, deployment, configuration, failure recovery, and incident response. The ideal is to have systems that self-heal without human intervention.

Service Hierarchy: Service → SLI → SLO → SLA

The reliability hierarchy provides a structured way to reason about service reliability. Each level builds on the one below it.

┌─────────────────────────────────────────────────────────────────┐
│  Service: "User Authentication Service"                         │
│  Description: Handles user login, token issuance, validation    │
├─────────────────────────────────────────────────────────────────┤
│  SLI (Service Level Indicator) — WHAT to measure                │
│  • Request latency (p50, p95, p99)                              │
│  • Error rate (5xx responses / total responses)                 │
│  • Availability (successful requests / total requests)          │
│  • Throughput (requests per second)                             │
├─────────────────────────────────────────────────────────────────┤
│  SLO (Service Level Objective) — TARGET for each SLI            │
│  • Latency p95 < 200ms                                          │
│  • Error rate < 0.1% (99.9% success rate)                     │
│  • Availability = 99.9% (43.8 min downtime/month)             │
│  • Throughput > 10,000 req/s at peak                          │
├─────────────────────────────────────────────────────────────────┤
│  SLA (Service Level Agreement) — CONTRACT with consequences     │
│  • 99.9% availability: < 1% monthly credit for breach         │
│  • 99.5% availability: 5% monthly credit                      │
│  • 99.0% availability: 10% monthly credit + escalation        │
└─────────────────────────────────────────────────────────────────┘

Concept	Definition	Example	Audience
Service	A system or component that provides functionality to users	Payment API	Engineers, Product
SLI	A quantitative measure of some aspect of service level	Request latency at 95th percentile	Engineers, SRE
SLO	A target value for an SLI over a measurement period	Latency p95 < 300ms over 30 days	Engineers, SRE, Leadership
SLA	A contract with business consequences for missing SLOs	99.9% availability or 10% service credit	Customers, Legal, Leadership

Error Budgets: Concept and Calculation

An error budget is the acceptable amount of unreliability within a compliance period. It is the inverse of the SLO: if your availability SLO is 99.9%, your error budget is 0.1%. Error budgets provide a concrete, quantitative mechanism for balancing reliability against feature velocity.

Error Budget Calculation

# Error Budget Calculation Formula
# 
# Error Budget = (1 - SLO) × Total Events in Compliance Period
#
# Example: 99.9% availability SLO over 30 days

availability_slo = 0.999    # 99.9%
error_budget_ratio = 1 - availability_slo  # 0.001 or 0.1%

# For request-based SLO
total_requests_30d = 100_000_000  # 100 million requests
error_budget_requests = total_requests_30d * error_budget_ratio
# = 100,000 acceptable failed requests per 30 days

# For time-based SLO
minutes_in_30d = 30 * 24 * 60  # 43,200 minutes
downtime_budget_minutes = minutes_in_30d * error_budget_ratio
# = 43.2 minutes of acceptable downtime per 30 days

# Daily burn rate analysis
# Burn rate 1.0 = consuming budget at exactly the rate to exhaust in 30 days
# Burn rate 2.0 = will exhaust budget in 15 days
# Burn rate 30.0 = will exhaust budget in 1 day

daily_budget = downtime_budget_minutes / 30  # ~1.44 minutes/day
burn_rate = 1.0  # normal consumption

# Alert thresholds (Google's multi-window approach):
# Fast burn (high burn rate, short window):
#   - Burn rate > 14.4 over 1 hour: page immediately
#   - Burn rate > 6 over 6 hours: page
# Slow burn (lower burn rate, longer window):
#   - Burn rate > 3 over 3 days: ticket (non-urgent)
#   - Burn rate > 1 over 30 days: review in next planning

Error Budget Policy Template

This policy governs how error budgets are consumed and what actions are taken at different consumption rates. Adapt thresholds to your organization's reliability requirements.

═══════════════════════════════════════════════════════════════════
                    ERROR BUDGET POLICY
                    Template v1.0
═══════════════════════════════════════════════════════════════════

SERVICE: _______________________
SLO: _______________________
COMPLIANCE PERIOD: 30 days / 90 days
EFFECTIVE DATE: _______________________
APPROVED BY: _______________________

─── BUDGET CONSUMPTION THRESHOLDS ──────────────────────────────

┌─────────────┬────────────────┬────────────────────────────────┐
│ Consumption │ Status         │ Required Action                │
├─────────────┼────────────────┼────────────────────────────────┤
│ < 50%       │ Healthy        │ Normal operations, continue    │
│             │                │ feature development pace       │
├─────────────┼────────────────┼────────────────────────────────┤
│ 50-75%      │ Caution        │ SRE review at next standup;    │
│             │                │ identify top error sources     │
├─────────────┼────────────────┼────────────────────────────────┤
│ 75-100%     │ At Risk        │ Freeze non-critical releases;  │
│             │                │ dedicate 50% sprint capacity   │
│             │                │ to reliability improvements    │
├─────────────┼────────────────┼────────────────────────────────┤
│ 100%+       │ Exhausted      │ Complete release freeze;       │
│             │                │ all engineering to reliability │
│             │                │ work until budget recovers     │
└─────────────┴────────────────┴────────────────────────────────┘

─── RELEASE GATES ──────────────────────────────────────────────

Budget > 50%:   Normal release process
Budget 25-50%:  Releases require SRE approval
Budget 10-25%:  Critical bugfixes only; requires director approval
Budget < 10%:   Emergency releases only; full release freeze

─── ESCALATION ─────────────────────────────────────────────────

If budget is consumed before 50% of compliance period:
  → Immediate escalation to Engineering Director
  → Emergency reliability sprint initiated
  → Weekly executive review until recovery

─── RECOVERY ───────────────────────────────────────────────────

Budget naturally replenishes at the start of each compliance period.
Partial recovery within a period is not automatic — reliability
improvements must reduce error rate below SLO for extended period
to "earn back" budget through reduced burn.

─── EXCEPTIONS ─────────────────────────────────────────────────

Planned maintenance windows (approved 48h in advance) do not
consume error budget. Customer-caused issues (malformed requests,
rate limit violations) are excluded from budget calculations.

Signed: _______________________ Date: _______________________

Toil Identification and Automation

Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as the service grows. Google's SRE book defines toil as work that, when performed manually, is devoid of enduring value.

Identifying Toil

Track all activities in your weekly work log and categorize them. Common sources of toil include:

Toil Category	Examples	Automation Approach
Ticket-driven interrupts	Password resets, access requests, quota increases	Self-service portals, automated provisioning
Production changes	Configuration updates, flag flips, certificate rotations	GitOps, automated rollout pipelines
Pager responses	Restart stuck services, clear disk space, kill runaway jobs	Self-healing automation, runbook automation
Data recovery	Restore from backup, replay failed transactions	Automated backup verification, replay systems
Capacity management	Add servers, scale resources, rebalance load	Horizontal pod autoscaling, cluster autoscaler
Incident tasks	Run diagnostic commands, collect logs, generate reports	Runbook automation, chatops bots
Onboarding/offboarding	Create accounts, provision access, configure environments	IAM automation, infrastructure as code

The 50% Engineering Time Rule

Google's SRE practice caps toil at 50% of an SRE's time. The remaining 50% must be spent on engineering project work that improves the service — automation, reliability improvements, tooling, and feature work that reduces future toil.

┌────────────────────────────────────────────────────────────────┐
│                    SRE TIME ALLOCATION                         │
│                                                                │
│   ┌──────────────────────────────────────────────────┐         │
│   │ ████████████████████████████████████████████████ │ 50%     │
│   │ PROJECT WORK (Engineering)                       │         │
│   │ • Automation development                         │         │
│   │ • Reliability improvements                       │         │
│   │ • Tooling and platform work                      │         │
│   │ • Capacity planning and architecture             │         │
│   └──────────────────────────────────────────────────┘         │
│                                                                │
│   ┌──────────────────────────────────────────────────┐         │
│   │ ████████████████████████████████████             │ 50%     │
│   │ TOIL (Operational Work)                          │         │
│   │ • On-call rotations and incident response        │         │
│   │ • Tickets and interrupts                         │         │
│   │ • Manual deployments and changes                 │         │
│   │ • Routine maintenance                            │         │
│   └──────────────────────────────────────────────────┘         │
│                                                                │
│   TOIL CAP: 50% maximum                                        │
│   ACTION: If toil exceeds 50% for 2 consecutive quarters,      │
│   the team must halt new features and prioritize automation.   │
└────────────────────────────────────────────────────────────────┘

Toil Calculation Formula

# Toil Percentage Calculation (quarterly)
# Track for each team member and aggregate

# Example: SRE Engineer quarterly time breakdown
on_call_hours = 168           # 7 days primary on-call per quarter
ticket_hours = 120            # ~1 hour/day on tickets
manual_deployments = 24       # Routine deployment work
maintenance_windows = 16      # Planned maintenance
pager_incidents = 32          # Off-hours incident response

total_toil_hours = on_call_hours + ticket_hours + \
                   manual_deployments + maintenance_windows + \
                   pager_incidents  # = 360 hours

total_work_hours = 520        # ~13 weeks × 40 hours

toil_percentage = (total_toil_hours / total_work_hours) * 100
# = 69.2% → EXCEEDS 50% CAP → Automation required

# Target reduction:
target_toil = total_work_hours * 0.50  # 260 hours max
reduction_needed = total_toil_hours - target_toil  # 100 hours

# Prioritize by ROI: automate highest-frequency, highest-time tasks first

SRE Team Topologies

There is no single correct way to structure an SRE team. The right topology depends on organizational size, maturity, and culture. Most organizations evolve through these models over time.

Centralized / Platform SRE

A dedicated SRE team provides platform-level services (monitoring, CI/CD, incident management) to multiple product teams. Product teams remain responsible for their own on-call.

Best for: Organizations starting their SRE journey, shared infrastructure needs.

Pros: Economies of scale, consistent tooling, clear career path for SREs.

Cons: Risk of becoming a silo, may lack domain expertise for specific services.

Embedded / Product SRE

SREs are embedded within product teams, sharing on-call rotations and owning reliability for specific services. SREs report into the SRE organization but work day-to-day with product engineering.

Best for: Mature organizations with critical services requiring dedicated reliability focus.

Pros: Deep domain expertise, strong alignment with product goals, shared ownership.

Cons: Harder to maintain SRE culture, risk of SREs being pulled into feature work, duplicated tooling.

Consulting / Guild SRE

A small SRE team acts as consultants — defining standards, providing tooling, and helping teams establish SRE practices without taking on-call for those services.

Best for: Large organizations with many teams at varying maturity levels.

Pros: Scales expertise across many teams, empowers development teams.

Cons: Limited direct impact, teams may not adopt practices, no shared on-call burden.

Hybrid Model

A central SRE team handles platform and shared services, while embedded SREs (or "SRE ambassadors") sit with critical product teams. The central team maintains standards and tooling; embedded SREs handle service-specific reliability.

Best for: Large, diverse organizations (e.g., Samsung with Knox, Pay, SmartThings).

Pros: Combines platform efficiency with domain expertise; flexible scaling.

Cons: Complex to manage; requires strong coordination; potential for unclear boundaries.

┌─────────────────────────────────────────────────────────────────┐
│              HYBRID SRE ORGANIZATION (Example)                  │
│                                                                 │
│   ┌─────────────────────────────────────────────────────┐       │
│   │              SRE LEADERSHIP / STANDARDS             │       │
│   │         (Error budget policies, SLO frameworks,       │       │
│   │          incident management standards)               │       │
│   └───────────────────────┬─────────────────────────────┘       │
│                           │                                     │
│           ┌───────────────┼───────────────┐                     │
│           ▼               ▼               ▼                     │
│   ┌─────────────┐ ┌─────────────┐ ┌─────────────┐              │
│   │ PLATFORM    │ │ PRODUCT     │ │ TOOLS &     │              │
│   │ SRE TEAM    │ │ SREs        │ │ AUTOMATION  │              │
│   │             │ │ (Embedded)  │ │ TEAM        │              │
│   │ • K8s infra │ │             │ │             │              │
│   │ • CI/CD     │ │ • Knox SRE  │ │ • Monitoring│              │
│   │ • Networking│ │ • Pay SRE   │ │ • Alerting  │              │
│   │ • Security  │ │ • SmartThings││ • Dashboards│              │
│   │ • Observability│ SRE        │ │ • Runbooks  │              │
│   └─────────────┘ └─────────────┘ └─────────────┘              │
│                                                                 │
│   Each product team has 1-2 embedded SREs who participate in   │
│   their on-call rotation and own service-specific reliability.   │
│   Platform SRE maintains shared infrastructure.                  │
└─────────────────────────────────────────────────────────────────┘

Key SRE Practices Overview

Practice	Description	Key Tool/Artifact	Frequency
SLO Review	Evaluate SLO achievement and error budget consumption	Error budget report	Weekly
Post-Mortem	Blameless analysis of significant incidents	Post-mortem document	Per SEV1-2 incident
Production Readiness Review	Verify service meets operational standards before launch	PRR checklist	Per service launch
Capacity Planning	Forecast resource needs based on growth projections	Capacity plan document	Quarterly
Disaster Recovery Drill	Test failover procedures and backup restoration	DR runbook	Quarterly
Game Day / Chaos	Inject failures to validate resilience	Chaos experiment plan	Monthly
Toil Audit	Measure and categorize operational toil	Toil tracking spreadsheet	Quarterly
On-Call Retro	Review on-call week for trends and improvements	On-call retro notes	Weekly
Dependency Mapping	Maintain map of service dependencies and critical paths	Dependency graph	Monthly
Runbook Review	Update operational runbooks based on incidents	Runbook repository	Monthly

Getting Started with SRE

For organizations beginning their SRE journey, here is a practical 90-day roadmap:

Phase 1: Foundation (Days 1-30)

Identify your critical user-facing services
Define SLIs for each service (start with availability and latency)
Set initial SLOs based on current performance (not aspirational targets)
Implement basic monitoring and alerting for those SLOs
Establish an on-call rotation with escalation paths

Phase 2: Process (Days 31-60)

Calculate error budgets and establish the error budget policy
Write post-mortems for all significant incidents using the blameless template
Catalog toil sources and pick one high-impact item to automate
Create runbooks for the most common alert types
Schedule your first game day

Phase 3: Culture (Days 61-90)

Conduct first SLO review meeting with leadership
Establish the toil budget (target 50% max)
Begin production readiness reviews for new services
Create dependency maps for critical services
Document and socialize SRE principles across engineering

Common Pitfall: Do not start with aspirational SLOs. Set your initial SLOs to reflect your service's current performance. If you routinely achieve 99.5% availability, set that as your SLO. Raising SLOs without engineering investment just creates constant paging and erodes trust. Improve the service first, then tighten the SLO.

References and Further Reading

Site Reliability Engineering: How Google Runs Production Systems (O'Reilly, 2016) — Free online
The Site Reliability Workbook (O'Reilly, 2018) — Free online
Building Secure & Reliable Systems (O'Reilly, 2020) — Free online
SLO / SLI / SLA — Detailed guide on defining reliability targets
Incident Management — Complete incident response framework
Monitoring & Observability — Instrumentation and alerting practices

Last updated: June 2026 | Author: SRE Team