41 pages Β· 8 sections
Ctrl K
GitHub Portfolio

DevOps Fundamentals

DevOps is a cultural and professional movement that emphasizes collaboration between development and operations teams, automating infrastructure, and delivering software rapidly and reliably. Born from the Agile movement and refined through years of large-scale production experience, DevOps represents a fundamental shift in how organizations build, deploy, and maintain software systems.

What is DevOps

DevOps is not a single tool, a role, or a team. It is a set of practices, cultural philosophies, and technical patterns that aims to shorten the systems development lifecycle while delivering features, fixes, and updates frequently in close alignment with business objectives. The term emerged around 2009, coined by Patrick Debois, and has since evolved into a mature discipline practiced by leading technology organizations worldwide.

At Samsung, where I led infrastructure for Knox, Pay, and SmartThings, DevOps was the difference between monthly release trains and multiple daily deployments to production. The transformation did not happen overnight; it required deliberate investment in automation, cultural change, and tooling.

The CALMS Framework

The CALMS framework, popularized by Jez Humble and coined by John Willis, provides a structured way to evaluate DevOps adoption:

LetterPrincipleDescriptionKey Practices
C Culture Shared ownership, blameless postmortems, cross-functional teams Breaking down silos, shared on-call rotations, blameless culture
A Automation Automating repetitive tasks to reduce human error and increase speed CI/CD pipelines, IaC, automated testing, self-service platforms
L Lean Minimizing waste, optimizing flow, delivering value continuously Value stream mapping, small batch sizes, eliminating bottlenecks
M Measurement Data-driven decisions through comprehensive observability DORA metrics, SLIs/SLOs/SLAs, monitoring, distributed tracing
S Sharing Open knowledge transfer, shared tooling, community building Internal wikis, demo days, open source contributions, chat ops
Tip: Culture is the hardest and most important component of CALMS. You can buy tools, but you cannot buy culture. At Samsung, the turning point was when developers began participating in production on-call rotations alongside SREs, creating genuine shared ownership.

DevOps vs. SRE: Complementary Disciplines

Site Reliability Engineering (SRE), pioneered at Google, is often discussed alongside DevOps. Rather than competing approaches, they are complementary disciplines with significant overlap:

AspectDevOpsSRE
OriginMovement focused on cultural transformationEngineering discipline with concrete practices
Primary GoalBridge dev and ops through collaborationApply software engineering to operations problems
Error BudgetEmphasizes speed and stability balanceFormalizes error budgets as engineering contracts
ImplementationBroader cultural and process changesSpecific engineering practices and tooling
MetricsDORA metrics (deployment frequency, lead time)SLIs, SLOs, SLAs, error budgets, toil reduction
Role DefinitionCan be a role, team, or cultural practiceClearly defined engineering role with coding requirements

Ben Treynor Sloss, VP of Engineering at Google, described SRE as "what happens when you ask a software engineer to design an operations function." In practice, mature organizations adopt both: DevOps as the cultural foundation and SRE as the engineering implementation.

DevOps Lifecycle Phases

The DevOps lifecycle is often visualized as an infinite loop with eight phases. Each phase feeds into the next, creating a continuous improvement cycle:

1. Plan

Requirements gathering, sprint planning, task tracking. Tools include Jira, Azure DevOps Boards, Linear, and Confluence. Infrastructure planning happens here too, using architecture decision records (ADRs) to document choices.

2. Code

Application development with version control, code review, and branch protection. Git is the standard VCS. Feature branching with pull requests, trunk-based development, and GitFlow are common branching strategies. Code quality gates include linting, static analysis, and peer review requirements.

3. Build

Compilation, packaging, and artifact creation. CI servers (GitHub Actions, Jenkins, CircleCI) trigger builds on every commit. Docker images are built, versioned with semantic tags, and pushed to registries (ECR, GCR, ACR, Docker Hub).

4. Test

Automated testing at multiple levels: unit tests, integration tests, end-to-end tests, security scans (SAST/DAST), and performance tests. Shift-left testing integrates quality checks into the earliest stages of the pipeline.

5. Release

Artifact promotion through environments, changelog generation, and release orchestration. Git tagging, GitHub Releases, and semantic versioning (SemVer) provide traceability from code to deployed artifact.

6. Deploy

Infrastructure provisioning and application deployment. Infrastructure-as-Code (Terraform, Pulumi, CloudFormation) provisions resources. Blue/green deployments, canary releases, and rolling updates minimize risk.

7. Operate

Running production systems: configuration management, secrets rotation, database maintenance, capacity planning. This is where SRE practices heavily intersect with DevOps.

8. Monitor

Observability through metrics, logs, and traces. Alerting on SLO breaches, anomaly detection, and feedback into planning. Tools include Prometheus, Grafana, Datadog, New Relic, and Jaeger.

Lifecycle Integration: The loop closes when monitoring insights feed back into planning. A production incident may result in a Jira ticket for the next sprint. A performance regression triggers an architecture review. This feedback loop is what makes DevOps genuinely continuous.

DORA Metrics: Measuring DevOps Performance

The DevOps Research and Assessment (DORA) team, now part of Google Cloud, identified four key metrics that predict software delivery performance:

MetricDescriptionElite performersHow to measure
Deployment Frequency How often code is deployed to production Multiple times per day Count of production deployments per day/week
Lead Time for Changes Time from commit to production deployment Less than 1 hour Git commit timestamp to deployment completion timestamp
Mean Time to Recovery (MTTR) Time to recover from a production failure Less than 1 hour Incident detection timestamp to service restoration timestamp
Change Failure Rate Percentage of deployments causing production failures Less than 5% Failed deployments / total deployments
Warning: Do not optimize DORA metrics in isolation. A team could game deployment frequency by making trivial changes. The metrics must be considered together: frequent deployments with high failure rates indicate instability; low failure rates with infrequent deployments indicate excessive caution. Balance is essential.

In my experience migrating 50+ microservices at Samsung, focusing on these four metrics provided a clear North Star. We reduced lead time from 2 weeks to 45 minutes and cut the change failure rate from 12% to 3% through investment in automated testing, canary deployments, and feature flags.

DevOps Toolchain Landscape

The DevOps toolchain is organized by lifecycle phase. The following diagram represents a production-grade toolchain as used at enterprise scale:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Plan   β”‚   Code   β”‚   Build  β”‚   Test   β”‚  Release β”‚  Deploy  β”‚  Operate β”‚  Monitor β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Jira     β”‚ GitHub   β”‚ GitHub   β”‚ SonarQubeβ”‚ GitHub   β”‚ Terraformβ”‚ Ansible  β”‚ Datadog  β”‚
β”‚ Confluenceβ”‚ GitLab  β”‚ Actions  β”‚ Jest/JUnitβ”‚ Releasesβ”‚ AWS Code β”‚ Kubernetesβ”‚ Prometheusβ”‚
β”‚ Linear   β”‚ Bitbucketβ”‚ Jenkins  β”‚ Cypress  β”‚ Semantic β”‚ Deploy   β”‚ Helm     β”‚ Grafana  β”‚
β”‚ Miro     β”‚ VS Code  β”‚ CircleCI β”‚ Snyk     β”‚ Version β”‚ ArgoCD   β”‚ Puppet   β”‚ Jaeger   β”‚
β”‚          β”‚          β”‚ Docker   β”‚ Trivy    β”‚ tags    β”‚ Flux     β”‚ Chef     β”‚ PagerDutyβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Tool Selection Principles

  • API-first: Tools must expose APIs for automation and integration
  • Git-native: Configuration stored in version control for auditability and rollback
  • Open standards: Prefer OpenTelemetry, OCI, CNCF-graduated projects
  • Composability: Tools should integrate via webhooks, events, or APIs
  • Exit strategy: Consider migration cost before adopting any tool

DevOps Best Practices Summary

PracticeDescriptionImplementation
Everything as Code Define all infrastructure and configuration in version-controlled code Terraform, Ansible, Pulumi
Immutable Infrastructure Never modify running servers; replace them with new versions Container images, blue/green deployments
Shift-Left Security Integrate security scanning into the earliest pipeline stages Snyk, Trivy, SonarQube, Checkov
Trunk-Based Development Short-lived branches merged to main frequently Feature flags, mainline development
Observability by Design Applications emit structured logs, metrics, and traces from day one OpenTelemetry, Prometheus, structured logging
Self-Service Platforms Developer portals enable provisioning without tickets Backstage, Terraform modules, service catalog
Blameless Postmortems Incident reviews focus on system improvement, not individual blame Documented runbooks, incident tracking
Automated Rollbacks Failed deployments automatically revert to the last known good state ArgoCD auto-sync, canary analysis

Getting Started: Implementation Roadmap

For organizations beginning their DevOps journey, the following phased approach has proven effective across multiple enterprise migrations:

  1. Assessment: Map current state value stream, identify bottlenecks, establish DORA baseline
  2. Foundation: Implement version control for all code, establish CI pipelines for builds
  3. Automation: Deploy Infrastructure-as-Code, automate testing, implement artifact management
  4. CD Pipeline: Build deployment pipelines with environment promotion, implement blue/green or canary
  5. Observability: Deploy monitoring, logging, and alerting; define initial SLOs
  6. Governance: Implement policy-as-code, secrets management, compliance automation
  7. Platform: Build internal developer platform with self-service capabilities
  8. Optimize: Continuous improvement through DORA measurement, cost optimization, chaos engineering
Insider Advice: Do not attempt to implement all practices simultaneously. Start with version control and CI builds. The biggest mistake I have seen is teams trying to deploy Kubernetes and GitOps before they have reliable builds. Build the foundation first, then add complexity.