SRE (Site Reliability Engineering) is a way of running software services where operations is treated like a software engineering problem.
Instead of relying mostly on manual work (“keep the site up”), SRE teams use software, automation, and data to make services reliable, scalable, and efficient—while still shipping features at a healthy pace.
What SRE actually does
- Defines reliability targets using SLOs (Service Level Objectives), like “99.9% successful requests” or “p95 latency < 300ms”.
- Measures reality with monitoring/metrics/logs/traces and incident reviews.
- Prevents outages through automation, safer releases, testing, capacity planning, and resilience patterns.
- Responds to incidents (on-call), restores service fast, and then fixes root causes so it doesn’t repeat.
- Reduces toil: eliminates repetitive manual tasks by automating them.
Key SRE concepts (the core vocabulary)
- SLI (Service Level Indicator): what you measure (e.g., success rate, latency).
- SLO (Service Level Objective): your reliability goal (e.g., 99.9% per 30 days).
- SLA (Service Level Agreement): external promise/contract (often with penalties).
- Error Budget: the allowed “unreliability” (e.g., 0.1% failures).
If you’re burning budget too fast, you slow down releases and focus on reliability work.
- Toil: repetitive, manual, automatable work SRE tries to eliminate.
SRE vs DevOps (simple difference)
- DevOps is a culture/mindset: dev + ops collaborate, automate, ship faster.
- SRE is a concrete implementation model: reliability is managed with engineering + SLOs + error budgets and clear operational practices.
If you tell me what kind of system you’re working with (Kubernetes/EKS, web apps, data pipelines, etc.), I can give a practical example of SLOs, error budgets, and what an SRE playbook looks like for it.