Apply SRE principles to build and maintain highly reliable, scalable systems at scale
Site Reliability Engineering is what happens when you ask a software engineer to design an operations team. SRE applies software engineering principles to infrastructure and operations problems.
Embrace Risk
100% reliability is impossible and too expensive. Find the right balance.
Service Level Objectives
Define reliability targets based on user happiness, not perfection.
Eliminate Toil
Automate repetitive, manual work. SREs should spend <50% time on toil.
Monitor Everything
You can't improve what you don't measure.
Automate This Year's Job Away
Continuously automate yourself out of operational work.
Move Fast by Reducing Risk
Good practices enable faster, safer deployments.
SRE is a specific implementation of DevOps principles. DevOps is the philosophy, SRE is the practice. "Class SRE implements interface DevOps"
Incidents are inevitable. What matters is how quickly and effectively you respond. A good incident management process minimizes impact and maximizes learning.
SEV-1 (Critical)
Complete service outage. All hands on deck. Page everyone.
Example: Website down, payment processing failed
SEV-2 (High)
Major feature broken. Significant user impact. Page on-call.
Example: Login broken, search not working
SEV-3 (Medium)
Minor feature degraded. Limited impact. Notify on-call.
Example: Slow page load, minor UI bug
SEV-4 (Low)
Cosmetic issue. No user impact. Fix during business hours.
Example: Typo, minor styling issue
# Incident Update Template
**Status:** [INVESTIGATING | IDENTIFIED | MONITORING | RESOLVED]
**Severity:** SEV-X
**Impact:** [Description of user impact]
**Started:** [Timestamp]
## What's Happening
[Brief description of the issue]
## Current Status
[What we're doing right now]
## User Impact
[How this affects users]
## Next Update
[When we'll provide next update]
## Example:
**Status:** INVESTIGATING
**Severity:** SEV-1
**Impact:** Users unable to log in
**Started:** 2024-01-15 14:30 UTC
## What's Happening
Login service is returning 500 errors for all authentication attempts.
## Current Status
Team is investigating database connection issues.
Rollback to previous version in progress.
## User Impact
All users unable to log in. Existing sessions unaffected.
## Next Update
In 15 minutes or when status changes.Being on-call means you're responsible for responding to incidents. Good on-call practices prevent burnout and ensure reliable incident response.
Runbooks are step-by-step guides for handling common incidents. They enable faster response and reduce cognitive load during stressful situations.
# Runbook: High API Latency
## Symptoms
- API response time > 2 seconds
- Alert: "HighLatency" firing
## Impact
- Slow user experience
- Potential timeouts
## Diagnosis Steps
1. Check Grafana dashboard: api-performance
2. Look for spike in traffic (DDoS?)
3. Check database query times
4. Review recent deployments
5. Check external service status
## Mitigation
### If traffic spike:
- Enable rate limiting
- Scale up API servers
- Contact security team if DDoS
### If database slow:
- Check slow query log
- Kill long-running queries
- Scale read replicas
### If recent deployment:
- Rollback to previous version
- Check deployment logs
## Escalation
If not resolved in 30 minutes:
- Page: @backend-lead
- Slack: #incidents
## Related
- Dashboard: https://grafana.com/d/api
- Logs: https://kibana.com/app/logs
- Previous incidents: INC-123, INC-456Capacity planning ensures you have enough resources to handle current and future load. It prevents outages from resource exhaustion and optimizes costs.
CPU Utilization
Target: 60-70% average, 90% peak
Memory Usage
Target: 70-80% average, 95% peak
Disk I/O
Monitor IOPS and throughput
Network Bandwidth
Track ingress/egress rates
Simple Linear Projection:
1. Measure current usage and growth rate
2. Project forward 6-12 months
3. Add buffer (20-30%) for spikes
4. Plan capacity additions before hitting limits
5. Review quarterly and adjust
Chaos Engineering is the practice of intentionally breaking things in production to build confidence in system resilience. "Break things on purpose before they break by accident."
Verify Kubernetes restarts failed pods and traffic routes correctly
Test timeout handling and circuit breakers
Verify failover to replica and graceful degradation
Test auto-scaling and resource limits
# Kill random pods
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: pod-kill-example
spec:
action: pod-kill
mode: one
selector:
namespaces:
- production
labelSelectors:
app: api
scheduler:
cron: '@every 10m'
---
# Network delay
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: network-delay
spec:
action: delay
mode: all
selector:
namespaces:
- production
labelSelectors:
app: api
delay:
latency: "100ms"
correlation: "100"
jitter: "0ms"
duration: "5m"You've learned Site Reliability Engineering practices:
Complete your DevOps & SRE journey by learning security practices. Discover DevSecOps, container security, and compliance automation.
Continue to Module 7: Security & Compliance →