Module 6: Site Reliability Engineering

Apply SRE principles to build and maintain highly reliable, scalable systems at scale

🎯 What is SRE?

Site Reliability Engineering is what happens when you ask a software engineer to design an operations team. SRE applies software engineering principles to infrastructure and operations problems.

Core SRE Principles

Embrace Risk

100% reliability is impossible and too expensive. Find the right balance.

Service Level Objectives

Define reliability targets based on user happiness, not perfection.

Eliminate Toil

Automate repetitive, manual work. SREs should spend <50% time on toil.

Monitor Everything

You can't improve what you don't measure.

Automate This Year's Job Away

Continuously automate yourself out of operational work.

Move Fast by Reducing Risk

Good practices enable faster, safer deployments.

💡 SRE vs DevOps:

SRE is a specific implementation of DevOps principles. DevOps is the philosophy, SRE is the practice. "Class SRE implements interface DevOps"

🚨 Incident Management

Incidents are inevitable. What matters is how quickly and effectively you respond. A good incident management process minimizes impact and maximizes learning.

Incident Severity Levels

SEV-1 (Critical)

Complete service outage. All hands on deck. Page everyone.

Example: Website down, payment processing failed

SEV-2 (High)

Major feature broken. Significant user impact. Page on-call.

Example: Login broken, search not working

SEV-3 (Medium)

Minor feature degraded. Limited impact. Notify on-call.

Example: Slow page load, minor UI bug

SEV-4 (Low)

Cosmetic issue. No user impact. Fix during business hours.

Example: Typo, minor styling issue

Incident Response Process

1. Detection & Alert

• Monitoring system detects issue
• Alert sent to on-call engineer
• Acknowledge alert within 5 minutes

2. Triage & Assess

• Determine severity level
• Assess user impact
• Decide if escalation needed

3. Communicate

• Create incident channel (Slack)
• Assign incident commander
• Update status page
• Notify stakeholders

4. Mitigate

• Focus on restoring service first
• Rollback if possible
• Apply temporary fixes
• Document actions taken

5. Resolve & Learn

• Verify service restored
• Close incident
• Schedule postmortem
• Create action items

Incident Communication Template

# Incident Update Template

**Status:** [INVESTIGATING | IDENTIFIED | MONITORING | RESOLVED]
**Severity:** SEV-X
**Impact:** [Description of user impact]
**Started:** [Timestamp]

## What's Happening
[Brief description of the issue]

## Current Status
[What we're doing right now]

## User Impact
[How this affects users]

## Next Update
[When we'll provide next update]

## Example:
**Status:** INVESTIGATING
**Severity:** SEV-1
**Impact:** Users unable to log in
**Started:** 2024-01-15 14:30 UTC

## What's Happening
Login service is returning 500 errors for all authentication attempts.

## Current Status
Team is investigating database connection issues. 
Rollback to previous version in progress.

## User Impact
All users unable to log in. Existing sessions unaffected.

## Next Update
In 15 minutes or when status changes.

📱 On-Call Best Practices

Being on-call means you're responsible for responding to incidents. Good on-call practices prevent burnout and ensure reliable incident response.

On-Call Rotation

✅ Good Practices:

• Rotate weekly or bi-weekly
• Primary and secondary on-call
• Compensate with time off
• Limit consecutive rotations
• Share load across team

❌ Avoid:

• Same person always on-call
• No backup coverage
• Unpaid on-call duty
• Excessive alert fatigue
• No handoff process

Runbooks

Runbooks are step-by-step guides for handling common incidents. They enable faster response and reduce cognitive load during stressful situations.

# Runbook: High API Latency

## Symptoms
- API response time > 2 seconds
- Alert: "HighLatency" firing

## Impact
- Slow user experience
- Potential timeouts

## Diagnosis Steps
1. Check Grafana dashboard: api-performance
2. Look for spike in traffic (DDoS?)
3. Check database query times
4. Review recent deployments
5. Check external service status

## Mitigation
### If traffic spike:
- Enable rate limiting
- Scale up API servers
- Contact security team if DDoS

### If database slow:
- Check slow query log
- Kill long-running queries
- Scale read replicas

### If recent deployment:
- Rollback to previous version
- Check deployment logs

## Escalation
If not resolved in 30 minutes:
- Page: @backend-lead
- Slack: #incidents

## Related
- Dashboard: https://grafana.com/d/api
- Logs: https://kibana.com/app/logs
- Previous incidents: INC-123, INC-456

📈 Capacity Planning

Capacity planning ensures you have enough resources to handle current and future load. It prevents outages from resource exhaustion and optimizes costs.

Key Metrics to Track

CPU Utilization

Target: 60-70% average, 90% peak

Memory Usage

Target: 70-80% average, 95% peak

Disk I/O

Monitor IOPS and throughput

Network Bandwidth

Track ingress/egress rates

Growth Forecasting

Simple Linear Projection:

1. Measure current usage and growth rate

2. Project forward 6-12 months

3. Add buffer (20-30%) for spikes

4. Plan capacity additions before hitting limits

5. Review quarterly and adjust

💥 Chaos Engineering

Chaos Engineering is the practice of intentionally breaking things in production to build confidence in system resilience. "Break things on purpose before they break by accident."

Chaos Experiments

1. Kill Random Pods

Verify Kubernetes restarts failed pods and traffic routes correctly

2. Inject Network Latency

Test timeout handling and circuit breakers

3. Simulate Database Failure

Verify failover to replica and graceful degradation

4. CPU/Memory Stress

Test auto-scaling and resource limits

Chaos Mesh Example

# Kill random pods
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-kill-example
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api
  scheduler:
    cron: '@every 10m'

---
# Network delay
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: network-delay
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - production
    labelSelectors:
      app: api
  delay:
    latency: "100ms"
    correlation: "100"
    jitter: "0ms"
  duration: "5m"

⚠️ Chaos Engineering Rules:

• Start small (dev/staging first)
• Have rollback plan ready
• Run during business hours initially
• Monitor closely during experiments
• Document findings and improvements
• Get buy-in from leadership

📝 Module Summary

You've learned Site Reliability Engineering practices:

SRE Practices:

✓ SRE principles and philosophy
✓ Incident management process
✓ On-call best practices
✓ Runbook creation

Advanced Topics:

✓ Capacity planning
✓ Performance optimization
✓ Chaos engineering
✓ Disaster recovery

🎯 Next Steps

Complete your DevOps & SRE journey by learning security practices. Discover DevSecOps, container security, and compliance automation.

Continue to Module 7: Security & Compliance →