Back to DevOps & SRE

Module 5: Monitoring & Observability

Build comprehensive observability with metrics, logs, and traces using Prometheus, Grafana, and ELK stack

🎯 Monitoring vs Observability

Monitoring tells you when something is wrong. Observability helps you understand why it's wrong and how to fix it. Modern systems require both.

Monitoring

  • • Predefined metrics and alerts
  • • Known failure modes
  • • "Is the system up?"
  • • Dashboards and graphs

Observability

  • • Explore unknown unknowns
  • • Debug novel issues
  • • "Why is it behaving this way?"
  • • Metrics, logs, and traces

The Three Pillars of Observability

1. Metrics

Numerical measurements over time (CPU, memory, request rate, latency)

2. Logs

Timestamped records of discrete events (errors, requests, state changes)

3. Traces

Request journey through distributed systems (service-to-service calls)

📊 Prometheus & Grafana

Prometheus is a time-series database and monitoring system. Grafana visualizes the data with beautiful dashboards. Together, they're the industry standard for metrics.

Prometheus Architecture

Pull Model: Prometheus scrapes metrics from targets

Time Series DB: Stores metrics with timestamps

PromQL: Query language for metrics

Alertmanager: Handles alerts and notifications

Instrumenting Your Application

// Node.js with prom-client
const express = require('express');
const client = require('prom-client');

const app = express();

// Create a Registry
const register = new client.Registry();

// Add default metrics
client.collectDefaultMetrics({ register });

// Custom metrics
const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.1, 0.5, 1, 2, 5]
});

const httpRequestTotal = new client.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

register.registerMetric(httpRequestDuration);
register.registerMetric(httpRequestTotal);

// Middleware to track metrics
app.use((req, res, next) => {
  const start = Date.now();
  
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    
    httpRequestDuration
      .labels(req.method, req.route?.path || req.path, res.statusCode)
      .observe(duration);
    
    httpRequestTotal
      .labels(req.method, req.route?.path || req.path, res.statusCode)
      .inc();
  });
  
  next();
});

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

app.listen(3000);

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

# Load rules
rule_files:
  - "alerts.yml"

# Scrape configurations
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
  
  - job_name: 'api'
    static_configs:
      - targets: ['api:3000']
    metrics_path: '/metrics'
  
  # Kubernetes service discovery
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

Alert Rules

# alerts.yml
groups:
  - name: api_alerts
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status_code=~"5.."}[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }} for {{ $labels.instance }}"
      
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, 
            rate(http_request_duration_seconds_bucket[5m])
          ) > 2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "95th percentile latency is {{ $value }}s"
      
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service is down"
          description: "{{ $labels.instance }} has been down for 1 minute"

PromQL Examples

rate(http_requests_total[5m]) - Request rate

avg(cpu_usage) by (instance) - Avg CPU by instance

sum(rate(errors[1m])) - Total error rate

histogram_quantile(0.95, ...) - 95th percentile

📝 ELK Stack (Elasticsearch, Logstash, Kibana)

The ELK stack is the most popular solution for log aggregation and analysis. Collect logs from all services, search them, and visualize patterns.

Components

Elasticsearch

Distributed search and analytics engine. Stores and indexes logs.

Logstash

Data processing pipeline. Ingests, transforms, and sends logs to Elasticsearch.

Kibana

Visualization layer. Create dashboards and explore logs.

Filebeat (Bonus)

Lightweight shipper. Forwards logs from servers to Logstash/Elasticsearch.

Structured Logging

// Use structured logging (JSON format)
const winston = require('winston');

const logger = winston.createLogger({
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.json()
  ),
  transports: [
    new winston.transports.Console(),
    new winston.transports.File({ filename: 'app.log' })
  ]
});

// Good: Structured log
logger.info('User login', {
  userId: '123',
  email: 'user@example.com',
  ip: '192.168.1.1',
  duration: 245
});

// Output:
{
  "level": "info",
  "message": "User login",
  "timestamp": "2024-01-15T10:30:00.000Z",
  "userId": "123",
  "email": "user@example.com",
  "ip": "192.168.1.1",
  "duration": 245
}

Logstash Pipeline

# logstash.conf
input {
  beats {
    port => 5044
  }
}

filter {
  # Parse JSON logs
  json {
    source => "message"
  }
  
  # Add geoip data
  geoip {
    source => "ip"
  }
  
  # Parse timestamps
  date {
    match => [ "timestamp", "ISO8601" ]
  }
  
  # Add tags based on log level
  if [level] == "error" {
    mutate {
      add_tag => [ "error" ]
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "logs-%{+YYYY.MM.dd}"
  }
}

🔍 Distributed Tracing

Distributed tracing tracks requests as they flow through microservices. It helps identify bottlenecks and understand service dependencies.

OpenTelemetry Example

// Node.js with OpenTelemetry
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');

// Initialize tracer
const provider = new NodeTracerProvider();

// Configure Jaeger exporter
const exporter = new JaegerExporter({
  endpoint: 'http://jaeger:14268/api/traces',
});

provider.addSpanProcessor(
  new BatchSpanProcessor(exporter)
);

provider.register();

// Auto-instrument HTTP
registerInstrumentations({
  instrumentations: [
    new HttpInstrumentation(),
  ],
});

// Manual instrumentation
const tracer = provider.getTracer('my-service');

app.get('/api/users/:id', async (req, res) => {
  const span = tracer.startSpan('get-user');
  
  try {
    span.setAttribute('user.id', req.params.id);
    
    const user = await db.getUser(req.params.id);
    span.addEvent('user-fetched');
    
    res.json(user);
  } catch (error) {
    span.recordException(error);
    span.setStatus({ code: SpanStatusCode.ERROR });
    throw error;
  } finally {
    span.end();
  }
});

💡 What Tracing Shows:

  • • Request flow across services
  • • Time spent in each service
  • • Service dependencies
  • • Bottlenecks and slow operations
  • • Error propagation

🎯 SLIs, SLOs, and SLAs

Service Level Indicators, Objectives, and Agreements define and measure reliability. They're the foundation of SRE practices.

SLI (Service Level Indicator)

A quantitative measure of service level. What you measure.

Examples:

  • • Request latency (95th percentile < 200ms)
  • • Availability (% of successful requests)
  • • Error rate (% of failed requests)
  • • Throughput (requests per second)

SLO (Service Level Objective)

Target value for an SLI. What you promise internally.

Examples:

  • • 99.9% of requests succeed (availability)
  • • 95% of requests complete in < 200ms
  • • Error rate < 0.1%

SLA (Service Level Agreement)

Contract with consequences. What you promise customers.

Example:

99.95% uptime or customer gets 10% credit

⚠️ Error Budget

If your SLO is 99.9% uptime, you have a 0.1% error budget. That's ~43 minutes of downtime per month.

Use error budget to balance feature velocity with reliability. If budget is exhausted, focus on reliability.

📝 Module Summary

You've learned comprehensive observability practices:

Tools:

  • ✓ Prometheus for metrics
  • ✓ Grafana for visualization
  • ✓ ELK stack for logs
  • ✓ Jaeger for tracing

Concepts:

  • ✓ Three pillars of observability
  • ✓ SLIs, SLOs, SLAs
  • ✓ Error budgets
  • ✓ Alerting best practices

🎯 Next Steps

Now that you can monitor and observe systems, let's dive deep into Site Reliability Engineering practices including incident management and chaos engineering.

Continue to Module 6: Site Reliability Engineering →