Build comprehensive observability with metrics, logs, and traces using Prometheus, Grafana, and ELK stack
Monitoring tells you when something is wrong. Observability helps you understand why it's wrong and how to fix it. Modern systems require both.
Numerical measurements over time (CPU, memory, request rate, latency)
Timestamped records of discrete events (errors, requests, state changes)
Request journey through distributed systems (service-to-service calls)
Prometheus is a time-series database and monitoring system. Grafana visualizes the data with beautiful dashboards. Together, they're the industry standard for metrics.
Pull Model: Prometheus scrapes metrics from targets
Time Series DB: Stores metrics with timestamps
PromQL: Query language for metrics
Alertmanager: Handles alerts and notifications
// Node.js with prom-client
const express = require('express');
const client = require('prom-client');
const app = express();
// Create a Registry
const register = new client.Registry();
// Add default metrics
client.collectDefaultMetrics({ register });
// Custom metrics
const httpRequestDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.1, 0.5, 1, 2, 5]
});
const httpRequestTotal = new client.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code']
});
register.registerMetric(httpRequestDuration);
register.registerMetric(httpRequestTotal);
// Middleware to track metrics
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
httpRequestDuration
.labels(req.method, req.route?.path || req.path, res.statusCode)
.observe(duration);
httpRequestTotal
.labels(req.method, req.route?.path || req.path, res.statusCode)
.inc();
});
next();
});
// Metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
app.listen(3000);# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
# Load rules
rule_files:
- "alerts.yml"
# Scrape configurations
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'api'
static_configs:
- targets: ['api:3000']
metrics_path: '/metrics'
# Kubernetes service discovery
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true# alerts.yml
groups:
- name: api_alerts
interval: 30s
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{status_code=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} for {{ $labels.instance }}"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "95th percentile latency is {{ $value }}s"
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service is down"
description: "{{ $labels.instance }} has been down for 1 minute"rate(http_requests_total[5m]) - Request rate
avg(cpu_usage) by (instance) - Avg CPU by instance
sum(rate(errors[1m])) - Total error rate
histogram_quantile(0.95, ...) - 95th percentile
The ELK stack is the most popular solution for log aggregation and analysis. Collect logs from all services, search them, and visualize patterns.
Elasticsearch
Distributed search and analytics engine. Stores and indexes logs.
Logstash
Data processing pipeline. Ingests, transforms, and sends logs to Elasticsearch.
Kibana
Visualization layer. Create dashboards and explore logs.
Filebeat (Bonus)
Lightweight shipper. Forwards logs from servers to Logstash/Elasticsearch.
// Use structured logging (JSON format)
const winston = require('winston');
const logger = winston.createLogger({
format: winston.format.combine(
winston.format.timestamp(),
winston.format.json()
),
transports: [
new winston.transports.Console(),
new winston.transports.File({ filename: 'app.log' })
]
});
// Good: Structured log
logger.info('User login', {
userId: '123',
email: 'user@example.com',
ip: '192.168.1.1',
duration: 245
});
// Output:
{
"level": "info",
"message": "User login",
"timestamp": "2024-01-15T10:30:00.000Z",
"userId": "123",
"email": "user@example.com",
"ip": "192.168.1.1",
"duration": 245
}# logstash.conf
input {
beats {
port => 5044
}
}
filter {
# Parse JSON logs
json {
source => "message"
}
# Add geoip data
geoip {
source => "ip"
}
# Parse timestamps
date {
match => [ "timestamp", "ISO8601" ]
}
# Add tags based on log level
if [level] == "error" {
mutate {
add_tag => [ "error" ]
}
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "logs-%{+YYYY.MM.dd}"
}
}Distributed tracing tracks requests as they flow through microservices. It helps identify bottlenecks and understand service dependencies.
// Node.js with OpenTelemetry
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
// Initialize tracer
const provider = new NodeTracerProvider();
// Configure Jaeger exporter
const exporter = new JaegerExporter({
endpoint: 'http://jaeger:14268/api/traces',
});
provider.addSpanProcessor(
new BatchSpanProcessor(exporter)
);
provider.register();
// Auto-instrument HTTP
registerInstrumentations({
instrumentations: [
new HttpInstrumentation(),
],
});
// Manual instrumentation
const tracer = provider.getTracer('my-service');
app.get('/api/users/:id', async (req, res) => {
const span = tracer.startSpan('get-user');
try {
span.setAttribute('user.id', req.params.id);
const user = await db.getUser(req.params.id);
span.addEvent('user-fetched');
res.json(user);
} catch (error) {
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR });
throw error;
} finally {
span.end();
}
});Service Level Indicators, Objectives, and Agreements define and measure reliability. They're the foundation of SRE practices.
A quantitative measure of service level. What you measure.
Examples:
Target value for an SLI. What you promise internally.
Examples:
Contract with consequences. What you promise customers.
Example:
99.95% uptime or customer gets 10% credit
If your SLO is 99.9% uptime, you have a 0.1% error budget. That's ~43 minutes of downtime per month.
Use error budget to balance feature velocity with reliability. If budget is exhausted, focus on reliability.
You've learned comprehensive observability practices:
Now that you can monitor and observe systems, let's dive deep into Site Reliability Engineering practices including incident management and chaos engineering.
Continue to Module 6: Site Reliability Engineering →