Keep systems running, detect issues before users do, and respond to incidents with confidence. The SRE playbook for modern engineering teams.
Observability is the ability to understand the internal state of a system from its external outputs. You can't fix what you can't see. The three pillars give you complete visibility into your production systems.
Numeric measurements over time. Aggregated, efficient to store. Perfect for dashboards, alerting, and capacity planning.
Timestamped, structured records of events. The "why" behind the numbers. Essential for debugging production issues.
// ✅ Structured JSON logging { "level": "error", "msg": "Payment failed", "userId": "u_abc123", "amount": 99.99, "error": "card_declined", "traceId": "abc-xyz-789", "timestamp": "2025-03-16T10:30:00Z" } // Tools: ELK Stack, Loki, Datadog Logs
Follow a request as it flows through multiple services. Essential for microservices debugging — see exactly where latency comes from.
Reliability targets are not optional — they define your team's promises to users and guide engineering priorities. Understanding the difference between SLI, SLO, and SLA is fundamental to SRE.
| Term | Definition | Example |
|---|---|---|
| SLI Service Level Indicator |
The actual metric you measure | 99.2% of requests responded in <200ms over the last 30 days |
| SLO Service Level Objective |
Your internal target for the SLI | 99.5% of requests must respond in <200ms — internal goal |
| SLA Service Level Agreement |
The contractual promise to customers (with penalties) | 99.9% uptime guaranteed — if breached, 10% service credit |
| Error Budget | How much unreliability you can "spend" (100% - SLO) | 0.1% error budget = 43.8 min/month downtime allowed |
# Common SLO targets by service tier ## Tier 1: Core business (payments, auth) Availability: 99.99% # 52 min/year downtime allowed Latency p99: <500ms ## Tier 2: Important (main product features) Availability: 99.9% # 8.7 hours/year downtime allowed Latency p99: <1000ms ## Tier 3: Non-critical (analytics, reports) Availability: 99.5% # 43.8 hours/year downtime allowed Latency p99: <3000ms
Incidents will happen. The best teams aren't those who never have incidents — they're the ones who respond fast, communicate clearly, and learn from every outage.
# INCIDENT RUNBOOK: High Error Rate on /api/expenses Triggered by: Error rate > 5% for 5 minutes (Prometheus alert) 1. DETECT □ Check Grafana dashboard: grafana.ganaat.app □ Check error logs: logs.ganaat.app → filter by ERROR level □ Check recent deploys: was anything deployed in last 2 hours? 2. MITIGATE (stop the bleeding first) □ If recent deploy: roll back immediately → docker compose up -d --build (redeploy previous image) □ If DB issue: check connections → docker exec ganaat-db psql -c "SELECT count(*) FROM pg_stat_activity" □ If memory: restart container → docker restart ganaat-backend 3. COMMUNICATE □ Post to #incidents Slack channel every 30 min □ Update status page (status.ganaat.app) □ Template: "We are investigating elevated error rates on [feature]. Impact: [X% of users]. ETA: [time]. Updates every 30 min." 4. RESOLVE & POSTMORTEM □ Root cause identified and fixed □ Write postmortem within 48 hours (blameless) □ Action items assigned with owners and deadlines
Deliberately inject failures into your system to find weaknesses before production does. "If it hurts, do it more often" — the philosophy behind resilience engineering.
Netflix Chaos Monkey randomly terminates EC2 instances in production. If a single instance kill takes down the service — the system is not resilient.
Add artificial latency or drop packets between services. Does your circuit breaker trigger? Do timeouts work? What happens when the DB is unreachable?
Fill disk, consume all CPU, max out memory. Does the app crash gracefully? Do health checks fail and trigger restarts? Does it alert before users notice?