⚙️

Production Operations

Keep systems running, detect issues before users do, and respond to incidents with confidence. The SRE playbook for modern engineering teams.

The 3 Pillars of Observability

Observability is the ability to understand the internal state of a system from its external outputs. You can't fix what you can't see. The three pillars give you complete visibility into your production systems.

📊 Metrics

Numeric measurements over time. Aggregated, efficient to store. Perfect for dashboards, alerting, and capacity planning.

RED Method (for services):
Rate — requests per second
Errors — error rate percentage
Duration — request latency (p50, p95, p99)

USE Method (for resources):
Utilization — % time resource is busy
Saturation — queue depth
Errors — error count

Tools: Prometheus, Grafana, Datadog

📝 Logs

Timestamped, structured records of events. The "why" behind the numbers. Essential for debugging production issues.

// ✅ Structured JSON logging
{
  "level": "error",
  "msg": "Payment failed",
  "userId": "u_abc123",
  "amount": 99.99,
  "error": "card_declined",
  "traceId": "abc-xyz-789",
  "timestamp": "2025-03-16T10:30:00Z"
}
// Tools: ELK Stack, Loki, Datadog Logs

🔍 Traces

Follow a request as it flows through multiple services. Essential for microservices debugging — see exactly where latency comes from.

POST /api/checkout [120ms total]
├─ auth-service [8ms]
├─ inventory-service [15ms]
├─ payment-service [85ms] ⚠️ slow
└─ stripe-api [82ms] ← bottleneck
└─ notification-service [12ms]
Tools: Jaeger, Zipkin, Tempo, Datadog APM

SLI / SLO / SLA

Reliability targets are not optional — they define your team's promises to users and guide engineering priorities. Understanding the difference between SLI, SLO, and SLA is fundamental to SRE.

TermDefinitionExample
SLI
Service Level Indicator
The actual metric you measure 99.2% of requests responded in <200ms over the last 30 days
SLO
Service Level Objective
Your internal target for the SLI 99.5% of requests must respond in <200ms — internal goal
SLA
Service Level Agreement
The contractual promise to customers (with penalties) 99.9% uptime guaranteed — if breached, 10% service credit
Error Budget How much unreliability you can "spend" (100% - SLO) 0.1% error budget = 43.8 min/month downtime allowed
✅ Google SRE's Error Budget Policy When a team's error budget is exhausted, new feature development freezes until reliability is restored. This creates a productive tension: move fast (burn budget) vs. stay reliable (save budget). The SRE team at Google doesn't just monitor — they own reliability like developers own features.
# Common SLO targets by service tier

## Tier 1: Core business (payments, auth)
Availability: 99.99%  # 52 min/year downtime allowed
Latency p99:  <500ms

## Tier 2: Important (main product features)
Availability: 99.9%   # 8.7 hours/year downtime allowed
Latency p99:  <1000ms

## Tier 3: Non-critical (analytics, reports)
Availability: 99.5%   # 43.8 hours/year downtime allowed
Latency p99:  <3000ms

Incident Response

Incidents will happen. The best teams aren't those who never have incidents — they're the ones who respond fast, communicate clearly, and learn from every outage.

Severity Levels

P1 — SEV1
Critical — Production Down
Complete outage affecting all users. Revenue impact. Response: 15 min. Wake up on-call engineer immediately. Notify all stakeholders. War room opened.
P2 — SEV2
Major — Core Feature Degraded
Significant feature broken, major performance degradation, partial data loss. Response: 30 min. On-call notified, team lead notified.
P3 — SEV3
Minor — Non-critical Issue
Minor feature broken, workaround available, small % of users affected. Response: 2 hours. Create ticket, fix in next sprint.

Incident Runbook Template

# INCIDENT RUNBOOK: High Error Rate on /api/expenses

Triggered by: Error rate > 5% for 5 minutes (Prometheus alert)

1. DETECT
  □ Check Grafana dashboard: grafana.ganaat.app
  □ Check error logs: logs.ganaat.app → filter by ERROR level
  □ Check recent deploys: was anything deployed in last 2 hours?

2. MITIGATE (stop the bleeding first)
  □ If recent deploy: roll back immediately
     → docker compose up -d --build  (redeploy previous image)
  □ If DB issue: check connections
     → docker exec ganaat-db psql -c "SELECT count(*) FROM pg_stat_activity"
  □ If memory: restart container
     → docker restart ganaat-backend

3. COMMUNICATE
  □ Post to #incidents Slack channel every 30 min
  □ Update status page (status.ganaat.app)
  □ Template: "We are investigating elevated error rates on [feature].
    Impact: [X% of users]. ETA: [time]. Updates every 30 min."

4. RESOLVE & POSTMORTEM
  □ Root cause identified and fixed
  □ Write postmortem within 48 hours (blameless)
  □ Action items assigned with owners and deadlines
💡 Blameless Postmortems Google's SRE book introduced the concept of blameless postmortems. The goal is to understand the systemic failures that allowed the incident, not to find a person to blame. Ask "why did the system allow this to happen?" not "who made this mistake?" Teams with blameless culture report incidents 3× faster because engineers aren't afraid of consequences.

Chaos Engineering

Deliberately inject failures into your system to find weaknesses before production does. "If it hurts, do it more often" — the philosophy behind resilience engineering.

💀

Kill random instances

Netflix Chaos Monkey randomly terminates EC2 instances in production. If a single instance kill takes down the service — the system is not resilient.

🌐

Network partition simulation

Add artificial latency or drop packets between services. Does your circuit breaker trigger? Do timeouts work? What happens when the DB is unreachable?

💾

Resource exhaustion

Fill disk, consume all CPU, max out memory. Does the app crash gracefully? Do health checks fail and trigger restarts? Does it alert before users notice?

⚠️ Start in Staging, Not Production Never run chaos experiments in production without: (1) a way to stop immediately, (2) monitoring in place, (3) a small blast radius. Netflix ran Chaos Monkey for 2 years in staging before enabling it in production. Start small: kill one instance of your least critical service during business hours when your whole team is watching.