A practical playbook for detecting, responding to, and learning from production incidents — before they affect millions of users.
API response time degrades from 120ms to 8s during peak traffic. Users experience timeouts and abandoned requests.
Payment service returns 503 errors for 12 minutes. Checkout flow completely unavailable — zero transactions processed.
A bad deployment corrupts user account balances. Silent but critical data integrity failure.
Complete service failure. All users affected. Revenue at immediate risk. Wake up on-call — war room now.
Significant feature broken. Workaround may exist. Engineering lead and on-call notified immediately.
Non-critical feature affected, workaround available, small percentage of users impacted. Fix in current sprint.
Minimal impact, cosmetic issue, or affects <0.1% of users. Log as ticket, prioritize in next sprint.
Alert fires or user report triggers the incident process
Assign commander, open war room, assess impact
Temporary fix — rollback, flag off, scale up
Root cause fixed, service restored, stakeholders notified
Blameless postmortem, action items, prevent recurrence
Acknowledge the alert to stop the escalation timer. Declare severity. Open dedicated #incident-YYYY-MM-DD channel.
One person coordinates — not fixes. Delegates investigation to subject matter experts while maintaining the big picture.
How many users affected? Which features? Is revenue impacted? Which regions? This determines severity and escalation path.
Post an update in 15 minutes even without resolution. Silence is the worst response. Update every 30 minutes.
All updates, findings, commands run, and decisions documented here — becomes the authoritative incident timeline.
Commander coordinates · Tech Lead investigates · Comms updates stakeholders
P1: every 15 min · P2: every 30 min · Post when you try something, even if it fails.
Update within 10 min of declaring the incident. Proactive communication prevents support ticket storms.
Ask 'why' 5 times from the symptom to the systemic root cause. The answer is never 'human error' — always a process or system gap.
Root cause is rarely one thing — it's technical failure + process gap + monitoring blind spot. Document all factors.
Add alert for DB connection pool saturation (>80%)
Implement circuit breaker on payment service client
Add canary deployment with auto rollback on error spike
Load test payment flow with 2× peak traffic in staging
Update runbook with new mitigation steps found
Instrument everything. Metrics, logs, traces. Alert on SLO burn rate — symptoms, not causes. Tune to eliminate alert fatigue.
Observe EverythingHorizontal pod autoscaling on CPU and custom metrics. Pre-scale before predicted traffic spikes (sales, launches).
Scale ProactivelyStop cascading failures. When downstream fails, open the circuit: fail fast, return cached data. Prevents thread pool exhaustion.
Fail FastRun k6 / Artillery against staging. Simulate 2× peak traffic. Find your breaking point before your users do.
Test LimitsEvery deploy must have a 1-command rollback. Blue/green deployments, feature flags as kill switches. Test rollbacks in staging.
Always RevertibleDocument every known failure mode. Step-by-step response guides. At 3am under pressure, no one should be guessing.
No SurprisesHigh-load systems will fail — that's not the question. The question is: how fast will you detect it, how quickly can you respond, and what will you learn afterward? The best teams aren't those who never have incidents — they're the ones with the fastest MTTR and the discipline to act on every lesson.