Post-Mortem Process Setup for Incident Analysis
Post-mortem is a structured analysis of an incident after resolution. Goal — understand why it happened and prevent recurrence. Without post-mortem the same incidents happen again and again because nobody dealt with root causes.
Blameless Post-Mortem Principles
Blameless — not about finding guilty parties. If engineer made mistake, cause is in system that allowed mistake without safeguards. Right question: "Why did system allow this?" not "Who pressed wrong button?".
Blame culture leads to people hiding incidents and not admitting errors. This is worse than incident itself.
When to Conduct Post-Mortem
- SEV1 incidents — always, within 48 hours
- SEV2 incidents — always, within 72 hours
- SEV3 — by team decision if incident revealed systemic issue
- Recurring SEV4 — worth conducting if same symptom third time
Post-Mortem Document Structure
# Post-Mortem: [Brief Incident Description]
**Date:** 2025-11-15
**Severity:** SEV1
**Duration:** 47 minutes (14:23 - 15:10 UTC)
**Impact:** ~12,000 users couldn't complete payment
## Timeline
| Time | Event |
|---|---|
| 14:23 | PagerDuty alert: error rate > 5% on payment service |
| 14:28 | Engineer acknowledged alert, started investigation |
| 14:35 | Found: DB not accepting new connections |
| 14:42 | Root cause identified: connection pool exhausted |
| 14:55 | Temporary fix applied: restart connection pool manager |
| 15:10 | Service restored, errors gone |
## Root Cause
Connection pool (pgBouncer) configured with max_client_conn=100.
After new app version deploy, worker count increased 4 → 8,
each opens up to 15 connections. Peak load: 8 * 15 = 120 > 100.
## What Went Wrong
- pgBouncer max_client_conn not reviewed when worker count changed
- No alert for approaching connection pool limit (would be visible early)
- Deployment checklist missing DB connection settings check
## What Went Well
- Alert detected problem 2 minutes after start
- DB connection issue runbook helped quickly localize cause
- Team responded within SLA
## Action Items
| Task | Owner | Deadline |
|---|---|---|
| Increase max_client_conn to 300, recalculate for current architecture | @db-team | 2 days |
| Add pgbouncer_clients_active metric and alert > 80% | @ops-team | 3 days |
| Update deployment checklist: DB connections section | @team-lead | 1 week |
| Document connection pool calculation formula | @db-team | 1 week |
Conducting Post-Mortem Meeting
Participants: everyone who responded to incident + tech lead + product owner if needed.
Duration: 60-90 minutes. If more needed — problem in document preparation.
Meeting Structure:
- Timeline review (10 min) — everyone should have same understanding of facts
- Root cause analysis (20-30 min) — 5 Why technique
- What could be done better (15 min)
- What went well (5 min) — important for morale
- Form action items (15 min) — with owners and deadlines
5 Why Technique for Root Cause
Why did service crash?
→ Connection pool exhausted
Why was connection pool exhausted?
→ Connection count exceeded max_client_conn
Why did connection count exceed limit?
→ Worker count increased during deploy
Why wasn't worker count increase reflected in pgBouncer config?
→ No process for checking DB config when scaling changes
Why is there no such process?
→ Deployment checklist doesn't cover config dependencies
Root cause: absence of process for checking config dependencies during deploy. Action item targets exactly this.
Post-Mortem Storage and Analytics
Documents stored in Confluence/Notion with tags: severity, service, cause-category. Cause categories:
- Configuration (wrong config)
- Deployment (deploy issue)
- Dependency failure (third-party service)
- Capacity (resource shortage)
- Human error (mistaken action)
Quarterly review: which categories dominate → where to invest in reliability.
Action Items: Avoiding Task Graveyard
Post-mortem is useless if action items go undone. Mandatory conditions:
- Specific owner (not "team", name)
- Clear deadline
- Jira/Linear ticket created immediately at meeting
- Completion review at next post-mortem meeting
Process Implementation Timeframes
- Document template + storage — 1 day
- Team training + first test post-mortem — 1-2 days
- Integration with incident tracker — 1 day







