Post-mortem process setup for incident analysis

Our company is engaged in the development, support and maintenance of sites of any complexity. From simple one-page sites to large-scale cluster systems built on micro services. Experience of developers is confirmed by certificates from vendors.
Development and maintenance of all types of websites:
Informational websites or web applications
Business card websites, landing pages, corporate websites, online catalogs, quizzes, promo websites, blogs, news resources, informational portals, forums, aggregators
E-commerce websites or web applications
Online stores, B2B portals, marketplaces, online exchanges, cashback websites, exchanges, dropshipping platforms, product parsers
Business process management web applications
CRM systems, ERP systems, corporate portals, production management systems, information parsers
Electronic service websites or web applications
Classified ads platforms, online schools, online cinemas, website builders, portals for electronic services, video hosting platforms, thematic portals

These are just some of the technical types of websites we work with, and each of them can have its own specific features and functionality, as well as be customized to meet the specific needs and goals of the client.

Our competencies:
Development stages
Latest works
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822
  • image_crm_chasseurs_493_0.webp
    CRM development for Chasseurs
    847
  • image_website-sbh_0.png
    Website development for SBH Partners
    999
  • image_website-_0.png
    Website development for Red Pear
    451

Post-Mortem Process Setup for Incident Analysis

Post-mortem is a structured analysis of an incident after resolution. Goal — understand why it happened and prevent recurrence. Without post-mortem the same incidents happen again and again because nobody dealt with root causes.

Blameless Post-Mortem Principles

Blameless — not about finding guilty parties. If engineer made mistake, cause is in system that allowed mistake without safeguards. Right question: "Why did system allow this?" not "Who pressed wrong button?".

Blame culture leads to people hiding incidents and not admitting errors. This is worse than incident itself.

When to Conduct Post-Mortem

  • SEV1 incidents — always, within 48 hours
  • SEV2 incidents — always, within 72 hours
  • SEV3 — by team decision if incident revealed systemic issue
  • Recurring SEV4 — worth conducting if same symptom third time

Post-Mortem Document Structure

# Post-Mortem: [Brief Incident Description]

**Date:** 2025-11-15
**Severity:** SEV1
**Duration:** 47 minutes (14:23 - 15:10 UTC)
**Impact:** ~12,000 users couldn't complete payment

## Timeline

| Time | Event |
|---|---|
| 14:23 | PagerDuty alert: error rate > 5% on payment service |
| 14:28 | Engineer acknowledged alert, started investigation |
| 14:35 | Found: DB not accepting new connections |
| 14:42 | Root cause identified: connection pool exhausted |
| 14:55 | Temporary fix applied: restart connection pool manager |
| 15:10 | Service restored, errors gone |

## Root Cause

Connection pool (pgBouncer) configured with max_client_conn=100.
After new app version deploy, worker count increased 4 → 8,
each opens up to 15 connections. Peak load: 8 * 15 = 120 > 100.

## What Went Wrong

- pgBouncer max_client_conn not reviewed when worker count changed
- No alert for approaching connection pool limit (would be visible early)
- Deployment checklist missing DB connection settings check

## What Went Well

- Alert detected problem 2 minutes after start
- DB connection issue runbook helped quickly localize cause
- Team responded within SLA

## Action Items

| Task | Owner | Deadline |
|---|---|---|
| Increase max_client_conn to 300, recalculate for current architecture | @db-team | 2 days |
| Add pgbouncer_clients_active metric and alert > 80% | @ops-team | 3 days |
| Update deployment checklist: DB connections section | @team-lead | 1 week |
| Document connection pool calculation formula | @db-team | 1 week |

Conducting Post-Mortem Meeting

Participants: everyone who responded to incident + tech lead + product owner if needed.

Duration: 60-90 minutes. If more needed — problem in document preparation.

Meeting Structure:

  1. Timeline review (10 min) — everyone should have same understanding of facts
  2. Root cause analysis (20-30 min) — 5 Why technique
  3. What could be done better (15 min)
  4. What went well (5 min) — important for morale
  5. Form action items (15 min) — with owners and deadlines

5 Why Technique for Root Cause

Why did service crash?
→ Connection pool exhausted

Why was connection pool exhausted?
→ Connection count exceeded max_client_conn

Why did connection count exceed limit?
→ Worker count increased during deploy

Why wasn't worker count increase reflected in pgBouncer config?
→ No process for checking DB config when scaling changes

Why is there no such process?
→ Deployment checklist doesn't cover config dependencies

Root cause: absence of process for checking config dependencies during deploy. Action item targets exactly this.

Post-Mortem Storage and Analytics

Documents stored in Confluence/Notion with tags: severity, service, cause-category. Cause categories:

  • Configuration (wrong config)
  • Deployment (deploy issue)
  • Dependency failure (third-party service)
  • Capacity (resource shortage)
  • Human error (mistaken action)

Quarterly review: which categories dominate → where to invest in reliability.

Action Items: Avoiding Task Graveyard

Post-mortem is useless if action items go undone. Mandatory conditions:

  • Specific owner (not "team", name)
  • Clear deadline
  • Jira/Linear ticket created immediately at meeting
  • Completion review at next post-mortem meeting

Process Implementation Timeframes

  • Document template + storage — 1 day
  • Team training + first test post-mortem — 1-2 days
  • Integration with incident tracker — 1 day