Implementing Incident Management Process for Web Application
Incident Management is not a tool, it's a process. Tools (PagerDuty, OpsGenie, Jira) without process are just noise sources. Process without tools is messenger chaos. Together they enable predictable team behavior during outage.
Incident Definition and Priorities
Not every error is an incident. Incident is unplanned service quality violation affecting users.
| Severity | Criteria | RTO | Example |
|---|---|---|---|
| SEV1 | Service completely unavailable | 30 min | Site returns 503 for all |
| SEV2 | Critical function degraded | 2 hours | Payment system not working |
| SEV3 | Non-critical function broken | 8 hours | Slow report loading |
| SEV4 | Minor issue | 24+ hours | Typo on page |
Incident Roles
Incident Commander (IC). Coordinates response, makes decisions, doesn't debug code. One per incident.
Technical Lead. Directs investigation and resolution. Multiple possible on wide incident.
Communications Lead. Updates Status Page, answers business questions, writes Slack incident channel updates.
Role separation is critical: one person can't simultaneously debug and answer CEO questions.
Incident Lifecycle
Detection → Triage → Escalation → Response → Resolution → Post-mortem
Detection: Alertmanager / PagerDuty detects anomaly and notifies on-call.
Triage (5-10 minutes): On-call evaluates severity, creates incident ticket, opens Slack channel #incident-YYYY-MM-DD-brief-description.
Escalation: For SEV1-2 — immediate IC and additional engineer involvement. On-call rotation determines who's on duty.
Response: Work conducted in dedicated Slack channel. Updates every 20-30 minutes. All significant actions logged in incident thread (who, what, when).
Resolution: Service restored, users notified, incident closed.
Post-mortem: Within 48 hours.
Tooling
Slack/Teams integration. Bot automatically creates incident channel, invites participants, posts incident ticket template.
Runbooks. Each alert references specific runbook in Confluence/Notion: what to do on this error, what commands to run, who to call.
Shared terminal (tmux/screen). For remote work — tmate or Teleport for shared console access without credential sharing.
Example Slack Bot for Incident Creation
# /incident create sev=1 "Payment system down"
@app.command("/incident")
def create_incident(ack, command, client):
ack()
severity = parse_severity(command["text"])
title = parse_title(command["text"])
channel = client.conversations_create(
name=f"incident-{date.today()}-{slugify(title)}"
)
client.chat_postMessage(
channel=channel["channel"]["id"],
text=INCIDENT_TEMPLATE.format(
severity=severity,
title=title,
commander=command["user_id"],
started_at=datetime.now().isoformat()
)
)
# Update Status Page
update_status_page(severity, title)
# PagerDuty: create incident
pagerduty.create_incident(severity, title)
Communication During Incident
Inside team — technical details in incident channel. For business — simple updates every 30 minutes: "Problem detected, working on fix. Next update in 30 minutes." For users — Status Page.
Never answer "soon fixed" without time estimates. Better "expect recovery in 2 hours" with later refinement.
Process Metrics
- MTTD (Mean Time to Detect) — average incident detection time
- MTTA (Mean Time to Acknowledge) — time from alert to work acceptance
- MTTR (Mean Time to Resolve) — average resolution time
- Incident Frequency — incident frequency by severity
Implementation Timeline
- Define process + roles + severity matrix — 2-3 days
- Setup PagerDuty/OpsGenie + on-call rotation — 1-2 days
- Slack integration + templates — 1-2 days
- Runbooks for top-10 alerts — 3-5 days
- Team training + test drill — 1 day







