PagerDuty Integration for Incident Management
PagerDuty is an incident management platform that takes events from monitoring, determines who is on-call, notifies the right person via the right channel, and tracks response. Integrating PagerDuty into existing infrastructure — 1-3 days of work with tangible results.
PagerDuty Architecture
Services — logical units (backend API, payment service, database). Each service has its own escalation policy and on-call schedule.
Integrations — event sources: Prometheus/Alertmanager, Datadog, CloudWatch, Grafana, Uptime Robot, custom webhooks. Each integration generates unique endpoint key.
Escalation Policies — rules: who gets alert, escalation after how many minutes, where to escalate.
Schedules — on-call schedules with rotations.
Connecting Prometheus Alertmanager
# alertmanager.yml
route:
group_by: ['alertname', 'cluster']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'pagerduty-critical'
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
- match:
severity: warning
receiver: 'slack-warnings'
receivers:
- name: 'pagerduty-critical'
pagerduty_configs:
- routing_key: '<PAGERDUTY_INTEGRATION_KEY>'
description: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
severity: '{{ .CommonLabels.severity }}'
details:
firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}'
Connecting Datadog
In Datadog: Integrations → PagerDuty → Add API key. Configuration in monitor notifications:
@pagerduty-MyService
Or via Datadog Webhook integration for finer payload control.
Event Intelligence and Noise Suppression
PagerDuty Event Intelligence (paid plan) — automatic noise suppression:
- Alert Grouping: related alerts merge into one incident. During DB outage you don't get 50 alerts from all services that can't connect — only one incident.
- Intelligent Alert Grouping: ML model groups by historical patterns.
- Suppression Rules: temporary alert suppression during planned maintenance.
Webhooks and Automation
PagerDuty Webhooks send events on incident create/update/close:
@app.route('/pd-webhook', methods=['POST'])
def pagerduty_webhook():
data = request.json
event_type = data['event']['event_type']
incident = data['event']['data']
if event_type == 'incident.triggered':
# Create Slack channel
create_incident_channel(incident['title'], incident['id'])
# Update Status Page
update_status_page('major_outage', incident['title'])
elif event_type == 'incident.resolved':
# Close Slack channel
archive_incident_channel(incident['id'])
# Restore Status Page
update_status_page('operational', '')
return '', 200
PagerDuty + Jira/Linear Integration
Automatic ticket creation on SEV1/SEV2 incidents:
- Native Jira integration: on trigger → Jira issue created with Incident type
- On resolve → Jira issue transitions to Done with duration comment
Runbook Automation
PagerDuty Runbook Automation (formerly Rundeck): on alert fire, runbook executes automatically — for example, restart service, clear disk, scale ASG. If helped — incident closes automatically without alarm.
Analytics and Reports
PagerDuty Analytics provides:
- MTTA/MTTR by teams and services
- Responder health score (who is overloaded)
- Noise ratio (how many alerts are actionable vs noise)
- Business impact (time without major incidents)
Integration Timeframes
- Creating services + escalation policies + schedules — 1 day
- Connecting Prometheus/Datadog/CloudWatch — 1 day
- Webhooks + Slack/Jira automation — 1-2 days
- Testing + team training — 1 day







