On-Call Rotation Setup for Support Team
On-call rotation is a system where responsibility for incident response outside business hours is distributed among team members in turns. Without rotation — one on-call engineer burns out in a month. With properly configured rotation — load is balanced and response is predictable.
On-Call Scheme Structure
Primary on-call: First level, receives all alerts. Must respond within 5-15 minutes.
Secondary on-call (backup): If primary doesn't respond in 10-15 minutes — escalation to secondary. Backup for unavailability.
Escalation path: Primary → Secondary → Engineering Manager → CTO. Each level — +10-15 minutes.
Rotation period: Usually 1 week. Can be 2 weeks for mature teams with low alert noise. Less than a week — too frequent context switching.
Requirements for On-Call Engineer
Being on-call means:
- Available within 15 minutes (not half hour to shower)
- Laptop and VPN access always at hand
- Sobriety and capacity for technical thinking
- Knowledge of runbooks for typical incidents
On-call doesn't mean "work 24/7". If no night alerts — engineer sleeps. Goal is to be ready, not to work constantly.
PagerDuty Configuration
Service → Escalation Policy:
Level 1: On-Call schedule (primary)
- Notify after: immediately
- Escalate after: 15 minutes
Level 2: On-Call schedule (secondary)
- Notify after: escalation
- Escalate after: 15 minutes
Level 3: Engineering Manager
- Notify after: escalation
Schedule (Primary):
Rotation type: Weekly
Handoff time: Monday 10:00 local time
Restrictions: None (24/7 coverage)
Layer 1: [Engineer A, Engineer B, Engineer C, Engineer D]
Handoff during business hours (not at 00:00) — engineer takes shift in calm environment, reviews open incidents.
Compensation for On-Call Duty
On-call is additional load that should be compensated:
- Bonus for on-call week
- Day off after heavy week with night incidents
- Compensation for each night call (if many)
Team without compensation — team that sabotages duty or leaves.
Reducing Alert Fatigue
On-call works only if alerts are meaningful. If 50 alerts come during duty week, of which 45 are noise, after a month team stops responding.
Tools to fight noise:
- Alert grouping: multiple related alerts → one incident
- Smart notifications: different channels for day (Slack) and night (call)
- Alert review: weekly review of fired alerts, disable false ones
- SLO-based alerting: alerts on burn rate, not thresholds
Handoff Procedure
Each shift — context transfer:
- List of open/recent incidents
- What infrastructure is unstable right now
- Planned changes this week (deploys, migrations)
- "Hot" areas requiring attention
Template of handoff-note in Slack/Confluence, filled by outgoing on-call.
On-Call Health Metrics
- Incidents per week per engineer — if > 5, need to reduce alarm noise
- After-hours incidents % — how many incidents happen night/weekends
- MTTA (Mean Time to Acknowledge) — if > 15 min, escalation policy is weak
- Fatigue score — subjective load assessment from each on-call
Setup Timeframes
- PagerDuty/OpsGenie + schedules + escalation policy — 1-2 days
- Integration with Prometheus/Datadog alerts — 1-2 days
- Notification channels setup (Slack, call, SMS) — 1 day
- Process documentation + team training — 1-2 days







