Conducting Disaster Recovery Drills
DR Drill is planned verification of team's ability to restore application from standby state. Organizations that never conducted drills discover runbook problems only during real incidents. That's poor timing for discoveries.
Why Regular Drills Are Needed
Backups not verified are illusion of security. Typical surprises on first DR Drill:
- Backup exists, but recovery takes 6 hours instead of expected 30 minutes
- Config files stored only on primary server, not included in backup
- Secrets (API keys, certificates) stored in people's heads, not in Vault/Secrets Manager
- Runbook describes outdated infrastructure
- Team doesn't know who decides on DR activation
Types of Drills
Tabletop Exercise. Discuss scenario without real execution. Team goes through runbook steps verbally. Reveals gaps in documentation and responsibility. 2-3 hours, no risks.
Functional Exercise. Real execution of individual DR components: restore DB from backup, verify DNS failover, spin up environment from Terraform. Risk: minimal if done in staging.
Full-Scale Drill. Complete disaster simulation: intentional primary infrastructure shutdown, full switchover to DR Site. Conducted during off-hours (usually night or weekends) with pre-agreed maintenance window.
Drill Preparation
At least one week before Drill:
- Refresh runbook (verify commands work on current infrastructure)
- Confirm backups are fresh and uncorrupted
- Assign roles: incident manager, technical lead, observer (records time and deviations)
- Coordinate window with business (for full-scale drill)
- Prepare success metrics: target RTO/RPO
Testing Scenarios
| Scenario | What We Test |
|---|---|
| Primary DB failure, promote replica | Promotion time, application correctness |
| Primary server loss | DNS failover, switch time |
| Data corruption (accidental delete) | PITR recovery, RPO |
| Complete region/DC loss | Spin up from IaC + data from DR Site |
| Secret compromise | Rotate all credentials, time |
Full-Scale Drill Process
Before start (T-60 minutes):
- All participants online, roles assigned
- Monitoring in separate window for observer
- Initial state documented (dashboard screenshots)
Scenario activation (T=0):
- Observer starts timer
- Simulate failure (per scenario)
Recovery:
- Team follows runbook (no improvisation)
- Observer records each step with timestamp
- Deviations from runbook documented
Verification (after switchover):
- Smoke tests of critical app functions
- Data integrity check
- Record actual RTO/RPO
Restore initial state (after successful verification).
Post-Drill Analysis
24-48 hours after drill — team meeting:
- What went as planned
- What went wrong (without blame)
- Concrete tasks to improve runbook
- Updated target RTO/RPO for next drill
Drill results — document stored alongside runbook.
Drill Frequency
- Tabletop exercise — quarterly
- Functional exercise — twice yearly
- Full-scale drill — annually (or after major infrastructure changes)
Organization Timeline
- Prepare first drills (tabletop) — 2-3 days
- Organize functional exercise — 3-5 days
- Prepare full-scale drill — 1-2 weeks







