Setting Up Disaster Recovery (DR) Site for Web Application
DR Site is a complete copy of infrastructure in separate physical or cloud data center, ready to take traffic on catastrophic primary site failure. The term "disaster recovery site" covers different readiness levels: from cold standby (hours to bring up) to hot standby (minutes).
DR Site Classification by Readiness
Cold Standby. Infrastructure not running. Data replicates, configuration stored in IaC. On failure: spin up environment from Terraform → restore data from backup → start application. RTO: 2-8 hours.
Warm Standby. Basic infrastructure runs at reduced size (1 instance instead of 10). Data current via replication. On failure: scale to production size → switch DNS. RTO: 15-60 minutes.
Hot Standby. Full infrastructure copy runs continuously. Data synchronized with lag < 1 minute. On failure: switch DNS/balancer. RTO: 1-5 minutes.
Selecting DR Site Location
Key requirements:
- Physically independent power grid and internet channels
- Minimum 100 km from primary site (protection from regional disasters)
- Legal compliance (user data from RF — in RF, GDPR for Europe)
Options:
- Second AWS/GCP/Azure region (simplest)
- Different cloud provider (protection from vendor outage)
- Own or leased co-location (for regulated industries)
Data Replication
PostgreSQL → DR Site: Streaming replication with async standby in DR. For critical data — synchronous_commit = remote_apply (guarantees data on standby if primary fails, but increases write latency).
Monitor replication lag:
SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;
Alert if lag > 30 seconds.
File Storage:
- S3 Cross-Region Replication (AWS) — automatic, RPO < 15 minutes
- Rclone sync on schedule — for infrequently changing objects
- Lsyncd for realtime filesystem sync between servers
Redis: Redis Sentinel with replica in DR or Redis Cluster with geo-distribution.
Infrastructure as Code for DR
All DR Site described in Terraform. Primary and standby environments — different workspaces or separate config directories, parameterized via variables:
module "app_cluster" {
source = "./modules/app"
region = var.region
instance_type = var.dr_mode ? "t3.medium" : "c6i.2xlarge"
replica_count = var.dr_mode ? 1 : 5
}
Cold standby: terraform apply only on DR activation. Warm standby: terraform apply immediately with dr_mode = true.
DR Site Activation Procedure
Documented runbook with exact commands — not general words, but specific steps:
- Confirm primary site failure (not false alarm)
- Declare DR incident, assign incident manager
- Check DB replication lag before switching
- If warm/hot: promote DB replica (
pg_promote()) - Update DNS (Route 53 / Cloudflare) to DR addresses
- Verify functionality via DR Site
- Notify team and, if needed, users
- Record RTO
Network Connectivity
Dedicated channel needed between primary site and DR Site for data replication:
- AWS VPC Peering or Transit Gateway (within AWS)
- AWS Direct Connect / GCP Interconnect (on-premise to cloud)
- Site-to-site VPN (budget option, less reliable)
Replication channel must be isolated from user traffic — application peak load shouldn't affect replication.
DR Site Cost
| Type | Permanent Cost | Example (AWS) |
|---|---|---|
| Cold Standby | Storage + replication | $50-200/month |
| Warm Standby | ~30% of prod | $500-2000/month |
| Hot Standby | ~80-100% of prod | $2000-8000/month |
For most web applications, warm standby is optimal: reasonable cost with RTO 30-60 minutes.
Implementation Timeline
- Analyze current infrastructure and choose strategy — 2-3 days
- Configure data replication — 3-7 days
- Deploy DR infrastructure in IaC — 5-10 days
- Network connectivity and security — 2-5 days
- Procedures, runbook, testing — 3-5 days
Total: 2-5 weeks depending on infrastructure complexity and DR type.







