Setting Up Automatic Failover for Primary Server Failure
Automatic failover is traffic switching to a backup server without human intervention. Goal: reduce RTO (Recovery Time Objective) from "until someone wakes up" to 30-120 seconds. For e-commerce or SaaS, this is the difference between losing 5 minutes and losing an hour of revenue.
Failover Levels and Where Each Applies
DNS-level (Route 53 Health Checks, Cloudflare Failover). The simplest approach. Health check monitors primary every 10-30 seconds. On failure — changes DNS record to backup server IP. Latency: TTL + detection time = 60-300 seconds. Suitable for most web applications.
Load Balancer (AWS ALB/NLB, nginx upstream). Health checks at balancer level, switching in 5-30 seconds. Requires both servers in the same cloud or region.
VRRP / Keepalived (bare metal / VPS). Virtual IP moves between servers on master failure. Switching in 2-5 seconds. Classic for on-premise and dedicated setups.
Database failover. Separate concern — application must know about new primary DB. Patroni (PostgreSQL), MHA (MySQL), AWS RDS Multi-AZ handle this automatically.
Implementation on AWS Route 53
Route 53 Failover Policy:
Primary record → 1.2.3.4 (primary server)
Health check: HTTP GET /health, port 443
Failure threshold: 3 consecutive failures
Request interval: 10 seconds
Secondary record → 5.6.7.8 (backup server)
Evaluate target health: Yes
The /health endpoint in the application should check real state: DB accessible, cache working, disk space not exhausted. Return 200 only when fully operational.
Keepalived for Bare Metal / VPS
# /etc/keepalived/keepalived.conf on PRIMARY
vrrp_script check_app {
script "/usr/local/bin/check_app.sh"
interval 5
weight -20
fall 2
rise 2
}
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 51
priority 100
advert_int 1
virtual_ipaddress {
192.168.1.100/24
}
track_script {
check_app
}
}
The check_app.sh script verifies application availability locally. After two consecutive failed checks — BACKUP server with priority 90 takes the virtual IP.
Data Synchronization Between Servers
Failover is meaningless without current data on the backup server:
- Database: master-slave replication (PostgreSQL streaming replication, MySQL GTID replication). Replication lag is monitored, alert if exceeding 30 seconds
- Files: lsyncd (realtime rsync) or S3-compatible storage as shared point
- Sessions: Redis with replication or sticky sessions through balancer
- Configuration: Ansible pull from shared git repository
Testing Failover
Regular drills are mandatory. Failover that hasn't been tested is failover that won't work when needed.
Check protocol:
- Verify monitoring captures baseline state
- Simulate failure:
systemctl stop nginxoriptables -I INPUT -p tcp --dport 80 -j DROPon primary - Record time until switching
- Verify functionality through backup server
- Restore primary, verify switchback
Target metrics: detection time < 30s, switch time < 60s, total RTO < 120s.
"Split-brain" State and How to Avoid It
Issue: both servers think they're primary. In Keepalived, solved through fencing (STONITH) — on conflict, weaker node is forcibly shut down. In PostgreSQL/Patroni — through DCS (etcd, Consul, ZooKeeper) as arbiter.
Setup Timeline
- DNS failover (Route 53 / Cloudflare) — 1-2 days
- Keepalived + data synchronization — 3-5 days
- Full scheme with DB failover (Patroni) — 5-10 days
- Testing and documentation — 1-2 days
Monitoring Failover Events
Each switchover is an incident requiring investigation. Alertmanager or PagerDuty capture the event. Ticket automatically created in Jira/Linear. Post-incident — root cause analysis: why primary failed.







