Setting Up RTO and RPO for Critical Systems
RTO (Recovery Time Objective) — maximum acceptable downtime after failure. RPO (Recovery Point Objective) — maximum acceptable data loss in time units. These two parameters determine entire backup architecture: lower values mean more expensive infrastructure.
Cost dependency on RTO/RPO
| RTO | RPO | Architecture | Approx. Cost |
|---|---|---|---|
| 24h | 24h | Daily backup to S3 | $50–200/month |
| 4h | 1h | Hourly backup + hot standby | $300–800/month |
| 1h | 15min | Streaming replication + autofailover | $800–2000/month |
| 15min | 5min | Patroni + WAL archiving + active standby | $2000–5000/month |
| 5min | 0 | Multi-region active-active | $8000+/month |
Determine business requirements
# Calculate downtime cost to define acceptable RTO
class RtoCalculator:
def calculate_downtime_cost(
self,
hourly_revenue: float,
customer_churn_per_hour: float,
penalty_per_sla_violation: float,
avg_customer_lifetime_value: float,
total_customers: int
) -> dict:
costs_per_hour = {
'lost_revenue': hourly_revenue,
'customer_churn': (customer_churn_per_hour / 100) * total_customers
* avg_customer_lifetime_value,
'sla_penalties': penalty_per_sla_violation,
'recovery_labor': 500
}
total_per_hour = sum(costs_per_hour.values())
return {
'cost_per_hour': total_per_hour,
'recommended_rto': self._recommend_rto(total_per_hour),
}
PostgreSQL configuration for RPO = 5 minutes
# postgresql.conf — WAL archiving for PITR
wal_level = replica
archive_mode = on
archive_command = 'pgbackrest --stanza=main archive-push %p'
# Checkpoint frequency
checkpoint_timeout = 5min
Patroni: automatic failover (RTO < 30 sec)
# /etc/patroni/patroni.yml
scope: postgres-cluster
namespace: /service/
name: pg-node-1
bootstrap:
dcs:
ttl: 30 # failover after 30s without heartbeat
maximum_lag_on_failover: 1048576
postgresql:
parameters:
wal_level: replica
hot_standby: on
max_wal_senders: 10
archive_mode: on
archive_command: 'pgbackrest --stanza=main archive-push %p'
HAProxy: routing by role
# haproxy.cfg
frontend postgres_write
bind *:5432
default_backend postgres_primary
backend postgres_primary
option httpchk GET /master
http-check expect status 200
server pg-node-1 pg-node-1-ip:5432 check port 8008
server pg-node-2 pg-node-2-ip:5432 check port 8008
server pg-node-3 pg-node-3-ip:5432 check port 8008
frontend postgres_read
bind *:5433
default_backend postgres_replicas
backend postgres_replicas
balance roundrobin
option httpchk GET /replica
http-check expect status 200
server pg-node-1 pg-node-1-ip:5432 check port 8008
server pg-node-2 pg-node-2-ip:5432 check port 8008
server pg-node-3 pg-node-3-ip:5432 check port 8008
Monitoring RTO/RPO metrics
# Prometheus alerting for SLA violations
- alert: ReplicationLagCritical
expr: postgresql_replication_lag_seconds > 300
for: 2m
labels:
severity: critical
annotations:
summary: "RPO at risk: replica lag {{ $value }}s > 5min RPO target"
Timeline
Setup of Patroni + pgBackRest + HAProxy to achieve RTO < 30 min and RPO < 5 min — 3–5 business days.







