Blockchain Node High Availability Setup
A single node is a single point of failure. For a production service that depends on blockchain data (dApp, payment processor, trading bot), node downtime = product downtime. High Availability (HA) is not just "run two nodes". It's a thoughtful architecture with failover, health checking, and understanding what exactly and how can break.
Typical Reasons for Node Unavailability
Before building HA, you need to understand what we're protecting against:
- Node lags behind tip (Ethereum: resync after crash; Solana: slot lag > 100)
- RPC overloaded — one instance can't handle request load
- Client update — during rolling update node is unavailable
- Hardware failure — disk, RAM, network card
- Snapshot corruption — after unexpected power loss
Architecture: Active-Active Behind Load Balancer
Most practical scheme for RPC nodes:
Client requests
│
┌───▼───┐
│ HAProxy / Nginx │ ← health check every 5s
└───┬───┘
│
┌────┴────┐
▼ ▼
Node-1 Node-2 ← different AZ / datacenters
│ │
└────┬────┘
│
Shared or
independent storage
Active-active better than active-passive for RPC: both nodes take traffic, load distributed, failover instant (don't wait for secondary promotion).
HAProxy Configuration for Ethereum RPC
# /etc/haproxy/haproxy.cfg
global
maxconn 50000
log stdout format raw daemon
defaults
mode http
timeout connect 5s
timeout client 60s
timeout server 60s
option http-server-close
option forwardfor
frontend ethereum_rpc
bind *:8545
bind *:8546 # WebSocket
default_backend ethereum_nodes
backend ethereum_nodes
balance leastconn
option httpchk POST / HTTP/1.1\r\nHost:\ localhost\r\nContent-Type:\ application/json\r\nContent-Length:\ 68\r\n\r\n{"jsonrpc":"2.0","method":"eth_syncing","params":[],"id":1}
http-check expect string '"result":false' # node in sync if eth_syncing = false
server node1 10.0.1.10:8545 check inter 5s fall 2 rise 3
server node2 10.0.1.11:8545 check inter 5s fall 2 rise 3
# Sticky sessions for WebSocket (can't switch mid-subscription)
stick-table type ip size 100k expire 30m
stick on src
frontend ethereum_ws
bind *:8546
default_backend ethereum_ws_nodes
backend ethereum_ws_nodes
balance source # WebSocket — by source IP for sticky
server node1 10.0.1.10:8546 check inter 10s fall 2 rise 3
server node2 10.0.1.11:8546 check inter 10s fall 2 rise 3
Critical point for WebSocket: subscriptions (eth_subscribe, Solana slotSubscribe) are stateful connections. On failover WebSocket client must recreate subscriptions. In load balancer use sticky sessions by IP — client always goes to one node while it's alive.
Health Check: What to Check
Standard HTTP health check (status 200) — insufficient. Node can respond to HTTP but be 1000 blocks behind tip. Correct check:
#!/bin/bash
# /etc/haproxy/scripts/check_eth_node.sh
# Returns 0 if node healthy, 1 if not
NODE_URL="http://localhost:8545"
# 1. Check that node is not syncing
SYNCING=$(curl -sf -X POST "$NODE_URL" \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"eth_syncing","params":[],"id":1}' | \
jq -r '.result')
if [ "$SYNCING" != "false" ]; then
exit 1
fi
# 2. Check that block is not older than 3 minutes (180 seconds)
BLOCK_HEX=$(curl -sf -X POST "$NODE_URL" \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"eth_getBlockByNumber","params":["latest",false],"id":1}' | \
jq -r '.result.timestamp')
BLOCK_TIME=$((16#${BLOCK_HEX#0x}))
NOW=$(date +%s)
AGE=$((NOW - BLOCK_TIME))
if [ $AGE -gt 180 ]; then
exit 1
fi
exit 0
Similar logic for Solana — check getSlot and compare with getEpochInfo, tolerance of 50–100 slots.
Rolling Update Without Downtime
Client update — most common reason for planned downtime. With HA this is solved:
#!/bin/bash
# rolling_update.sh
# Step 1: Remove node1 from rotation
haproxy -sf $(cat /var/run/haproxy.pid) -f /etc/haproxy/haproxy_node2_only.cfg
# Step 2: Wait for drain of existing connections
sleep 30
# Step 3: Update node1
ssh node1 "systemctl stop geth && apt upgrade -y ethereum && systemctl start geth"
# Step 4: Wait for node1 sync
while ! /etc/haproxy/scripts/check_eth_node.sh node1; do
echo "Waiting for node1 to sync..."
sleep 30
done
# Step 5: Return node1, update node2
haproxy -sf $(cat /var/run/haproxy.pid) -f /etc/haproxy/haproxy.cfg
sleep 30
ssh node2 "systemctl stop geth && apt upgrade -y ethereum && systemctl start geth"
Monitoring and Alerts
Prometheus + Grafana — standard. Key metrics:
| Metric | Alert Threshold | Criticality |
|---|---|---|
eth_block_age_seconds |
> 120s | Critical |
haproxy_backend_active_servers |
< 1 | Critical |
haproxy_backend_response_time_ms |
> 2000ms | Warning |
node_disk_io_time_percent |
> 80% | Warning |
node_memory_available_bytes |
< 10% | Warning |
Alerts — to PagerDuty or Telegram. For backend_active_servers < 1 (all nodes down) — wake on-call immediately.
What's Included
- Deployment of second node in separate AZ/datacenter
- HAProxy or Nginx setup with smart health checks
- Rolling update scripts for updates without downtime
- Prometheus metrics, Grafana dashboard, alerts
- Documentation of failover and recovery procedures







