Blockchain Node High-Availability Setup

We design and develop full-cycle blockchain solutions: from smart contract architecture to launching DeFi protocols, NFT marketplaces and crypto exchanges. Security audits, tokenomics, integration with existing infrastructure.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Services we offer

Showing 1 of 1All 1306 services

Blockchain Node High-Availability Setup

Medium

~3-5 days

Frequently Asked Questions

Blockchain Development Services

Discuss your blockchain project

Free consultation — we will show how blockchain can solve your challenge

Get a quote

We will estimate the budget and timeline for your blockchain project

Blockchain Development Stages

Latest works

Development of a web application for FEEDME
1220
Development of an online store for the company FURNORO
1149
B2B Advance company logo design
611
Development of a web application for Enviok
886
AIDER company logo development
813
CRM development for Chasseurs
933

Show more works

Blockchain Node High Availability Setup

A single node is a single point of failure. For a production service that depends on blockchain data (dApp, payment processor, trading bot), node downtime = product downtime. High Availability (HA) is not just "run two nodes". It's a thoughtful architecture with failover, health checking, and understanding what exactly and how can break.

Typical Reasons for Node Unavailability

Before building HA, you need to understand what we're protecting against:

Node lags behind tip (Ethereum: resync after crash; Solana: slot lag > 100)
RPC overloaded — one instance can't handle request load
Client update — during rolling update node is unavailable
Hardware failure — disk, RAM, network card
Snapshot corruption — after unexpected power loss

Architecture: Active-Active Behind Load Balancer

Most practical scheme for RPC nodes:

Client requests
       │
   ┌───▼───┐
   │  HAProxy / Nginx  │   ← health check every 5s
   └───┬───┘
       │
  ┌────┴────┐
  ▼         ▼
Node-1    Node-2        ← different AZ / datacenters
  │         │
  └────┬────┘
       │
   Shared or
   independent storage

Active-active better than active-passive for RPC: both nodes take traffic, load distributed, failover instant (don't wait for secondary promotion).

HAProxy Configuration for Ethereum RPC

# /etc/haproxy/haproxy.cfg
global
    maxconn 50000
    log stdout format raw daemon

defaults
    mode http
    timeout connect 5s
    timeout client 60s
    timeout server 60s
    option http-server-close
    option forwardfor

frontend ethereum_rpc
    bind *:8545
    bind *:8546  # WebSocket
    default_backend ethereum_nodes

backend ethereum_nodes
    balance leastconn
    option httpchk POST / HTTP/1.1\r\nHost:\ localhost\r\nContent-Type:\ application/json\r\nContent-Length:\ 68\r\n\r\n{"jsonrpc":"2.0","method":"eth_syncing","params":[],"id":1}
    http-check expect string '"result":false'   # node in sync if eth_syncing = false
    
    server node1 10.0.1.10:8545 check inter 5s fall 2 rise 3
    server node2 10.0.1.11:8545 check inter 5s fall 2 rise 3
    
    # Sticky sessions for WebSocket (can't switch mid-subscription)
    stick-table type ip size 100k expire 30m
    stick on src

frontend ethereum_ws
    bind *:8546
    default_backend ethereum_ws_nodes

backend ethereum_ws_nodes
    balance source        # WebSocket — by source IP for sticky
    server node1 10.0.1.10:8546 check inter 10s fall 2 rise 3
    server node2 10.0.1.11:8546 check inter 10s fall 2 rise 3

Critical point for WebSocket: subscriptions (eth_subscribe, Solana slotSubscribe) are stateful connections. On failover WebSocket client must recreate subscriptions. In load balancer use sticky sessions by IP — client always goes to one node while it's alive.

Health Check: What to Check

Standard HTTP health check (status 200) — insufficient. Node can respond to HTTP but be 1000 blocks behind tip. Correct check:

#!/bin/bash
# /etc/haproxy/scripts/check_eth_node.sh
# Returns 0 if node healthy, 1 if not

NODE_URL="http://localhost:8545"

# 1. Check that node is not syncing
SYNCING=$(curl -sf -X POST "$NODE_URL" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"eth_syncing","params":[],"id":1}' | \
  jq -r '.result')

if [ "$SYNCING" != "false" ]; then
  exit 1
fi

# 2. Check that block is not older than 3 minutes (180 seconds)
BLOCK_HEX=$(curl -sf -X POST "$NODE_URL" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"eth_getBlockByNumber","params":["latest",false],"id":1}' | \
  jq -r '.result.timestamp')

BLOCK_TIME=$((16#${BLOCK_HEX#0x}))
NOW=$(date +%s)
AGE=$((NOW - BLOCK_TIME))

if [ $AGE -gt 180 ]; then
  exit 1
fi

exit 0

Similar logic for Solana — check getSlot and compare with getEpochInfo, tolerance of 50–100 slots.

Rolling Update Without Downtime

Client update — most common reason for planned downtime. With HA this is solved:

#!/bin/bash
# rolling_update.sh

# Step 1: Remove node1 from rotation
haproxy -sf $(cat /var/run/haproxy.pid) -f /etc/haproxy/haproxy_node2_only.cfg

# Step 2: Wait for drain of existing connections
sleep 30

# Step 3: Update node1
ssh node1 "systemctl stop geth && apt upgrade -y ethereum && systemctl start geth"

# Step 4: Wait for node1 sync
while ! /etc/haproxy/scripts/check_eth_node.sh node1; do
  echo "Waiting for node1 to sync..."
  sleep 30
done

# Step 5: Return node1, update node2
haproxy -sf $(cat /var/run/haproxy.pid) -f /etc/haproxy/haproxy.cfg
sleep 30
ssh node2 "systemctl stop geth && apt upgrade -y ethereum && systemctl start geth"

Monitoring and Alerts

Prometheus + Grafana — standard. Key metrics:

Metric	Alert Threshold	Criticality
`eth_block_age_seconds`	> 120s	Critical
`haproxy_backend_active_servers`	< 1	Critical
`haproxy_backend_response_time_ms`	> 2000ms	Warning
`node_disk_io_time_percent`	> 80%	Warning
`node_memory_available_bytes`	< 10%	Warning

Alerts — to PagerDuty or Telegram. For backend_active_servers < 1 (all nodes down) — wake on-call immediately.

What's Included

Deployment of second node in separate AZ/datacenter
HAProxy or Nginx setup with smart health checks
Rolling update scripts for updates without downtime
Prometheus metrics, Grafana dashboard, alerts
Documentation of failover and recovery procedures