Blockchain Node High-Availability Setup

We design and develop full-cycle blockchain solutions: from smart contract architecture to launching DeFi protocols, NFT marketplaces and crypto exchanges. Security audits, tokenomics, integration with existing infrastructure.
Showing 1 of 1 servicesAll 1306 services
Blockchain Node High-Availability Setup
Medium
~3-5 business days
FAQ
Blockchain Development Services
Blockchain Development Stages
Latest works
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1051
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    827
  • image_logo-aider_0.jpg
    AIDER company logo development
    762
  • image_crm_chasseurs_493_0.webp
    CRM development for Chasseurs
    850

Blockchain Node High Availability Setup

A single node is a single point of failure. For a production service that depends on blockchain data (dApp, payment processor, trading bot), node downtime = product downtime. High Availability (HA) is not just "run two nodes". It's a thoughtful architecture with failover, health checking, and understanding what exactly and how can break.

Typical Reasons for Node Unavailability

Before building HA, you need to understand what we're protecting against:

  • Node lags behind tip (Ethereum: resync after crash; Solana: slot lag > 100)
  • RPC overloaded — one instance can't handle request load
  • Client update — during rolling update node is unavailable
  • Hardware failure — disk, RAM, network card
  • Snapshot corruption — after unexpected power loss

Architecture: Active-Active Behind Load Balancer

Most practical scheme for RPC nodes:

Client requests
       │
   ┌───▼───┐
   │  HAProxy / Nginx  │   ← health check every 5s
   └───┬───┘
       │
  ┌────┴────┐
  ▼         ▼
Node-1    Node-2        ← different AZ / datacenters
  │         │
  └────┬────┘
       │
   Shared or
   independent storage

Active-active better than active-passive for RPC: both nodes take traffic, load distributed, failover instant (don't wait for secondary promotion).

HAProxy Configuration for Ethereum RPC

# /etc/haproxy/haproxy.cfg
global
    maxconn 50000
    log stdout format raw daemon

defaults
    mode http
    timeout connect 5s
    timeout client 60s
    timeout server 60s
    option http-server-close
    option forwardfor

frontend ethereum_rpc
    bind *:8545
    bind *:8546  # WebSocket
    default_backend ethereum_nodes

backend ethereum_nodes
    balance leastconn
    option httpchk POST / HTTP/1.1\r\nHost:\ localhost\r\nContent-Type:\ application/json\r\nContent-Length:\ 68\r\n\r\n{"jsonrpc":"2.0","method":"eth_syncing","params":[],"id":1}
    http-check expect string '"result":false'   # node in sync if eth_syncing = false
    
    server node1 10.0.1.10:8545 check inter 5s fall 2 rise 3
    server node2 10.0.1.11:8545 check inter 5s fall 2 rise 3
    
    # Sticky sessions for WebSocket (can't switch mid-subscription)
    stick-table type ip size 100k expire 30m
    stick on src

frontend ethereum_ws
    bind *:8546
    default_backend ethereum_ws_nodes

backend ethereum_ws_nodes
    balance source        # WebSocket — by source IP for sticky
    server node1 10.0.1.10:8546 check inter 10s fall 2 rise 3
    server node2 10.0.1.11:8546 check inter 10s fall 2 rise 3

Critical point for WebSocket: subscriptions (eth_subscribe, Solana slotSubscribe) are stateful connections. On failover WebSocket client must recreate subscriptions. In load balancer use sticky sessions by IP — client always goes to one node while it's alive.

Health Check: What to Check

Standard HTTP health check (status 200) — insufficient. Node can respond to HTTP but be 1000 blocks behind tip. Correct check:

#!/bin/bash
# /etc/haproxy/scripts/check_eth_node.sh
# Returns 0 if node healthy, 1 if not

NODE_URL="http://localhost:8545"

# 1. Check that node is not syncing
SYNCING=$(curl -sf -X POST "$NODE_URL" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"eth_syncing","params":[],"id":1}' | \
  jq -r '.result')

if [ "$SYNCING" != "false" ]; then
  exit 1
fi

# 2. Check that block is not older than 3 minutes (180 seconds)
BLOCK_HEX=$(curl -sf -X POST "$NODE_URL" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"eth_getBlockByNumber","params":["latest",false],"id":1}' | \
  jq -r '.result.timestamp')

BLOCK_TIME=$((16#${BLOCK_HEX#0x}))
NOW=$(date +%s)
AGE=$((NOW - BLOCK_TIME))

if [ $AGE -gt 180 ]; then
  exit 1
fi

exit 0

Similar logic for Solana — check getSlot and compare with getEpochInfo, tolerance of 50–100 slots.

Rolling Update Without Downtime

Client update — most common reason for planned downtime. With HA this is solved:

#!/bin/bash
# rolling_update.sh

# Step 1: Remove node1 from rotation
haproxy -sf $(cat /var/run/haproxy.pid) -f /etc/haproxy/haproxy_node2_only.cfg

# Step 2: Wait for drain of existing connections
sleep 30

# Step 3: Update node1
ssh node1 "systemctl stop geth && apt upgrade -y ethereum && systemctl start geth"

# Step 4: Wait for node1 sync
while ! /etc/haproxy/scripts/check_eth_node.sh node1; do
  echo "Waiting for node1 to sync..."
  sleep 30
done

# Step 5: Return node1, update node2
haproxy -sf $(cat /var/run/haproxy.pid) -f /etc/haproxy/haproxy.cfg
sleep 30
ssh node2 "systemctl stop geth && apt upgrade -y ethereum && systemctl start geth"

Monitoring and Alerts

Prometheus + Grafana — standard. Key metrics:

Metric Alert Threshold Criticality
eth_block_age_seconds > 120s Critical
haproxy_backend_active_servers < 1 Critical
haproxy_backend_response_time_ms > 2000ms Warning
node_disk_io_time_percent > 80% Warning
node_memory_available_bytes < 10% Warning

Alerts — to PagerDuty or Telegram. For backend_active_servers < 1 (all nodes down) — wake on-call immediately.

What's Included

  • Deployment of second node in separate AZ/datacenter
  • HAProxy or Nginx setup with smart health checks
  • Rolling update scripts for updates without downtime
  • Prometheus metrics, Grafana dashboard, alerts
  • Documentation of failover and recovery procedures