Multi-Node Monitoring System Development

We design and develop full-cycle blockchain solutions: from smart contract architecture to launching DeFi protocols, NFT marketplaces and crypto exchanges. Security audits, tokenomics, integration with existing infrastructure.
Showing 1 of 1 servicesAll 1306 services
Multi-Node Monitoring System Development
Medium
~1-2 weeks
FAQ
Blockchain Development Services
Blockchain Development Stages
Latest works
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1051
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    827
  • image_logo-aider_0.jpg
    AIDER company logo development
    762
  • image_crm_chasseurs_493_0.webp
    CRM development for Chasseurs
    850

Development of Multi-Node Monitoring System

Monitoring blockchain nodes isn't about "set up Prometheus and forget it". Blockchain-specific metrics fundamentally differ from standard server metrics: a node can be completely alive from a process perspective but lag 10000 blocks behind the chain and quietly serve outdated data to clients. Standard uptime monitor won't catch this.

The task becomes more complex with multi-network infrastructure: Ethereum full node, BSC validator, Solana RPC, Cosmos validator — each has its own telemetry, RPC methods for state checking, critical metrics.

What Actually Needs Monitoring

Blockchain-Specific Metrics

Block height lag — lag behind network. Most critical metric. Node is alive but lagged — for RPC service this is critical (clients get stale data), for validator — slashing risk.

// Check lag for EVM-compatible node
async function checkBlockLag(nodeRpc: string, referenceRpc: string): Promise<number> {
    const [nodeBlock, referenceBlock] = await Promise.all([
        getBlockNumber(nodeRpc),
        getBlockNumber(referenceRpc),  // public endpoint as reference
    ]);
    return referenceBlock - nodeBlock;
}

async function getBlockNumber(rpc: string): Promise<number> {
    const response = await fetch(rpc, {
        method: "POST",
        body: JSON.stringify({ jsonrpc: "2.0", method: "eth_blockNumber", id: 1 }),
        headers: { "Content-Type": "application/json" },
        signal: AbortSignal.timeout(5000),
    });
    const { result } = await response.json();
    return parseInt(result, 16);
}

Peer count — connected peers count. Low peer count (< 5) signals sync problems and potentially isolated node. For Ethereum: net_peerCount. For Cosmos: /net_info via RPC.

Sync status — node in sync mode or already synced. For Ethereum: eth_syncing returns false (synced) or object with progress. Node on sync shouldn't accept production traffic.

Mempool depth — pending transaction count. For RPC nodes, large mempool can indicate processing issues. For Ethereum: txpool_status.

Validator-specific metrics (Cosmos, Ethereum PoS):

  • Missed blocks / attestations — missed signatures lead to slashing
  • Validator balance (ETH) — below threshold triggers validator ejection
  • Double sign risk — monitoring double-sign attempts

Infrastructure Metrics with Blockchain Context

Standard CPU/RAM/Disk metrics are critical but interpreted differently. Ethereum full node consumes 1–2 TB on NVMe (not HDD). Sharp I/O increase may signal active resync. Ethereum under full RPC load consumes 16–32 GB RAM — that's normal, not a leak.

Monitoring System Architecture

Collector Layer

For each node type — specialized collector translating blockchain-specific telemetry to unified format (Prometheus metrics).

// Collector for EVM-compatible nodes (Go)
type EVMNodeCollector struct {
    nodeRPC      string
    referenceRPC string
    nodeName     string
    chainID      string
}

func (c *EVMNodeCollector) Describe(ch chan<- *prometheus.Desc) {
    ch <- blockLagDesc
    ch <- peerCountDesc
    ch <- syncStatusDesc
    ch <- mempoolSizeDesc
}

func (c *EVMNodeCollector) Collect(ch chan<- prometheus.Metric) {
    ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
    defer cancel()

    lag, err := c.getBlockLag(ctx)
    if err != nil {
        ch <- prometheus.NewInvalidMetric(blockLagDesc, err)
        return
    }

    ch <- prometheus.MustNewConstMetric(
        blockLagDesc,
        prometheus.GaugeValue,
        float64(lag),
        c.nodeName, c.chainID,
    )
    // ... remaining metrics
}

For Cosmos-based nodes — parse /status, /net_info, /validators via RPC. For Solana — JSON-RPC methods getHealth, getSlot, getVoteAccounts. For Bitcoin — getblockchaininfo, getpeerinfo.

Ready exporters vs custom:

  • ethereum-exporter (open source) covers basic EVM metrics
  • cosmos-validator-exporter (Frens Validator) — for Cosmos ecosystem
  • For non-standard protocols (TON, Solana with custom metrics) — write exporter in Go

Aggregation and Storage

Prometheus + VictoriaMetrics for long-term storage. VictoriaMetrics preferable for multi-network operations: better compresses time series, supports federated scraping from multiple Prometheus instances.

# prometheus.yml — scrape config for multi-node environment
scrape_configs:
  - job_name: 'ethereum-nodes'
    scrape_interval: 15s
    scrape_timeout: 10s
    static_configs:
      - targets:
          - 'eth-node-1:9090'
          - 'eth-node-2:9090'
          - 'eth-node-3:9090'
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance

  - job_name: 'cosmos-validators'
    scrape_interval: 30s  # Cosmos block ~6 sec, 30 sec sufficient
    static_configs:
      - targets: ['cosmos-val-1:26660', 'cosmos-val-2:26660']

  - job_name: 'solana-rpc'
    scrape_interval: 10s  # Solana ~400ms slot, frequent checks needed
    static_configs:
      - targets: ['solana-rpc-1:9101']

Alerting

Grafana Alerting or AlertManager. Key principle: different severity for different metrics. Not everything requires immediate response.

Metric Warning Critical Action
Block lag (EVM) > 10 blocks > 50 blocks Auto-restart or traffic switch
Peer count < 10 < 3 Check firewall/network
Disk space < 20% < 10% Expand or pruning
Validator missed > 1% > 5% Immediately (slashing risk)
Memory usage > 80% > 95% Check leaks, restart
# alertmanager rules
groups:
  - name: blockchain-nodes
    rules:
      - alert: ValidatorMissedBlocks
        expr: rate(cosmos_validator_missed_blocks_total[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Validator {{ $labels.validator }} missing >5% blocks"
          description: "Slashing risk. Immediate action required."

      - alert: NodeBlockLagHigh
        expr: blockchain_block_lag{chain="ethereum"} > 50
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Ethereum node {{ $labels.instance }} lagging {{ $value }} blocks"

Automatic Response

Passive monitoring isn't sufficient for 24/7 production. For critical scenarios — automatic remediation actions.

Auto-failover for RPC nodes. Load balancer (HAProxy/nginx) checks node health endpoint, on failure — automatically excludes from rotation. Blockchain node health check must include block lag check, not just HTTP 200.

# Health check script for HAProxy (called as external check)
import sys
import asyncio
from web3 import AsyncWeb3

MAX_LAG = 20  # maximum acceptable lag in blocks

async def check_node_health(node_url: str, reference_url: str) -> bool:
    try:
        w3_node = AsyncWeb3(AsyncWeb3.AsyncHTTPProvider(node_url, request_kwargs={"timeout": 3}))
        w3_ref = AsyncWeb3(AsyncWeb3.AsyncHTTPProvider(reference_url, request_kwargs={"timeout": 3}))

        node_block, ref_block = await asyncio.gather(
            w3_node.eth.block_number,
            w3_ref.eth.block_number,
        )
        return (ref_block - node_block) <= MAX_LAG
    except Exception:
        return False

if not asyncio.run(check_node_health(sys.argv[1], sys.argv[2])):
    sys.exit(1)

Auto-restart on hang. Node can hang without crash. Watchdog: if block height unchanged for N minutes — restart service via systemd or Kubernetes restart policy.

Dashboards

Grafana dashboards by structure: Overview (all nodes, all networks, status at a glance), Per-network deep dive (detailed metrics per network), Validator performance (for staking nodes, including APR and slashing risks), Infrastructure (CPU/RAM/Disk per node).

For public RPC services — additional: request metrics (RPS, latency, error rate), rate limiting stats, top methods by load.

Development Timeline

Component Timeline
Basic exporters (EVM + 1–2 other networks) 1–2 weeks
Prometheus + VictoriaMetrics + Grafana setup 3–5 days
Alert rules + PagerDuty/Telegram integration 2–3 days
Auto-failover for RPC 1 week
Dashboards + documentation 1 week

Monitoring for 3–5 networks with basic dashboards and alerts — 3–4 weeks. Extended system with auto-remediation and custom exporters for non-standard protocols — 6–8 weeks.