Market Data Replication System Development
Data replication solves the task of reliable distribution of market feed from primary sources (exchanges) to multiple consumers in different geographical locations or isolated environments. This is not simple copying — it's ensuring consistency, fault tolerance, and scalability of the data stream.
Why Replication is Needed
Trading systems often have multiple components operating in different environments:
- Production trading — on servers close to exchanges (co-location or low latency)
- Research/backtesting — in data centers with large storage volumes
- Risk management — in protected networks with restricted access
- Analytics dashboards — accessible to broad audience
Each environment should receive identical data without overloading the source.
Replication Topologies
Hub-and-Spoke — one primary aggregator collects data from exchanges, multiple replica nodes subscribe to it. Simple implementation, single point of failure on hub.
Chain Replication — data passes through a chain: exchange → primary → secondary → tertiary. Low load on primary source, high cumulative latency.
Pub-Sub via Kafka — primary writes to Kafka, any number of consumer groups read independently. Most flexible option for production.
Kafka as Replication Backbone
Kafka is ideal for market data replication:
- Durability — data stored on disk, consumers can replay any historical period
- Scalability — horizontal scaling through partitioning
- Consumer Groups — different consumers read the same topic independently
- Exactly-once delivery — with proper idempotent producer configuration
Topic: market.trades.binance.BTCUSDT
Partition 0: trades (all, ordered by time)
Topic: market.orderbook.binance.BTCUSDT
Partition 0: snapshots + diffs (ordered by update_id)
Topic: market.candles.binance.BTCUSDT.1m
Partition 0: 1-minute OHLCV (ordered by candle time)
Topic naming: {data_type}.{exchange}.{symbol}.{interval} — allows easy filtering of needed data.
Delivery Guarantees
Market data systems typically use at-least-once delivery: better to get duplicate messages than lose data. Consumers are idempotent: deduplication by trade_id or update_id.
For critical components (risk management, position accounting) — exactly-once via Kafka Transactions with idempotent producers and acks: all configuration.
Cross-Datacenter Replication
Kafka MirrorMaker 2 replicates topics between clusters, enabling EU clusters to receive replicas of all market.* topics with small delay (typically 50–200ms for transatlantic replication).
Retention and Compression Management
Market data accumulates rapidly. Kafka retention policies:
# For tick-data: 7 days, then delete
log.retention.hours=168
# For daily OHLCV: infinite (use size limit)
log.retention.bytes=10737418240 # 10 GB per partition
# Log compaction for order book: keep only latest state per price level
log.cleanup.policy=compact
Kafka-level compression: compression.type=zstd compresses market data by 40–70% without significant CPU overhead.
Replication Monitoring
Key metrics:
| Metric | Shows |
|---|---|
| Consumer lag | Consumer delay behind producer |
| Replication latency | Delay between primary and replica clusters |
| Producer send rate | Publishing speed (messages/sec) |
| Bytes in/out rate | Throughput |
| Under-replicated partitions | Partitions with insufficient replication |
Consumer lag > N minutes for trading bot — critical alert. For analytics system — warning.
Schema Registry and Format Compatibility
With evolving schemas (adding fields, changing types), preventing consumer breakage is important. Confluent Schema Registry + Avro provide schema evolution with compatibility checks, supporting backward compatible changes like optional fields with defaults.







