Development of AI-based Order Flow Analysis Model
Order Flow analysis — studying trades, not bids. If Order Book shows intentions, Order Flow shows real actions: who aggressively buys or sells, removing liquidity from the book. This is the foundation for understanding "smart money" and supply/demand imbalance.
Key Order Flow Concepts
Aggressor vs. Passive: Each trade is initiated either by a buyer (market buy — lifts ask) or seller (market sell — hits bid). Classification by tick rule or Lee-Ready:
- Trade at price > previous → buyer-initiated
- Trade at price < previous → seller-initiated
- Same price → look at previous movement
Delta (Cumulative Volume Delta, CVD):
Delta = Buyer_Volume - Seller_Volume
CVD = Σ Delta for period
Positive CVD with rising price = trend confirmation. Negative CVD with rising price = divergence (potential reversal).
Absorption: large passive participant "absorbs" aggressive orders without price movement. This is support/resistance level with major player.
Feature Engineering from Order Flow
Trade-level features:
def compute_order_flow_features(trades_df, window_seconds=60):
features = {}
# Buyer/Seller classification
trades_df['initiator'] = np.where(trades_df['side'] == 'buy', 1, -1)
# Rolling window aggregations
features['buy_volume'] = trades_df[trades_df.initiator==1]['volume'].rolling(f'{window_seconds}s').sum()
features['sell_volume'] = trades_df[trades_df.initiator==-1]['volume'].rolling(f'{window_seconds}s').sum()
features['cvd'] = features['buy_volume'] - features['sell_volume']
features['trade_imbalance'] = features['cvd'] / (features['buy_volume'] + features['sell_volume'])
# Trade size distribution
features['avg_buy_size'] = (features['buy_volume'] / buy_count)
features['avg_sell_size'] = (features['sell_volume'] / sell_count)
features['large_buy_ratio'] = (large_buy_volume / total_volume) # trades > 95th percentile
return features
Volume Profile: Histogram of volume by price levels over period (VPOC = Volume Point of Control — level with maximum volume). Used as support/resistance level.
Time and Sales analysis: Patterns in trade sequence: clusters of large buys in short time = major player entering position.
Footprint Chart as Input Data
Footprint (or Cluster Chart) — Order Book + Order Flow combined:
- Each candle divided into price levels
- Each level: [buyer_volume × seller_volume]
- Divergences visible: many buys at level, but price didn't rise → absorption
ML on footprint data:
# Footprint as matrix: [time_bins × price_levels × 2 (buy/sell)]
# For example: 100 one-minute bars × 20 price levels × 2
class FootprintCNN(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Conv3d(1, 32, kernel_size=(3, 3, 2))
self.conv2 = nn.Conv3d(32, 64, kernel_size=(3, 3, 1))
self.flatten = nn.Flatten()
self.fc = nn.Linear(64 * ..., 1)
Volume Weighted Average Price (VWAP) Analysis
VWAP — benchmark for institutional execution. Price deviation from VWAP with volume consideration:
VWAP deviation signal:
- Price above VWAP + large buy volumes → trend confirmed
- Price above VWAP + seller dominates in volume → potential reversal
TWAP vs. VWAP execution: For large orders — predicting optimal execution timing to minimize market impact. RL-agent optimizes execution strategy.
Prediction Tasks
Short-term (1-10 min):
- Direction of mid-price over next N trades
- Probability of significant move in next X seconds
- Estimate of immediate market impact from order placement
Execution Quality:
- Slippage: predict execution deviation from theoretical price
- Optimal order sizing: maximum size without significant market impact
Stacked Imbalance and Major Levels
"Stacked imbalance" — several adjacent price levels dominated by buys/sells. Statistically these are support/resistance levels.
ML-detection of significant levels:
def detect_imbalance_levels(footprint_data, threshold=0.7):
"""
Level is significant if buy_vol / (buy_vol + sell_vol) > threshold
OR sell_vol / ... > threshold
AND total volume in top-20% of daily
"""
Data and Infrastructure
Tick data sources:
- Dukascopy (forex): free historical tick data
- Kinetick (futures): real-time tick data, $50/month
- IQFeed: comprehensive US market data
- Binance WebSocket: crypto L3 (by trades)
Storage: ClickHouse ideal for tick data: columnar storage, < 1 ms queries on billions of rows. TimescaleDB as PostgreSQL alternative.
Timeline: Order Flow Feature Engineering + baseline regression — 3-4 weeks. Footprint CNN with backtesting and production pipeline — 3-4 months.







