Monitoring and Data Visualization System Development
A monitoring system collects metrics, events, and logs from various sources (servers, applications, IoT devices, business processes) and provides tools for anomaly detection, alert configuration, and real-time visualization.
Monitoring stack
Developing a custom stack makes sense when ready-made solutions (Grafana + Prometheus) are insufficient: you need non-standard visualization, business context (not just technical metrics), or built-in monitoring as a product feature (embedded monitoring).
Architectural components:
Data sources (servers, IoT, API)
↓ push/pull
Collector (Vector, Fluentd, custom agent)
↓
Time-Series DB (InfluxDB, TimescaleDB, ClickHouse)
↓
Query Engine + Alert Engine
↓
Visualization (WebSocket → React dashboard)
Time-Series databases
TimescaleDB is a PostgreSQL extension. Hypertable automatically partitions data by time:
-- Creating a time-series table
SELECT create_hypertable('metrics', 'time');
-- Fast insertion
INSERT INTO metrics (time, device_id, temperature, humidity)
VALUES (NOW(), 'sensor_42', 23.5, 61.2);
-- Aggregation with time_bucket
SELECT time_bucket('5 minutes', time) AS bucket,
AVG(temperature) AS avg_temp
FROM metrics
WHERE device_id = 'sensor_42'
AND time > NOW() - INTERVAL '24 hours'
GROUP BY bucket ORDER BY bucket;
InfluxDB is a specialized TSDB with SQL-like Flux language. Good for IoT scenarios.
ClickHouse is a columnar database, optimal for high event volumes (billions of rows).
Real-time updates
Real-time visualization requires WebSocket or SSE:
// WebSocket subscription to a metric
const ws = new WebSocket('wss://monitor.example.com/stream');
ws.send(JSON.stringify({
subscribe: ['cpu_usage', 'memory_usage'],
device_id: 'server_01',
interval: 5000
}));
ws.onmessage = ({ data }) => {
const metric = JSON.parse(data);
updateChart(metric.name, metric.value, metric.timestamp);
};
The server publishes new values from the time-series DB every N seconds via Redis Pub/Sub or Apache Kafka.
Alert system
An alert is a rule: "if metric X exceeds threshold Y for Z minutes → notify". Alert types:
-
Threshold:
cpu_usage > 90% for 5 minutes - Anomaly detection: deviation from historical normal by N standard deviations
- Absence: metric not received for N minutes (device not responding)
- Rate of change: metric increased/decreased by more than X% per period
Notification channels: email, Telegram, Slack, PagerDuty, SMS, webhook.
Annotations and context
Annotations are markers on temporal graphs explaining anomalies: new version deployment, planned maintenance, incident. They help understand: "here the metric spiked because we deployed v2.3".
CREATE TABLE annotations (
id, title, description TEXT,
tags TEXT[], start_time, end_time,
created_by
);
IoT visualization
For industrial monitoring and IoT scenarios, specialized visualizations are often needed:
- Heatmap — distribution of values over time
- Gauge — current value with color zones (green/yellow/red)
- Topology map — object layout with color status (SCADA-like)
- Geo-map — device locations with color status
Data retention
Time-series data grows fast. Retention policy:
- Raw data (every second): 7 days
- Aggregated by minute: 90 days
- Aggregated by hour: 2 years
- Daily aggregates: unlimited
In TimescaleDB, configured via add_retention_policy:
SELECT add_retention_policy('metrics', INTERVAL '7 days');
Timeline
MVP (metric collection, basic charts, threshold alerts): 6–8 weeks. Full system with anomaly detection, IoT visualization, custom dashboards and PagerDuty integration: 4–6 months.







