Setting Up Alerts on Web Application Metrics
Alerting without thought turns into noise: 200 notifications overnight, half resolved in 2 minutes. Teams stop responding — that's when real problems come. Goal: alerts only on situations requiring human action.
Principles Before Configuration
Alert on symptoms, not causes. Alert on "site unavailable to users" matters more than "CPU > 80%". High CPU is a cause that may not affect users.
Four Golden Signals (Google SRE Book):
- Latency — response time
- Traffic — rps/rpm
- Errors — error percentage
- Saturation — resource utilization
Start with the first three.
Burn rate instead of thresholds. "Error rate > 5% for 5 minutes" is better than "1 error per minute". Burn rate shows how fast you're consuming your SLO error budget.
Stack: Prometheus + Alertmanager + Grafana
Alert rules in Prometheus:
groups:
- name: web-app
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate {{ $value | humanizePercentage }} for {{ $labels.instance }}"
- alert: SlowResponses
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "p95 latency is {{ $value }}s"
- alert: DatabaseConnections
expr: pg_stat_activity_count > 90
for: 5m
labels:
severity: warning
Timeline
Basic alerts for core metrics: 1 day. Refined thresholds and correlation across services: 2-3 days.







