Setting Up SLA Monitoring for Web Application
SLA monitoring is measuring whether system fulfills commitments on availability and quality. Without instrumentation, SLA remains declaration of intent. With monitoring — it's measurable and verifiable agreement.
What We Measure in SLA
Availability. Percentage of time service works correctly. Formula: (total_time - downtime) / total_time * 100%. For 99.9% SLA, ~8.7 hours downtime per year acceptable. For 99.99% — 52 minutes.
Response Time. P95 and P99 response times matter more than average. Average hides tail of slow requests users complain about. Typical targets: P95 < 500ms, P99 < 2s for web app.
Error Rate. Percentage of 5xx responses. Target: < 0.1% for production.
Throughput. If throughput in SLA — RPS or transactions per unit time.
Metrics Collection Tools
Prometheus + Grafana — standard self-hosted stack. Prometheus scrapes metrics every 15-30 seconds. Grafana visualizes and calculates SLI/SLO.
Datadog / New Relic — managed solutions, quick start, built-in SLO dashboards.
Uptime Robot / Freshping — external availability monitoring (checks from worldwide points), supplements internal monitoring.
SLI/SLO Setup in Prometheus
# Availability SLO rule (target: 99.9%)
- record: job:availability:ratio_rate5m
expr: |
1 - (
rate(http_requests_total{status=~"5.."}[5m])
/
rate(http_requests_total[5m])
)
# Alert: SLO at risk (burn rate > 14.4x per 1 hour)
- alert: SLOBurnRateTooHigh
expr: |
job:availability:ratio_rate5m < 0.999
and
rate(http_requests_total{status=~"5.."}[1h]) > 0
for: 2m
labels:
severity: critical
annotations:
summary: "SLO availability at risk"
Error Budget — key concept. For 99.9% SLO over 30 days, error budget = 0.1% = 43.2 minutes. Monitoring should show: how much error budget spent, burn rate.
External Availability Checks
Internal metrics may be green while users can't reach — e.g., DNS or CDN failure. External HTTP checks from multiple geographic points:
- Pingdom, Uptime Robot, Checkly — 1-minute checks from 5-20 world points
- Blackbox Exporter (Prometheus) — probe checks HTTP, TCP, ICMP from own infrastructure
Minimum check set: homepage, login page, API health endpoint, post-auth page (for DB verification).
SLA Reporting
Automatic monthly report for business:
- Actual uptime vs target
- Incident list with duration and cause
- Error budget usage
- Trend — improving or degrading
Grafana allows PDF report generation on schedule. For enterprise — Datadog SLO Reports or Statuspage.
Setup Timeline
- Prometheus + Grafana + basic SLI — 2-3 days
- SLO rules + error budget dashboard — 1-2 days
- External checks + alerts — 1 day
- Report configuration — 1-2 days







