Implementing SLA Dashboard (Uptime, Response Time, Error Rate)
SLA dashboard is single window where business and development see same numbers about service state. Key requirement: dashboard must answer "are we meeting SLA right now" in 5 seconds of viewing.
SLA Dashboard Structure
Good dashboard has three detail levels:
Top panel (status right now):
- Current uptime for month (e.g., 99.94%)
- Remaining error budget in minutes/hours
- Service status: OK / DEGRADED / DOWN (large colored indicator)
Middle panel (trends over period):
- Uptime graph over last 30/90 days
- P50/P95/P99 response time — time series
- Error rate — time series with incident annotations
Bottom panel (details):
- Breakdown by endpoints: which are slowest
- Breakdown by regions/data centers
- Recent incidents with duration
Implementation in Grafana
{
"panels": [
{
"title": "SLO Availability (30d)",
"type": "stat",
"targets": [{
"expr": "avg_over_time(job:availability:ratio_rate5m[30d]) * 100",
"legendFormat": "Availability %"
}],
"thresholds": [
{"color": "red", "value": 99.0},
{"color": "yellow", "value": 99.9},
{"color": "green", "value": 99.95}
]
},
{
"title": "Error Budget Remaining",
"type": "gauge",
"targets": [{
"expr": "slo_error_budget_remaining_minutes"
}]
}
]
}
Dashboard variables for filtering: $service, $environment, $time_range. One dashboard for all services.
Key Metrics and Calculation
Uptime %:
(1 - sum(increase(http_requests_total{status=~"5.."}[30d]))
/ sum(increase(http_requests_total[30d]))) * 100
P95 Response Time:
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
)
Error Budget Burn Rate (1h):
(
rate(http_requests_total{status=~"5.."}[1h])
/ rate(http_requests_total[1h])
) / (1 - 0.999)
Burn rate > 14.4 means: at current pace, entire monthly error budget burns in 2 days.
Dashboard for Different Audiences
Technical dashboard (for developers): detailed metrics, breakdown by services and endpoints, stack traces from Sentry/Jaeger, correlation with deploys.
Management dashboard (for business): uptime in percent, incident count, trend. Minimum numbers, maximum context. Can be read-only Grafana snapshot, updated daily.
Public Status Page (for users) — separate implementation (Cachet, Statuspage.io, self-hosted).
Integration with Alerting
Dashboard should show: active alerts right now, alert history over period. Grafana Alerting or Alertmanager (with Prometheus) integrates directly. Each alert on dashboard — annotation on graph (vertical line with description).
Implementation Timeline
- Basic panels (uptime, response time, error rate) — 1-2 days
- Error budget + burn rate — 1 day
- Incident annotations + history — 1 day
- Management dashboard — 1-2 days







