SLA dashboard with uptime response time and error rate

Our company is engaged in the development, support and maintenance of sites of any complexity. From simple one-page sites to large-scale cluster systems built on micro services. Experience of developers is confirmed by certificates from vendors.
Development and maintenance of all types of websites:
Informational websites or web applications
Business card websites, landing pages, corporate websites, online catalogs, quizzes, promo websites, blogs, news resources, informational portals, forums, aggregators
E-commerce websites or web applications
Online stores, B2B portals, marketplaces, online exchanges, cashback websites, exchanges, dropshipping platforms, product parsers
Business process management web applications
CRM systems, ERP systems, corporate portals, production management systems, information parsers
Electronic service websites or web applications
Classified ads platforms, online schools, online cinemas, website builders, portals for electronic services, video hosting platforms, thematic portals

These are just some of the technical types of websites we work with, and each of them can have its own specific features and functionality, as well as be customized to meet the specific needs and goals of the client.

Our competencies:
Development stages
Latest works
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822
  • image_crm_chasseurs_493_0.webp
    CRM development for Chasseurs
    847
  • image_website-sbh_0.png
    Website development for SBH Partners
    999
  • image_website-_0.png
    Website development for Red Pear
    451

Implementing SLA Dashboard (Uptime, Response Time, Error Rate)

SLA dashboard is single window where business and development see same numbers about service state. Key requirement: dashboard must answer "are we meeting SLA right now" in 5 seconds of viewing.

SLA Dashboard Structure

Good dashboard has three detail levels:

Top panel (status right now):

  • Current uptime for month (e.g., 99.94%)
  • Remaining error budget in minutes/hours
  • Service status: OK / DEGRADED / DOWN (large colored indicator)

Middle panel (trends over period):

  • Uptime graph over last 30/90 days
  • P50/P95/P99 response time — time series
  • Error rate — time series with incident annotations

Bottom panel (details):

  • Breakdown by endpoints: which are slowest
  • Breakdown by regions/data centers
  • Recent incidents with duration

Implementation in Grafana

{
  "panels": [
    {
      "title": "SLO Availability (30d)",
      "type": "stat",
      "targets": [{
        "expr": "avg_over_time(job:availability:ratio_rate5m[30d]) * 100",
        "legendFormat": "Availability %"
      }],
      "thresholds": [
        {"color": "red", "value": 99.0},
        {"color": "yellow", "value": 99.9},
        {"color": "green", "value": 99.95}
      ]
    },
    {
      "title": "Error Budget Remaining",
      "type": "gauge",
      "targets": [{
        "expr": "slo_error_budget_remaining_minutes"
      }]
    }
  ]
}

Dashboard variables for filtering: $service, $environment, $time_range. One dashboard for all services.

Key Metrics and Calculation

Uptime %:

(1 - sum(increase(http_requests_total{status=~"5.."}[30d]))
   / sum(increase(http_requests_total[30d]))) * 100

P95 Response Time:

histogram_quantile(0.95,
  rate(http_request_duration_seconds_bucket[5m])
)

Error Budget Burn Rate (1h):

(
  rate(http_requests_total{status=~"5.."}[1h])
  / rate(http_requests_total[1h])
) / (1 - 0.999)

Burn rate > 14.4 means: at current pace, entire monthly error budget burns in 2 days.

Dashboard for Different Audiences

Technical dashboard (for developers): detailed metrics, breakdown by services and endpoints, stack traces from Sentry/Jaeger, correlation with deploys.

Management dashboard (for business): uptime in percent, incident count, trend. Minimum numbers, maximum context. Can be read-only Grafana snapshot, updated daily.

Public Status Page (for users) — separate implementation (Cachet, Statuspage.io, self-hosted).

Integration with Alerting

Dashboard should show: active alerts right now, alert history over period. Grafana Alerting or Alertmanager (with Prometheus) integrates directly. Each alert on dashboard — annotation on graph (vertical line with description).

Implementation Timeline

  • Basic panels (uptime, response time, error rate) — 1-2 days
  • Error budget + burn rate — 1 day
  • Incident annotations + history — 1 day
  • Management dashboard — 1-2 days