Setting up custom monitoring dashboards (Grafana)
Grafana dashboards are the visual language of infrastructure state. Default community dashboards are often overloaded with unnecessary panels and don't answer specific service questions. Custom dashboards are built from questions your team asks, not available metrics.
Principles of effective dashboard
Information hierarchy. Top row—most important: is the service working or not. Details below. Don't make eyes search for status.
Actionable metrics. Each panel answers a question that impacts decision. "CPU 67%" is not actionable. "CPU 67%, target 60%, trending up, 3 instances scaling" is actionable.
Time variables. $__timeRange and $__interval allow changing view period and preserving graph resolution.
Dashboard structure for web application
Row 1: Service Health (large stat panels)
[Error Rate %] [P95 Latency ms] [Uptime %] [Active Users]
Row 2: Traffic & Performance
[RPS - timeseries] [Response time P50/P95/P99 - timeseries] [HTTP status breakdown]
Row 3: Infrastructure
[CPU % per host] [Memory % per host] [Disk I/O] [Network I/O]
Row 4: Database
[DB Connections active/max] [Query latency P95] [Slow queries count]
Row 5: Cache
[Redis hit rate %] [Redis memory usage] [Evictions per sec]
Prometheus queries for key panels
Error Rate:
sum(rate(http_requests_total{status=~"5..", job="app"}[5m]))
/
sum(rate(http_requests_total{job="app"}[5m]))
* 100
P95 Latency:
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{job="app"}[5m])) by (le)
)
Active DB Connections:
pg_stat_activity_count{datname="mydb", state="active"}
Redis Hit Rate:
rate(redis_keyspace_hits_total[5m])
/
(rate(redis_keyspace_hits_total[5m]) + rate(redis_keyspace_misses_total[5m]))
* 100
Dashboard variables
Variables make dashboard universal:
{
"templating": {
"list": [
{
"name": "environment",
"type": "custom",
"options": [
{"text": "production", "value": "production"},
{"text": "staging", "value": "staging"}
]
},
{
"name": "instance",
"type": "query",
"query": "label_values(up{job='app', env='$environment'}, instance)"
}
]
}
}
Use in queries: {job="app", env="$environment", instance="$instance"}.
Deployment annotations
Vertical line on graphs at each deployment—quickly see correlation between deploy and degradation:
# CI/CD: send annotation after deploy
import requests
def create_grafana_annotation(grafana_url: str, api_key: str, text: str, tags: list):
requests.post(
f"{grafana_url}/api/annotations",
headers={"Authorization": f"Bearer {api_key}"},
json={
"text": text,
"tags": tags,
"time": int(time.time() * 1000) # milliseconds
}
)
# In CI/CD pipeline after successful deploy:
create_grafana_annotation(
GRAFANA_URL, API_KEY,
text=f"Deploy v{VERSION} to production",
tags=["deploy", "production"]
)
Dashboard as Code (Grafonnet / Terraform)
Store dashboards in git, not just in UI:
// Grafonnet: dashboard as code
local grafana = import 'grafonnet/grafana.libsonnet';
local dashboard = grafana.dashboard;
local graphPanel = grafana.graphPanel;
dashboard.new(
'Application Overview',
time_from='now-1h',
refresh='30s',
)
.addPanel(
graphPanel.new(
'Error Rate',
datasource='Prometheus',
)
.addTarget(
grafana.prometheus.target(
'sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100',
legendFormat='Error Rate %'
)
),
gridPos={ x: 0, y: 0, w: 12, h: 8 }
)
Or via Terraform Grafana provider: resource "grafana_dashboard" "app".
Sharing and access
- Read-only public URL—for status board in office
- Snapshot—share current state with someone without Grafana access
- Embedded panels—embed in internal team portal
Creation timeline
- Basic panels (error rate, latency, traffic) — 1-2 days
- Full application dashboard (all layers) — 3-5 days
- Dashboard as code + git workflow — 1-2 days
- Deploy annotations — 1 day







