Setting up multi-region availability monitoring
Internal monitoring shows the service works from inside infrastructure. But users from Tokyo, Frankfurt or Novosibirsk may see a different picture. Multi-region availability monitoring is your users' eyes worldwide.
Why monitor from multiple points
Scenarios internal monitoring won't catch:
- BGP routing issue in specific region—traffic from Europe reroutes through 15 hops
- CDN endpoint in Asia returns 502—your origin in USA is fine
- DDoS absorb Cloudflare in one PoP—users from that region get timeouts
- DNS cache poisoning in specific network—your server isn't involved
Managed solutions
Pingdom. Checks from 100+ worldwide points, 1-minute intervals, alerts on failure from specific region. Transaction checks for multi-step scenarios (login → purchase).
Checkly. Playwright-based checks—real browser, not just HTTP ping. Very accurate user simulation. Checks from 20+ regions.
Better Uptime / Freshping. More affordable, basic functionality.
Datadog Synthetic Monitoring. Integration with rest of Datadog stack, API and Browser tests, CI/CD integration.
Self-hosted: Prometheus Blackbox Exporter in multiple regions
Deploy Blackbox Exporter in multiple cloud regions:
# Terraform: EC2 instance with Blackbox in each region
provider "aws" {
alias = "eu-west-1"
region = "eu-west-1"
}
provider "aws" {
alias = "ap-southeast-1"
region = "ap-southeast-1"
}
resource "aws_instance" "monitor_eu" {
provider = aws.eu-west-1
ami = data.aws_ami.ubuntu_eu.id
instance_type = "t3.micro"
user_data = file("blackbox-setup.sh")
tags = { Name = "blackbox-eu-west-1" }
}
resource "aws_instance" "monitor_ap" {
provider = aws.ap-southeast-1
ami = data.aws_ami.ubuntu_ap.id
instance_type = "t3.micro"
user_data = file("blackbox-setup.sh")
tags = { Name = "blackbox-ap-southeast-1" }
}
Each Blackbox exporter is collected by central Prometheus via federation or remote_write:
# prometheus.yml on central server
scrape_configs:
- job_name: 'blackbox-us-east-1'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets: ['https://example.com']
relabel_configs:
- target_label: region
replacement: us-east-1
- target_label: __address__
replacement: blackbox-us-east-1.internal:9115
- job_name: 'blackbox-eu-west-1'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets: ['https://example.com']
relabel_configs:
- target_label: region
replacement: eu-west-1
- target_label: __address__
replacement: blackbox-eu-west-1.internal:9115
Metrics and alerts
Grafana World Map Panel—visualize latency by region on world map. Instantly see problematic region.
# Alert: availability down in specific region
- alert: ServiceDownInRegion
expr: probe_success == 0
for: 3m
labels:
severity: critical
annotations:
summary: "Service unavailable from {{ $labels.region }}: {{ $labels.instance }}"
# Alert: high latency from specific region
- alert: HighLatencyInRegion
expr: probe_duration_seconds > 3.0
for: 5m
labels:
severity: warning
annotations:
summary: "Response time from {{ $labels.region }} is {{ $value | humanizeDuration }}"
What to check
Not just homepage. Typical check set:
- GET / (homepage)
- GET /api/health (backend health check)
- POST /api/auth/login (with test credentials)
- GET /cdn/asset.jpg (CDN availability)
- GET /sitemap.xml (periodic for SEO monitoring)
Correlation with server metrics
On alert from region—automatically open:
- Traceroute from that region (for network issues)
- CloudFront / CDN metrics for that PoP
- Origin server metrics
Grafana annotations: on alert from specific region—annotation on all dashboards.
Setup timeline
- Pingdom / Checkly / Better Uptime (managed) — 1-2 hours
- Self-hosted Blackbox in 3 regions + Prometheus — 2-3 days
- Grafana World Map + alerts — 1-2 days
- Transaction (multi-step) checks — 1-2 days







