Setting up custom metrics and alerts (Prometheus / CloudWatch)
Default metrics (CPU, Memory, Disk) are the baseline. They describe infrastructure, not business logic. Custom metrics answer "what's happening with the app," not "how loaded are the servers."
Types of custom metrics
Business metrics:
- Orders created per minute
- Checkout funnel conversion
- Active user sessions
Application-level technical metrics:
- Task processing queue size
- Cache hit rate
- Specific operation execution time
- Error count by type
External dependencies:
- Latency to third-party APIs
- Payment gateway availability
- Integration status
Prometheus: custom metrics in application
Python (FastAPI):
from prometheus_client import Counter, Histogram, Gauge
from prometheus_fastapi_instrumentator import Instrumentator
# Counter
order_counter = Counter(
'orders_created_total',
'Total orders created',
['status', 'payment_method']
)
# Histogram (for percentile)
checkout_duration = Histogram(
'checkout_duration_seconds',
'Time spent in checkout process',
buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
)
# Gauge (current value)
queue_size = Gauge(
'task_queue_size',
'Current size of processing queue'
)
# Usage in code
async def create_order(order_data: dict):
with checkout_duration.time(): # Measure time
result = await process_order(order_data)
order_counter.labels(
status=result.status,
payment_method=order_data['payment_method']
).inc()
return result
Node.js (prom-client):
const client = require('prom-client')
const httpDuration = new client.Histogram({
name: 'http_request_duration_ms',
help: 'Duration of HTTP requests in ms',
labelNames: ['method', 'route', 'code'],
buckets: [1, 5, 15, 50, 100, 200, 500, 1000, 2000]
})
app.use((req, res, next) => {
const end = httpDuration.startTimer()
res.on('finish', () => {
end({ method: req.method, route: req.route?.path, code: res.statusCode })
})
next()
})
Prometheus Recording Rules
Pre-compute expensive queries for fast dashboards:
groups:
- name: app_slo
interval: 30s
rules:
- record: job:request_errors:rate5m
expr: rate(http_requests_total{status=~"5.."}[5m])
- record: job:request_duration_p95:rate5m
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
)
# Availability over sliding hour
- record: job:availability:ratio1h
expr: |
1 - (
sum(rate(http_requests_total{status=~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))
)
Alerting rules
groups:
- name: app_alerts
rules:
- alert: HighErrorRate
expr: job:request_errors:rate5m > 0.05
for: 2m
labels:
severity: critical
team: backend
annotations:
summary: "Error rate {{ $value | humanizePercentage }} on {{ $labels.job }}"
runbook_url: "https://wiki.company.com/runbooks/high-error-rate"
- alert: SlowResponseTime
expr: job:request_duration_p95:rate5m > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "P95 latency {{ $value | humanizeDuration }} > 1s"
- alert: QueueBacklog
expr: task_queue_size > 1000
for: 10m
labels:
severity: warning
annotations:
summary: "Task queue has {{ $value }} pending items"
AWS CloudWatch Custom Metrics
import boto3
cw = boto3.client('cloudwatch')
def put_metric(name: str, value: float, unit: str = 'Count', dimensions: dict = None):
metric_data = {
'MetricName': name,
'Value': value,
'Unit': unit
}
if dimensions:
metric_data['Dimensions'] = [
{'Name': k, 'Value': v} for k, v in dimensions.items()
]
cw.put_metric_data(
Namespace='MyApp/Business',
MetricData=[metric_data]
)
# Usage
put_metric('OrdersCreated', 1, 'Count', {'Environment': 'production'})
put_metric('CheckoutDuration', 0.85, 'Seconds', {'PaymentMethod': 'card'})
put_metric('QueueDepth', queue.size(), 'Count')
CloudWatch Alarm on custom metric:
resource "aws_cloudwatch_metric_alarm" "queue_depth" {
alarm_name = "high-queue-depth"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "QueueDepth"
namespace = "MyApp/Business"
period = 60
statistic = "Maximum"
threshold = 1000
alarm_description = "Task queue is backed up"
dimensions = {
Environment = "production"
}
alarm_actions = [aws_sns_topic.alerts.arn]
ok_actions = [aws_sns_topic.alerts.arn]
}
Embedded metrics (Lambda → CloudWatch)
AWS EMF (Embedded Metrics Format) — structured logs, automatically converted to CloudWatch Metrics without PutMetricData calls:
from aws_embedded_metrics import metric_scope
@metric_scope
async def handler(event, context, metrics):
metrics.set_namespace("MyApp/Lambda")
metrics.put_dimensions({"FunctionName": context.function_name})
start = time.time()
result = await process(event)
metrics.put_metric("ProcessingTime", (time.time() - start) * 1000, "Milliseconds")
metrics.put_metric("ItemsProcessed", len(result), "Count")
return result
Setup timeline
- Prometheus metrics in application (Python/Node.js) — 1-3 days
- Recording rules + alert rules — 1-2 days
- CloudWatch custom metrics — 1-2 days
- Alertmanager / SNS routing + notifications — 1 day







