Implementing Logging and Traffic Monitoring through API Gateway
API Gateway is the single entry point for all traffic. Without proper logging, you're flying blind: no visibility into who's calling the API, what latency they're experiencing, which endpoints are failing, and why. Setting up observability at the gateway level is faster than troubleshooting incidents manually later.
What to Log
Minimal fields for each request:
| Field | Example | Purpose |
|---|---|---|
request_id |
uuid4 |
End-to-end tracing across services |
consumer_id |
client_abc |
Who's making the request |
method + path |
GET /api/v2/orders |
Endpoint statistics |
status_code |
429 |
Error monitoring |
latency_ms |
143 |
Performance |
upstream_latency_ms |
138 |
Where time is spent |
request_size |
1024 |
Traffic anomalies |
response_size |
4096 |
— |
ip |
1.2.3.4 |
Security |
Never log request body by default — it may contain passwords, tokens, PAN data. Use a separate debug flag at the route level.
Setup in Kong Gateway
Kong is the most common self-hosted gateway. Logging via http-log plugin:
plugins:
- name: http-log
config:
http_endpoint: http://logstash:5044/kong
method: POST
timeout: 1000
keepalive: 1000
flush_timeout: 2
retry_count: 10
queue:
max_batch_size: 200
max_coalescing_delay: 1
max_entries: 10000
For Prometheus metrics — separate plugin:
plugins:
- name: prometheus
config:
per_consumer: true
status_code_metrics: true
latency_metrics: true
bandwidth_metrics: true
upstream_health_metrics: true
After this, /metrics on Kong Manager exports all metrics in Prometheus format. Scrape interval: 15 seconds.
Setup in AWS API Gateway
In AWS, logging is configured at Stage level via CloudWatch:
{
"loggingLevel": "INFO",
"dataTraceEnabled": false,
"metricsEnabled": true,
"accessLogDestinationArn": "arn:aws:logs:us-east-1:123456789:log-group:api-gateway-access",
"accessLogFormat": "{\"requestId\":\"$context.requestId\",\"ip\":\"$context.identity.sourceIp\",\"caller\":\"$context.identity.caller\",\"user\":\"$context.identity.user\",\"requestTime\":\"$context.requestTime\",\"httpMethod\":\"$context.httpMethod\",\"resourcePath\":\"$context.resourcePath\",\"status\":\"$context.status\",\"protocol\":\"$context.protocol\",\"responseLength\":\"$context.responseLength\",\"integrationLatency\":\"$context.integrationLatency\",\"responseLatency\":\"$context.responseLatency\"}"
}
Never enable dataTraceEnabled in production — it logs request bodies.
CloudWatch Insights query for p95 latency by endpoint:
fields @timestamp, resourcePath, responseLatency
| filter status >= 200
| stats pct(responseLatency, 95) as p95 by resourcePath
| sort p95 desc
| limit 20
Nginx API Gateway + OpenTelemetry
If gateway is on Nginx (nginx-plus or OpenResty), logging is configured via log_format:
log_format api_json escape=json
'{'
'"timestamp":"$time_iso8601",'
'"request_id":"$request_id",'
'"method":"$request_method",'
'"path":"$uri",'
'"status":$status,'
'"latency_ms":$request_time,'
'"upstream_latency_ms":"$upstream_response_time",'
'"bytes_sent":$bytes_sent,'
'"consumer":"$http_x_consumer_id",'
'"ip":"$remote_addr"'
'}';
access_log /var/log/nginx/api_access.log api_json buffer=32k flush=5s;
For distributed tracing — opentelemetry-nginx-module:
opentelemetry on;
opentelemetry_propagate;
opentelemetry_operation_name $request_method_$uri;
opentelemetry_otlp_exporter otelhttp;
otelhttp_exporter_otlp_endpoint http://otel-collector:4317;
Stack for Collection and Visualization
Two common approaches:
ELK Stack:
- Logstash collects logs from gateway
- Elasticsearch stores and indexes
- Kibana — dashboards, alerts
Grafana Stack:
- Loki — log storage (cheaper than ES, doesn't index fields)
- Prometheus — metrics
- Grafana — unified UI for logs and metrics
For most projects, Grafana Stack is simpler and cheaper.
Key dashboards to build:
- Traffic overview: RPS, error rate, p50/p95/p99 latency — last 15 min and 24 hours
- By consumer: top request generators, who gets 4xx/5xx
- By endpoint: slowest, most erroring
- Upstream health: latency to backend services
Alerting
Minimal alert set (Prometheus AlertManager / Grafana Alerting):
- alert: APIHighErrorRate
expr: |
sum(rate(kong_http_requests_total{status=~"5.."}[5m]))
/ sum(rate(kong_http_requests_total[5m])) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "Error rate > 5% over last 5 minutes"
- alert: APIHighLatency
expr: |
histogram_quantile(0.95,
sum(rate(kong_request_latency_ms_bucket[5m])) by (le, route)
) > 2000
for: 5m
labels:
severity: warning
annotations:
summary: "p95 latency > 2s for route {{ $labels.route }}"
Timeline
Basic logging and dashboards: 2–3 days. Full stack with alerting, tracing, and retrospective analysis: 1–2 weeks depending on infrastructure maturity.







