Setting up Serverless Monitoring (Lumigo / Datadog Serverless)
Standard monitoring doesn't work well for serverless: no continuously running processes, instances are ephemeral, cold start is a unique type of latency, no direct infrastructure access. Specialized tools fill these gaps.
Problems with Standard Monitoring for Lambda
CloudWatch Metrics out of the box provide: Invocations, Errors, Duration, Throttles. This is insufficient:
- No distinction between cold start vs warm start latency
- No tracing between functions and downstream services
- No visibility into specific errors with context
- No correlation between logs from different functions in one request
Lumigo
Lumigo is a serverless-first observability platform. Installation via Lambda Layer without code changes:
resource "aws_lambda_function" "api" {
layers = [
"arn:aws:lambda:us-east-1:114300393969:layer:lumigo-python-tracer:latest"
]
environment {
variables = {
LUMIGO_TRACER_TOKEN = var.lumigo_token
LUMIGO_DEBUG = "false"
}
}
}
For Python via decorator (if customization is needed):
import lumigo_tracer
@lumigo_tracer.lumigo_tracer(token="your-token")
def handler(event, context):
# Automatically traces HTTP, boto3, psycopg2 calls
response = requests.get("https://api.external.com/data")
return process(response.json())
What Lumigo provides:
- Automatic distributed tracing (Lambda → SQS → Lambda → DynamoDB)
- Timeline for each invocation: initialization, cold start, handler, downstream calls
- Payload inspector: incoming/outgoing data for each call
- Smart alerts: anomalies without manual threshold configuration
- Cost analysis: cost by functions, memory optimization estimates
Datadog Serverless
A broader platform with serverless-specific capabilities:
# serverless.yml (Serverless Framework)
plugins:
- serverless-datadog-plugin
custom:
datadog:
apiKey: ${env:DD_API_KEY}
enableXrayTracing: true
enableDDTracing: true
enableMergedXrayTraces: true
captureLambdaPayload: true
logLevel: WARN
Or via Terraform with Lambda Layer:
resource "aws_lambda_function" "api" {
layers = [
"arn:aws:lambda:us-east-1:464622532012:layer:Datadog-Python312:latest"
]
environment {
variables = {
DD_API_KEY = var.datadog_api_key
DD_SITE = "datadoghq.com"
DD_ENHANCED_METRICS = "true"
DD_TRACE_ENABLED = "true"
DD_COLD_START_TRACING = "true"
}
}
}
Enhanced Lambda Metrics from Datadog: breakdown by cold/warm invocations, estimated cost, out-of-memory events — beyond standard CloudWatch metrics.
Key Metrics for Serverless Monitoring
Latency breakdown:
- Cold start duration (p50, p95, p99)
- Initialization time (handler setup)
- Handler duration
Reliability:
- Error rate by function
- Timeout rate
- Throttle rate
- Concurrent executions vs limit
Cost:
- GB-seconds consumption
- Invocation count
- Estimated monthly cost
Distributed Tracing for Serverless
When Lambda → SQS → Lambda → RDS, the trace should pass through all services:
# First Lambda: add trace context to SQS message
from opentelemetry import trace
from opentelemetry.propagate import inject
def handler(event, context):
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("process-order"):
# Inject trace context into SQS message attributes
headers = {}
inject(headers)
sqs.send_message(
QueueUrl=QUEUE_URL,
MessageBody=json.dumps({"orderId": "123"}),
MessageAttributes={
"trace_context": {
"StringValue": json.dumps(headers),
"DataType": "String"
}
}
)
Lumigo and Datadog do this automatically for AWS SDK calls.
Alerts for Serverless
# Datadog monitor via Terraform
resource "datadog_monitor" "lambda_error_rate" {
name = "Lambda High Error Rate"
type = "metric alert"
message = "Error rate on {{functionname.name}} > 5%. @pagerduty-oncall"
query = "sum(last_5m):sum:aws.lambda.errors{env:production} by {functionname}.as_rate() / sum:aws.lambda.invocations{env:production} by {functionname}.as_rate() > 0.05"
thresholds = {
critical = 0.05
warning = 0.02
}
}
Setup Timeline
- Lumigo (Layer + no code changes) — 0.5-1 day
- Datadog Serverless (Layer + config) — 1-2 days
- Custom metrics + alerts — 1-2 days
- Distributed tracing setup — 1-2 days







