Setting Up Distributed Tracing (Jaeger/Zipkin) for Microservices
Distributed Tracing tracks the path of a request through multiple microservices as a single "trace". When a user's HTTP request passes through API Gateway → Order Service → Inventory Service → Payment Service, the trace shows time spent in each service, which database queries were executed, and where slowdowns occurred.
Key Concepts
Trace — the complete path of a request from start to finish. Consists of spans.
Span — a unit of work (HTTP request, database call, external API call). Each span contains: operation name, start/end time, tags (key-value), logs, reference to parent span.
Context Propagation — trace-id and span-id are passed in HTTP headers (traceparent in W3C Trace Context or X-B3-TraceId in Zipkin).
OpenTelemetry — Instrumentation Standard
OpenTelemetry (OTel) is a vendor-neutral SDK. You instrument your code once, then send data to Jaeger, Zipkin, Datadog, or any other backend.
// tracing.ts — initialization, import before everything else
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { SEMRESATTRS_SERVICE_NAME } from '@opentelemetry/semantic-conventions';
const sdk = new NodeSDK({
resource: new Resource({
[SEMRESATTRS_SERVICE_NAME]: 'order-service',
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://jaeger:4318/v1/traces',
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-http': { enabled: true },
'@opentelemetry/instrumentation-express': { enabled: true },
'@opentelemetry/instrumentation-pg': { enabled: true },
'@opentelemetry/instrumentation-redis': { enabled: true },
}),
],
});
sdk.start();
Auto-instrumentations intercept Express, pg, redis, axios without writing code.
Manual Span Creation
For business operations not intercepted by auto-instrumentation:
import { trace, SpanStatusCode, context } from '@opentelemetry/api';
const tracer = trace.getTracer('order-service');
async function processOrder(orderId: string): Promise<void> {
const span = tracer.startSpan('processOrder', {
attributes: {
'order.id': orderId,
'service.operation': 'process'
}
});
try {
await context.with(trace.setSpan(context.active(), span), async () => {
const order = await loadOrder(orderId); // child span created automatically
await validateOrder(order);
await reserveInventory(order); // call another service with propagation
await chargePayment(order);
});
span.setStatus({ code: SpanStatusCode.OK });
} catch (error) {
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
throw error;
} finally {
span.end();
}
}
Installing Jaeger via Docker
# docker-compose.yml
services:
jaeger:
image: jaegertracing/all-in-one:1.52
ports:
- "16686:16686" # UI
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
environment:
COLLECTOR_OTLP_ENABLED: "true"
SPAN_STORAGE_TYPE: "badger" # for production — Elasticsearch
Jaeger with Elasticsearch (Production)
services:
jaeger-collector:
image: jaegertracing/jaeger-collector:1.52
environment:
SPAN_STORAGE_TYPE: elasticsearch
ES_SERVER_URLS: http://elasticsearch:9200
ES_INDEX_PREFIX: jaeger
depends_on:
- elasticsearch
jaeger-query:
image: jaegertracing/jaeger-query:1.52
ports:
- "16686:16686"
environment:
SPAN_STORAGE_TYPE: elasticsearch
ES_SERVER_URLS: http://elasticsearch:9200
Sampling
In production, tracing 100% of requests is expensive. Sampling strategies:
import { ParentBasedSampler, TraceIdRatioBased } from '@opentelemetry/sdk-trace-base';
// Trace 10% of requests, but always if parent is already traced
const sampler = new ParentBasedSampler({
root: new TraceIdRatioBased(0.1)
});
Head-based sampling — decision made at trace start (cheap, but misses rare errors). Tail-based sampling — in Jaeger Agent, decision after receiving entire trace; can save all error traces.
Zipkin vs Jaeger
| Zipkin | Jaeger | |
|---|---|---|
| Storage | MySQL, Elasticsearch, Cassandra | Elasticsearch, Cassandra, Kafka |
| UI | Basic | Richer |
| OTel support | Yes | Native |
| Sampling | Basic | Advanced |
For new projects—use Jaeger. Zipkin if already in use or compatibility needed.
Implementation Timeline
- OpenTelemetry SDK + Jaeger + auto-instrumentation for 3–5 services — 3–5 days
- Manual instrumentation of business operations + sampling setup — another 3–5 days
- Alert configuration for p95 latency via Prometheus + Grafana — 2–3 days







