Distributed Tracing
Follow a single request as it travels through dozens of microservices.
1The Package Tracking Analogy
Distributed tracing tracks a request as it flows through multiple services, creating a timeline (trace) of all operations. Essential for debugging latency and errors in microservices.
2Core Concepts
Trace
The entire journey of a request. Contains multiple spans. Has a unique trace ID.
Span
A single operation within a trace. Has start time, duration, and parent span. Example: 'database query' or 'HTTP call to user-service'.
Trace ID
Unique identifier propagated across all services. Ties everything together.
Span ID
Unique identifier for each span. Parent span ID creates the hierarchy.
Context Propagation
Passing trace ID and span ID in HTTP headers (e.g., X-Trace-Id) so each service can continue the trace.
3How It Works
| Step | Service | Action |
|---|---|---|
| 1 | API Gateway | Create trace ID: abc123. Start span: "gateway" |
| 2 | API Gateway | Call Order Service with header: X-Trace-Id: abc123 |
| 3 | Order Service | Extract trace ID, create child span: "process-order" |
| 4 | Order Service | Call Payment Service with same trace ID |
| 5 | Payment Service | Create child span: "charge-card" (duration: 150ms) |
| 6 | Trace Collector | Assembles all spans into complete trace visualization |
4Real-World Dry Run
Scenario: Checkout taking 5 seconds (should be 500ms)
Without tracing: "Something is slow somewhere." With tracing: "The check_stock_query in Inventory Service took 4400ms due to missing index."
5Sampling Strategies
Tracing every request is expensive. Use sampling:
Head-based
Decide at request start. Sample 1% of all requests. Simple but might miss rare errors.
Tail-based
Decide after request completes. Keep all errors and slow requests. More expensive.
Priority
Always trace certain endpoints (checkout, payment). Sample others.
Adaptive
Increase sampling when traffic is low. Decrease during peak.
6Common Tools
Jaeger
CNCF project. Popular open-source. Originally from Uber.
Zipkin
Originally from Twitter. Simple and mature.
OpenTelemetry
Standard for instrumentation. Vendor-neutral. The future.
AWS X-Ray
AWS native. Integrates with Lambda, ECS, etc.
Datadog APM
SaaS. Automatic instrumentation for many languages.
Honeycomb
Observability platform. Great for high-cardinality data.
7Key Takeaways
?Quiz
1. A request calls 5 services. How many spans minimum?
2. To keep all error traces while sampling, you need: