Module 7 - Observability

Distributed Tracing

Follow a single request as it travels through dozens of microservices.

1The Package Tracking Analogy

Simple Analogy
When you order a package, you get a tracking number. You can see: warehouse → truck → distribution center → local post → delivered. Each step is recorded with timestamps. Distributed tracing is exactly this-but for your API requests traveling through services.

Distributed tracing tracks a request as it flows through multiple services, creating a timeline (trace) of all operations. Essential for debugging latency and errors in microservices.

2Core Concepts

Trace

The entire journey of a request. Contains multiple spans. Has a unique trace ID.

Span

A single operation within a trace. Has start time, duration, and parent span. Example: 'database query' or 'HTTP call to user-service'.

Trace ID

Unique identifier propagated across all services. Ties everything together.

Span ID

Unique identifier for each span. Parent span ID creates the hierarchy.

Context Propagation

Passing trace ID and span ID in HTTP headers (e.g., X-Trace-Id) so each service can continue the trace.

3How It Works

StepServiceAction
1API GatewayCreate trace ID: abc123. Start span: "gateway"
2API GatewayCall Order Service with header: X-Trace-Id: abc123
3Order ServiceExtract trace ID, create child span: "process-order"
4Order ServiceCall Payment Service with same trace ID
5Payment ServiceCreate child span: "charge-card" (duration: 150ms)
6Trace CollectorAssembles all spans into complete trace visualization

4Real-World Dry Run

Scenario: Checkout taking 5 seconds (should be 500ms)

1
Get slow request's trace ID from logs
trace_id: tr_abc123
2
Open trace in Jaeger/Zipkin
See complete waterfall diagram
3
Analyze span durations
Gateway: 10ms, Order: 50ms, Inventory: 4500ms (!), Payment: 200ms
4
Drill into Inventory span
Child span: 'check_stock_query' took 4400ms
5
Root cause identified
Missing index on product_id. Query doing full table scan.

Without tracing: "Something is slow somewhere." With tracing: "The check_stock_query in Inventory Service took 4400ms due to missing index."

5Sampling Strategies

Tracing every request is expensive. Use sampling:

Head-based

Decide at request start. Sample 1% of all requests. Simple but might miss rare errors.

Tail-based

Decide after request completes. Keep all errors and slow requests. More expensive.

Priority

Always trace certain endpoints (checkout, payment). Sample others.

Adaptive

Increase sampling when traffic is low. Decrease during peak.

6Common Tools

Jaeger

CNCF project. Popular open-source. Originally from Uber.

Zipkin

Originally from Twitter. Simple and mature.

OpenTelemetry

Standard for instrumentation. Vendor-neutral. The future.

AWS X-Ray

AWS native. Integrates with Lambda, ECS, etc.

Datadog APM

SaaS. Automatic instrumentation for many languages.

Honeycomb

Observability platform. Great for high-cardinality data.

7Key Takeaways

1Trace = full request journey, Span = single operation
2Trace ID propagated via headers (X-Trace-Id, traceparent)
3Essential for debugging latency and errors in microservices
4Sampling reduces cost: head-based (simple) vs tail-based (keeps errors)
5OpenTelemetry is the emerging standard for instrumentation

?Quiz

1. A request calls 5 services. How many spans minimum?

2. To keep all error traces while sampling, you need: