Module 7 - Observability

Distributed Tracing

Follow a single request as it travels through dozens of microservices.

1The Package Tracking Analogy

Simple Analogy

When you order a package, you get a tracking number. You can see: warehouse → truck → distribution center → local post → delivered. Each step is recorded with timestamps. Distributed tracing is exactly this-but for your API requests traveling through services.

Distributed tracing tracks a request as it flows through multiple services, creating a timeline (trace) of all operations. Essential for debugging latency and errors in microservices.

2Core Concepts

Trace

The entire journey of a request. Contains multiple spans. Has a unique trace ID.

Span

A single operation within a trace. Has start time, duration, and parent span. Example: 'database query' or 'HTTP call to user-service'.

Trace ID

Unique identifier propagated across all services. Ties everything together.

Span ID

Unique identifier for each span. Parent span ID creates the hierarchy.

Context Propagation

Passing trace ID and span ID in HTTP headers (e.g., X-Trace-Id) so each service can continue the trace.

3How It Works

Step	Service	Action
1	API Gateway	Create trace ID: `abc123`. Start span: "gateway"
2	API Gateway	Call Order Service with header: `X-Trace-Id: abc123`
3	Order Service	Extract trace ID, create child span: "process-order"
4	Order Service	Call Payment Service with same trace ID
5	Payment Service	Create child span: "charge-card" (duration: 150ms)
6	Trace Collector	Assembles all spans into complete trace visualization

4Real-World Dry Run

Scenario: Checkout taking 5 seconds (should be 500ms)

Get slow request's trace ID from logs

trace_id: tr_abc123

Open trace in Jaeger/Zipkin

See complete waterfall diagram

Analyze span durations

Gateway: 10ms, Order: 50ms, Inventory: 4500ms (!), Payment: 200ms

Drill into Inventory span

Child span: 'check_stock_query' took 4400ms

Root cause identified

Missing index on product_id. Query doing full table scan.

Without tracing: "Something is slow somewhere." With tracing: "The check_stock_query in Inventory Service took 4400ms due to missing index."

5Sampling Strategies

Tracing every request is expensive. Use sampling:

Head-based

Decide at request start. Sample 1% of all requests. Simple but might miss rare errors.

Tail-based

Decide after request completes. Keep all errors and slow requests. More expensive.

Priority

Always trace certain endpoints (checkout, payment). Sample others.

Adaptive

Increase sampling when traffic is low. Decrease during peak.

6Common Tools

Jaeger

CNCF project. Popular open-source. Originally from Uber.

Zipkin

Originally from Twitter. Simple and mature.

OpenTelemetry

Standard for instrumentation. Vendor-neutral. The future.

AWS X-Ray

AWS native. Integrates with Lambda, ECS, etc.

Datadog APM

SaaS. Automatic instrumentation for many languages.

Honeycomb

Observability platform. Great for high-cardinality data.

7Key Takeaways

1Trace = full request journey, Span = single operation

2Trace ID propagated via headers (X-Trace-Id, traceparent)

3Essential for debugging latency and errors in microservices

4Sampling reduces cost: head-based (simple) vs tail-based (keeps errors)

5OpenTelemetry is the emerging standard for instrumentation

?Quiz

1. A request calls 5 services. How many spans minimum?

2. To keep all error traces while sampling, you need: