Module 3 - Async Processing

Batch vs Stream Processing

Two paradigms for processing data at scale: in chunks or continuously.

1The Factory Analogy

Simple Analogy

Batch Processing: Like a bakery that bakes all day's bread at 4 AM. Efficient, but you can't get fresh bread at noon.

Stream Processing: Like a sushi chef making rolls on demand. Each piece is fresh, but there's more overhead per item.

2Key Differences

Aspect	Batch	Stream
Latency	Minutes to hours	Milliseconds to seconds
Data	Bounded (fixed dataset)	Unbounded (continuous)
Processing	Complete dataset at once	Event by event
Throughput	Higher (optimized)	Lower per event
Complexity	Simpler	More complex

3When to Use Each

Choose Batch

✓ Daily/weekly reports
✓ ETL pipelines
✓ ML model training
✓ Data warehouse loading
✓ Cost optimization (spot instances)

Choose Stream

✓ Real-time dashboards
✓ Fraud detection
✓ Live recommendations
✓ Alerting systems
✓ Event-driven microservices

4Popular Tools

Batch

Apache Spark, Hadoop MapReduce, AWS Glue, dbt

Stream

Apache Kafka Streams, Apache Flink, Apache Spark Streaming

Unified

Apache Beam (batch + stream with same API)

5Lambda vs Kappa Architecture

Lambda Architecture

Separate batch and stream layers. Batch for accuracy, stream for speed. Complex to maintain.

Kappa Architecture

Stream-only. Replay stream for batch-like processing. Simpler but requires robust streaming.

6Key Takeaways

1Batch: high latency, high throughput, bounded data

2Stream: low latency, event-by-event, unbounded data

3Use batch for reports/ETL, stream for real-time

4Lambda = batch + stream; Kappa = stream only

5Apache Beam provides unified batch+stream API

?Quiz

1. Daily sales report aggregation is best done with:

2. Fraud detection on credit card transactions needs: