Module 3 - Async Processing
Batch vs Stream Processing
Two paradigms for processing data at scale: in chunks or continuously.
1The Factory Analogy
Simple Analogy
Batch Processing: Like a bakery that bakes all day's bread at 4 AM. Efficient, but you can't get fresh bread at noon.
Stream Processing: Like a sushi chef making rolls on demand. Each piece is fresh, but there's more overhead per item.
Stream Processing: Like a sushi chef making rolls on demand. Each piece is fresh, but there's more overhead per item.
2Key Differences
| Aspect | Batch | Stream |
|---|---|---|
| Latency | Minutes to hours | Milliseconds to seconds |
| Data | Bounded (fixed dataset) | Unbounded (continuous) |
| Processing | Complete dataset at once | Event by event |
| Throughput | Higher (optimized) | Lower per event |
| Complexity | Simpler | More complex |
3When to Use Each
Choose Batch
- ✓ Daily/weekly reports
- ✓ ETL pipelines
- ✓ ML model training
- ✓ Data warehouse loading
- ✓ Cost optimization (spot instances)
Choose Stream
- ✓ Real-time dashboards
- ✓ Fraud detection
- ✓ Live recommendations
- ✓ Alerting systems
- ✓ Event-driven microservices
4Popular Tools
Batch
Apache Spark, Hadoop MapReduce, AWS Glue, dbt
Stream
Apache Kafka Streams, Apache Flink, Apache Spark Streaming
Unified
Apache Beam (batch + stream with same API)
5Lambda vs Kappa Architecture
Lambda Architecture
Separate batch and stream layers. Batch for accuracy, stream for speed. Complex to maintain.
Kappa Architecture
Stream-only. Replay stream for batch-like processing. Simpler but requires robust streaming.
6Key Takeaways
1Batch: high latency, high throughput, bounded data
2Stream: low latency, event-by-event, unbounded data
3Use batch for reports/ETL, stream for real-time
4Lambda = batch + stream; Kappa = stream only
5Apache Beam provides unified batch+stream API
?Quiz
1. Daily sales report aggregation is best done with:
2. Fraud detection on credit card transactions needs: