Module 3 - Async Processing

Batch vs Stream Processing

Two paradigms for processing data at scale: in chunks or continuously.

1The Factory Analogy

Simple Analogy
Batch Processing: Like a bakery that bakes all day's bread at 4 AM. Efficient, but you can't get fresh bread at noon.

Stream Processing: Like a sushi chef making rolls on demand. Each piece is fresh, but there's more overhead per item.

2Key Differences

AspectBatchStream
LatencyMinutes to hoursMilliseconds to seconds
DataBounded (fixed dataset)Unbounded (continuous)
ProcessingComplete dataset at onceEvent by event
ThroughputHigher (optimized)Lower per event
ComplexitySimplerMore complex

3When to Use Each

Choose Batch

  • ✓ Daily/weekly reports
  • ✓ ETL pipelines
  • ✓ ML model training
  • ✓ Data warehouse loading
  • ✓ Cost optimization (spot instances)

Choose Stream

  • ✓ Real-time dashboards
  • ✓ Fraud detection
  • ✓ Live recommendations
  • ✓ Alerting systems
  • ✓ Event-driven microservices

4Popular Tools

Batch

Apache Spark, Hadoop MapReduce, AWS Glue, dbt

Stream

Apache Kafka Streams, Apache Flink, Apache Spark Streaming

Unified

Apache Beam (batch + stream with same API)

5Lambda vs Kappa Architecture

Lambda Architecture

Separate batch and stream layers. Batch for accuracy, stream for speed. Complex to maintain.

Kappa Architecture

Stream-only. Replay stream for batch-like processing. Simpler but requires robust streaming.

6Key Takeaways

1Batch: high latency, high throughput, bounded data
2Stream: low latency, event-by-event, unbounded data
3Use batch for reports/ETL, stream for real-time
4Lambda = batch + stream; Kappa = stream only
5Apache Beam provides unified batch+stream API

?Quiz

1. Daily sales report aggregation is best done with:

2. Fraud detection on credit card transactions needs: