About today’s class
This week we move from batch to real-time data processing with Spark Structured Streaming. We’ll start with the historical context of DStreams (Spark’s legacy streaming API) and its limitations, then dive into Structured Streaming — a unified programming model that lets you write the same DataFrame code for both batch and streaming workloads.
Key topics include the unbounded-table programming model, output modes (Append, Update, Complete), the five fundamental steps of a streaming query, stateless vs. stateful transformations, event-time windows (tumbling and sliding), and watermarks for handling late-arriving data.
Readings
Readings for this lecture (to be completed before class):
- Learning Spark, 2nd Edition by Damji et al. (2020). Chapter 8: Structured Streaming. (Free PDF from Databricks)
- Spark Structured Streaming Programming Guide
Slides
The slides for this week are available online.
Lab
This week’s lab will implement a Spark Structured Streaming pipeline, consuming a data stream, applying windowed aggregations, and writing results to an output sink.
Assignment
Details for this week’s assignment will be announced in class and posted to the course website.