About today’s class
This week we begin our multi-week exploration of Apache Spark, the dominant framework for large-scale distributed data processing. We’ll start from first principles: why in-memory computing matters, how Spark’s architecture differs from Hadoop MapReduce, and how to work with Spark’s two primary data abstractions — RDDs (Resilient Distributed Datasets) and DataFrames.
By the end of class you’ll be able to write PySpark programs that read data, apply transformations, perform aggregations with SparkSQL, and understand when to use RDDs versus DataFrames.
Readings
Readings for this lecture (to be completed before class):
- Learning Spark, 2nd Edition by Damji et al. (2020). Chapters 1–3: Introduction to Apache Spark, Downloading and Getting Started, Apache Spark’s Structured APIs. (Free PDF from Databricks)
Slides
The slides for this week are available online.
Lab
This week’s lab will introduce PySpark on a VM (AWS EC2), working with RDDs and DataFrames on a real dataset.
Assignment
Details for this week’s assignment will be announced in class and posted to the course website.