content – DATS 6450.13

About today’s class

This week we go deeper into PySpark, covering two important topics: diagnosing Spark job performance and extending Spark with custom logic via User-Defined Functions (UDFs).

The Spark UI is an essential tool for understanding how the cluster is actually executing your jobs — we’ll walk through reading the event timeline, DAG visualization, executor status, and query plans. We’ll then cover how to write Python UDFs and the more performant vectorized Pandas UDFs, which use Apache Arrow for zero-copy data transfer between the JVM and Python.

Readings

Readings for this lecture (to be completed before class):

Learning Spark, 2nd Edition by Damji et al. (2020). Chapter 7: Optimizing and Tuning Spark Applications. (Free PDF from Databricks)
PySpark UDF documentation
Pandas UDF documentation

Useful references:

DataCamp

Slides

The slides for this week are available online.

Lab

This week’s lab will walk through how to set up a Spark cluster using EC2 instances, both manually and with automated script. Traditionally Spark runs on EMR clusters which usually take 45 minutes to spin up. The EC2 strategy is quite a bit faster.

GH classroom link