About today’s class

This week, we look at big data tools that work synergistically for data analytics. These tools are meant for big-ish data, and leverage the Parquet file format. We can process TB-level data easily using these tools, and they can work well with AWS S3 storage. The tools we explore for the next two weeks are Polars and DuckDB, leveraging Parquet and (often) living on S3.

All our work will be done on AWS, so make sure you’re set up following Lab 3.

Readings

Readings for this lecture (to be completed before this class):

Slides

The slides for this week are available online.

Lab

This week the formal lab will explore 1 year of the NYC Taxi dataset using Parquet, Polars and DuckDB, showing the advantages of columnar formatting, smart partitioning and no-copy analytics.

The Github Classroom link is https://classroom.github.com/a/0Y73uVoM and is due February 12

Assignment