About today’s class

This week, we continue to look at big data tools that work synergistically for data analytics. These tools are meant for big-ish data, and leverage the Parquet file format. We can process TB-level data easily using these tools, and they can work well with AWS S3 storage.

All our work will be done on AWS, so make sure you’re set up following Lab 3.

Readings

No new readings for this week, but you should have completed the readings for last week (to be completed before this class):

Slides

The slides for this week are available online.

Lab

This week the formal lab will explore 1 year of the NYC Taxi dataset using Parquet and Polars, showing the advantages of columnar formatting, smart partitioning and no-copy analytics.

The Github Classroom link is https://classroom.github.com/a/IYlzNqy1 and is due February 19

Assignment

There are two assignments this week:

  1. Assignment 5: This assignment will focus on using Polars and DuckDB for data analysis on Parquet datasets. The assignment will be released on Friday, February 13th, and will be due on Sunday, February 22rd.
  2. Project Milestone 1: The first milestone for the final project will be due on Wednesday, February 21nd. This milestone will require you to submit a project proposal outlining your research question, data sources, and proposed analysis methods. The proposal should be approximately 1-2 pages in length and should demonstrate a clear understanding of the data and methods you plan to use for your final project. If you are not using the Reddit dataset, your proposal will also need to include a data acquisition plan detailing how you will obtain and prepare your dataset for analysis, and this plan must be approved by the instructor before you proceed with your project.

Information about the Reddit dataset is available here