Lecture 4

Parquet, DuckDB, Polars: A holy trinity

Agenda and Goals for Today

Lecture

Distributed file systems
Modern file types
Working with large tabular data on a single node
- DuckDB
- Polars

Lab

Run a similar task with Pandas, polars and duckdb

Review

We’ve introduced the cloud environment on AWS (or a version of it)
We’ve learned to move data to it
We’ve learned that parallelization in some problems can help with speeding computations

Looking forward

Today we’ll start looking at how to store data for efficiency, and tools to compute efficiently on it
Some parallelization is happening under the hood, but it is being utilized
We’ll start seeing the advantages of lazy evaluation, which requires thought and planning
Compute takes time and infrastructure, but the final product is often small
OLAP tools can work for a lot of the ETL work you’d need. We’ll spend today and next week on that
We’ll then proceed to how to actually analyze big data using models, leading us to Apache Spark
A cool part of the tools we’re using is that they don’t actually copy the data (saving time, I/O cost)

Filesystems

Raw ingredients of storage systems

Disk drives (magnetic HDDs or SSDs)
RAM
Networking and CPU
Serialization
Compression
Caching

Single machine vs. distributed storage

Single machine

They are commonly used for storing operating system files, application files, and user data files.
Filesystems are also used in databases to store data files, transaction logs, and backups.

Distributed storage

A distributed filesystem is a type of filesystem that spans multiple computers.
It provides a unified view of files across all the computers in the system.
Have existed before cloud

File storage

A file is a data entity with specific read, write, and reference characteristics used by software and operating systems.

Local disk

Operating system–managed filesystems on local disk partition of SSD or magnetic disk:
- NTFS (Windows)
- HFS+ (MacOS)
- ext4 (Linux)() on a local disk partition of SSD or magnetic disk

Network-attached (NAS)

File storage system to clients over a network
Including redundancy and reliability, fine-grained control of resources, storage pooling across multiple disks for large virtual volumes, and file sharing across multiple machines

Cloud filesystems

Not object store (more on that later)
Not the virtual hard drive attached to a virtual machine
Fully managed filesystem which takes care of networking, managing disk clusters, failures, and configuration (Azure Files, Amazon Elastic Filesystem)
Backed by Object Store

Object stores

The term object storage is somewhat confusing because object has several meanings in computer science. In this context, we’re talking about a specialized file-like construct. It could be any type of file: TXT, CSV, JSON, images, videos, audio, or pretty much any type of file.

Contains objects of all shapes and sizes.
Every object gets a unique identifier
Objects are immutable; cannot be modifier in place (unlike local FS)
Distributed by design
Massively scalable REST API access

Distributed FS vs Object Store

	Distributed File System	Object Storage
Organization	Files in hierarchical directories	Flat organization (though there can be overlays to provide hierarchical files structure)
Method	POSIX File Operations	REST API
Immutability	None: Random writes anywhere in file	Immutable: need to replace/append entire object
Scalability	Millions of files	Billions of objects

Both provide:

Fault tolerance
Availability and consistency

Before: Data locality (for Hadoop)

Today: de-coupling storage from compute

Data on-disk formats

Plain Text (CSV, TDF, FWF)

Pay attention to encodings!
Lines end in linefeed, carriage-return, or both together depending on the OS that generated
Typically, a single line of text contains a single record

JSON

Warning

JSON files have two flavors: JSON Lines vs. JSON. Typically when we say data is in JSON format, we imply it’s JSON Lines which means that there is a single JSON object per line, and there are multiple lines.

JSON Lines

Four records, one per line. No ending comma.

{"id":1, "name":"marck", "last_name":"vaisman"}
{"id":2, "name":"anderson", "last_name":"monken"}
{"id":3, "name":"amit", "last_name":"arora"}
{"id":4, "name":"abhijit", "last_name":"dasgupta"}

JSON

Four records enclosed in a JSON Array

[
  {"id":1, "name":"marck", "last_name":"vaisman"},
  {"id":2, "name":"anderson", "last_name":"monken"},
  {"id":3, "name":"amit", "last_name":"arora"},
  {"id":4, "name":"abhijit", "last_name":"dasgupta"},
]

Binary files

Issues with common file formats, particularly CSVs:

Still ubiquitous and highly error prone (even in 2023)
The default delimiter is also one of the most familiar characters in the English language—the comma
Not a uniform format
- Delimiter (comma, tab, semi-colon, custom)
- Quote characters (single or doble quote)
- Escaping to appropriately handle string data
Doesn’t natively encode schema information
No direct support for nested structures
Encoding and schema information must be configured in the target system to ensure appropriate ingestion
Autodetection is a convenience feature provided in many cloud environments but is inappropriate for production ingestion, and can be painfully slow
Data engineers are often forced to work with CSV data and then build robust exception handling and error detection to ensure data quality on ingestion

Introducing Apache Parquet

Apache Parquet is an open-source columnar storage file format designed for efficient data storage and retrieval.

Created: 2013 by Twitter and Cloudera
Type: Columnar storage format
Purpose: Optimized for analytics workloads
License: Apache 2.0 (open source)
Ecosystem: Hadoop, Spark, Pandas, Polars, DuckDB, and more

Key Innovation: Store data by column instead of by row

Row vs Column Storage

Row-Oriented Storage (CSV, JSON)


Record 1: [ID=1, Name="Alice", Age=30, Salary=75000] Record 2: [ID=2, Name="Bob", Age=25, Salary=65000] Record 3: [ID=3, Name="Carol", Age=35, Salary=85000]

Stored as: Row1 → Row2 → Row3 → …

Column-Oriented Storage (Parquet)


ID Column: [1, 2, 3, ...] Name Column: ["Alice", "Bob", "Carol", ...] Age Column: [30, 25, 35, ...] Salary Column: [75000, 65000, 85000, ...]

Stored as: All IDs → All Names → All Ages → All Salaries

Why Columnar Storage?

Analytics Query Pattern

SELECT AVG(salary), MAX(salary)
FROM employees
WHERE age > 30;

Row Storage: Must read ALL columns for ALL rows

Column Storage: Only read age and salary columns

Result: 10-100x faster for analytical queries!

Parquet File Structure

┌─────────────────────────────────────────┐
│         Parquet File                    │
├─────────────────────────────────────────┤
│  4-byte Magic Number: PAR1              │
├─────────────────────────────────────────┤
│                                         │
│  ┌───────────────────────────────────┐  │
│  │  Row Group 1 (default: 128 MB)   │  │
│  │  ┌─────────┬─────────┬─────────┐ │  │
│  │  │Column A │Column B │Column C │ │  │
│  │  │(Pages)  │(Pages)  │(Pages)  │ │  │
│  │  └─────────┴─────────┴─────────┘ │  │
│  └───────────────────────────────────┘  │
│                                         │
│  ┌───────────────────────────────────┐  │
│  │  Row Group 2                      │  │
│  │  ┌─────────┬─────────┬─────────┐ │  │
│  │  │Column A │Column B │Column C │ │  │
│  │  └─────────┴─────────┴─────────┘ │  │
│  └───────────────────────────────────┘  │
│                                         │
├─────────────────────────────────────────┤
│  Footer (Metadata)                      │
│  • File schema                          │
│  • Column metadata                      │
│  • Row group metadata                   │
│  • Column statistics (min/max/nulls)    │
│  • Encoding information                 │
├─────────────────────────────────────────┤
│  4-byte Magic Number: PAR1              │
│  4-byte Footer Length                   │
└─────────────────────────────────────────┘

Key Components

1. Row Groups

Purpose: Horizontal partitioning of data
Default Size: 128 MB (configurable)
Benefit: Enable parallel processing and selective reading

2. Column Chunks

Purpose: All data for one column in one row group
Organization: Divided into pages
Benefit: Read only columns you need

3. Pages

Purpose: Unit of compression and encoding
Types: Data pages, dictionary pages, index pages
Benefit: Fine-grained access and efficient compression

Metadata, compression, and dictionary encoding

Metadata and Statistics

File-Level Metadata

Schema definition (column names, types, nesting)
Number of rows
Created by (writer version)
Key-value metadata

Column-Level Statistics

Min/Max values: Skip row groups outside query range
Null count: Optimize null handling
Distinct count: Cardinality estimation
Total compressed/uncompressed size: Storage metrics

Query Optimization: Read metadata first, skip irrelevant data!

Compression in Parquet

Supported Codecs

Codec	Ratio	Speed	Use Case
Snappy	2-3x	Very Fast	Default, balanced
GZIP	4-6x	Slow	Cold storage, archival
LZ4	2-3x	Very Fast	Similar to Snappy
ZSTD	3-5x	Fast	Best overall (if available)
Brotli	4-7x	Medium	Web delivery
Uncompressed	1x	Fastest	Hot data, local SSD

Recommendation: Start with Snappy, test ZSTD for better compression

Apache Arrow: The Perfect Companion

What is Apache Arrow?

Apache Arrow is an in-memory columnar data format and set of libraries for efficient data interchange.

Created: 2016
Purpose: Zero-copy data sharing between tools
Format: Standardized columnar memory layout
Language Support: C++, Python, R, Java, JavaScript, Rust, Go, and more

Parquet Vs Arrow

┌─────────────────────────────────────────────────────┐
│                  STORAGE                            │
│              (Parquet Files)                        │
│  • Compressed                                       │
│  • On disk / S3                                     │
│  • Optimized for I/O                                │
└─────────────────┬───────────────────────────────────┘
                  │
                  │ Read/Write
                  │
                  ▼
┌─────────────────────────────────────────────────────┐
│                  MEMORY                             │
│              (Arrow Format)                         │
│  • Uncompressed                                     │
│  • In RAM                                           │
│  • Optimized for CPU processing                     │
└─────────────────┬───────────────────────────────────┘
                  │
                  │ Zero-Copy Share
                  │
        ┌─────────┼─────────┐
        │         │         │
        ▼         ▼         ▼
    ┌───────┐ ┌───────┐ ┌───────┐
    │Polars │ │DuckDB │ │Pandas │
    └───────┘ └───────┘ └───────┘

Parquet ↔︎ Arrow Relationship

Complementary Roles

Parquet (Storage):

Persistent storage format
Compressed for efficiency
Includes rich metadata
Optimized for I/O operations

Arrow (Memory):

In-memory processing format
Uncompressed for speed
Standardized layout
Optimized for CPU operations

Key Benefit: Convert between formats with minimal overhead

Zero-Copy Data Sharing

Traditional Approach (With Copying)

# Read Parquet → Pandas
df_pandas = pd.read_parquet("data.parquet")  # Copy 1

# Convert to NumPy
array = df_pandas.to_numpy()  # Copy 2

# Pass to ML library
model.fit(array)  # Copy 3

# Total: 3 copies in memory!

Arrow Approach (Zero-Copy)

# Read Parquet → Arrow
table = pa.parquet.read_table("data.parquet")

# Share with Polars (no copy)
df_polars = pl.from_arrow(table)

# Share with Pandas (no copy)
df_pandas = table.to_pandas(zero_copy_only=True)

# Total: 1 copy in memory!

Zero-Copy Example

Parquet Strengths Summary

┌─────────────────────────────────────────────────┐
│         PARQUET STRENGTHS                       │
├─────────────────────────────────────────────────┤
│  ✓ 70-90% compression ratio                     │
│  ✓ 10-100x faster analytical queries            │
│  ✓ Read only needed columns                     │
│  ✓ Skip irrelevant data (predicate pushdown)    │
│  ✓ Rich metadata for optimization               │
│  ✓ Cloud-native (S3, Azure, GCS)                │
│  ✓ Universal ecosystem support                  │
│  ✓ Schema evolution                             │
│  ✓ Parallel processing                          │
│  ✓ Self-describing format                       │
└─────────────────────────────────────────────────┘

Parquet Weaknesses Summary

┌─────────────────────────────────────────────────┐
│         PARQUET WEAKNESSES                      │
├─────────────────────────────────────────────────┤
│  ✗ Slower write performance (2-5x vs CSV)       │
│  ✗ Inefficient for single-row lookups           │
│  ✗ No native update/delete support              │
│  ✗ Immutable (requires rewrite to modify)       │
│  ✗ Overhead for small files (< 10 MB)           │
│  ✗ Complex format (steep learning curve)        │
│  ✗ Limited streaming support                    │
│  ✗ Not ideal for transactional workloads        │
│  ✗ Schema evolution requires planning           │
└─────────────────────────────────────────────────┘

Before Arrow

After Arrow

Arrow Compatibility

Use Arrow to read/write CSVs and Parquet

Python

Use the pyarrow library or straight from pandas

import pandas as pd
pd.read_csv(engine = 'pyarrow')
pd.read_parquet

import pyarrow.csv
pyarrow.csv.read_csv()

import pyarrow.parquet
pyarrow.parquet.read_table()

R

Use the arrow package

library(arrow)

read_csv_arrow()
read_parquet()
read_json_arrow()

write_csv_arrow()
write_parquet()

Recommendation: save your intermediate and analytical datasets as Parquet!

When to Use Parquet

✅ Ideal Use Cases

Data warehousing and analytics
Data lakes (S3, Azure Blob, GCS)
Machine learning datasets (training data)
ETL pipelines (batch processing)
Historical data archival
Read-heavy workloads (10:1 read/write ratio or higher)
Large datasets (> 1 GB)
Column-based queries (aggregations, filtering)

When NOT to Use Parquet

❌ Poor Fit For

Transactional databases (OLTP)
Frequent updates/deletes
Real-time streaming (use Avro, Arrow IPC)
Small files (< 10 MB)
Row-based access patterns (single record lookups)
Write-heavy workloads
Append-only logs (use line-delimited JSON)
Immediate consistency requirements

Parquet Vs Other Formats

Feature	Parquet	CSV	JSON	Avro	ORC
Compression	Excellent	Poor	Poor	Good	Excellent
Read Speed	Very Fast	Slow	Slow	Fast	Very Fast
Write Speed	Medium	Fast	Fast	Fast	Medium
Schema	Yes	No	Flexible	Yes	Yes
Nested Data	Yes	No	Yes	Yes	Yes
Human Readable	No	Yes	Yes	No	No
Analytics	Excellent	Poor	Poor	Good	Excellent
Streaming	Poor	Good	Good	Excellent	Poor

Polars + DuckDB + Parquet

The Modern Local Analytics Stack

Fast, flexible analytics on your laptop—no cluster required

The Challenge

Modern data science faces critical bottlenecks:

Volume: Datasets exceeding available RAM are increasingly common
Distribution: Data stored across cloud object storage (S3, Azure Blob, GCS)
Performance: Traditional tools (Pandas, SQLite) struggle with scale
Cost: Inefficient data processing translates to higher cloud costs

We need tools designed for modern data infrastructure from the ground up.

DuckDB

In‑process SQL OLAP engine with vectorized execution. Reads/writes Parquet and excels at joins, aggregations, and window functions—fast analytics without a server.

Side road: OLAP & OLTP

What is OLAP?

Online Analytical Processing

Systems designed for complex analytical queries

Handles business intelligence and reporting
Complex queries across large datasets
Primarily read-intensive with periodic bulk loads
Column-oriented storage optimized for analytics
Sub-second to minutes response time

What is OLTP?

Online Transaction Processing

Systems designed for real-time operational transactions

Handles daily business operations (sales, orders, payments)
Fast, short transactions with immediate results
Frequent inserts, updates, and deletes
Row-oriented storage optimized for writes
Millisecond response times with ACID guarantees

Head-to-Head Comparison

Aspect	OLTP	OLAP
Purpose	Transaction processing	Data analysis
Data Type	Current operational	Historical aggregated
Queries	Simple, few rows	Complex, millions of rows
Workload	Heavy writes	Heavy reads
Storage	Row-based	Column-based
Users	Thousands+ concurrent	Dozens of analysts
Response	Milliseconds	Seconds to minutes

OLAP Use Cases

Analytical Workloads

Business Intelligence: Sales trend analysis and revenue forecasting
Data Warehousing: Multi-dimensional reporting across departments
Financial Analysis: Quarterly performance aggregations and comparisons
Marketing Analytics: Campaign effectiveness across customer segments
Supply Chain: Historical demand patterns and inventory optimization

OLTP Use Cases

Operational Workloads

E-commerce: Processing customer orders and payments in real-time
Banking: ATM transactions, account updates, wire transfers
Inventory Management: Real-time stock level tracking and updates
CRM Systems: Customer data entry, modifications, and lookups
Booking Systems: Hotel/flight reservations with immediate confirmation

Back to normal programming

Polars

Rust‑based DataFrame with lazy and eager APIs, query optimization, and parallel execution. Arrow‑native for fast wrangling, grouping, joins, and analytics pipelines.

How They Interplay

Parquet dataset → [Scan + Pushdown] → DuckDB (SQL) ↔︎ Arrow ↔︎ Polars (Expressions) → Parquet (results)

Columnar synergy: skip unused columns
Row‑group pruning via statistics
Zero/low‑copy Arrow interchange
In‑process OLAP on a single machine

Why It’s Useful

Near‑warehouse performance
Efficient, larger‑than‑RAM reads
Interoperable across the data stack
Switch between SQL and expressions
Simple, low‑cost, no servers
Reproducible Parquet snapshots

Typical Workflows

Explore: Query partitioned Parquet with DuckDB SQL for fast summaries.
Transform: Build Polars lazy pipelines; push filters/projections into scans.
Combine: Use DuckDB for heavy joins; hand off via Arrow to Polars.
Persist: Save curated results/features back to Parquet for training or sharing.

Practical Tips

Partitioning: Use selective keys (e.g., dt=YYYY‑MM‑DD, region) and keep stats.
Column selection: Project only needed columns for faster scans.
File sizing: Target 100–512 MB files; row groups ~64–256 MB.
Schema evolution: Track types; avoid expensive casts and mixed dtypes.
I/O locality: Cache or work on local copies for iterative dev.
Interchange: Use Arrow for minimal copies between DuckDB and Polars.

Polars

Lightning-fast DataFrame library for Rust and Python

Before we begin..

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

Pandas is slow, well yes but also, not so much if you use it the right way.

Apache Arrow and the “10 Things I Hate About pandas” (A 2017 post from the creator of Pandas..)
50 times faster data loading for Pandas: no problem (but this is an old 2019 article..)
Is Pandas really that slow?

Pandas 2.0 and the arrow revolution

Polars

Why is Polars faster than Pandas?

Polars is written in Rust. Rust is a compiled language, Python is an interpreted language.
- Compiled language: you generate the machine code only once then run it, subsequent runs do not need the compilation step.
- Interpreted language: code has to be parsed, interpreted and converted into machine code every single time.
Parallelization: Vectorized operations that can be executed in parallel on multiple CPU cores.
Lazy evaluation: Polars supports two APIs lazy as well as eager evaluation (used by pandas). In lazy evaluation, a query is executed only when required. While in eager evaluation, a query is executed immediately.
Polars uses Arrow as its in-memory representation of data. Similar to how pandas uses Numpy (although Pandas 2.0 does allow using Arrow as the backend in addition to Numpy).
- [Excerpt from this post from Ritchie Vink, author of Polars] Arrow provides the efficient data structures and some compute kernels, like a SUM, a FILTER, a MAX etc. Arrow is not a query engine. Polars is a DataFrame library on top of arrow that has implemented efficient algorithms for JOINS, GROUPBY, PIVOTs, MELTs, QUERY OPTIMIZATION, etc. (the things you expect from a DF lib).
- Polars could be best described as an in-memory DataFrame library with a query optimizer.

Ease of use

Familiar API for users of Pandas: there are differences in syntax polars != pandas but it is still a Dataframe API making it straightforward to perform common operations such as filtering, aggregating, and joining data. See migrating from Pandas

Reading data

# must install s3fs -> "pip install s3fs"

# Using Polars
import polars as pl
polars_df = pl.read_parquet("s3://nyc-tlc/trip data/yellow_tripdata_2023-06.parquet")

# using Pandas
import pandas as pd
pandas_df = pd.read_parquet("s3://nyc-tlc/trip data/yellow_tripdata_2023-06.parquet")

Selecting columns (see Pushdown optimization)

# Using Polars
selected_columns_polars = polars_df[['column1', 'column2']]

# Using Pandas
selected_columns_pandas = pandas_df[['column1', 'column2']]

Ease of use (contd.)

Familiar API for users of Pandas:

Filtering data
```
# Using Polars
filtered_polars = polars_df[polars_df['column1'] > 10]

# Using Pandas
filtered_pandas = pandas_df[pandas_df['column1'] > 10]
```
Even though you can write Polars code that looks like Pandas, it is better to write idiomatic Polars code that takes advantages of unique features Polars offers.
Migrating from Apache Spark: Whereas the Spark DataFrame is analogous to a collection of rows, a Polars DataFrame is closer to a collection of columns.

Installation, data loading and basic operations

Install polars via pip.

pip install polars

Import polars in your Python code and read data as usual

import polars as pl
df = pl.read_parquet("s3://nyc-tlc/trip data/yellow_tripdata_2023-06.parquet")
df.head()

shape: (5, 19)
┌──────────┬────────────┬──────────────┬──────────────┬─────┬──────────────┬──────────────┬──────────────┬─────────────┐
│ VendorID ┆ tpep_picku ┆ tpep_dropoff ┆ passenger_co ┆ ... ┆ improvement_ ┆ total_amount ┆ congestion_s ┆ Airport_fee │
│ ---      ┆ p_datetime ┆ _datetime    ┆ unt          ┆     ┆ surcharge    ┆ ---          ┆ urcharge     ┆ ---         │
│ i32      ┆ ---        ┆ ---          ┆ ---          ┆     ┆ ---          ┆ f64          ┆ ---          ┆ f64         │
│          ┆ datetime[n ┆ datetime[ns] ┆ i64          ┆     ┆ f64          ┆              ┆ f64          ┆             │
│          ┆ s]         ┆              ┆              ┆     ┆              ┆              ┆              ┆             │
╞══════════╪════════════╪══════════════╪══════════════╪═════╪══════════════╪══════════════╪══════════════╪═════════════╡
│ 1        ┆ 2023-06-01 ┆ 2023-06-01   ┆ 1            ┆ ... ┆ 1.0          ┆ 33.6         ┆ 2.5          ┆ 0.0         │
│          ┆ 00:08:48   ┆ 00:29:41     ┆              ┆     ┆              ┆              ┆              ┆             │
│ 1        ┆ 2023-06-01 ┆ 2023-06-01   ┆ 0            ┆ ... ┆ 1.0          ┆ 23.6         ┆ 2.5          ┆ 0.0         │
│          ┆ 00:15:04   ┆ 00:25:18     ┆              ┆     ┆              ┆              ┆              ┆             │
│ 1        ┆ 2023-06-01 ┆ 2023-06-01   ┆ 1            ┆ ... ┆ 1.0          ┆ 60.05        ┆ 0.0          ┆ 1.75        │
└──────────┴────────────┴──────────────┴──────────────┴─────┴──────────────┴──────────────┴──────────────┴─────────────┘

A Polars DataFrame processing pipeline example

Here is an example that we will run as part of the lab in a little bit.

Think how you would code this same pipeline in Pandas…

Lazy vs Eager Evaluation

Motivation: The Performance Problem

Scenario: Analyzing Customer Data

import pandas as pd

# Load 10 GB of customer data
customers = pd.read_csv("customers.csv")  # Takes 45 seconds

# Filter for active customers
active = customers[customers['status'] == 'active']  # 30 seconds

# Select relevant columns
subset = active[['id', 'revenue', 'region']]  # 15 seconds

# Calculate regional averages
result = subset.groupby('region')['revenue'].mean()  # 20 seconds

# Total time: 110 seconds

Question: Could we do better?

The Eager Evaluation Problem

Execution Flow

Code

graph TB
    Start[Start] --> Read[Read entire CSV<br/>10 GB in memory]
    Read --> Filter[Filter rows<br/>Create new DataFrame 6 GB]
    Filter --> Select[Select columns<br/>Create new DataFrame 2 GB]
    Select --> Group[Group and aggregate<br/>Final result]
    Group --> End[End]
    
    Read -.->|Peak Memory| Mem1[10 GB]
    Filter -.->|Peak Memory| Mem2[16 GB total]
    Select -.->|Peak Memory| Mem3[18 GB total]
    
    style Read fill:#ff6b6b,stroke:#c92a2a,stroke-width:2px,color:#fff
    style Filter fill:#ff6b6b,stroke:#c92a2a,stroke-width:2px,color:#fff
    style Select fill:#ff6b6b,stroke:#c92a2a,stroke-width:2px,color:#fff
    style Group fill:#ffa94d,stroke:#e67700,stroke-width:2px,color:#fff

graph TB
    Start[Start] --> Read[Read entire CSV<br/>10 GB in memory]
    Read --> Filter[Filter rows<br/>Create new DataFrame 6 GB]
    Filter --> Select[Select columns<br/>Create new DataFrame 2 GB]
    Select --> Group[Group and aggregate<br/>Final result]
    Group --> End[End]
    
    Read -.->|Peak Memory| Mem1[10 GB]
    Filter -.->|Peak Memory| Mem2[16 GB total]
    Select -.->|Peak Memory| Mem3[18 GB total]
    
    style Read fill:#ff6b6b,stroke:#c92a2a,stroke-width:2px,color:#fff
    style Filter fill:#ff6b6b,stroke:#c92a2a,stroke-width:2px,color:#fff
    style Select fill:#ff6b6b,stroke:#c92a2a,stroke-width:2px,color:#fff
    style Group fill:#ffa94d,stroke:#e67700,stroke-width:2px,color:#fff

Problem: Each operation executes immediately and creates intermediate results

Enter Lazy Evaluation

The Same Pipeline, Reimagined

import polars as pl

# Build execution plan (no computation yet)
result = (pl.scan_csv("customers.csv")           # 0.001 seconds
    .filter(pl.col('status') == 'active')        # 0.001 seconds
    .select(['id', 'revenue', 'region'])         # 0.001 seconds
    .groupby('region')
    .agg(pl.col('revenue').mean())
    .collect())                                   # 12 seconds

# Total time: 12 seconds

Key Insight: Operations are planned first, then executed optimally

Lazy Evaluation Flow

Code

graph TB
    Start[Start] --> Plan1[Build Plan: Scan CSV]
    Plan1 --> Plan2[Build Plan: Filter]
    Plan2 --> Plan3[Build Plan: Select]
    Plan3 --> Plan4[Build Plan: GroupBy]
    Plan4 --> Optimize[Query Optimizer]
    
    Optimize --> Exec[Optimized Execution:<br/>Single Pass<br/>Streaming]
    Exec --> End[Result]
    
    Optimize -.->|Pushdown| Push1[Filter at source]
    Optimize -.->|Pushdown| Push2[Select only needed columns]
    Optimize -.->|Fusion| Fuse[Combine operations]
    
    style Plan1 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style Plan2 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style Plan3 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style Plan4 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style Optimize fill:#fff3e0,stroke:#f57c00,stroke-width:3px
    style Exec fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#000

graph TB
    Start[Start] --> Plan1[Build Plan: Scan CSV]
    Plan1 --> Plan2[Build Plan: Filter]
    Plan2 --> Plan3[Build Plan: Select]
    Plan3 --> Plan4[Build Plan: GroupBy]
    Plan4 --> Optimize[Query Optimizer]
    
    Optimize --> Exec[Optimized Execution:<br/>Single Pass<br/>Streaming]
    Exec --> End[Result]
    
    Optimize -.->|Pushdown| Push1[Filter at source]
    Optimize -.->|Pushdown| Push2[Select only needed columns]
    Optimize -.->|Fusion| Fuse[Combine operations]
    
    style Plan1 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style Plan2 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style Plan3 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style Plan4 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style Optimize fill:#fff3e0,stroke:#f57c00,stroke-width:3px
    style Exec fill:#c8e6c9,stroke:#388e3c,stroke-width:2px,color:#000

Key Concepts: Computation Graph (DAG)

Lazy Evaluation Builds a Directed Acyclic Graph

Code

graph TB
    Result[Result DataFrame]
    GroupBy[GroupBy Operation<br/>groupby 'region'<br/>agg mean 'revenue']
    Select[Select Operation<br/>columns: id, revenue, region]
    Filter[Filter Operation<br/>status == 'active']
    Scan[Scan Operation<br/>customers.csv]
    
    Result --> GroupBy
    GroupBy --> Select
    Select --> Filter
    Filter --> Scan
    
    style Result fill:#c8e6c9,stroke:#388e3c,stroke-width:2px
    style GroupBy fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style Select fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style Filter fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style Scan fill:#fff3e0,stroke:#f57c00,stroke-width:2px

graph TB
    Result[Result DataFrame]
    GroupBy[GroupBy Operation<br/>groupby 'region'<br/>agg mean 'revenue']
    Select[Select Operation<br/>columns: id, revenue, region]
    Filter[Filter Operation<br/>status == 'active']
    Scan[Scan Operation<br/>customers.csv]
    
    Result --> GroupBy
    GroupBy --> Select
    Select --> Filter
    Filter --> Scan
    
    style Result fill:#c8e6c9,stroke:#388e3c,stroke-width:2px
    style GroupBy fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style Select fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style Filter fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style Scan fill:#fff3e0,stroke:#f57c00,stroke-width:2px

Directed Acyclic Graph (DAG): Represents dependencies between operations

Query Optimization: The Power of Lazy

Optimization Techniques

Code

mindmap
  root((Query<br/>Optimization))
    Pushdown
      Predicate Pushdown
        Move filters to source
        Skip irrelevant data
      Projection Pushdown
        Read only needed columns
        Reduce I/O
      Partition Pruning
        Skip entire partitions
        Metadata-based filtering
    Fusion
      Operation Fusion
        Combine multiple ops
        Single pass execution
      Filter Fusion
        Merge multiple filters
        Reduce overhead
    Reordering
      Join Reordering
        Optimize join order
        Minimize intermediates
      Predicate Reordering
        Most selective first
        Early reduction
    Elimination
      Common Subexpression
        Reuse computed values
        Avoid redundancy
      Constant Folding
        Pre-compute constants
        Simplify expressions

mindmap
  root((Query<br/>Optimization))
    Pushdown
      Predicate Pushdown
        Move filters to source
        Skip irrelevant data
      Projection Pushdown
        Read only needed columns
        Reduce I/O
      Partition Pruning
        Skip entire partitions
        Metadata-based filtering
    Fusion
      Operation Fusion
        Combine multiple ops
        Single pass execution
      Filter Fusion
        Merge multiple filters
        Reduce overhead
    Reordering
      Join Reordering
        Optimize join order
        Minimize intermediates
      Predicate Reordering
        Most selective first
        Early reduction
    Elimination
      Common Subexpression
        Reuse computed values
        Avoid redundancy
      Constant Folding
        Pre-compute constants
        Simplify expressions

Optimization Example: Predicate Pushdown

Code

graph TB
    subgraph "Without Optimization (Eager)"
        E1[Read ALL data<br/>10 GB, 50 columns]
        E2[Filter in memory<br/>year == 2025]
        E3[Result: 1 GB]
        E1 --> E2 --> E3
    end
    
    subgraph "With Optimization (Lazy)"
        L1[Read with filter<br/>Only year == 2025<br/>1 GB, 50 columns]
        L2[Result: 1 GB]
        L1 --> L2
    end
    
    style E1 fill:#ff6b6b,stroke:#c92a2a,stroke-width:2px,color:#fff
    style E2 fill:#ff6b6b,stroke:#c92a2a,stroke-width:2px,color:#fff
    style L1 fill:#51cf66,stroke:#2f9e44,stroke-width:2px,color:#fff
    style L2 fill:#51cf66,stroke:#2f9e44,stroke-width:2px,color:#fff

graph TB
    subgraph "Without Optimization (Eager)"
        E1[Read ALL data<br/>10 GB, 50 columns]
        E2[Filter in memory<br/>year == 2025]
        E3[Result: 1 GB]
        E1 --> E2 --> E3
    end
    
    subgraph "With Optimization (Lazy)"
        L1[Read with filter<br/>Only year == 2025<br/>1 GB, 50 columns]
        L2[Result: 1 GB]
        L1 --> L2
    end
    
    style E1 fill:#ff6b6b,stroke:#c92a2a,stroke-width:2px,color:#fff
    style E2 fill:#ff6b6b,stroke:#c92a2a,stroke-width:2px,color:#fff
    style L1 fill:#51cf66,stroke:#2f9e44,stroke-width:2px,color:#fff
    style L2 fill:#51cf66,stroke:#2f9e44,stroke-width:2px,color:#fff

Benefit: 10x less data read from disk/network

Optimization Example: Projection Pushdown

Code

graph LR
    subgraph "Without Optimization"
        W1[Read 50 columns<br/>5 GB total]
        W2[Select 2 columns<br/>Still read 5 GB]
        W3[Result: 200 MB]
        W1 --> W2 --> W3
    end
    
    subgraph "With Optimization"
        O1[Read 2 columns only<br/>200 MB total]
        O2[Result: 200 MB]
        O1 --> O2
    end
    
    style W1 fill:#ff6b6b,stroke:#c92a2a,stroke-width:2px,color:#fff
    style W2 fill:#ff6b6b,stroke:#c92a2a,stroke-width:2px,color:#fff
    style O1 fill:#51cf66,stroke:#2f9e44,stroke-width:2px,color:#fff
    style O2 fill:#51cf66,stroke:#2f9e44,stroke-width:2px,color:#fff

graph LR
    subgraph "Without Optimization"
        W1[Read 50 columns<br/>5 GB total]
        W2[Select 2 columns<br/>Still read 5 GB]
        W3[Result: 200 MB]
        W1 --> W2 --> W3
    end
    
    subgraph "With Optimization"
        O1[Read 2 columns only<br/>200 MB total]
        O2[Result: 200 MB]
        O1 --> O2
    end
    
    style W1 fill:#ff6b6b,stroke:#c92a2a,stroke-width:2px,color:#fff
    style W2 fill:#ff6b6b,stroke:#c92a2a,stroke-width:2px,color:#fff
    style O1 fill:#51cf66,stroke:#2f9e44,stroke-width:2px,color:#fff
    style O2 fill:#51cf66,stroke:#2f9e44,stroke-width:2px,color:#fff

Benefit: 25x less data read (columnar format)

Lazy Evaluation in Practice: Polars

Eager API

import polars as pl

# Eager: Immediate execution
df = pl.read_csv("data.csv")                    # Executes
filtered = df.filter(pl.col('value') > 100)     # Executes
result = filtered.select(['id', 'value'])       # Executes

# Each operation creates materialized DataFrame

Lazy API

# Lazy: Deferred execution
lazy_df = pl.scan_csv("data.csv")               # Plan only
filtered = lazy_df.filter(pl.col('value') > 100)  # Plan only
result = filtered.select(['id', 'value'])       # Plan only

# Nothing executed yet!

# Trigger execution
materialized = result.collect()  # Optimized execution

Streaming Execution

Processing Larger-than-Memory Data

Code

graph TB
    File[File: 100 GB]
    
    File --> C1[Chunk 1: 100 MB]
    File --> C2[Chunk 2: 100 MB]
    File --> C3[Chunk 3: 100 MB]
    File --> CN[Chunk N: 100 MB]
    
    C1 --> P1[Process:<br/>Filter → Select → Partial Agg]
    C2 --> P2[Process:<br/>Filter → Select → Partial Agg]
    C3 --> P3[Process:<br/>Filter → Select → Partial Agg]
    CN --> PN[Process:<br/>Filter → Select → Partial Agg]
    
    P1 --> Combine[Combine Partial Results]
    P2 --> Combine
    P3 --> Combine
    PN --> Combine
    
    Combine --> Final[Final Result]
    
    style File fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style C1 fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    style C2 fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    style C3 fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    style CN fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    style Combine fill:#c8e6c9,stroke:#388e3c,stroke-width:2px
    style Final fill:#4caf50,stroke:#2e7d32,stroke-width:3px,color:#fff

graph TB
    File[File: 100 GB]
    
    File --> C1[Chunk 1: 100 MB]
    File --> C2[Chunk 2: 100 MB]
    File --> C3[Chunk 3: 100 MB]
    File --> CN[Chunk N: 100 MB]
    
    C1 --> P1[Process:<br/>Filter → Select → Partial Agg]
    C2 --> P2[Process:<br/>Filter → Select → Partial Agg]
    C3 --> P3[Process:<br/>Filter → Select → Partial Agg]
    CN --> PN[Process:<br/>Filter → Select → Partial Agg]
    
    P1 --> Combine[Combine Partial Results]
    P2 --> Combine
    P3 --> Combine
    PN --> Combine
    
    Combine --> Final[Final Result]
    
    style File fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    style C1 fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    style C2 fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    style C3 fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    style CN fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    style Combine fill:#c8e6c9,stroke:#388e3c,stroke-width:2px
    style Final fill:#4caf50,stroke:#2e7d32,stroke-width:3px,color:#fff

Peak Memory: ~500 MB (chunk + partials) instead of 100 GB

Streaming Execution Code

import polars as pl

# Process 100 GB file with 8 GB RAM
result = (pl.scan_parquet("huge_file.parquet")
    .filter(pl.col('year') == 2025)
    .groupby('category')
    .agg(pl.col('revenue').sum())
    .collect(streaming=True))  # ← Streaming mode

# Data processed in chunks, never fully materialized

How it works:

Read chunk of data (e.g., 100 MB)
Apply filter, select, partial aggregation
Discard processed chunk
Repeat until complete
Combine partial aggregations

When to Use Eager Evaluation

Decision Tree

Code

graph TD
    Start{Evaluate<br/>Requirements}
    
    Start -->|Small data| Size{Dataset size?}
    Size -->|< 100 MB| Eager1[Use Eager]
    Size -->| ≥ 100 MB| Next1{Pipeline complexity?}
    
    Start -->|Interactive| Context{Context?}
    Context -->|Exploration| Eager2[Use Eager]
    Context -->|Production| Next2{Memory constraints?}
    
    Next1 -->|Simple 1-2 ops| Eager3[Use Eager]
    Next1 -->|Complex 3+ ops| Lazy1[Use Lazy]
    
    Next2 -->|Abundant RAM| Maybe[Consider Eager]
    Next2 -->|Limited RAM| Lazy2[Use Lazy]
    
    style Eager1 fill:#51cf66,stroke:#2f9e44,stroke-width:2px,color:#fff
    style Eager2 fill:#51cf66,stroke:#2f9e44,stroke-width:2px,color:#fff
    style Eager3 fill:#51cf66,stroke:#2f9e44,stroke-width:2px,color:#fff
    style Lazy1 fill:#667eea,stroke:#2c3e50,stroke-width:2px,color:#fff
    style Lazy2 fill:#667eea,stroke:#2c3e50,stroke-width:2px,color:#fff

graph TD
    Start{Evaluate<br/>Requirements}
    
    Start -->|Small data| Size{Dataset size?}
    Size -->|< 100 MB| Eager1[Use Eager]
    Size -->| ≥ 100 MB| Next1{Pipeline complexity?}
    
    Start -->|Interactive| Context{Context?}
    Context -->|Exploration| Eager2[Use Eager]
    Context -->|Production| Next2{Memory constraints?}
    
    Next1 -->|Simple 1-2 ops| Eager3[Use Eager]
    Next1 -->|Complex 3+ ops| Lazy1[Use Lazy]
    
    Next2 -->|Abundant RAM| Maybe[Consider Eager]
    Next2 -->|Limited RAM| Lazy2[Use Lazy]
    
    style Eager1 fill:#51cf66,stroke:#2f9e44,stroke-width:2px,color:#fff
    style Eager2 fill:#51cf66,stroke:#2f9e44,stroke-width:2px,color:#fff
    style Eager3 fill:#51cf66,stroke:#2f9e44,stroke-width:2px,color:#fff
    style Lazy1 fill:#667eea,stroke:#2c3e50,stroke-width:2px,color:#fff
    style Lazy2 fill:#667eea,stroke:#2c3e50,stroke-width:2px,color:#fff

Eager Evaluation Advantages

Code

mindmap
  root((Eager<br/>Evaluation))
    Simplicity
      Immediate feedback
      Easy debugging
      What you see is what you get
      Step-by-step execution
    Interactivity
      Jupyter notebooks
      REPL environments
      Quick exploration
      Instant results
    Small Data
      Low overhead
      No optimization needed
      Fast for < 100 MB
      Simple operations
    Stateful Operations
      Maintain state
      Inspect intermediates
      Conditional logic
      Complex dependencies

mindmap
  root((Eager<br/>Evaluation))
    Simplicity
      Immediate feedback
      Easy debugging
      What you see is what you get
      Step-by-step execution
    Interactivity
      Jupyter notebooks
      REPL environments
      Quick exploration
      Instant results
    Small Data
      Low overhead
      No optimization needed
      Fast for < 100 MB
      Simple operations
    Stateful Operations
      Maintain state
      Inspect intermediates
      Conditional logic
      Complex dependencies

When to Use Lazy Evaluation

Use Case Matrix

Code

quadrantChart
    title Lazy vs Eager Evaluation: Use Case Matrix
    x-axis Low Complexity --> High Complexity
    y-axis Small Data --> Large Data
    quadrant-1 Lazy Ideal
    quadrant-2 Lazy Beneficial
    quadrant-3 Either Works
    quadrant-4 Eager Sufficient
    Small exploration: [0.3, 0.2]
    Simple ETL: [0.4, 0.5]
    Complex analytics: [0.7, 0.8]
    ML feature engineering: [0.8, 0.9]
    Production pipelines: [0.85, 0.85]
    Interactive notebook: [0.2, 0.3]
    Data quality checks: [0.5, 0.7]

quadrantChart
    title Lazy vs Eager Evaluation: Use Case Matrix
    x-axis Low Complexity --> High Complexity
    y-axis Small Data --> Large Data
    quadrant-1 Lazy Ideal
    quadrant-2 Lazy Beneficial
    quadrant-3 Either Works
    quadrant-4 Eager Sufficient
    Small exploration: [0.3, 0.2]
    Simple ETL: [0.4, 0.5]
    Complex analytics: [0.7, 0.8]
    ML feature engineering: [0.8, 0.9]
    Production pipelines: [0.85, 0.85]
    Interactive notebook: [0.2, 0.3]
    Data quality checks: [0.5, 0.7]

Lazy Evaluation Advantages

Code

mindmap
  root((Lazy<br/>Evaluation))
    Performance
      Query optimization
      Predicate pushdown
      Projection pushdown
      Operation fusion
    Scalability
      Streaming execution
      Larger-than-memory
      Cloud-native
      Minimal data transfer
    Efficiency
      Memory efficient
      Parallel execution
      Reduced I/O
      Single pass processing
    Production
      Predictable performance
      Reproducible results
      Resource optimization
      Cost effective

mindmap
  root((Lazy<br/>Evaluation))
    Performance
      Query optimization
      Predicate pushdown
      Projection pushdown
      Operation fusion
    Scalability
      Streaming execution
      Larger-than-memory
      Cloud-native
      Minimal data transfer
    Efficiency
      Memory efficient
      Parallel execution
      Reduced I/O
      Single pass processing
    Production
      Predictable performance
      Reproducible results
      Resource optimization
      Cost effective

DuckDB

An in-process SQL OLAP database management system

DuckDB

DuckDB is an in-process SQL OLAP database management system.

It is like sqllite, but for analytics. What does this mean? It means that your database runs inside your process, there are no servers to manage, no remote system to connect to. Very snappy, easy to experiment with SQL like syntax.

DuckDB does vectorized processing i.e. loads chunks of data into memory (tries to keep everything in the CPU’s L1 and L2 cache) and is thus able to handle datasets bigger than the amount of RAM available.
DuckDB supports Python, R and a host of other languages.

DuckDB - quick introduction

DuckDB is an in-process SQL OLAP database management system

pip install duckdb

Key Features

Columnar Storage & Vectorized Query processing: DuckDB contains a columnar-vectorized query execution engine, where queries are still interpreted, but a large batch of values (a “vector”) are processed in one operation.
- Most analytical queries (think group by and summarize) or even data retrieval for training ML models require retrieving a subset of columns and now the entire row, columnar storage make this faster.
In-Memory Processing: all data needed for processing is brought within the process memory (recall that columnar storage format helps with this) making the queries run faster (no call to a database over the network).
SQL Support: highly Postgres-compatible version of SQL¹.
ACID Compliance: Transactional guarantees (ACID properties) through bulk-optimized Multi-Version Concurrency Control (MVCC).

Use-cases for DuckDB

Data Warehousing
Business Intelligence
Real-time Analytics
IoT Data Processing

DuckDB in the wild

How might you think about DuckDB - DIY Version

Using DuckDB in AWS Lambda
Modern Data Stack in a Box with DuckDB: DuckDB, Meltano, Dbt, Apache Superset

How might you think about DuckDB - Fully-managed

Architecture and capabilities
Seamlessly analyze data, whether it sits on your laptop, in the cloud or split between. Hybrid execution automatically plans each part of your query and determines where it’s best computed. If you use DuckDB with data lakes in s3, it’ll be much faster to run your analyses on MotherDuck.

DuckDB Vs MotherDuck

Setting Up DuckDB

Configuration and Initialization: DuckDB is deeply integrated into Python and R for efficient interactive data analysis. DuckDB provides APIs for Java, C, C++, Julia, Swift, and others.
```
pip install duckdb
# OR
conda install python-duckdb -c conda-forge
```

Connecting to DuckDB

import duckdb
# directly query a Pandas DataFrame
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/anly503/datasets/main/EconomistData.csv")
duckdb.sql('SELECT * FROM df')

Setting Up DuckDB (contd.)

Supported data formats: DuckDB can ingest data from a wide variety of formats – both on-disk and in-memory. See the data ingestion page for more information.

import duckdb
duckdb.read_csv('example.csv')                # read a CSV file into a Relation
duckdb.read_parquet('example.parquet')        # read a Parquet file into a Relation
duckdb.read_json('example.json')              # read a JSON file into a Relation

duckdb.sql('SELECT * FROM "example.csv"')     # directly query a CSV file
duckdb.sql('SELECT * FROM "example.parquet"') # directly query a Parquet file
duckdb.sql('SELECT * FROM "example.json"')    # directly query a JSON file

Querying DuckDB

Essential reading: Friendlier SQL with DuckDB, SQL Introduction

Basic SQL Queries:

import duckdb
import pandas as pd
babynames = pd.read_parquet("https://github.com/anly503/datasets/raw/main/babynames.parquet.zstd")
duckdb.sql("select count(*)  from babynames where Name='John'")

Aggregations and Grouping

duckdb.sql("select State, Name, count(*) as count  from babynames group by State, Name order by State desc, count desc")

Querying DuckDB (contd.)

Essential reading: From & Join clauses

Joins and Subqueries

# -- join two tables together
duckdb.sql("SELECT * FROM table_name JOIN other_table ON (table_name.key = other_table.key")

Window Functions

powerplants = pd.read_csv("https://raw.githubusercontent.com/anly503/datasets/main/powerplants.csv", parse_dates=["date"])
q = """
SELECT "plant", "date",
    AVG("MWh") OVER (
        PARTITION BY "plant"
        ORDER BY "date" ASC
        RANGE BETWEEN INTERVAL 3 DAYS PRECEDING
                  AND INTERVAL 3 DAYS FOLLOWING)
        AS "MWh 7-day Moving Average"
FROM powerplants
ORDER BY 1, 2;
"""
duckdb.sql(q)

Using the DuckDB CLI & Shell

Install the DuckDB CLI (download link) OR use it in your browser via shell.duckdb.org/ for easy data exploration using just SQL.
Once installed you can import a local file into the shell and run queries.
- You can download powerplants.csv from here.

C:\Users\<username>\Downloads\duckdb_cli-windows-amd64>duckdb
v0.8.1 6536a77232
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D CREATE TABLE powerplants AS SELECT * FROM read_csv_auto('powerplants.csv');
D DESCRIBE powerplants;
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬───────┐
│ column_name │ column_type │  null   │   key   │ default │ extra │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ int32 │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼───────┤
│ plant       │ VARCHAR     │ YES     │         │         │       │
│ date        │ DATE        │ YES     │         │         │       │
│ MWh         │ BIGINT      │ YES     │         │         │       │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴───────┘
D  SELECT * from powerplants where plant='Boston' and date='2019-01-02';
┌─────────┬────────────┬────────┐
│  plant  │    date    │  MWh   │
│ varchar │    date    │ int64  │
├─────────┼────────────┼────────┤
│ Boston  │ 2019-01-02 │ 564337 │
└─────────┴────────────┴────────┘
D

Profiling in DuckDB

Query Optimization: Use the EXPLAIN & ANALYZE keywords to understand how your query is being executed (see Query Plan and the time being spent in individual steps of your query.
```
D EXPLAIN ANALYZE SELECT * from powerplants where plant='Boston' and date='2019-01-02';
```

DuckDB will use all the cores available on the underlying compute, but you can adjust it (scenario: your process is not the only application on that VM, you want to limit the amount of resources it gets). Full configuration available here.

D select current_setting('threads');
┌────────────────────────────┐
│ current_setting('threads') │
│           int64            │
├────────────────────────────┤
│                          8 │
└────────────────────────────┘
D SET threads=4;
D select current_setting('threads');
┌────────────────────────────┐
│ current_setting('threads') │
│           int64            │
├────────────────────────────┤
│                          4 │
└────────────────────────────┘
D

Benchmarks and Comparisons

This is a tricky topic, in general, you can make your choosen solution look better by focussing on metrics on which your choosen solution provides better results.
- In general, TPC-H and TPC-DS are considered the standard benchmarks for data processing.
Some links to explore further:

Reddit post: Benchmarking for DuckDB and Polars
TPC-H benchmarks for DuckDB: did not find anything on the duckdb.org, but this site has several DuckDB TPC-H benchmarks.
Benchmarking, Snowflake, Databricks, Synapse, BigQuery, Redshift, Trino, DuckDB and Hyper using TPCH-

A simple example of using DuckDB and Apache Arrow using NYC Taxi dataset

This notebook reads the NYC taxi dataset files for the year 2021 (about ~29 million rows) and runs some analytics operation on this dataset. This dataset is too big to fit into memory.

We read the data from S3 using apache Arrow (pyarrow).
The zero-copy integration between DuckDB and Apache Arrow allows for rapid analysis of larger than memory datasets in Python and R using either SQL or relational APIs.
We create a DuckDB instance in memory and using the connection to this in-memory database We run some simple analytics operations using SQL syntax.

Also see https://duckdb.org/2021/12/03/duck-arrow.html

GitHub Classroom

GitHub Classroom Link