Lecture 5

Unified Data Access with Narwhals and Ibis

Agenda and Goals for Today

Lecture

Review: The modern data stack
The DataFrame fragmentation problem
Narwhals: Unified DataFrame API
Ibis: Unified database interface
Building portable data pipelines

Lab

Write code that works across Pandas, Polars

Review: Where We’ve Been

Week 2: Parallelization in Python
Week 4: The holy trinity - Parquet, DuckDB, Polars
- Columnar storage formats
- OLAP databases
- Modern DataFrame libraries

Important

Blackboard is strange!! We need you to make a “submission” for each of the earlier labs/assignments. This submission will be the URL of the respective Github repository.

This “submission” ensures that your grades show up on Blackboard. It will not affect late policies, which are determined by the timestamp of your GitHub commits.

This will be a requirement for all labs/assignments going forward.

Project progress

Only 15/24 students have joined groups!!

Fix this before tonight!!!

First milestone is due next week. See here for details for now.

First, a digression

Writing Good GitHub Commits

Best Practices for Clear Version Control

Why Commits Matter

Communication with your future self and teammates
History of project evolution
Debugging by identifying when issues were introduced
Documentation of decision-making

Anatomy of a Good Commit

<type>: <subject>

<body>

<footer>

The Subject Line

Keep it short and descriptive (50 characters or less)

Good examples:

Fix login button alignment on mobile
Add user authentication middleware
Update README with installation steps

Subject Line Rules

Use imperative mood (“Add feature” not “Added feature”)
Capitalize the first letter
No period at the end
Be specific about what changed

❌ Bad Subject Lines

Fixed stuff
Updated files
Changes
WIP
asdfasdf

These tell us nothing about what actually changed!

✅ Good Subject Lines

Fix memory leak in data processing loop
Add validation for email input fields
Refactor database query for performance
Remove deprecated API endpoints

Clear, concise, and descriptive!

The Body (Optional)

Use when you need to explain:

Why the change was made
What problem it solves
How it differs from previous behavior
Any side effects or limitations

Body Example

Fix duplicate entries in user search results

The search query was not properly deduplicating
results when users had multiple roles. Added
DISTINCT clause to SQL query and updated tests
to verify unique results.

Fixes #234

Commit Best Practices

Do

One logical change per commit
Test before committing
Commit often
Write for others

Don’t

Mix unrelated changes
Commit broken code
Use vague messages
Commit sensitive data

Atomic Commits

Each commit should be a single, complete unit of work

✅ Good:

Commit 1: Add user model
Commit 2: Add user controller
Commit 3: Add user routes

❌ Bad:

Commit 1: Add user feature, fix typo in README, update dependencies

Conventional Commits

A standardized format for commit messages:

feat: add user profile page
fix: resolve null pointer in checkout
docs: update API documentation
style: format code with prettier
refactor: simplify error handling
test: add unit tests for auth service
chore: update dependencies

Common Types

feat: A new feature
fix: A bug fix
docs: Documentation only changes
style: Code style changes (formatting, etc.)
refactor: Code restructuring without changing behavior
test: Adding or updating tests
chore: Maintenance tasks

When to Commit?

Commit when you’ve completed a logical unit of work:

✅ Finished a function
✅ Fixed a bug
✅ Added a test
✅ Refactored a component

❌ Not at end of day “just to save”

The Golden Rule

Write commit messages that will help someone (including future you) understand what changed and why, without having to look at the diff.

Quick Checklist

Before committing, ask yourself:

Is the subject clear and concise?
Does it explain what changed?
If needed, does the body explain why?
Is this a single logical change?
Have I tested this change?

Resources

Git Collaboration Models

Different workflows for different team sizes and needs:

Centralized Workflow
Feature Branch Workflow
Gitflow Workflow
Forking Workflow

Centralized Workflow

Best for: Small teams, simple projects

main branch
    |
    *---*---*---*

Everyone commits directly to main
Pull before you push
Linear history
Simple but risky

Feature Branch Workflow

Best for: Most teams

main     *---*---*-------*
              \         /
feature        *---*---*

Create branch for each feature
Merge back to main when done
Use Pull Requests for review
Keeps main stable

Feature Branch Example

# Create and switch to feature branch
git checkout -b feature/user-auth

# Make changes and commit
git add .
git commit -m "feat: add login endpoint"

# Push and create Pull Request
git push origin feature/user-auth

Gitflow Workflow

Best for: Scheduled releases, larger teams

main      *---------*---------*
           \       / \       /
develop     *--*--*---*--*--*
             \   /     \   /
feature       *-*       *-*

main: production-ready code
develop: integration branch
feature/*: new features
release/*: preparing releases
hotfix/*: emergency fixes

Forking Workflow

Best for: Open source projects

Fork repository to your account
Clone your fork locally
Create feature branch
Push to your fork
Open Pull Request to original repo

Maintainers control what gets merged.

Pull Request Best Practices

Clear title following commit conventions
Description explaining the change
Link related issues (Fixes #123)
Request reviewers explicitly
Keep PRs small and focused
Respond to feedback promptly

Code Review Tips

As Author:

Write clear PR description
Self-review before requesting
Be open to feedback

As Reviewer:

Be constructive and specific
Ask questions, don’t demand
Approve when ready, not perfect

Merge Strategies

Merge Commit: Preserves all history

*---*---*-------*  main
         \     /
          *---*    feature

Squash: Combines commits into one

*---*---*---*  main

Rebase: Linear history

*---*---*---*---*  main

Handling Conflicts

When merges conflict:

Pull latest changes
Resolve conflicts in files
Test the resolution
Commit the merge
Push changes

Tip: Communicate with your team to avoid conflicts!

Branch Protection Rules

Protect important branches on GitHub:

Require Pull Request reviews
Require status checks to pass
Require signed commits
Restrict who can push
Require linear history

Team Workflow Tips

Agree on a workflow as a team
Document conventions in CONTRIBUTING.md
Use branch naming conventions (feature/, fix/, docs/)
Keep branches short-lived
Communicate about major changes

Thank You!

Remember: Good commits make collaboration easier and your project history more valuable.

Questions?

Zero-Copy Data Access

The Traditional Problem

Old Approach

# CSV: Read entire file
df = pd.read_csv("data.csv")
# Copy into memory
# Parse text → objects
# High memory overhead

Consequences

Multiple copies in memory
Slow parsing (text → typed data)
Memory spikes during load
Limited by RAM size

What is Zero-Copy Access?

Zero-copy means reading data without duplicating it in memory

Key Idea

Data stays on disk (memory-mapped)
Direct access to bytes
No intermediate parsing
Share data between processes

Benefits

🚀 Faster startup
💾 Lower memory usage
🔄 Efficient sharing
📊 Query before loading

Apache Parquet: Designed for Zero-Copy

Columnar binary format with metadata

import polars as pl

# Metadata tells us structure without reading data
schema = pl.scan_parquet("data.parquet").schema

# Read only what you need
df = pl.scan_parquet("data.parquet").select("col1", "col2").collect()

Why Parquet enables zero-copy:

Binary format (no text parsing)
Columnar layout (read only needed columns)
Built-in statistics (skip chunks via metadata)
Apache Arrow compatible

Apache Arrow: The Memory Standard

Arrow is the common in-memory format

┌─────────────┐
│   Parquet   │ (on disk)
└──────┬──────┘
       │ zero-copy map
       ↓
┌─────────────┐
│Arrow Memory │ (columnar buffers)
└──────┬──────┘
       ├───→ Polars (zero-copy)
       ├───→ DuckDB (zero-copy)
       └───→ PyArrow (zero-copy)

No serialization between tools!

Polars: Zero-Copy Native

Built on Arrow from the ground up

import polars as pl

# Lazy scan: no data loaded yet
lazy_df = pl.scan_parquet("taxi_*.parquet")

# Build query plan (still no copy)
query = (lazy_df
    .filter(pl.col("distance") > 5)
    .group_by("borough")
    .agg(pl.col("fare").mean())
)

# Execute: stream through data, minimal copies
result = query.collect()

Advantages: Query optimization, predicate pushdown, parallel streaming

DuckDB: Zero-Copy SQL

Analytical database designed for zero-copy

import duckdb

# Query parquet directly (no load step)
result = duckdb.sql("""
    SELECT borough, AVG(fare) as avg_fare
    FROM 'taxi_*.parquet'
    WHERE distance > 5
    GROUP BY borough
""").pl()  # Returns Polars DataFrame (zero-copy)

DuckDB → Polars is zero-copy (both use Arrow)

Memory Mapping in Action

How it works under the hood:

mmap() system call maps file to virtual memory
OS loads pages on-demand (lazy loading)
Multiple processes share same physical pages
OS manages paging (better than manual reads)

# Conceptually what happens:
file_bytes = mmap.mmap(fd, length, access=mmap.ACCESS_READ)
# Arrow views these bytes directly as typed arrays
# No copy, no parsing!

Real Performance Impact

Example: 1GB Parquet file

Operation	Traditional	Zero-Copy	Speedup
Load time	8.2s	0.3s	27x
Memory use	2.5 GB	1.1 GB	2.3x
Filter	3.1s	0.4s	7.8x
Column select	8.2s	0.05s	164x

Zero-copy reads only what’s needed, when it’s needed

Interoperability Benefits

Seamless data sharing:

import polars as pl
import duckdb

# Polars reads parquet (zero-copy)
df = pl.scan_parquet("data.parquet").collect()

# Pass to DuckDB (zero-copy)
result = duckdb.sql("SELECT * FROM df WHERE x > 10").pl()

# Convert to pandas (copy needed, but only once)
pandas_df = result.to_pandas()

Arrow is the lingua franca of modern data tools

Best Practices

To maximize zero-copy benefits:

Use Parquet for storage (not CSV)
Use lazy evaluation when possible (scan_parquet, not read_parquet)
Filter early (predicate pushdown)
Select columns before collecting (projection pushdown)
Partition data for better selectivity
Stay in Arrow ecosystem (Polars/DuckDB/PyArrow)

When Copies Still Happen

Zero-copy isn’t always possible:

Converting to pandas (different memory layout)
Complex string operations (may need materialization)
Writing results to disk
Network transfers
Some aggregations (need intermediate buffers)

But: Even when copies happen, they’re minimized and strategic

Summary

Zero-copy data access advantages:

✅ Performance: Faster loads, lower latency ✅ Efficiency: Reduced memory footprint
✅ Scalability: Work with larger-than-RAM datasets
✅ Interoperability: Share data between tools seamlessly
✅ Simplicity: Query before loading, explore without cost

The ecosystem: Parquet (storage) + Arrow (memory) + Polars/DuckDB (compute)

Resources

Learn More:

Key concept: Modern data tools separate storage from compute, enabling zero-copy workflows

The DataFrame Fragmentation Problem

The Landscape in 2025:

Pandas: The original, still widely used
Polars: Fast, Rust-based, growing adoption
PyArrow: Direct Arrow Table manipulation
Dask: Distributed computing
Modin: Parallel Pandas
cuDF: GPU-accelerated DataFrames

Each has different APIs, syntax, and capabilities

Why This Matters

For Library Authors:

def process_data(df):  # What type is df?
    # Works with Pandas
    result = df.groupby('category')['value'].mean()
    
    # Breaks with Polars
    # Polars uses: df.group_by('category').agg(pl.col('value').mean())

For Data Scientists:

Lock-in to specific libraries
Hard to switch when requirements change
Code becomes non-portable
Organizational silos

Enter: Narwhals 🦄

“Write once, run on any DataFrame”

A lightweight compatibility layer providing a unified API across:

Full API support: cuDF, Modin, pandas, Polars, PyArrow.
Lazy-only support: Daft, Dask, DuckDB, Ibis, PySpark, SQLFrame.

Narwhals Philosophy

Design Principles:

Lightweight: Minimal overhead, lazy when possible
Expressive: Polars-like syntax (modern and consistent) (uses a subset of the Polars API)
Interoperable: Easy conversion to/from native types
Type-safe: Full type hints for better IDE support
Zero-cost: Abstraction that doesn’t sacrifice performance

Key insight: You work with native objects, Narwhals just provides a common interface

Narwhals: Basic Example

import narwhals as nw
import pandas as pd
import polars as pl

# Works with Pandas DataFrame
df_pandas = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})

# Works with Polars DataFrame  
df_polars = pl.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})

# Same code for both!
def process(df):
    df_nw = nw.from_native(df)
    result = df_nw.filter(nw.col('a') > 1).select(nw.col('b').sum())
    return nw.to_native(result)

process(df_pandas)  # Returns pandas DataFrame
process(df_polars)  # Returns polars DataFrame

Narwhals: Key Operations

import narwhals as nw

def analyze_data(df):
    df = nw.from_native(df)
    
    result = (
        df
        .filter(nw.col('age') > 18)
        .with_columns([
            (nw.col('salary') * 1.1).alias('salary_adjusted'),
            nw.col('name').str.to_uppercase().alias('name_upper')
        ])
        .group_by('department')
        .agg([
            nw.col('salary_adjusted').mean().alias('avg_salary'),
            nw.col('age').max().alias('max_age')
        ])
        .sort('avg_salary', descending=True)
    )
    
    return nw.to_native(result)

Narwhals: Lazy Evaluation

Narwhals supports both eager and lazy execution modes:

import narwhals as nw
import polars as pl

# Eager mode (like Pandas)
df_pandas = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df_nw = nw.from_native(df_pandas)
result = df_nw.filter(nw.col('a') > 1).select('b')  # Executes immediately

# Lazy mode (like Polars LazyFrame)
df_lazy = pl.LazyFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df_nw = nw.from_native(df_lazy)
result = df_nw.filter(nw.col('a') > 1).select('b')  # Builds query plan
final = result.collect()  # Execute optimized query

Narwhals: Lazy Execution Benefits

Why use lazy evaluation?

Query optimization: Backend can reorder and optimize operations
Memory efficiency: Only materialize what’s needed
Predicate pushdown: Filters applied early
Projection pushdown: Only read required columns

# Lazy mode allows optimization
df_lazy = nw.from_native(pl.scan_parquet('large_file.parquet'))

result = (
    df_lazy
    .select(['date', 'amount', 'category'])  # Only read these columns
    .filter(nw.col('date').dt.year() == 2024)  # Filter at read time
    .group_by('category')
    .agg(nw.col('amount').sum())
)

# Optimized plan is executed only on collect()
final = result.collect()

Narwhals: Checking Execution Mode

import narwhals as nw

def process_data(df):
    df_nw = nw.from_native(df)
    
    # Check if lazy or eager
    if hasattr(df_nw, 'collect'):
        print("Lazy evaluation supported!")
        result = (
            df_nw
            .filter(nw.col('value') > 100)
            .select(['id', 'value'])
        )
        return result.collect()  # Materialize result
    else:
        print("Eager evaluation")
        result = (
            df_nw
            .filter(nw.col('value') > 100)
            .select(['id', 'value'])
        )
        return nw.to_native(result)

# Works with both Pandas (eager) and Polars LazyFrame (lazy)

Narwhals: When to Use It

✅ Great for:

Building reusable data processing libraries
Writing code for teams with different tool preferences
Prototyping with flexibility to switch backends
Teaching/documentation that should work everywhere

❌ Not ideal for:

Performance-critical inner loops (use native APIs)
Features specific to one backend
When you know you’ll only use one library

Moving to Databases: The SQL Problem

Challenge: Each database has slightly different SQL dialects

-- PostgreSQL
SELECT * FROM table LIMIT 10;

-- SQL Server
SELECT TOP 10 * FROM table;

-- Oracle
SELECT * FROM table WHERE ROWNUM <= 10;

Plus: Many more differences in date functions, string operations, window functions, etc.

Enter: Ibis

“The portable Python dataframe library”

A unified Python interface to:

20+ database backends (DuckDB, PostgreSQL, BigQuery, Snowflake, etc.)
Local DataFrames (Pandas, Polars)
Cloud data warehouses
File formats (Parquet, CSV)

Ibis Philosophy

Design Principles:

Lazy Evaluation: Build expression trees, execute when needed
Backend Agnostic: Same code works on any database
Pythonic: No SQL string manipulation
Type-safe: Column types are known and checked
Optimized: Pushes computation to the database

Key insight: Let databases do what they’re good at, orchestrate from Python

Ibis: Basic Example

import ibis

# Connect to different backends with same interface
duckdb_con = ibis.duckdb.connect('data.ddb')
postgres_con = ibis.postgres.connect(host='localhost', database='mydb')
bigquery_con = ibis.bigquery.connect(project_id='my-project')

# Same query works on all!
def analyze(con):
    table = con.table('sales')
    
    result = (
        table
        .filter(table.amount > 100)
        .group_by('product')
        .aggregate(total=table.amount.sum())
        .order_by(ibis.desc('total'))
    )
    
    return result.execute()  # Returns pandas DataFrame

Ibis: Rich Expression API

import ibis
from ibis import _

con = ibis.duckdb.connect()
orders = con.table('orders')

result = (
    orders
    .filter(_.order_date.year() == 2024)
    .mutate(
        month=_.order_date.month(),
        revenue=_.quantity * _.price,
        is_large=_.quantity > 100
    )
    .group_by(['month', 'category'])
    .agg(
        total_revenue=_.revenue.sum(),
        avg_quantity=_.quantity.mean(),
        order_count=_.count()
    )
    .filter(_.total_revenue > 10000)
    .order_by([_.month, ibis.desc(_.total_revenue)])
)

# This generates optimized SQL for the backend
result.execute()

Ibis: Supported Backends

Databases:

PostgreSQL, MySQL, SQLite
DuckDB, DataFusion
Snowflake, BigQuery, Redshift
ClickHouse, Trino
SQL Server, Oracle

DataFrames:

Pandas, Polars, PyArrow

Files:

Parquet, CSV (via DuckDB backend)

All with the same Python API!

Ibis: Lazy Evaluation

import ibis

con = ibis.duckdb.connect()
table = con.table('large_table')

# Build the query (no execution yet)
query = (
    table
    .filter(table.amount > 1000)
    .group_by('category')
    .aggregate(total=table.amount.sum())
)

# Inspect the generated SQL
print(query.compile())

# Execute when ready
result = query.execute()  # This runs the query

# Or stream for large results
for batch in query.to_pyarrow_batches():
    process(batch)

Combining Narwhals and Ibis

Use them together for maximum portability!

import ibis
import narwhals as nw

# Extract from database with Ibis
con = ibis.duckdb.connect('warehouse.ddb')
sales = con.table('sales').execute()  # pandas DataFrame

# Transform with Narwhals (works with pandas or polars)
def transform(df):
    df = nw.from_native(df)
    return (
        df
        .with_columns([(nw.col('amount') * 1.1).alias('amount_tax')])
        .filter(nw.col('amount_tax') > 100)
        .to_native()
    )

transformed = transform(sales)

# Load back to database with Ibis
con.create_table('sales_processed', transformed)

Real-World Pipeline Example

import ibis
import narwhals as nw

def etl_pipeline(source_db, dest_db, frame_backend='pandas'):
    # EXTRACT: Read from any database
    source = ibis.connect(source_db)
    raw_data = source.table('events').filter(_.date >= '2024-01-01').execute()
    
    # TRANSFORM: Process with any DataFrame library
    def transform(df):
        df = nw.from_native(df)
        return (
            df
            .with_columns([
                nw.col('event_time').dt.hour().alias('hour'),
                nw.col('user_id').cast(nw.String).alias('user_id_str')
            ])
            .group_by(['hour', 'event_type'])
            .agg([nw.col('user_id').n_unique().alias('unique_users')])
            .to_native()
        )
    
    processed = transform(raw_data)
    
    # LOAD: Write to any database
    dest = ibis.connect(dest_db)
    dest.create_table('event_summary', processed, overwrite=True)

Performance Considerations

Narwhals:

Negligible overhead (~1-2% typically)
Works with native objects, just wraps the API
No data copying unless converting between backends

Ibis:

Pushes computation to database (often faster than local)
Lazy evaluation means efficient query planning
Can generate suboptimal SQL in complex cases (rare)

Best practice: Profile both native and unified APIs for your use case

When to Use Each Tool

Use Narwhals when:

Writing library code for others
Building data processing functions
Wanting flexibility in DataFrame backend
Working across team with different preferences

Use Ibis when:

Querying multiple databases
Building cloud-agnostic pipelines
Need to switch data warehouses
Want Python instead of SQL strings

Use both when:

Building full ETL/ELT pipelines
Maximum portability is required

Getting Started: Installation

# Install narwhals
pip install narwhals

# Install ibis with specific backends
pip install 'ibis-framework[duckdb]'
pip install 'ibis-framework[postgres]'
pip install 'ibis-framework[bigquery]'

# Or install multiple backends
pip install 'ibis-framework[duckdb,postgres,snowflake]'

# Narwhals automatically detects installed DataFrame libraries
# No additional configuration needed!

Code Migration Strategy

Step 1: Identify portable operations

# Before (pandas-specific)
df.groupby('col')['value'].mean()

# After (narwhals)
nw.from_native(df).group_by('col').agg(nw.col('value').mean())

Step 2: Wrap in functions

def my_analysis(df):
    df = nw.from_native(df)
    # ... narwhals operations ...
    return nw.to_native(result)

Step 3: Gradually migrate critical paths

Best Practices

Start with the interface layer: Wrap your data access in Narwhals/Ibis
Document backend requirements: Some features may be backend-specific
Test across backends: CI/CD should test multiple DataFrame/DB engines
Profile before optimizing: Unified APIs are usually fast enough
Use type hints: Both libraries support full typing
Leverage lazy evaluation: Build queries, execute when ready

Limitations to Know

Narwhals:

Not all DataFrame features are supported (by design)
Some operations may be slightly different across backends
Extension types (categorical, etc.) may behave differently

Ibis:

SQL dialect differences can still leak through
Some backends have limited feature support
Requires connection setup for each backend

Solution: Read the docs, test thoroughly, have fallback plans

Community and Resources

Narwhals:

GitHub: narwhals-dev/narwhals
Docs: narwhals-dev.github.io/narwhals
Discord community for support

Ibis:

GitHub: ibis-project/ibis
Docs: ibis-project.org
Active Zulip chat

Both projects are actively maintained and growing!

Key Takeaways

Remember:

Narwhals = Unified DataFrame API for local computation
Ibis = Unified interface to databases and warehouses
Together = Portable, backend-agnostic data pipelines
Trade-off = Slight abstraction cost for massive flexibility
Future-proof = Code survives technology changes

Write portable code today, thank yourself tomorrow

Examples

See this page for more code examples and patterns using Narwhals and Ibis!

Hive-Style Partitioning for Parquet

Directory-based data organization pattern:

data/
├── year=2024/
│   ├── month=01/
│   │   ├── day=01/
│   │   │   └── data.parquet
│   │   └── day=02/
│   │       └── data.parquet
│   └── month=02/
│       └── day=01/
│           └── data.parquet
└── year=2025/
    └── month=01/
        └── day=01/
            └── data.parquet

Partition keys are encoded in directory structure, not file contents

Hive Partitioning: How It Works

Key concept: Organize files by column values in nested directories

# Writing partitioned data
import polars as pl

df = pl.DataFrame({
    'date': ['2024-01-01', '2024-01-02', '2024-02-01'],
    'product': ['A', 'B', 'A'],
    'sales': [100, 200, 150]
})

# Create partitions by date components
df = df.with_columns([
    pl.col('date').str.to_date().dt.year().alias('year'),
    pl.col('date').str.to_date().dt.month().alias('month')
])

# Write with Hive partitioning
df.write_parquet('data', partition_by=['year', 'month'])

Result: Files organized in year=2024/month=01/ directories

Why Hive Partitioning Matters

🚀 Performance Benefits:

Partition pruning: Skip entire directories when filtering
Parallel I/O: Read only relevant partitions concurrently
Reduced memory: Don’t load unnecessary data
Faster queries: Orders of magnitude speedup on large datasets

# Without partitioning: scan entire dataset
df = pl.scan_parquet('data/*.parquet')
result = df.filter(pl.col('year') == 2024).collect()

# With partitioning: only read year=2024/ directory
df = pl.scan_parquet('data/**/*.parquet', hive_partitioning=True)
result = df.filter(pl.col('year') == 2024).collect()  # Much faster!

Partition Pruning in Action

Example: 1TB dataset, 3 years of data

import ibis

# Connect to DuckDB (great Parquet support)
con = ibis.duckdb.connect()

# Query with automatic partition pruning
table = con.read_parquet(
    'data/**/*.parquet',
    hive_partitioning=True
)

# Only reads ~340GB (1 year) instead of 1TB
recent = table.filter(table.year == 2024).execute()

Benchmark: 3x-10x faster queries on partitioned data

Choosing Good Partition Keys

✅ Good partition keys:

High cardinality but not too high: 10-1000 partitions ideal
Common filter columns: Date/time, region, category
Evenly distributed data: Avoid skew
Stable values: Don’t change over time

❌ Avoid:

Too many partitions: User ID, transaction ID (millions of dirs)
Unbalanced partitions: Status with 99% “active”, 1% “inactive”
Rarely queried columns: Columns you never filter on

Partition Strategy Examples

Time-series data:

year=2024/month=12/day=15/data.parquet  # Fine-grained
year=2024/month=12/data.parquet          # Medium
year=2024/data.parquet                    # Coarse

Multi-dimensional:

region=us/category=electronics/data.parquet
region=eu/category=books/data.parquet

Best practice: Start with coarser partitions, refine if needed

Reading Partitioned Data

All modern tools support Hive partitioning:

# Polars
df = pl.scan_parquet('data/**/*.parquet', hive_partitioning=True)

# DuckDB
con.execute("SELECT * FROM read_parquet('data/**/*.parquet', hive_partitioning=true)")

# Ibis
table = con.read_parquet('data/**/*.parquet', hive_partitioning=True)

# PyArrow
import pyarrow.parquet as pq
ds = pq.ParquetDataset('data/', partitioning='hive')

Partition columns automatically added to DataFrame

Writing Partitioned Data

Create partitions when writing:

# Polars
df.write_parquet('data', partition_by=['year', 'month'])

# PyArrow
import pyarrow.parquet as pq
pq.write_to_dataset(
    table,
    root_path='data',
    partition_cols=['year', 'month']
)

# Ibis with DuckDB
con.create_table('output', df, temp=False)
con.execute("""
    COPY (SELECT * FROM output) 
    TO 'data' (FORMAT PARQUET, PARTITION_BY (year, month))
""")

Partitioning Best Practices

Partition size: Target 100MB-1GB per file (not too small/large)
Partition count: Keep under 1000 directories if possible
Schema evolution: Partition keys should be stable
Metadata: Consider partition discovery overhead
Compression: Works with partitioning (zstd recommended)

# Good: 365 partitions (daily for 1 year)
partition_by=['year', 'month', 'day']

# Bad: Millions of partitions
partition_by=['user_id']  # Too many!

Real-World Example: Sales Analytics

import polars as pl
from datetime import datetime

# Generate sales data
sales = pl.DataFrame({
    'timestamp': [...],  # 10M rows
    'product_id': [...],
    'amount': [...],
    'region': [...]
})

# Prepare for partitioning
sales = sales.with_columns([
    pl.col('timestamp').dt.year().alias('year'),
    pl.col('timestamp').dt.month().alias('month'),
    pl.col('region').alias('region')
])

# Write partitioned (100s of files, organized by year/month/region)
sales.write_parquet('sales_data', partition_by=['year', 'month', 'region'])

# Fast queries: only scan relevant partitions
q = pl.scan_parquet('sales_data/**/*.parquet', hive_partitioning=True)
q.filter(
    (pl.col('year') == 2024) & 
    (pl.col('region') == 'US')
).group_by('product_id').agg(pl.col('amount').sum())

Partitioning + Column Pruning = Maximum Speed

Combine two optimizations:

# Only read specific columns from specific partitions
q = (
    pl.scan_parquet('data/**/*.parquet', hive_partitioning=True)
    .select(['product_id', 'amount'])  # Column pruning
    .filter(pl.col('year') == 2024)     # Partition pruning
)

# Reads minimal data:
# ✓ Only 2024 directories (not 2022, 2023)  
# ✓ Only 2 columns (not all 50)
# ✓ Compressed Parquet format

Result: Queries on TBs feel like queries on MBs

When NOT to Partition

Skip partitioning if:

Dataset is small (<10GB)
No clear filter patterns
Data changes frequently (many small updates)
You need flexibility to change partition keys

In these cases, flat Parquet files may be simpler and just as fast

Understanding Null vs NaN in Python

Two distinct concepts often confused:

None / null: Absence of a value (missing data)
NaN: “Not a Number” - a specific floating-point value (IEEE 754)

import math
import numpy as np

# None is Python's null
x = None
print(type(x))  # <class 'NoneType'>

# NaN is a float value
y = float('nan')
print(type(y))  # <class 'float'>
print(y == y)   # False (NaN != NaN by IEEE 754)
print(math.isnan(y))  # True

Key difference: NaN is a value; None represents no value

Null vs NaN: Semantic Meaning

When to use each:

None / null: - Data wasn’t collected - Not applicable for this record - Missing in the source system - Any data type (strings, integers, dates)

NaN: - Mathematical undefined result (0/0, ∞ - ∞) - Propagates through numeric calculations - Specific to floating-point arithmetic - Only for numeric types

# Appropriate uses
age = None          # Person didn't provide age
price = float('nan')  # Invalid calculation result

The Pandas Problem

Pandas historically conflated None and NaN:

import pandas as pd
import numpy as np

# Creating a Series with None
s = pd.Series([1, None, 3])
print(s)
# 0    1.0
# 1    NaN  # None became NaN!
# 2    3.0

# Type was converted to float64
print(s.dtype)  # float64

# Can't represent integer + null without nullable types
# Integer columns become float just to handle missing data

Problem: Lost distinction between missing data and invalid calculations

Pandas: Type Coercion Issues

The old Pandas behavior:

import pandas as pd

# Integer column with missing value
data = pd.Series([1, 2, None, 4])
print(data.dtype)  # float64 (not int64!)

# String column with missing value  
names = pd.Series(['Alice', None, 'Charlie'])
print(names.dtype)  # object

# Even boolean columns!
flags = pd.Series([True, False, None])
print(flags.dtype)  # object (not bool!)

Why?: NumPy arrays don’t support null for most types, so Pandas used NaN and float64

Modern Solution: Nullable Types

Pandas 1.0+ introduced proper nullable types:

import pandas as pd

# Use nullable integer
data = pd.Series([1, 2, None, 4], dtype='Int64')  # Capital I
print(data.dtype)  # Int64
print(data)
# 0       1
# 1       2
# 2    <NA>  # Proper null marker
# 3       4

# Nullable boolean
flags = pd.Series([True, False, None], dtype='boolean')
print(flags.dtype)  # boolean

# Nullable string
names = pd.Series(['Alice', None, 'Charlie'], dtype='string')
print(names.dtype)  # string

<NA> is Pandas’ null sentinel, distinct from NaN

Polars: Clear Null Semantics

Polars has null handling right from the start:

import polars as pl

# Integer column with null - no type coercion!
data = pl.Series('values', [1, 2, None, 4])
print(data.dtype)  # Int64 (not Float64!)

# Nulls work with any type
names = pl.Series('names', ['Alice', None, 'Charlie'])
print(names.dtype)  # Utf8

flags = pl.Series('flags', [True, False, None])
print(flags.dtype)  # Boolean

# NULL is a distinct value, not NaN
print(data.null_count())  # 1

Polars: null is null, NaN is NaN, types are preserved

Checking for Missing Values

Different approaches across tools:

import pandas as pd
import polars as pl
import numpy as np

# Pandas (checks for both None and NaN)
s = pd.Series([1, None, float('nan'), 4])
print(s.isna())   # [False, True, True, False]
print(s.isnull()) # Same as isna()

# Polars (explicit about nulls)
s = pl.Series([1, None, float('nan'), 4])
print(s.is_null())     # [False, True, False, False]
print(s.is_nan())      # [False, False, True, False]

# NumPy (only checks for NaN)
arr = np.array([1, np.nan, 4])
print(np.isnan(arr))  # [False, True, False]

Polars distinguishes null from NaN; Pandas often doesn’t

Narwhals: Handling Nulls Portably

Narwhals follows Polars semantics:

import narwhals as nw
import pandas as pd
import polars as pl

def handle_missing(df):
    df = nw.from_native(df)
    return (
        df
        .filter(~nw.col('value').is_null())  # Remove nulls
        .with_columns(
            nw.col('value').fill_null(0)      # Replace nulls with 0
        )
    )

# Works consistently across backends
df_pandas = pd.DataFrame({'value': [1, None, 3]})
df_polars = pl.DataFrame({'value': [1, None, 3]})

handle_missing(df_pandas)  # Treats None/NaN as null
handle_missing(df_polars)  # Native null handling

Recommendation: Use Narwhals/Polars semantics for clarity

Best Practices: Null Handling

Modern recommendations:

Use nullable dtypes in Pandas: Int64, boolean, string
Prefer Polars for new projects (clean null semantics)
Avoid mixing None and NaN in the same context
Be explicit: Use is_null() / is_nan() separately
Document assumptions: What does missing mean in your data?

# Good: Clear intent
df = pl.DataFrame({
    'age': pl.Series([25, None, 30], dtype=pl.Int64),  # Missing
    'score': pl.Series([1.5, float('nan'), 2.0])       # Invalid calculation
})

# Bad: Ambiguous
df = pd.DataFrame({
    'age': [25, np.nan, 30],  # Is this missing or invalid?
})

Next Week

Deep dive into Apache Spark
Distributed computing at scale
When local tools aren’t enough
How Spark relates to the tools we’ve learned

The tools we’ve covered are powerful for single-node workloads—Spark takes us to multi-node clusters

GitHub Classroom

GitHub Classroom Link