Unified Data Access with Narwhals and Ibis
Important
Blackboard is strange!! We need you to make a “submission” for each of the earlier labs/assignments. This submission will be the URL of the respective Github repository.
This “submission” ensures that your grades show up on Blackboard. It will not affect late policies, which are determined by the timestamp of your GitHub commits.
This will be a requirement for all labs/assignments going forward.
Only 15/24 students have joined groups!!
Fix this before tonight!!!
First milestone is due next week. See here for details for now.
Best Practices for Clear Version Control
<type>: <subject>
<body>
<footer>
Keep it short and descriptive (50 characters or less)
Good examples:
Fix login button alignment on mobile
Add user authentication middleware
Update README with installation steps
Fixed stuff
Updated files
Changes
WIP
asdfasdf
These tell us nothing about what actually changed!
Fix memory leak in data processing loop
Add validation for email input fields
Refactor database query for performance
Remove deprecated API endpoints
Clear, concise, and descriptive!
Use when you need to explain:
Fix duplicate entries in user search results
The search query was not properly deduplicating
results when users had multiple roles. Added
DISTINCT clause to SQL query and updated tests
to verify unique results.
Fixes #234
Each commit should be a single, complete unit of work
✅ Good:
❌ Bad:
A standardized format for commit messages:
feat: add user profile page
fix: resolve null pointer in checkout
docs: update API documentation
style: format code with prettier
refactor: simplify error handling
test: add unit tests for auth service
chore: update dependencies
Commit when you’ve completed a logical unit of work:
❌ Not at end of day “just to save”
Write commit messages that will help someone (including future you) understand what changed and why, without having to look at the diff.
Before committing, ask yourself:
Different workflows for different team sizes and needs:
Best for: Small teams, simple projects
main branch
|
*---*---*---*
mainBest for: Most teams
main *---*---*-------*
\ /
feature *---*---*
main when donemain stableBest for: Scheduled releases, larger teams
main *---------*---------*
\ / \ /
develop *--*--*---*--*--*
\ / \ /
feature *-* *-*
main: production-ready codedevelop: integration branchfeature/*: new featuresrelease/*: preparing releaseshotfix/*: emergency fixesBest for: Open source projects
Maintainers control what gets merged.
As Author:
As Reviewer:
Merge Commit: Preserves all history
*---*---*-------* main
\ /
*---* feature
Squash: Combines commits into one
*---*---*---* main
Rebase: Linear history
*---*---*---*---* main
When merges conflict:
Tip: Communicate with your team to avoid conflicts!
Protect important branches on GitHub:
Remember: Good commits make collaboration easier and your project history more valuable.
Questions?
Zero-copy means reading data without duplicating it in memory
Key Idea
Benefits
Columnar binary format with metadata
Why Parquet enables zero-copy:
Arrow is the common in-memory format
┌─────────────┐
│ Parquet │ (on disk)
└──────┬──────┘
│ zero-copy map
↓
┌─────────────┐
│Arrow Memory │ (columnar buffers)
└──────┬──────┘
├───→ Polars (zero-copy)
├───→ DuckDB (zero-copy)
└───→ PyArrow (zero-copy)
No serialization between tools!
Built on Arrow from the ground up
import polars as pl
# Lazy scan: no data loaded yet
lazy_df = pl.scan_parquet("taxi_*.parquet")
# Build query plan (still no copy)
query = (lazy_df
.filter(pl.col("distance") > 5)
.group_by("borough")
.agg(pl.col("fare").mean())
)
# Execute: stream through data, minimal copies
result = query.collect()Advantages: Query optimization, predicate pushdown, parallel streaming
Analytical database designed for zero-copy
DuckDB → Polars is zero-copy (both use Arrow)
How it works under the hood:
Example: 1GB Parquet file
| Operation | Traditional | Zero-Copy | Speedup |
|---|---|---|---|
| Load time | 8.2s | 0.3s | 27x |
| Memory use | 2.5 GB | 1.1 GB | 2.3x |
| Filter | 3.1s | 0.4s | 7.8x |
| Column select | 8.2s | 0.05s | 164x |
Zero-copy reads only what’s needed, when it’s needed
Seamless data sharing:
Arrow is the lingua franca of modern data tools
To maximize zero-copy benefits:
scan_parquet, not read_parquet)Zero-copy isn’t always possible:
But: Even when copies happen, they’re minimized and strategic
Zero-copy data access advantages:
✅ Performance: Faster loads, lower latency ✅ Efficiency: Reduced memory footprint
✅ Scalability: Work with larger-than-RAM datasets
✅ Interoperability: Share data between tools seamlessly
✅ Simplicity: Query before loading, explore without cost
The ecosystem: Parquet (storage) + Arrow (memory) + Polars/DuckDB (compute)
Learn More:
Key concept: Modern data tools separate storage from compute, enabling zero-copy workflows
The Landscape in 2025:
Each has different APIs, syntax, and capabilities
For Library Authors:
For Data Scientists:
“Write once, run on any DataFrame”
A lightweight compatibility layer providing a unified API across:
Full API support: cuDF, Modin, pandas, Polars, PyArrow.
Lazy-only support: Daft, Dask, DuckDB, Ibis, PySpark, SQLFrame.
Design Principles:
Key insight: You work with native objects, Narwhals just provides a common interface
import narwhals as nw
import pandas as pd
import polars as pl
# Works with Pandas DataFrame
df_pandas = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
# Works with Polars DataFrame
df_polars = pl.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
# Same code for both!
def process(df):
df_nw = nw.from_native(df)
result = df_nw.filter(nw.col('a') > 1).select(nw.col('b').sum())
return nw.to_native(result)
process(df_pandas) # Returns pandas DataFrame
process(df_polars) # Returns polars DataFrameimport narwhals as nw
def analyze_data(df):
df = nw.from_native(df)
result = (
df
.filter(nw.col('age') > 18)
.with_columns([
(nw.col('salary') * 1.1).alias('salary_adjusted'),
nw.col('name').str.to_uppercase().alias('name_upper')
])
.group_by('department')
.agg([
nw.col('salary_adjusted').mean().alias('avg_salary'),
nw.col('age').max().alias('max_age')
])
.sort('avg_salary', descending=True)
)
return nw.to_native(result)Narwhals supports both eager and lazy execution modes:
import narwhals as nw
import polars as pl
# Eager mode (like Pandas)
df_pandas = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df_nw = nw.from_native(df_pandas)
result = df_nw.filter(nw.col('a') > 1).select('b') # Executes immediately
# Lazy mode (like Polars LazyFrame)
df_lazy = pl.LazyFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df_nw = nw.from_native(df_lazy)
result = df_nw.filter(nw.col('a') > 1).select('b') # Builds query plan
final = result.collect() # Execute optimized queryWhy use lazy evaluation?
# Lazy mode allows optimization
df_lazy = nw.from_native(pl.scan_parquet('large_file.parquet'))
result = (
df_lazy
.select(['date', 'amount', 'category']) # Only read these columns
.filter(nw.col('date').dt.year() == 2024) # Filter at read time
.group_by('category')
.agg(nw.col('amount').sum())
)
# Optimized plan is executed only on collect()
final = result.collect()import narwhals as nw
def process_data(df):
df_nw = nw.from_native(df)
# Check if lazy or eager
if hasattr(df_nw, 'collect'):
print("Lazy evaluation supported!")
result = (
df_nw
.filter(nw.col('value') > 100)
.select(['id', 'value'])
)
return result.collect() # Materialize result
else:
print("Eager evaluation")
result = (
df_nw
.filter(nw.col('value') > 100)
.select(['id', 'value'])
)
return nw.to_native(result)
# Works with both Pandas (eager) and Polars LazyFrame (lazy)✅ Great for:
❌ Not ideal for:
Challenge: Each database has slightly different SQL dialects
Plus: Many more differences in date functions, string operations, window functions, etc.
“The portable Python dataframe library”
A unified Python interface to:
Design Principles:
Key insight: Let databases do what they’re good at, orchestrate from Python
import ibis
# Connect to different backends with same interface
duckdb_con = ibis.duckdb.connect('data.ddb')
postgres_con = ibis.postgres.connect(host='localhost', database='mydb')
bigquery_con = ibis.bigquery.connect(project_id='my-project')
# Same query works on all!
def analyze(con):
table = con.table('sales')
result = (
table
.filter(table.amount > 100)
.group_by('product')
.aggregate(total=table.amount.sum())
.order_by(ibis.desc('total'))
)
return result.execute() # Returns pandas DataFrameimport ibis
from ibis import _
con = ibis.duckdb.connect()
orders = con.table('orders')
result = (
orders
.filter(_.order_date.year() == 2024)
.mutate(
month=_.order_date.month(),
revenue=_.quantity * _.price,
is_large=_.quantity > 100
)
.group_by(['month', 'category'])
.agg(
total_revenue=_.revenue.sum(),
avg_quantity=_.quantity.mean(),
order_count=_.count()
)
.filter(_.total_revenue > 10000)
.order_by([_.month, ibis.desc(_.total_revenue)])
)
# This generates optimized SQL for the backend
result.execute()Databases:
DataFrames:
Files:
All with the same Python API!
import ibis
con = ibis.duckdb.connect()
table = con.table('large_table')
# Build the query (no execution yet)
query = (
table
.filter(table.amount > 1000)
.group_by('category')
.aggregate(total=table.amount.sum())
)
# Inspect the generated SQL
print(query.compile())
# Execute when ready
result = query.execute() # This runs the query
# Or stream for large results
for batch in query.to_pyarrow_batches():
process(batch)Use them together for maximum portability!
import ibis
import narwhals as nw
# Extract from database with Ibis
con = ibis.duckdb.connect('warehouse.ddb')
sales = con.table('sales').execute() # pandas DataFrame
# Transform with Narwhals (works with pandas or polars)
def transform(df):
df = nw.from_native(df)
return (
df
.with_columns([(nw.col('amount') * 1.1).alias('amount_tax')])
.filter(nw.col('amount_tax') > 100)
.to_native()
)
transformed = transform(sales)
# Load back to database with Ibis
con.create_table('sales_processed', transformed)import ibis
import narwhals as nw
def etl_pipeline(source_db, dest_db, frame_backend='pandas'):
# EXTRACT: Read from any database
source = ibis.connect(source_db)
raw_data = source.table('events').filter(_.date >= '2024-01-01').execute()
# TRANSFORM: Process with any DataFrame library
def transform(df):
df = nw.from_native(df)
return (
df
.with_columns([
nw.col('event_time').dt.hour().alias('hour'),
nw.col('user_id').cast(nw.String).alias('user_id_str')
])
.group_by(['hour', 'event_type'])
.agg([nw.col('user_id').n_unique().alias('unique_users')])
.to_native()
)
processed = transform(raw_data)
# LOAD: Write to any database
dest = ibis.connect(dest_db)
dest.create_table('event_summary', processed, overwrite=True)Narwhals:
Ibis:
Best practice: Profile both native and unified APIs for your use case
Use Narwhals when:
Use Ibis when:
Use both when:
# Install narwhals
pip install narwhals
# Install ibis with specific backends
pip install 'ibis-framework[duckdb]'
pip install 'ibis-framework[postgres]'
pip install 'ibis-framework[bigquery]'
# Or install multiple backends
pip install 'ibis-framework[duckdb,postgres,snowflake]'
# Narwhals automatically detects installed DataFrame libraries
# No additional configuration needed!Step 1: Identify portable operations
Step 2: Wrap in functions
Step 3: Gradually migrate critical paths
Narwhals:
Ibis:
Solution: Read the docs, test thoroughly, have fallback plans
Narwhals:
narwhals-dev/narwhalsnarwhals-dev.github.io/narwhalsIbis:
ibis-project/ibisibis-project.orgBoth projects are actively maintained and growing!
Remember:
Write portable code today, thank yourself tomorrow
See this page for more code examples and patterns using Narwhals and Ibis!
Directory-based data organization pattern:
data/
├── year=2024/
│ ├── month=01/
│ │ ├── day=01/
│ │ │ └── data.parquet
│ │ └── day=02/
│ │ └── data.parquet
│ └── month=02/
│ └── day=01/
│ └── data.parquet
└── year=2025/
└── month=01/
└── day=01/
└── data.parquet
Partition keys are encoded in directory structure, not file contents
Key concept: Organize files by column values in nested directories
# Writing partitioned data
import polars as pl
df = pl.DataFrame({
'date': ['2024-01-01', '2024-01-02', '2024-02-01'],
'product': ['A', 'B', 'A'],
'sales': [100, 200, 150]
})
# Create partitions by date components
df = df.with_columns([
pl.col('date').str.to_date().dt.year().alias('year'),
pl.col('date').str.to_date().dt.month().alias('month')
])
# Write with Hive partitioning
df.write_parquet('data', partition_by=['year', 'month'])Result: Files organized in year=2024/month=01/ directories
🚀 Performance Benefits:
# Without partitioning: scan entire dataset
df = pl.scan_parquet('data/*.parquet')
result = df.filter(pl.col('year') == 2024).collect()
# With partitioning: only read year=2024/ directory
df = pl.scan_parquet('data/**/*.parquet', hive_partitioning=True)
result = df.filter(pl.col('year') == 2024).collect() # Much faster!Example: 1TB dataset, 3 years of data
Benchmark: 3x-10x faster queries on partitioned data
✅ Good partition keys:
❌ Avoid:
Time-series data:
year=2024/month=12/day=15/data.parquet # Fine-grained
year=2024/month=12/data.parquet # Medium
year=2024/data.parquet # Coarse
Multi-dimensional:
region=us/category=electronics/data.parquet
region=eu/category=books/data.parquet
Best practice: Start with coarser partitions, refine if needed
All modern tools support Hive partitioning:
# Polars
df = pl.scan_parquet('data/**/*.parquet', hive_partitioning=True)
# DuckDB
con.execute("SELECT * FROM read_parquet('data/**/*.parquet', hive_partitioning=true)")
# Ibis
table = con.read_parquet('data/**/*.parquet', hive_partitioning=True)
# PyArrow
import pyarrow.parquet as pq
ds = pq.ParquetDataset('data/', partitioning='hive')Partition columns automatically added to DataFrame
Create partitions when writing:
# Polars
df.write_parquet('data', partition_by=['year', 'month'])
# PyArrow
import pyarrow.parquet as pq
pq.write_to_dataset(
table,
root_path='data',
partition_cols=['year', 'month']
)
# Ibis with DuckDB
con.create_table('output', df, temp=False)
con.execute("""
COPY (SELECT * FROM output)
TO 'data' (FORMAT PARQUET, PARTITION_BY (year, month))
""")import polars as pl
from datetime import datetime
# Generate sales data
sales = pl.DataFrame({
'timestamp': [...], # 10M rows
'product_id': [...],
'amount': [...],
'region': [...]
})
# Prepare for partitioning
sales = sales.with_columns([
pl.col('timestamp').dt.year().alias('year'),
pl.col('timestamp').dt.month().alias('month'),
pl.col('region').alias('region')
])
# Write partitioned (100s of files, organized by year/month/region)
sales.write_parquet('sales_data', partition_by=['year', 'month', 'region'])
# Fast queries: only scan relevant partitions
q = pl.scan_parquet('sales_data/**/*.parquet', hive_partitioning=True)
q.filter(
(pl.col('year') == 2024) &
(pl.col('region') == 'US')
).group_by('product_id').agg(pl.col('amount').sum())Combine two optimizations:
# Only read specific columns from specific partitions
q = (
pl.scan_parquet('data/**/*.parquet', hive_partitioning=True)
.select(['product_id', 'amount']) # Column pruning
.filter(pl.col('year') == 2024) # Partition pruning
)
# Reads minimal data:
# ✓ Only 2024 directories (not 2022, 2023)
# ✓ Only 2 columns (not all 50)
# ✓ Compressed Parquet formatResult: Queries on TBs feel like queries on MBs
Skip partitioning if:
In these cases, flat Parquet files may be simpler and just as fast
Two distinct concepts often confused:
None / null: Absence of a value (missing data)NaN: “Not a Number” - a specific floating-point value (IEEE 754)Key difference: NaN is a value; None represents no value
When to use each:
None / null: - Data wasn’t collected - Not applicable for this record - Missing in the source system - Any data type (strings, integers, dates)
NaN: - Mathematical undefined result (0/0, ∞ - ∞) - Propagates through numeric calculations - Specific to floating-point arithmetic - Only for numeric types
Pandas historically conflated None and NaN:
import pandas as pd
import numpy as np
# Creating a Series with None
s = pd.Series([1, None, 3])
print(s)
# 0 1.0
# 1 NaN # None became NaN!
# 2 3.0
# Type was converted to float64
print(s.dtype) # float64
# Can't represent integer + null without nullable types
# Integer columns become float just to handle missing dataProblem: Lost distinction between missing data and invalid calculations
The old Pandas behavior:
import pandas as pd
# Integer column with missing value
data = pd.Series([1, 2, None, 4])
print(data.dtype) # float64 (not int64!)
# String column with missing value
names = pd.Series(['Alice', None, 'Charlie'])
print(names.dtype) # object
# Even boolean columns!
flags = pd.Series([True, False, None])
print(flags.dtype) # object (not bool!)Why?: NumPy arrays don’t support null for most types, so Pandas used NaN and float64
Pandas 1.0+ introduced proper nullable types:
import pandas as pd
# Use nullable integer
data = pd.Series([1, 2, None, 4], dtype='Int64') # Capital I
print(data.dtype) # Int64
print(data)
# 0 1
# 1 2
# 2 <NA> # Proper null marker
# 3 4
# Nullable boolean
flags = pd.Series([True, False, None], dtype='boolean')
print(flags.dtype) # boolean
# Nullable string
names = pd.Series(['Alice', None, 'Charlie'], dtype='string')
print(names.dtype) # string<NA> is Pandas’ null sentinel, distinct from NaN
Polars has null handling right from the start:
import polars as pl
# Integer column with null - no type coercion!
data = pl.Series('values', [1, 2, None, 4])
print(data.dtype) # Int64 (not Float64!)
# Nulls work with any type
names = pl.Series('names', ['Alice', None, 'Charlie'])
print(names.dtype) # Utf8
flags = pl.Series('flags', [True, False, None])
print(flags.dtype) # Boolean
# NULL is a distinct value, not NaN
print(data.null_count()) # 1Polars: null is null, NaN is NaN, types are preserved
Different approaches across tools:
import pandas as pd
import polars as pl
import numpy as np
# Pandas (checks for both None and NaN)
s = pd.Series([1, None, float('nan'), 4])
print(s.isna()) # [False, True, True, False]
print(s.isnull()) # Same as isna()
# Polars (explicit about nulls)
s = pl.Series([1, None, float('nan'), 4])
print(s.is_null()) # [False, True, False, False]
print(s.is_nan()) # [False, False, True, False]
# NumPy (only checks for NaN)
arr = np.array([1, np.nan, 4])
print(np.isnan(arr)) # [False, True, False]Polars distinguishes null from NaN; Pandas often doesn’t
Narwhals follows Polars semantics:
import narwhals as nw
import pandas as pd
import polars as pl
def handle_missing(df):
df = nw.from_native(df)
return (
df
.filter(~nw.col('value').is_null()) # Remove nulls
.with_columns(
nw.col('value').fill_null(0) # Replace nulls with 0
)
)
# Works consistently across backends
df_pandas = pd.DataFrame({'value': [1, None, 3]})
df_polars = pl.DataFrame({'value': [1, None, 3]})
handle_missing(df_pandas) # Treats None/NaN as null
handle_missing(df_polars) # Native null handlingRecommendation: Use Narwhals/Polars semantics for clarity
Modern recommendations:
Int64, boolean, stringNone and NaN in the same contextis_null() / is_nan() separatelyThe tools we’ve covered are powerful for single-node workloads—Spark takes us to multi-node clusters