Distributed File Systems and HDFS
Modern applications generate data at rates that far exceed what any single machine can store or process:
Big data processing systems are built on average machines that fail pretty often.
The software must treat failure as a routine event, not an exception.
A distributed file system (DFS) stores data across many machines while presenting a single, unified namespace to clients.
Google’s 2003 paper described a production file system built from commodity hardware, designed around the assumption that:
GFS proved that a reliable, large-scale distributed file system could be built cheaply.
Google File System is a proprietary product. Nobody outside Google could use it.
Hadoop is open source (Apache License). Anyone can use, modify, and deploy it.
“Who has heard of Colossus?” (Google’s successor to GFS — almost nobody outside Google)
Open source → ecosystem → community → adoption.
fsimage + edit log)Network is the bottleneck, not CPU.
Old model: copy data to where the code runs. HDFS/MapReduce model: run code where the data already lives.
When possible, the scheduler assigns work to a node that already holds the relevant block. This eliminates most network traffic for data-intensive workloads.
| Pattern | HDFS handles well | HDFS handles poorly |
|---|---|---|
| File size | Large files (GB–TB) | Millions of small files |
| Access pattern | Sequential reads | Random reads/writes |
| Write pattern | Write-once, append-only | Frequent overwrites |
| Latency | High throughput | Low-latency access |
| Clients | Batch jobs | Interactive queries |
Note
HDFS is optimized for the analytics batch processing use case — reading entire datasets sequentially — not for serving web requests or online transactions.
As cloud computing matured, object stores (S3, Azure Blob, GCS) became a popular alternative to HDFS for large-scale storage.
| HDFS | Cloud Object Store | |
|---|---|---|
| Scalability | Bound by NameNode | Virtually unlimited |
| Coupling | Storage + compute together | Storage separate from compute |
| Cost | CapEx (servers) | OpEx (pay per GB) |
| Fault tolerance | 3× replication | Built in (provider-managed) |
| Interface | POSIX-like CLI / Java API | REST API (s3://, abfs://) |
| Data locality | Yes | No (network always) |
Storage and compute are separate, independently scalable systems.
Benefits: - Pay only for compute while jobs run - Store data indefinitely at low cost - Mix and match compute engines
HDFS exposes a shell interface that mirrors Unix filesystem commands:
# List files
hdfs dfs -ls /user/hadoop/data/
# Copy local file to HDFS
hdfs dfs -put localfile.csv /user/hadoop/data/
# Copy from HDFS to local
hdfs dfs -get /user/hadoop/data/output/ ./output/
# Print contents of a file
hdfs dfs -cat /user/hadoop/data/results/part-00000
# Create a directory
hdfs dfs -mkdir /user/hadoop/staging
# Remove a file or directory
hdfs dfs -rm -r /user/hadoop/stagingFull reference: HDFS Shell Commands
HDFS is the storage foundation; dozens of tools have been built on top of it. Live reference: https://hadoopecosystemtable.github.io/
HDFS proved these ideas work at internet scale. Even as cloud object stores have largely supplanted HDFS for new deployments, every modern data lake — S3, Azure Data Lake, GCS — inherits the same core principles: