Design a Distributed File System: System Design 2026

·17 min read
system-designdistributed-file-systemchunked-storagereplicationarchitectureinterview-preparation

A distributed file system is the foundation many other distributed systems quietly stand on - the storage layer under MapReduce and Spark, the persistence underneath Bigtable, and the architectural ancestor of every modern object store. Its difficulty is not that any one operation is complex but that the system must, on commodity hardware that constantly fails, deliver petabytes of capacity, gigabytes-per-second of aggregate throughput, and a coherent file abstraction at the same time. The design that answers all three at once is the Google File System design, and it makes one famous architectural bet: a single master that knows everything but touches nothing.

This walkthrough assumes the 6-step system design framework and applies it at senior-plus depth. It is Part 16 of an extended system design series, closing Tier 5.

Table of Contents

  1. The Problem
  2. Step 1 - Clarify Requirements
  3. Step 2 - Estimate Scale
  4. Step 3 - API and Data Model
  5. Step 4 - High-Level Design
  6. Step 5 - Deep Dive: Master, Chunks, and Replication
  7. Step 6 - Bottlenecks and Trade-offs
  8. Reference Architecture
  9. Common Mistakes in the Interview
  10. Quick Reference
  11. Related Articles

The Problem

We are designing a distributed file system: hierarchical paths, large files split across many commodity machines, with replication that survives constant hardware failure and throughput that scales linearly with cluster size. The canonical examples are the Google File System (GFS), HDFS (its open-source descendant), and the chunked storage layers behind newer systems like Colossus.

The senior framing is that this design is shaped by the workload it serves: large files, mostly sequential reads, appends rather than random writes. That workload lets the system give up the things a POSIX filesystem promises - strong per-byte consistency, fine-grained random writes, small-file efficiency - in exchange for an order of magnitude more throughput and a tractable architecture. The deep dive is largely the story of those deliberate trades.


Step 1 - Clarify Requirements

Functional requirements:

  • Hierarchical namespace: create, read, write/append, delete, snapshot.
  • Large files (GB to TB) with mostly sequential access.
  • Atomic record append - the workhorse for log-style pipelines.
  • Many concurrent writers to the same file via append.

Out of scope (name, then defer): full POSIX semantics, low-latency random small writes, millions of tiny files (a different system - a key-value store - is the right tool for those).

Non-functional requirements:

  • Petabyte-scale capacity, growing without redesign.
  • High aggregate throughput - many GB/s across the cluster.
  • Durability through replication on commodity hardware that fails constantly.
  • Sequential read and append performance prioritised over random IO.
  • Single-datacenter in the canonical design; cross-cluster replication is a separate layer above.

The defining clarifying question: what workload? Large sequential reads, large appends, no small random writes. State this explicitly, because every design choice - chunk size, single master, the consistency model - is calibrated to that workload and would be wrong for a general-purpose filesystem.


Step 2 - Estimate Scale

Total capacity. 100 PB raw, replication factor 3 -> 300 PB of physical storage. With ~10 TB usable per chunkserver, the cluster runs at ~30,000 chunkservers.

Chunk count. At 64 MB per chunk, 100 PB = 100 x 10^15 / 64 x 10^6~1.6 billion chunks of unique data, or ~4.8 billion chunk replicas.

Metadata. Each chunk costs ~64 bytes of master metadata (chunk ID, current locations, version). 1.6 billion x 64 bytes ≈ ~100 GB - fits in the master's memory on a beefy machine. This single number is what makes a single-master design viable.

Throughput. Per chunkserver ~100 MB/s, and the cluster's aggregate ceiling is the chunkserver count - multiple TB/s is realistic at this size.

The shape: enormous total bytes, very few but large operations, and a metadata budget that fits in a single host's RAM.


Step 3 - API and Data Model

The client interacts with two parties:

# Metadata - to the master
open(path)              -> fileHandle
lookup(fileHandle, off) -> [(chunkId, version, [chunkserver locations])]
create(path) / delete(path) / snapshot(path)
 
# Data - directly to chunkservers
read(chunkId, version, offset, length)   -> bytes
write(chunkId, version, offset, bytes)
recordAppend(chunkId, bytes)             -> offset  (atomic)
Held by masterContents
NamespacePath tree (logged + in-memory)
File -> chunksOrdered list of chunk IDs per file
Chunk -> locationsCurrent chunkserver replicas (volatile - reconstructed from heartbeats)
Chunk versionsMonotonic version per chunk - what is current vs stale
Held by chunkserversContents
ChunksThe actual data, as local files
Chunk metadataPer-chunk checksums
HeartbeatsPeriodic report to master of chunks held

A subtle point: the master persists the namespace and file-to-chunks mapping durably (operation log + checkpoints), but chunk-to-location mapping is volatile and rebuilt from chunkserver heartbeats at startup. Locations are authoritative at the chunkservers; the master mirrors them. This is what keeps the metadata small enough to fit in memory and the master log small enough to replay quickly.


Step 4 - High-Level Design

flowchart TD
    Client([Client])
    Master[Master<br/>metadata in memory<br/>+ persistent op log]
    Shadow[Shadow Master<br/>replays op log]
    subgraph CS["Chunkservers - thousands of nodes"]
        C1[(Chunkserver)]
        C2[(Chunkserver)]
        C3[(Chunkserver)]
    end
    Client -->|metadata lookup| Master
    Master -.replicates op log.-> Shadow
    Client -->|read / write chunks directly| C1
    Client --> C2
    Client --> C3
    C1 -.heartbeat: chunks held.-> Master
    C2 -.heartbeat.-> Master
    C3 -.heartbeat.-> Master

Figure 1. The architecture splits cleanly into two tiers: a single master holding all metadata in memory plus a durable operation log, and many commodity chunkservers holding the replicated chunked data. The client consults the master once for metadata, then bypasses it entirely for data - the bandwidth never touches the master, which is what makes a single-master design viable at petabyte scale.

The architecture is two tiers separated by a deliberate split between metadata and data. The client consults the master once for metadata, caches the result, then talks to chunkservers directly for the bytes. The master never sees data. The shadow master receives the operation log and is ready to take over.


Step 5 - Deep Dive: Master, Chunks, and Replication

This is the core. Four ideas cooperate: the single master, fixed-size chunking, primary-driven replicated writes, and the failure-handling that keeps the cluster at full replication.

Part A - Why a single master

A single master sounds like a single point of failure and a scaling ceiling, and it is - deliberately. Three observations make it viable:

  • The master holds metadata only. ~100 GB at petabyte scale fits in one host's RAM.
  • Clients cache chunk locations after the first lookup, so steady-state read traffic does not touch the master. The master sees one query per file open, not per byte.
  • Data flow bypasses the master entirely - reads and writes go straight from client to chunkservers.

So the master's load is bounded by the rate of file opens and writes that need new chunks, not by data throughput. One machine handles that for very large clusters. The simplification is enormous: the namespace, file-to-chunk mapping, and chunk version are authoritative in one place, with one ordering, so the consistency story above this layer becomes "what the master decided". When metadata eventually outgrows one machine, designs federate the master (HDFS Federation) or shard it across namespaces (Colossus) - but the single-master baseline is where the architecture starts and what most candidates underestimate.

Failover: a shadow master replays the persistent operation log and takes over on the primary's death; the cluster pauses on metadata operations during the brief switch but keeps reading and writing existing chunks because the data path does not need the master.

Part B - Chunking

Files are split into fixed-size chunks - typically 64 MB - each with a globally unique 64-bit chunk ID. The size is a calibration:

  • Larger chunks reduce metadata per byte stored - 64 MB vs 4 KB means ~16,000x less metadata.
  • Larger chunks also let one TCP connection do long sustained transfers, amortising overhead and saturating the disk.
  • Smaller chunks would balance load on hot files better and waste less space on small files.

64 MB is a deliberate choice for large-file, sequential-access workloads. The cost is that a 4 KB file still costs one chunk's metadata, and a workload of a million small files puts unnecessary pressure on the master. Small files are the canonical antipattern for this design; aggregate them into a larger container at the application layer, or use a key-value store.

Part C - Replication and the primary-lease write path

Each chunk is replicated to N chunkservers (typically 3), placed on distinct racks so a rack failure cannot lose any chunk. The master grants a lease on each chunk to one replica - the primary - which serialises writes. The lease has a TTL; the primary renews while alive, and a failed primary's lease expires so the master can grant a new one.

A write involves two flows that the design deliberately separates:

sequenceDiagram
    participant C as Client
    participant M as Master
    participant P as Primary Replica
    participant S1 as Secondary 1
    participant S2 as Secondary 2
 
    C->>M: lookup chunk -> primary + secondaries
    M-->>C: P, S1, S2 (current version)
    Note over C,S2: Data flow - pipelined
    C->>S1: push data
    S1->>S2: forward
    S2->>P: forward
    Note over C,P: Control flow - serialise the write
    C->>P: write (commit)
    P->>S1: apply at offset X (chosen by P)
    P->>S2: apply at offset X
    S1-->>P: ack
    S2-->>P: ack
    P-->>C: ack write complete

Figure 2. A write's two flows. Data is pushed through a pipeline of replicas - total bandwidth used is one chunk size, not N times - exploiting network topology. Control flow goes separately to the primary, which picks an order for concurrent writes, tells the secondaries to apply in that order, and acknowledges only when every replica has applied. Splitting data from control is what lets the system use bandwidth optimally while still serialising writes correctly.

Data flow is pipelined: the client pushes the bytes to the nearest replica, which forwards to the next - so total bandwidth used is roughly the chunk size, not N x chunkSize, and the data ride the network topology efficiently. Control flow happens separately: the client tells the primary to commit, the primary picks an order (for concurrent writes), tells the secondaries to apply in that order, and acknowledges only when every replica has applied. Splitting data and control is what lets the system use bandwidth optimally and still serialise the writes correctly.

Part D - Atomic record append

The most important write primitive is not "write to offset X" but recordAppend. The client supplies only the bytes; the primary picks the offset at the file's current end and applies the append atomically across replicas, returning the chosen offset. This is the cornerstone of log-style pipelines - many concurrent writers append records, the system serialises them, every record lands intact at some offset.

The semantics are at-least-once, not exactly-once: a partial failure can cause the same record to appear twice, at different offsets. Consumers must therefore be idempotent - the Part 11 idea applied to log records. The system gives up exactly-once because the price - global agreement on every append - would destroy the throughput the whole design exists to provide. This trade-off is, again, exactly the one Part 3 made.

Part E - Failure handling

The cluster runs on commodity hardware that fails continuously - this is not the exceptional case, it is the steady state.

  • Chunkserver crash. Heartbeats stop; after a timeout the master marks chunks held there as under-replicated and queues background re-replication tasks that copy surviving replicas onto fresh chunkservers, restoring N. The cluster self-heals without client action.
  • Stale replica. Every chunk has a monotonic version number bumped on each write. A chunkserver returning after a partition reports its versions; replicas behind the master's recorded version are recognised as stale and garbage-collected. There is no merge - the master's version is authoritative.
  • Bit rot. Chunkservers checksum each chunk; a checksum failure on a read returns an error to the client and reports the corruption to the master, which triggers re-replication from a clean copy.
  • Master crash. The shadow master replays the operation log and is promoted; clients pause briefly on metadata operations and resume.
  • Network partition. A chunkserver the master cannot reach may still serve cached clients; on partition heal, its version numbers reconcile with the master's, and stale chunks are reclaimed.

Consistency model

GFS makes a deliberate trade. Namespace operations (creating, deleting, renaming files) are strongly consistent through the master. Chunk data has a weaker model:

  • After a successful write, replicas are consistent (all hold the same bytes).
  • If writes by multiple clients overlapped, the result is consistent but undefined: all replicas agree on the bytes, but the order does not match any single client's intent. Applications that care use record append or write-then-rename instead.
  • Partial failures can leave the file in an inconsistent state briefly, repaired by re-replication.
  • Record append gives at-least-once semantics, with duplicates an application concern.

This is far less than POSIX promises and far more than nothing - it is calibrated to the workload it serves.

Multi-cluster / multi-region

A single GFS cluster is a single-datacenter design. Multi-region runs multiple independent clusters with a higher-level replication system copying files between them - the standard pattern is "each region runs its own filesystem, and applications choose where data lives". Federating the master across regions has been attempted in successor systems but is rare; the simpler answer wins almost always.

Evolution path

StageApproach
LaunchOne server with a local filesystem - simple, won't scale
GrowthDistribute manually, replicate files by hand - painful but works for a while
ScaleGFS-style: single master + chunkservers + chunk replication + record append
MassiveFederate or shard the master (HDFS Federation, Colossus); per-region clusters above

Build chunking and the master-versus-data-path separation from day one - retrofitting them means rewriting everything that touches storage. Build chunk version numbers from day one - they are how stale replicas are detected without merge logic. Defer master federation until even metadata outgrows one machine.

Observability

Track total chunk count, under-replicated chunk count (alert - this is the headline durability signal), re-replication queue depth and rate, master metadata RAM usage and growth, master operation log size, heartbeat lag distribution, per-chunkserver read/write throughput, checksum-failure rate (corruption), and stale-replica count. A reasonable SLO: under-replicated count returns to zero within a bounded window after any single-machine failure.


Step 6 - Bottlenecks and Trade-offs

  • Master memory is the architectural ceiling - federate when 100 GB of metadata is no longer enough.
  • Re-replication bandwidth after a failure must complete fast enough to keep up with the next failure - the cluster is in a race.
  • Hot files concentrate read load on the chunkservers holding their replicas; pre-replicate hot files to more replicas, or layer a cache above.
  • Small files are an antipattern - aggregate them at the application level.
  • POSIX semantics are deliberately traded away; designs that need them are designing the wrong system.

Reference Architecture

The pattern this problem teaches, reusable beyond file systems:

A single metadata authority that holds the whole picture in memory, paired with many commodity data nodes that hold replicated, chunked data; clients consult the authority once and then bypass it; the authority orchestrates re-replication and rebalancing in the background.

flowchart LR
    subgraph Meta["Metadata authority - one machine"]
        M[Master / NameNode]
    end
    subgraph Data["Data plane - thousands of nodes"]
        N1[(Node)] --- N2[(Node)] --- N3[(Node)]
    end
    Client([Client]) -->|metadata, once| Meta
    Client -->|all data flow| Data
    Meta -.orchestrate replication.-> Data

Figure 3. The reference architecture for any "metadata authority + data nodes" split: keep the authority small, in-memory, and out of the byte path; let data nodes carry all the bandwidth. The same shape recurs in HDFS NameNode + DataNodes, MapReduce master + workers, Kafka brokers + controller, and many distributed databases - the architecture scales with the data nodes because the authority is deliberately on the side.

The same shape is everywhere "metadata + data nodes" is the right split: HDFS's NameNode plus DataNodes, MapReduce's master plus workers, Kafka brokers plus a controller, many distributed databases' coordinator plus storage. Keep the authority small and out of the bytes, and the architecture scales with the data nodes.


Common Mistakes in the Interview

  • Putting data through the master, defeating the whole reason a single master is acceptable.
  • Assuming POSIX semantics and designing for fine-grained random writes that the workload does not need.
  • Ignoring the small-file antipattern and proposing the design as a general-purpose filesystem.
  • No story for chunkserver failure - re-replication and version numbers must be in the design, not bolted on.
  • Mixing metadata into the data nodes, losing the simple "one source of truth" the architecture relies on.
  • Promising exactly-once record append instead of at-least-once with idempotent consumers.
  • Per-chunk strong consistency at scale - the consistency model is deliberately relaxed for a reason.

Quick Reference

TopicKey Point
Core patternSingle master + chunkservers; chunked, replicated data; clients bypass master
MasterMetadata in memory (~100 GB at PB scale); operation log + shadow for failover
Chunk sizeFixed (typically 64 MB); large for sequential workloads; small files antipattern
ReplicationN replicas across distinct racks; primary lease serialises writes
Write pathPipelined data flow + separate primary-driven control flow
Record appendAt-least-once at a primary-chosen offset; idempotent consumers
Chunk versionMonotonic per chunk; stale replicas detected and garbage-collected
Failure recoveryHeartbeat-driven re-replication restores N silently
ConsistencyStrong on namespace; consistent-but-undefined on overlapping writes; at-least-once append
Multi-regionMultiple independent clusters with replication above; no cross-region master

This is Part 16, the close of Tier 5 and of the extended system design series. The distributed file system makes a fitting end because it inverts almost every default the rest of the series accepted - a single master instead of partitioned ownership, deliberate weakening of consistency instead of strengthening, large sequential operations instead of small fast ones - and shows that the right architecture is always the one calibrated to the workload, not the one that ticks the most "modern distributed systems" boxes. Return to the series roadmap to revisit any pattern.

Ready to ace your interview?

Get 550+ interview questions with detailed answers in our comprehensive PDF guides.

View PDF Guides