Why does a GFS-style file system use one master instead of many?

The single master simplifies the design enormously and works because the master only holds metadata, never data. All file metadata - the namespace, file-to-chunk mappings, and chunk locations - fits in the master's memory at sizes well into the petabyte range, and clients cache chunk locations so the master sits out the data path. A shadow master receives the operation log and takes over on failure. Federating or sharding the master comes later, when even metadata exceeds one machine.

Why are files broken into fixed-size chunks?

Fixed-size chunks - typically 64 MB - reduce metadata per file dramatically, allow large parallel reads and writes across many chunkservers, and amortise TCP and connection overhead with a single sustained transfer per chunk. Each chunk gets a globally unique 64-bit ID and is replicated to several chunkservers. The cost is that many small files waste an entire chunk's metadata each, which is why GFS-style systems are a poor fit for workloads of millions of tiny files.

How does a write actually flow through chunkservers?

Two flows happen separately. The client pushes data to all chunk replicas in a pipeline - each replica forwards to the next - which uses network topology efficiently. Then the client tells the chunk's primary replica, designated by a lease from the master, to commit; the primary picks an order for concurrent writes, tells the secondaries to apply in that order, and acknowledges. Data flow uses bandwidth, control flow gets the ordering right.

What is an atomic record append?

An atomic record append is a write where the client specifies only the data, not the offset; the primary replica picks the offset at the file's current end and applies the write atomically across replicas. The result is at-least-once: the append definitely happens at some specific offset, but a partial failure can cause the same record to appear more than once, so consuming applications need to be idempotent. It is the cornerstone primitive for log-style workloads.

How does the system recover from a chunkserver failure?

The master detects a failed chunkserver through missed heartbeats, marks the chunks held there as under-replicated, and queues background re-replication tasks that copy surviving replicas of those chunks to other chunkservers. Replicas carry a version number so stale data from a returning chunkserver is recognised and discarded. The cluster stays at full replication factor without any client action.

What consistency does a GFS-style file system provide?

File-namespace operations are strongly consistent through the single master. Chunk data after a successful write is the same on all replicas, but concurrent overlapping writes can leave the data 'consistent but undefined' - all replicas agree on the bytes but the order may not match any single client's intent. Atomic record appends provide at-least-once semantics. POSIX-style guarantees are explicitly traded away in exchange for the throughput the design buys.

Design a Distributed File System: System Design 2026

A distributed file system is the foundation many other distributed systems quietly stand on - the storage layer under MapReduce and Spark, the persistence underneath Bigtable, and the architectural ancestor of every modern object store. Its difficulty is not that any one operation is complex but that the system must, on commodity hardware that constantly fails, deliver petabytes of capacity, gigabytes-per-second of aggregate throughput, and a coherent file abstraction at the same time. The design that answers all three at once is the Google File System design, and it makes one famous architectural bet: a single master that knows everything but touches nothing.

This walkthrough assumes the 6-step system design framework and applies it at senior-plus depth. It is Part 16 of an extended system design series, closing Tier 5.

The Problem
Step 1 - Clarify Requirements
Step 2 - Estimate Scale
Step 3 - API and Data Model
Step 4 - High-Level Design
Step 5 - Deep Dive: Master, Chunks, and Replication
Step 6 - Bottlenecks and Trade-offs
Reference Architecture
Common Mistakes in the Interview
Quick Reference
Related Articles

The Problem

We are designing a distributed file system: hierarchical paths, large files split across many commodity machines, with replication that survives constant hardware failure and throughput that scales linearly with cluster size. The canonical examples are the Google File System (GFS), HDFS (its open-source descendant), and the chunked storage layers behind newer systems like Colossus.

The senior framing is that this design is shaped by the workload it serves: large files, mostly sequential reads, appends rather than random writes. That workload lets the system give up the things a POSIX filesystem promises - strong per-byte consistency, fine-grained random writes, small-file efficiency - in exchange for an order of magnitude more throughput and a tractable architecture. The deep dive is largely the story of those deliberate trades.

Step 1 - Clarify Requirements

Functional requirements:

Hierarchical namespace: create, read, write/append, delete, snapshot.
Large files (GB to TB) with mostly sequential access.
Atomic record append - the workhorse for log-style pipelines.
Many concurrent writers to the same file via append.

Out of scope (name, then defer): full POSIX semantics, low-latency random small writes, millions of tiny files (a different system - a key-value store - is the right tool for those).

Non-functional requirements:

Petabyte-scale capacity, growing without redesign.
High aggregate throughput - many GB/s across the cluster.
Durability through replication on commodity hardware that fails constantly.
Sequential read and append performance prioritised over random IO.
Single-datacenter in the canonical design; cross-cluster replication is a separate layer above.

The defining clarifying question: what workload? Large sequential reads, large appends, no small random writes. State this explicitly, because every design choice - chunk size, single master, the consistency model - is calibrated to that workload and would be wrong for a general-purpose filesystem.

Step 2 - Estimate Scale

Total capacity. 100 PB raw, replication factor 3 -> 300 PB of physical storage. With ~10 TB usable per chunkserver, the cluster runs at ~30,000 chunkservers.

Chunk count. At 64 MB per chunk, 100 PB = 100 x 10^15 / 64 x 10^6 ≈ ~1.6 billion chunks of unique data, or ~4.8 billion chunk replicas.

Metadata. Each chunk costs ~64 bytes of master metadata (chunk ID, current locations, version). 1.6 billion x 64 bytes ≈ ~100 GB - fits in the master's memory on a beefy machine. This single number is what makes a single-master design viable.

Throughput. Per chunkserver ~100 MB/s, and the cluster's aggregate ceiling is the chunkserver count - multiple TB/s is realistic at this size.

The shape: enormous total bytes, very few but large operations, and a metadata budget that fits in a single host's RAM.

Step 3 - API and Data Model

The client interacts with two parties:

# Metadata - to the master
open(path)              -> fileHandle
lookup(fileHandle, off) -> [(chunkId, version, [chunkserver locations])]
create(path) / delete(path) / snapshot(path)
 
# Data - directly to chunkservers
read(chunkId, version, offset, length)   -> bytes
write(chunkId, version, offset, bytes)
recordAppend(chunkId, bytes)             -> offset  (atomic)

Held by master	Contents
Namespace	Path tree (logged + in-memory)
File -> chunks	Ordered list of chunk IDs per file
Chunk -> locations	Current chunkserver replicas (volatile - reconstructed from heartbeats)
Chunk versions	Monotonic version per chunk - what is current vs stale

Held by chunkservers	Contents
Chunks	The actual data, as local files
Chunk metadata	Per-chunk checksums
Heartbeats	Periodic report to master of chunks held

A subtle point: the master persists the namespace and file-to-chunks mapping durably (operation log + checkpoints), but chunk-to-location mapping is volatile and rebuilt from chunkserver heartbeats at startup. Locations are authoritative at the chunkservers; the master mirrors them. This is what keeps the metadata small enough to fit in memory and the master log small enough to replay quickly.

Step 4 - High-Level Design

flowchart TD
    Client([Client])
    Master[Master<br/>metadata in memory<br/>+ persistent op log]
    Shadow[Shadow Master<br/>replays op log]
    subgraph CS["Chunkservers - thousands of nodes"]
        C1[(Chunkserver)]
        C2[(Chunkserver)]
        C3[(Chunkserver)]
    end
    Client -->|metadata lookup| Master
    Master -.replicates op log.-> Shadow
    Client -->|read / write chunks directly| C1
    Client --> C2
    Client --> C3
    C1 -.heartbeat: chunks held.-> Master
    C2 -.heartbeat.-> Master
    C3 -.heartbeat.-> Master

Figure 1. The architecture splits cleanly into two tiers: a single master holding all metadata in memory plus a durable operation log, and many commodity chunkservers holding the replicated chunked data. The client consults the master once for metadata, then bypasses it entirely for data - the bandwidth never touches the master, which is what makes a single-master design viable at petabyte scale.

The architecture is two tiers separated by a deliberate split between metadata and data. The client consults the master once for metadata, caches the result, then talks to chunkservers directly for the bytes. The master never sees data. The shadow master receives the operation log and is ready to take over.

Step 5 - Deep Dive: Master, Chunks, and Replication

This is the core. Four ideas cooperate: the single master, fixed-size chunking, primary-driven replicated writes, and the failure-handling that keeps the cluster at full replication.

Part A - Why a single master

A single master sounds like a single point of failure and a scaling ceiling, and it is - deliberately. Three observations make it viable:

The master holds metadata only. ~100 GB at petabyte scale fits in one host's RAM.
Clients cache chunk locations after the first lookup, so steady-state read traffic does not touch the master. The master sees one query per file open, not per byte.
Data flow bypasses the master entirely - reads and writes go straight from client to chunkservers.

So the master's load is bounded by the rate of file opens and writes that need new chunks, not by data throughput. One machine handles that for very large clusters. The simplification is enormous: the namespace, file-to-chunk mapping, and chunk version are authoritative in one place, with one ordering, so the consistency story above this layer becomes "what the master decided". When metadata eventually outgrows one machine, designs federate the master (HDFS Federation) or shard it across namespaces (Colossus) - but the single-master baseline is where the architecture starts and what most candidates underestimate.

Failover: a shadow master replays the persistent operation log and takes over on the primary's death; the cluster pauses on metadata operations during the brief switch but keeps reading and writing existing chunks because the data path does not need the master.

Part B - Chunking

Files are split into fixed-size chunks - typically 64 MB - each with a globally unique 64-bit chunk ID. The size is a calibration:

Larger chunks reduce metadata per byte stored - 64 MB vs 4 KB means ~16,000x less metadata.
Larger chunks also let one TCP connection do long sustained transfers, amortising overhead and saturating the disk.
Smaller chunks would balance load on hot files better and waste less space on small files.

64 MB is a deliberate choice for large-file, sequential-access workloads. The cost is that a 4 KB file still costs one chunk's metadata, and a workload of a million small files puts unnecessary pressure on the master. Small files are the canonical antipattern for this design; aggregate them into a larger container at the application layer, or use a key-value store.

Part C - Replication and the primary-lease write path

Each chunk is replicated to N chunkservers (typically 3), placed on distinct racks so a rack failure cannot lose any chunk. The master grants a lease on each chunk to one replica - the primary - which serialises writes. The lease has a TTL; the primary renews while alive, and a failed primary's lease expires so the master can grant a new one.

A write involves two flows that the design deliberately separates:

sequenceDiagram
    participant C as Client
    participant M as Master
    participant P as Primary Replica
    participant S1 as Secondary 1
    participant S2 as Secondary 2
 
    C->>M: lookup chunk -> primary + secondaries
    M-->>C: P, S1, S2 (current version)
    Note over C,S2: Data flow - pipelined
    C->>S1: push data
    S1->>S2: forward
    S2->>P: forward
    Note over C,P: Control flow - serialise the write
    C->>P: write (commit)
    P->>S1: apply at offset X (chosen by P)
    P->>S2: apply at offset X
    S1-->>P: ack
    S2-->>P: ack
    P-->>C: ack write complete

Figure 2. A write's two flows. Data is pushed through a pipeline of replicas - total bandwidth used is one chunk size, not N times - exploiting network topology. Control flow goes separately to the primary, which picks an order for concurrent writes, tells the secondaries to apply in that order, and acknowledges only when every replica has applied. Splitting data from control is what lets the system use bandwidth optimally while still serialising writes correctly.

Data flow is pipelined: the client pushes the bytes to the nearest replica, which forwards to the next - so total bandwidth used is roughly the chunk size, not N x chunkSize, and the data ride the network topology efficiently. Control flow happens separately: the client tells the primary to commit, the primary picks an order (for concurrent writes), tells the secondaries to apply in that order, and acknowledges only when every replica has applied. Splitting data and control is what lets the system use bandwidth optimally and still serialise the writes correctly.

Part D - Atomic record append

The most important write primitive is not "write to offset X" but recordAppend. The client supplies only the bytes; the primary picks the offset at the file's current end and applies the append atomically across replicas, returning the chosen offset. This is the cornerstone of log-style pipelines - many concurrent writers append records, the system serialises them, every record lands intact at some offset.

The semantics are at-least-once, not exactly-once: a partial failure can cause the same record to appear twice, at different offsets. Consumers must therefore be idempotent - the Part 11 idea applied to log records. The system gives up exactly-once because the price - global agreement on every append - would destroy the throughput the whole design exists to provide. This trade-off is, again, exactly the one Part 3 made.

Part E - Failure handling

The cluster runs on commodity hardware that fails continuously - this is not the exceptional case, it is the steady state.

Chunkserver crash. Heartbeats stop; after a timeout the master marks chunks held there as under-replicated and queues background re-replication tasks that copy surviving replicas onto fresh chunkservers, restoring N. The cluster self-heals without client action.
Stale replica. Every chunk has a monotonic version number bumped on each write. A chunkserver returning after a partition reports its versions; replicas behind the master's recorded version are recognised as stale and garbage-collected. There is no merge - the master's version is authoritative.
Bit rot. Chunkservers checksum each chunk; a checksum failure on a read returns an error to the client and reports the corruption to the master, which triggers re-replication from a clean copy.
Master crash. The shadow master replays the operation log and is promoted; clients pause briefly on metadata operations and resume.
Network partition. A chunkserver the master cannot reach may still serve cached clients; on partition heal, its version numbers reconcile with the master's, and stale chunks are reclaimed.

Consistency model

GFS makes a deliberate trade. Namespace operations (creating, deleting, renaming files) are strongly consistent through the master. Chunk data has a weaker model:

After a successful write, replicas are consistent (all hold the same bytes).
If writes by multiple clients overlapped, the result is consistent but undefined: all replicas agree on the bytes, but the order does not match any single client's intent. Applications that care use record append or write-then-rename instead.
Partial failures can leave the file in an inconsistent state briefly, repaired by re-replication.
Record append gives at-least-once semantics, with duplicates an application concern.

This is far less than POSIX promises and far more than nothing - it is calibrated to the workload it serves.

Multi-cluster / multi-region

A single GFS cluster is a single-datacenter design. Multi-region runs multiple independent clusters with a higher-level replication system copying files between them - the standard pattern is "each region runs its own filesystem, and applications choose where data lives". Federating the master across regions has been attempted in successor systems but is rare; the simpler answer wins almost always.

Evolution path

Stage	Approach
Launch	One server with a local filesystem - simple, won't scale
Growth	Distribute manually, replicate files by hand - painful but works for a while
Scale	GFS-style: single master + chunkservers + chunk replication + record append
Massive	Federate or shard the master (HDFS Federation, Colossus); per-region clusters above

Build chunking and the master-versus-data-path separation from day one - retrofitting them means rewriting everything that touches storage. Build chunk version numbers from day one - they are how stale replicas are detected without merge logic. Defer master federation until even metadata outgrows one machine.

Observability

Track total chunk count, under-replicated chunk count (alert - this is the headline durability signal), re-replication queue depth and rate, master metadata RAM usage and growth, master operation log size, heartbeat lag distribution, per-chunkserver read/write throughput, checksum-failure rate (corruption), and stale-replica count. A reasonable SLO: under-replicated count returns to zero within a bounded window after any single-machine failure.

Step 6 - Bottlenecks and Trade-offs

Master memory is the architectural ceiling - federate when 100 GB of metadata is no longer enough.
Re-replication bandwidth after a failure must complete fast enough to keep up with the next failure - the cluster is in a race.
Hot files concentrate read load on the chunkservers holding their replicas; pre-replicate hot files to more replicas, or layer a cache above.
Small files are an antipattern - aggregate them at the application level.
POSIX semantics are deliberately traded away; designs that need them are designing the wrong system.

Reference Architecture

The pattern this problem teaches, reusable beyond file systems:

A single metadata authority that holds the whole picture in memory, paired with many commodity data nodes that hold replicated, chunked data; clients consult the authority once and then bypass it; the authority orchestrates re-replication and rebalancing in the background.

flowchart LR
    subgraph Meta["Metadata authority - one machine"]
        M[Master / NameNode]
    end
    subgraph Data["Data plane - thousands of nodes"]
        N1[(Node)] --- N2[(Node)] --- N3[(Node)]
    end
    Client([Client]) -->|metadata, once| Meta
    Client -->|all data flow| Data
    Meta -.orchestrate replication.-> Data

Figure 3. The reference architecture for any "metadata authority + data nodes" split: keep the authority small, in-memory, and out of the byte path; let data nodes carry all the bandwidth. The same shape recurs in HDFS NameNode + DataNodes, MapReduce master + workers, Kafka brokers + controller, and many distributed databases - the architecture scales with the data nodes because the authority is deliberately on the side.

The same shape is everywhere "metadata + data nodes" is the right split: HDFS's NameNode plus DataNodes, MapReduce's master plus workers, Kafka brokers plus a controller, many distributed databases' coordinator plus storage. Keep the authority small and out of the bytes, and the architecture scales with the data nodes.

Common Mistakes in the Interview

Putting data through the master, defeating the whole reason a single master is acceptable.
Assuming POSIX semantics and designing for fine-grained random writes that the workload does not need.
Ignoring the small-file antipattern and proposing the design as a general-purpose filesystem.
No story for chunkserver failure - re-replication and version numbers must be in the design, not bolted on.
Mixing metadata into the data nodes, losing the simple "one source of truth" the architecture relies on.
Promising exactly-once record append instead of at-least-once with idempotent consumers.
Per-chunk strong consistency at scale - the consistency model is deliberately relaxed for a reason.

Quick Reference

Topic	Key Point
Core pattern	Single master + chunkservers; chunked, replicated data; clients bypass master
Master	Metadata in memory (~100 GB at PB scale); operation log + shadow for failover
Chunk size	Fixed (typically 64 MB); large for sequential workloads; small files antipattern
Replication	N replicas across distinct racks; primary lease serialises writes
Write path	Pipelined data flow + separate primary-driven control flow
Record append	At-least-once at a primary-chosen offset; idempotent consumers
Chunk version	Monotonic per chunk; stale replicas detected and garbage-collected
Failure recovery	Heartbeat-driven re-replication restores N silently
Consistency	Strong on namespace; consistent-but-undefined on overlapping writes; at-least-once append
Multi-region	Multiple independent clusters with replication above; no cross-region master

System Design Interview Problems: A Senior's Roadmap - the full series index and pattern library.
System Design Interview Guide: The 6-Step Framework - the method this walkthrough applies.
Design Video Streaming - Part 10; the chunked-blob storage that sits a layer above this kind of file system.
Design a Unique ID Generator - Part 13; the global chunk IDs that index every replica.
Design a Notification Service - Part 3; at-least-once delivery with idempotent consumers, exactly the contract record append offers.

This is Part 16, the close of Tier 5 and of the extended system design series. The distributed file system makes a fitting end because it inverts almost every default the rest of the series accepted - a single master instead of partitioned ownership, deliberate weakening of consistency instead of strengthening, large sequential operations instead of small fast ones - and shows that the right architecture is always the one calibrated to the workload, not the one that ticks the most "modern distributed systems" boxes. Return to the series roadmap to revisit any pattern.

Design a Distributed File System: System Design 2026

Table of Contents

The Problem

Step 1 - Clarify Requirements

Step 2 - Estimate Scale

Step 3 - API and Data Model

Step 4 - High-Level Design

Step 5 - Deep Dive: Master, Chunks, and Replication

Part A - Why a single master

Part B - Chunking

Part C - Replication and the primary-lease write path

Part D - Atomic record append

Part E - Failure handling

Consistency model

Multi-cluster / multi-region

Evolution path

Observability

Step 6 - Bottlenecks and Trade-offs

Reference Architecture

Common Mistakes in the Interview

Quick Reference

Ready to ace your interview?

Table of Contents

The Problem

Step 1 - Clarify Requirements

Step 2 - Estimate Scale

Step 3 - API and Data Model

Step 4 - High-Level Design

Step 5 - Deep Dive: Master, Chunks, and Replication

Part A - Why a single master

Part B - Chunking

Part C - Replication and the primary-lease write path

Part D - Atomic record append

Part E - Failure handling

Consistency model

Multi-cluster / multi-region

Evolution path

Observability

Step 6 - Bottlenecks and Trade-offs

Reference Architecture

Common Mistakes in the Interview

Quick Reference

Related Articles

Ready to ace your interview?