Design a Distributed Cache: System Design Interview 2026

·17 min read
system-designconsistent-hashingcachingarchitecturebackendinterview-preparation

A distributed cache is the component every other system in this series quietly depends on - Part 1's URL shortener and Part 2's rate limiter both assumed a fast shared store existed. This post designs that store itself: an in-memory tier, spread across many machines, that absorbs read load so the database does not have to. The problem looks like a hash map until you ask what happens when you add the hundredth node, when one key goes viral, or when a node dies at peak traffic. Those three questions are the interview.

This walkthrough assumes the 6-step system design framework and applies it at senior depth. It is Part 4 of a system design series.

Table of Contents

  1. The Problem
  2. Step 1 - Clarify Requirements
  3. Step 2 - Estimate Scale
  4. Step 3 - API and Data Model
  5. Step 4 - High-Level Design
  6. Step 5 - Deep Dive: Consistent Hashing and Cache Correctness
  7. Step 6 - Bottlenecks and Trade-offs
  8. Reference Architecture
  9. Common Mistakes in the Interview
  10. Quick Reference
  11. Related Articles

The Problem

We are designing a distributed in-memory cache - the infrastructure behind Memcached or Redis Cluster - that many application servers share to hold hot data and offload the database. We are designing the cache system, not using one.

The senior framing rests on a single liberating fact: the cache is not the source of truth. The database is. The cache is allowed to lose data and to be briefly stale, because anything it loses can be re-read from the database. That relaxed correctness requirement is what makes the whole design tractable - and a candidate who tries to make the cache durable and strongly consistent has misread the problem.


Step 1 - Clarify Requirements

Functional requirements:

  • GET, SET, and DELETE by key.
  • Per-key TTL / expiration.
  • Eviction when a node runs out of memory.

Non-functional requirements:

  • Latency: sub-millisecond p99 for GET. The cache exists to be fast; if it is not, it is pointless.
  • Throughput: millions of operations per second across the cluster.
  • Horizontal scalability: adding nodes adds capacity, without flushing the cluster.
  • Availability: losing one node should cost only that node's share of the data, never the whole cache.
  • Data loss is acceptable. It is a cache. No durability is promised.

The key clarifying question: look-aside or read-through? In a look-aside cache (Memcached, the common case) the application owns the cache-aside logic - it checks the cache, falls back to the database, and repopulates. In a read-through cache the cache itself fetches from the database on a miss. We will design a look-aside cache and note where read-through differs. Also confirm there is no persistence requirement: treating the cache as disposable is a feature, not a gap.


Step 2 - Estimate Scale

Unlike most systems in this series, a cache is sized by memory, not by throughput.

Memory and node count. Suppose the hot working set is 10 TB. With 64 GB of usable RAM per node:

  • 10 TB / 64 GB ≈ ~160 nodes; round up to ~200 for headroom and eviction slack.

Throughput per node. At 10 million operations/sec across 200 nodes, each node serves ~50,000 ops/sec - comfortable for an in-memory store, confirming the cluster is memory-bound, not CPU-bound.

Latency budget. An intra-datacenter network round trip is ~0.2-0.5 ms, and an in-memory lookup is microseconds, so a sub-millisecond p99 is achievable - provided the client reaches the right node in one hop. That single constraint - one hop - is what forces the routing design in Step 5.


Step 3 - API and Data Model

The external API is deliberately tiny:

GET(key)              -> value | MISS
SET(key, value, ttl)  -> ok
DELETE(key)           -> ok

Values are opaque bytes with a size cap (commonly ~1 MB) so one huge value cannot distort a node's memory.

Inside each node, three structures cooperate:

StructurePurpose
Hash mapThe key-to-value store itself, for O(1) lookup
Eviction indexAn LRU list or LFU counters, to choose a victim when memory is full
Expiry trackingTTL metadata, swept lazily on access and actively in the background

The hard part is not any single node - it is which node a key lives on, and what happens to that mapping as the cluster changes. That is the deep dive.


Step 4 - High-Level Design

Clients reach cache nodes through a topology-aware routing layer - either a smart client library linked into every application server, or a thin proxy tier. A separate membership service tracks which nodes are alive and publishes one authoritative view of the ring.

flowchart TD
    App[Application Servers] --> Router[Smart Client / Proxy<br/>consistent-hashing router]
    Router --> N1[(Cache Node A)]
    Router --> N2[(Cache Node B)]
    Router --> N3[(Cache Node C)]
    Member[Membership Service<br/>authoritative ring view] -.topology.-> Router
    N1 -.heartbeat.-> Member
    N2 -.heartbeat.-> Member
    N3 -.heartbeat.-> Member
    App -.miss falls back to.-> DB[(Database - source of truth)]

Figure 1. The cache cluster as seen from a client. The smart-client or proxy turns a key into a node in one hop using the ring topology published by the membership service, and the application falls back to the database on miss. The membership service is small but vital - it is the single source of truth for who owns what, which is the answer to the split-brain routing bug.

The router turns a key into a node in one hop. The membership service exists so that every client agrees on the same topology - a detail that, skipped, produces the split-brain routing bug discussed below.


Step 5 - Deep Dive: Consistent Hashing and Cache Correctness

This is the core. Two themes carry it: how keys are partitioned across nodes so the cluster can grow without flushing, and how the cache stays useful and correct under hot keys, expiry storms, and node failure.

Part A - Why not modulo

The obvious partitioning is node = hash(key) % N. It works until N changes. Add one node to a 100-node cluster and the divisor becomes 101, so hash(key) % 100 and hash(key) % 101 disagree for roughly 99% of keys. Almost every key now maps to a different node, every lookup misses, and the entire read load slams the database at once. A routine capacity increase becomes an outage. Modulo hashing makes the cluster size effectively un-changeable.

Part B - Consistent hashing

Consistent hashing fixes this by mapping both nodes and keys onto a hash ring - a circular space, say 0 to 2^32. Each key is owned by the first node encountered moving clockwise from the key's position.

When a node is added, it inserts itself at one point on the ring and takes over only the keys between itself and its predecessor. When a node is removed, only its keys move, to its successor. Either way, only about 1/N of keys remap - adding a node to a 100-node ring disturbs ~1% of keys, not 99%. The cluster becomes freely resizable.

lookup(key):
    pos  = hash(key)
    node = first node clockwise from pos on the ring
    return node

Part C - Virtual nodes

Plain consistent hashing has two weaknesses with real (small) node counts: the random ring positions distribute unevenly, so some nodes own far more keyspace than others; and when a node dies, all of its load lands on its single successor, which can then topple.

Virtual nodes solve both. Each physical node is assigned many ring positions - typically 100-200 - so the ring interleaves them:

flowchart LR
    A1((A)) --- B1((B)) --- C1((C)) --- A2((A)) --- B2((B)) --- C2((C)) --- A3((A)) --- W(("...wraps to A1"))

Figure 2. Virtual nodes interleave each physical node many times around the hash ring. With three physical nodes (A, B, C) appearing at multiple ring positions each, load distributes evenly and the data owned by any failed node fans out to many successors instead of crashing onto one - which is why vnodes turn a node loss from a thundering herd into a smooth handover.

Now load distributes evenly, and when node B fails its many virtual positions are absorbed by many different successors, spreading the load instead of concentrating it. The trade-off is more ring metadata to track - cheap, and well worth it. More virtual nodes means smoother balance at the cost of a larger ring map.

Part D - Replication

With no replication, losing a node loses its ~1/N of the cached data, and those keys all miss until repopulated. Whether that is acceptable is a quantitative decision: a node loss produces an instant 1/N spike in database read traffic. If the database can absorb that spike, skip replication - this is exactly why Memcached classically does not replicate. If it cannot, replicate each key to the next R nodes clockwise on the ring.

Replication factorNode-loss behaviourMemory cost
R = 1 (none)1/N of keys miss; DB absorbs the spike1x
R = 2A replica serves the lost keys; no miss spike2x

Replication doubles memory to buy availability. For a pure cache backed by a healthy database, R = 1 is often the right, deliberate choice; for a cache whose miss storm would itself take down the database, R = 2 is mandatory. Name the trade-off rather than defaulting.

Cache correctness: consistency model

The cache is eventually consistent with the database, and that is by design. The well-known hazard is the cache-aside write race: if a writer updates the database and then updates the cache, a concurrent reader can repopulate the old value in between, leaving the cache permanently stale. The standard fix is to delete the key on write, never update it - the next read repopulates from the database - and to keep a short TTL as a backstop so any missed invalidation self-heals. Across cache replicas, a SET reaches the primary synchronously and replicas asynchronously, so a replica read can be briefly stale. For a cache, all of this is acceptable; the point is to state it, not to pretend the cache is authoritative.

Cache correctness: hot keys

Consistent hashing balances keys across nodes - it does nothing for load within a single key. One viral key sends all of its traffic to one node, and that node hotspots no matter how good the ring is. Three mitigations stack:

  • An L1 client cache. A tiny in-process LRU on each application server fronts the distributed L2 cache and absorbs the few hottest keys with zero network hops.
  • Hot-key replication. Detect hot keys and replicate them across several nodes; clients read from a random replica, fanning the load out.
  • Key-splitting. Store the value under K suffixed keys (key#1 ... key#K) so it deliberately lands on K different nodes.

Cache correctness: the stampede

When a popular key expires, every concurrent request misses simultaneously and they all hit the database together - a cache stampede.

sequenceDiagram
    participant C1 as Client 1
    participant C2 as Client 2
    participant CN as Client N
    participant Cache as Cache shard
    participant DB as Database
 
    Note over Cache: hot key K just expired
    C1->>Cache: GET K
    Cache-->>C1: miss
    C2->>Cache: GET K
    Cache-->>C2: miss
    CN->>Cache: GET K
    Cache-->>CN: miss
    Note over Cache: single-flight - elect one leader for K, others wait
    C1->>DB: read K (leader)
    DB-->>C1: value
    C1->>Cache: SET K = value (TTL)
    Cache-->>C2: value (woken)
    Cache-->>CN: value (woken)
    Note over C1,CN: 1 DB query, not N - stampede absorbed

Figure 3. A cache stampede neutralised by single-flight. N concurrent clients all miss on a just-expired key; instead of N parallel database reads, one client is elected leader and the rest wait for its result. This is what stops a single popular key's TTL from cascading into a database outage.

Three defences:

  • Request coalescing (single-flight). On each node, only the first miss for a key recomputes it; concurrent requests for the same key wait for that one result.
  • Probabilistic early expiration. Refresh the key at a random moment slightly before its TTL, so its expiry is staggered across requests instead of being a cliff everyone falls off at once.
  • Stale-while-revalidate. The first miss serves the last known (stale) value while one background task refreshes it.

Failure modes

  • Node down. The membership service detects the lost heartbeat, removes the node's virtual positions from the ring, and publishes the new topology; clients re-route. With R = 1, expect a 1/N miss spike; with R = 2, a replica covers it seamlessly.
  • Split-brain routing. If each client independently decides a node is dead, clients disagree on the ring, and the same key gets written to two different nodes - silent inconsistency. This is why membership is owned by one authoritative service, not negotiated per client.
  • Memory pressure. Heavy eviction shrinks the effective working set, the hit ratio falls, and database load climbs - which is why hit ratio is the headline metric to watch.

Eviction

When a node is full it must evict. LRU (least recently used) is the default: cheap, and well matched to temporal locality. LFU (least frequently used) wins when a stable set of keys is persistently popular but accessed in bursts - a large scan can evict a genuinely hot key under LRU - but LFU needs frequency counters with aging, so a key that was hot last week decays rather than squatting forever. TTL expiry runs lazily on access and actively via a background sampler, because lazy-only expiry leaves dead keys occupying memory until something happens to touch them.

stateDiagram-v2
    [*] --> Cached: SET / repopulate on miss
    Cached --> Cached: GET (refresh recency / frequency)
    Cached --> Expired: TTL elapsed
    Cached --> Evicted: memory full (LRU / LFU victim)
    Cached --> [*]: DELETE on write
    Expired --> [*]
    Evicted --> [*]

Figure 4. The state machine of a single cache entry. The two outbound transitions to the terminal state on the right - DELETE on write and Evicted under memory pressure - explain why "cache invalidation is hard" reduces here to "delete the key on write": entries cannot transition back from terminal, so there is no merge or stale-data race to worry about.

Multi-region

A cache is regional. Cross-region latency destroys the sub-millisecond budget, so each region runs its own independent cache cluster, populated from that region's database replica. The cross-region question is invalidation: a write in region A must not leave region B serving a stale value. Use TTL for the bulk of keys - simple, self-healing, briefly stale - and a cross-region invalidation stream (a pub/sub topic carrying delete events) only for the keys whose staleness genuinely matters.

Evolution path

StageApproach
LaunchA single cache node, or an in-process cache
GrowthA handful of nodes with client-side consistent hashing
ScaleMany nodes with virtual nodes, an authoritative membership service, hot-shard replication, an L1 client cache, per-region clusters

Adopt consistent hashing from day one - switching off modulo later means a full cache flush - and design the client behind a routing abstraction. Defer replication, the L1 tier, and multi-region until load demands them.

Observability

The headline metric is hit ratio; a falling hit ratio directly forecasts rising database load. Track also p99 GET/SET latency, evictions per second, per-node memory utilisation, per-key request rate (for hot-key detection), and the rate of membership changes. Alert on a hit-ratio drop and on any single node hotspotting.


Step 6 - Bottlenecks and Trade-offs

  • Hot keys are the bottleneck consistent hashing cannot fix - it balances keys, not per-key load - so an L1 cache or hot-key replication is required separately.
  • A node loss is a 1/N database miss spike; the database must either survive it or the cache must replicate.
  • Memory is the ceiling, and eviction quality decides how much of that memory is actually useful.
  • Expiry stampedes turn a popular key's TTL into a synchronised database flood unless coalescing or jittered expiry is in place.
  • Membership consistency is non-negotiable: clients that disagree on the ring corrupt the cache silently.

Reference Architecture

The pattern this problem teaches, reusable well beyond caching:

A horizontally partitioned stateful tier addressed by consistent hashing with virtual nodes, fronted by a small per-client L1, backed by an authoritative store it is always allowed to lose data to.

flowchart LR
    subgraph Client["Per client"]
        L1[L1 in-process cache]
    end
    subgraph Ring["Consistent-hashing ring"]
        direction TB
        R1[(Node + vnodes)]
        R2[(Node + vnodes)]
        R3[(Node + vnodes)]
    end
    L1 -->|miss| Ring
    Ring -->|miss| Auth[(Authoritative store)]
    Member[Membership service] -.one ring view.-> Client

Figure 5. The reference architecture distils the design into three layers: a per-client L1 cache absorbs the hottest keys at zero network cost, the consistent-hashing ring with vnodes provides horizontal scale, and an authoritative store underneath catches everything the cache cannot. The membership service - shared by all clients - keeps the ring view consistent and is what prevents split-brain routing.

The same shape recurs in any partitioned stateful system: sharded session stores, the sharded counter store behind Part 2's rate limiter, sharded search indexes. Whenever state must be split across nodes and the node set will change over time, consistent hashing with virtual nodes is the default partitioning scheme.


Common Mistakes in the Interview

  • Partitioning with hash(key) % N, which turns every cluster resize into a near-total cache flush.
  • Omitting virtual nodes, leaving load imbalanced and a node loss dumping entirely onto one successor.
  • Treating consistent hashing as a hot-key fix - it balances keys, never load within a key.
  • No cache-stampede story, so a popular key's expiry floods the database.
  • Letting each client decide membership independently, producing split-brain routing and silent inconsistency.
  • Designing the cache as durable and strongly consistent, ignoring that the database is the source of truth.
  • Picking LRU or LFU by reflex, without reasoning from the access pattern.

Quick Reference

TopicKey Point
Core patternConsistent hashing on a ring; a key is owned by the next node clockwise
Modulo hashinghash(key) % N remaps ~99% of keys on resize - never use it
Virtual nodes100-200 ring positions per node; even load, graceful node loss
ReplicationR = 1 lets the DB absorb a 1/N miss spike; R = 2 avoids it at 2x memory
ConsistencyEventually consistent by design; delete keys on write, short TTL backstop
Hot keysL1 client cache, hot-key replication, or key-splitting
StampedeRequest coalescing, probabilistic early expiration, stale-while-revalidate
EvictionLRU by default; LFU (with aging) for bursty, persistently popular keys
MembershipOne authoritative service publishes the ring - never per-client
Multi-regionIndependent regional clusters; TTL plus a cross-region invalidation stream

This is Part 4 of a 12-part system design series where each post solves one problem around one core pattern. Next: Design a News Feed.

Ready to ace your interview?

Get 550+ interview questions with detailed answers in our comprehensive PDF guides.

View PDF Guides