What is consistent hashing and why does a distributed cache need it?

Consistent hashing maps both cache nodes and keys onto a hash ring, where each key is owned by the next node clockwise. Its value is what happens when the cluster changes size: adding or removing a node remaps only about 1/N of the keys, instead of the near-total remap that plain hash(key) modulo N causes. That difference is the gap between a cluster resize being routine and being a database-flooding cache flush.

What are virtual nodes in consistent hashing?

A virtual node is one of many ring positions assigned to a single physical node - typically 100 to 200 per node. With only one position each, the ring is uneven and removing a node dumps its entire load onto one successor. Spreading each physical node across many virtual positions smooths key distribution and ensures that when a node leaves, its load is absorbed by many nodes rather than one.

Should you use LRU or LFU cache eviction?

LRU evicts the least recently used key and is the sensible default - it is cheap and matches workloads with strong temporal locality. LFU evicts the least frequently used key and is better when a stable set of keys is persistently popular but accessed in bursts, since a large scan can wrongly evict a hot key under LRU. LFU needs frequency counters with aging so once-popular keys decay over time.

How do you prevent a cache stampede?

A cache stampede happens when a popular key expires and thousands of concurrent requests all miss and hit the database at once. Prevent it with request coalescing, where one request recomputes the key while the rest wait for its result; with probabilistic early expiration, which refreshes a key slightly before its TTL at a random time; or with a lock that lets the first miss recompute while others serve stale data.

Does a distributed cache need to be strongly consistent?

No. A cache is not the source of truth - the database is - so it is allowed to lose data and to be briefly stale. This relaxed requirement is what makes the design tractable: replicas can be eventually consistent, no durability is promised, and on a write you simply delete the cached key and let the next read repopulate it, using a short TTL as a backstop.

How do you handle a hot key in a distributed cache?

Consistent hashing balances keys across nodes, but it cannot balance load within a single key, so one viral key still overwhelms its owning node. Mitigate it with a small in-process L1 cache on each client that absorbs the hottest keys, by replicating the hot key across several nodes and reading from a random one, or by key-splitting the value into K copies under suffixed keys.

Design a Distributed Cache: System Design Interview 2026

A distributed cache is the component every other system in this series quietly depends on - Part 1's URL shortener and Part 2's rate limiter both assumed a fast shared store existed. This post designs that store itself: an in-memory tier, spread across many machines, that absorbs read load so the database does not have to. The problem looks like a hash map until you ask what happens when you add the hundredth node, when one key goes viral, or when a node dies at peak traffic. Those three questions are the interview.

This walkthrough assumes the 6-step system design framework and applies it at senior depth. It is Part 4 of a system design series.

The Problem
Step 1 - Clarify Requirements
Step 2 - Estimate Scale
Step 3 - API and Data Model
Step 4 - High-Level Design
Step 5 - Deep Dive: Consistent Hashing and Cache Correctness
Step 6 - Bottlenecks and Trade-offs
Reference Architecture
Common Mistakes in the Interview
Quick Reference
Related Articles

The Problem

We are designing a distributed in-memory cache - the infrastructure behind Memcached or Redis Cluster - that many application servers share to hold hot data and offload the database. We are designing the cache system, not using one.

The senior framing rests on a single liberating fact: the cache is not the source of truth. The database is. The cache is allowed to lose data and to be briefly stale, because anything it loses can be re-read from the database. That relaxed correctness requirement is what makes the whole design tractable - and a candidate who tries to make the cache durable and strongly consistent has misread the problem.

Step 1 - Clarify Requirements

Functional requirements:

GET, SET, and DELETE by key.
Per-key TTL / expiration.
Eviction when a node runs out of memory.

Non-functional requirements:

Latency: sub-millisecond p99 for GET. The cache exists to be fast; if it is not, it is pointless.
Throughput: millions of operations per second across the cluster.
Horizontal scalability: adding nodes adds capacity, without flushing the cluster.
Availability: losing one node should cost only that node's share of the data, never the whole cache.
Data loss is acceptable. It is a cache. No durability is promised.

The key clarifying question: look-aside or read-through? In a look-aside cache (Memcached, the common case) the application owns the cache-aside logic - it checks the cache, falls back to the database, and repopulates. In a read-through cache the cache itself fetches from the database on a miss. We will design a look-aside cache and note where read-through differs. Also confirm there is no persistence requirement: treating the cache as disposable is a feature, not a gap.

Step 2 - Estimate Scale

Unlike most systems in this series, a cache is sized by memory, not by throughput.

Memory and node count. Suppose the hot working set is 10 TB. With 64 GB of usable RAM per node:

10 TB / 64 GB ≈ ~160 nodes; round up to ~200 for headroom and eviction slack.

Throughput per node. At 10 million operations/sec across 200 nodes, each node serves ~50,000 ops/sec - comfortable for an in-memory store, confirming the cluster is memory-bound, not CPU-bound.

Latency budget. An intra-datacenter network round trip is ~0.2-0.5 ms, and an in-memory lookup is microseconds, so a sub-millisecond p99 is achievable - provided the client reaches the right node in one hop. That single constraint - one hop - is what forces the routing design in Step 5.

Step 3 - API and Data Model

The external API is deliberately tiny:

GET(key)              -> value | MISS
SET(key, value, ttl)  -> ok
DELETE(key)           -> ok

Values are opaque bytes with a size cap (commonly ~1 MB) so one huge value cannot distort a node's memory.

Inside each node, three structures cooperate:

Structure	Purpose
Hash map	The key-to-value store itself, for O(1) lookup
Eviction index	An LRU list or LFU counters, to choose a victim when memory is full
Expiry tracking	TTL metadata, swept lazily on access and actively in the background

The hard part is not any single node - it is which node a key lives on, and what happens to that mapping as the cluster changes. That is the deep dive.

Step 4 - High-Level Design

Clients reach cache nodes through a topology-aware routing layer - either a smart client library linked into every application server, or a thin proxy tier. A separate membership service tracks which nodes are alive and publishes one authoritative view of the ring.

flowchart TD
    App[Application Servers] --> Router[Smart Client / Proxy<br/>consistent-hashing router]
    Router --> N1[(Cache Node A)]
    Router --> N2[(Cache Node B)]
    Router --> N3[(Cache Node C)]
    Member[Membership Service<br/>authoritative ring view] -.topology.-> Router
    N1 -.heartbeat.-> Member
    N2 -.heartbeat.-> Member
    N3 -.heartbeat.-> Member
    App -.miss falls back to.-> DB[(Database - source of truth)]

Figure 1. The cache cluster as seen from a client. The smart-client or proxy turns a key into a node in one hop using the ring topology published by the membership service, and the application falls back to the database on miss. The membership service is small but vital - it is the single source of truth for who owns what, which is the answer to the split-brain routing bug.

The router turns a key into a node in one hop. The membership service exists so that every client agrees on the same topology - a detail that, skipped, produces the split-brain routing bug discussed below.

Step 5 - Deep Dive: Consistent Hashing and Cache Correctness

This is the core. Two themes carry it: how keys are partitioned across nodes so the cluster can grow without flushing, and how the cache stays useful and correct under hot keys, expiry storms, and node failure.

Part A - Why not modulo

The obvious partitioning is node = hash(key) % N. It works until N changes. Add one node to a 100-node cluster and the divisor becomes 101, so hash(key) % 100 and hash(key) % 101 disagree for roughly 99% of keys. Almost every key now maps to a different node, every lookup misses, and the entire read load slams the database at once. A routine capacity increase becomes an outage. Modulo hashing makes the cluster size effectively un-changeable.

Part B - Consistent hashing

Consistent hashing fixes this by mapping both nodes and keys onto a hash ring - a circular space, say 0 to 2^32. Each key is owned by the first node encountered moving clockwise from the key's position.

When a node is added, it inserts itself at one point on the ring and takes over only the keys between itself and its predecessor. When a node is removed, only its keys move, to its successor. Either way, only about 1/N of keys remap - adding a node to a 100-node ring disturbs ~1% of keys, not 99%. The cluster becomes freely resizable.

lookup(key):
    pos  = hash(key)
    node = first node clockwise from pos on the ring
    return node

Part C - Virtual nodes

Plain consistent hashing has two weaknesses with real (small) node counts: the random ring positions distribute unevenly, so some nodes own far more keyspace than others; and when a node dies, all of its load lands on its single successor, which can then topple.

Virtual nodes solve both. Each physical node is assigned many ring positions - typically 100-200 - so the ring interleaves them:

flowchart LR
    A1((A)) --- B1((B)) --- C1((C)) --- A2((A)) --- B2((B)) --- C2((C)) --- A3((A)) --- W(("...wraps to A1"))

Figure 2. Virtual nodes interleave each physical node many times around the hash ring. With three physical nodes (A, B, C) appearing at multiple ring positions each, load distributes evenly and the data owned by any failed node fans out to many successors instead of crashing onto one - which is why vnodes turn a node loss from a thundering herd into a smooth handover.

Now load distributes evenly, and when node B fails its many virtual positions are absorbed by many different successors, spreading the load instead of concentrating it. The trade-off is more ring metadata to track - cheap, and well worth it. More virtual nodes means smoother balance at the cost of a larger ring map.

Part D - Replication

With no replication, losing a node loses its ~1/N of the cached data, and those keys all miss until repopulated. Whether that is acceptable is a quantitative decision: a node loss produces an instant 1/N spike in database read traffic. If the database can absorb that spike, skip replication - this is exactly why Memcached classically does not replicate. If it cannot, replicate each key to the next R nodes clockwise on the ring.

Replication factor	Node-loss behaviour	Memory cost
R = 1 (none)	1/N of keys miss; DB absorbs the spike	1x
R = 2	A replica serves the lost keys; no miss spike	2x

Replication doubles memory to buy availability. For a pure cache backed by a healthy database, R = 1 is often the right, deliberate choice; for a cache whose miss storm would itself take down the database, R = 2 is mandatory. Name the trade-off rather than defaulting.

Cache correctness: consistency model

The cache is eventually consistent with the database, and that is by design. The well-known hazard is the cache-aside write race: if a writer updates the database and then updates the cache, a concurrent reader can repopulate the old value in between, leaving the cache permanently stale. The standard fix is to delete the key on write, never update it - the next read repopulates from the database - and to keep a short TTL as a backstop so any missed invalidation self-heals. Across cache replicas, a SET reaches the primary synchronously and replicas asynchronously, so a replica read can be briefly stale. For a cache, all of this is acceptable; the point is to state it, not to pretend the cache is authoritative.

Cache correctness: hot keys

Consistent hashing balances keys across nodes - it does nothing for load within a single key. One viral key sends all of its traffic to one node, and that node hotspots no matter how good the ring is. Three mitigations stack:

An L1 client cache. A tiny in-process LRU on each application server fronts the distributed L2 cache and absorbs the few hottest keys with zero network hops.
Hot-key replication. Detect hot keys and replicate them across several nodes; clients read from a random replica, fanning the load out.
Key-splitting. Store the value under K suffixed keys (key#1 ... key#K) so it deliberately lands on K different nodes.

Cache correctness: the stampede

When a popular key expires, every concurrent request misses simultaneously and they all hit the database together - a cache stampede.

sequenceDiagram
    participant C1 as Client 1
    participant C2 as Client 2
    participant CN as Client N
    participant Cache as Cache shard
    participant DB as Database
 
    Note over Cache: hot key K just expired
    C1->>Cache: GET K
    Cache-->>C1: miss
    C2->>Cache: GET K
    Cache-->>C2: miss
    CN->>Cache: GET K
    Cache-->>CN: miss
    Note over Cache: single-flight - elect one leader for K, others wait
    C1->>DB: read K (leader)
    DB-->>C1: value
    C1->>Cache: SET K = value (TTL)
    Cache-->>C2: value (woken)
    Cache-->>CN: value (woken)
    Note over C1,CN: 1 DB query, not N - stampede absorbed

Figure 3. A cache stampede neutralised by single-flight. N concurrent clients all miss on a just-expired key; instead of N parallel database reads, one client is elected leader and the rest wait for its result. This is what stops a single popular key's TTL from cascading into a database outage.

Three defences:

Request coalescing (single-flight). On each node, only the first miss for a key recomputes it; concurrent requests for the same key wait for that one result.
Probabilistic early expiration. Refresh the key at a random moment slightly before its TTL, so its expiry is staggered across requests instead of being a cliff everyone falls off at once.
Stale-while-revalidate. The first miss serves the last known (stale) value while one background task refreshes it.

Failure modes

Node down. The membership service detects the lost heartbeat, removes the node's virtual positions from the ring, and publishes the new topology; clients re-route. With R = 1, expect a 1/N miss spike; with R = 2, a replica covers it seamlessly.
Split-brain routing. If each client independently decides a node is dead, clients disagree on the ring, and the same key gets written to two different nodes - silent inconsistency. This is why membership is owned by one authoritative service, not negotiated per client.
Memory pressure. Heavy eviction shrinks the effective working set, the hit ratio falls, and database load climbs - which is why hit ratio is the headline metric to watch.

Eviction

When a node is full it must evict. LRU (least recently used) is the default: cheap, and well matched to temporal locality. LFU (least frequently used) wins when a stable set of keys is persistently popular but accessed in bursts - a large scan can evict a genuinely hot key under LRU - but LFU needs frequency counters with aging, so a key that was hot last week decays rather than squatting forever. TTL expiry runs lazily on access and actively via a background sampler, because lazy-only expiry leaves dead keys occupying memory until something happens to touch them.

stateDiagram-v2
    [*] --> Cached: SET / repopulate on miss
    Cached --> Cached: GET (refresh recency / frequency)
    Cached --> Expired: TTL elapsed
    Cached --> Evicted: memory full (LRU / LFU victim)
    Cached --> [*]: DELETE on write
    Expired --> [*]
    Evicted --> [*]

Figure 4. The state machine of a single cache entry. The two outbound transitions to the terminal state on the right - DELETE on write and Evicted under memory pressure - explain why "cache invalidation is hard" reduces here to "delete the key on write": entries cannot transition back from terminal, so there is no merge or stale-data race to worry about.

Multi-region

A cache is regional. Cross-region latency destroys the sub-millisecond budget, so each region runs its own independent cache cluster, populated from that region's database replica. The cross-region question is invalidation: a write in region A must not leave region B serving a stale value. Use TTL for the bulk of keys - simple, self-healing, briefly stale - and a cross-region invalidation stream (a pub/sub topic carrying delete events) only for the keys whose staleness genuinely matters.

Evolution path

Stage	Approach
Launch	A single cache node, or an in-process cache
Growth	A handful of nodes with client-side consistent hashing
Scale	Many nodes with virtual nodes, an authoritative membership service, hot-shard replication, an L1 client cache, per-region clusters

Adopt consistent hashing from day one - switching off modulo later means a full cache flush - and design the client behind a routing abstraction. Defer replication, the L1 tier, and multi-region until load demands them.

Observability

The headline metric is hit ratio; a falling hit ratio directly forecasts rising database load. Track also p99 GET/SET latency, evictions per second, per-node memory utilisation, per-key request rate (for hot-key detection), and the rate of membership changes. Alert on a hit-ratio drop and on any single node hotspotting.

Step 6 - Bottlenecks and Trade-offs

Hot keys are the bottleneck consistent hashing cannot fix - it balances keys, not per-key load - so an L1 cache or hot-key replication is required separately.
A node loss is a 1/N database miss spike; the database must either survive it or the cache must replicate.
Memory is the ceiling, and eviction quality decides how much of that memory is actually useful.
Expiry stampedes turn a popular key's TTL into a synchronised database flood unless coalescing or jittered expiry is in place.
Membership consistency is non-negotiable: clients that disagree on the ring corrupt the cache silently.

Reference Architecture

The pattern this problem teaches, reusable well beyond caching:

A horizontally partitioned stateful tier addressed by consistent hashing with virtual nodes, fronted by a small per-client L1, backed by an authoritative store it is always allowed to lose data to.

flowchart LR
    subgraph Client["Per client"]
        L1[L1 in-process cache]
    end
    subgraph Ring["Consistent-hashing ring"]
        direction TB
        R1[(Node + vnodes)]
        R2[(Node + vnodes)]
        R3[(Node + vnodes)]
    end
    L1 -->|miss| Ring
    Ring -->|miss| Auth[(Authoritative store)]
    Member[Membership service] -.one ring view.-> Client

Figure 5. The reference architecture distils the design into three layers: a per-client L1 cache absorbs the hottest keys at zero network cost, the consistent-hashing ring with vnodes provides horizontal scale, and an authoritative store underneath catches everything the cache cannot. The membership service - shared by all clients - keeps the ring view consistent and is what prevents split-brain routing.

The same shape recurs in any partitioned stateful system: sharded session stores, the sharded counter store behind Part 2's rate limiter, sharded search indexes. Whenever state must be split across nodes and the node set will change over time, consistent hashing with virtual nodes is the default partitioning scheme.

Common Mistakes in the Interview

Partitioning with hash(key) % N, which turns every cluster resize into a near-total cache flush.
Omitting virtual nodes, leaving load imbalanced and a node loss dumping entirely onto one successor.
Treating consistent hashing as a hot-key fix - it balances keys, never load within a key.
No cache-stampede story, so a popular key's expiry floods the database.
Letting each client decide membership independently, producing split-brain routing and silent inconsistency.
Designing the cache as durable and strongly consistent, ignoring that the database is the source of truth.
Picking LRU or LFU by reflex, without reasoning from the access pattern.

Quick Reference

Topic	Key Point
Core pattern	Consistent hashing on a ring; a key is owned by the next node clockwise
Modulo hashing	`hash(key) % N` remaps ~99% of keys on resize - never use it
Virtual nodes	100-200 ring positions per node; even load, graceful node loss
Replication	R = 1 lets the DB absorb a 1/N miss spike; R = 2 avoids it at 2x memory
Consistency	Eventually consistent by design; delete keys on write, short TTL backstop
Hot keys	L1 client cache, hot-key replication, or key-splitting
Stampede	Request coalescing, probabilistic early expiration, stale-while-revalidate
Eviction	LRU by default; LFU (with aging) for bursty, persistently popular keys
Membership	One authoritative service publishes the ring - never per-client
Multi-region	Independent regional clusters; TTL plus a cross-region invalidation stream

System Design Interview Problems: A Senior's Roadmap - the full series index and pattern library.
System Design Interview Guide: The 6-Step Framework - the method this walkthrough applies.
Design a URL Shortener - Part 1; the cache-aside read path this cache sits behind.
Design a News Feed - Part 5; the feed store and post cache build directly on this distributed cache.
Redis Interview Questions - eviction policies, clustering, and replication in a real cache.

This is Part 4 of a 12-part system design series where each post solves one problem around one core pattern. Next: Design a News Feed.

Design a Distributed Cache: System Design Interview 2026

Table of Contents

The Problem

Step 1 - Clarify Requirements

Step 2 - Estimate Scale

Step 3 - API and Data Model

Step 4 - High-Level Design

Step 5 - Deep Dive: Consistent Hashing and Cache Correctness

Part A - Why not modulo

Part B - Consistent hashing

Part C - Virtual nodes

Part D - Replication

Cache correctness: consistency model

Cache correctness: hot keys

Cache correctness: the stampede

Failure modes

Eviction

Multi-region

Evolution path

Observability

Step 6 - Bottlenecks and Trade-offs

Reference Architecture

Common Mistakes in the Interview

Quick Reference

Ready to ace your interview?

Table of Contents

The Problem

Step 1 - Clarify Requirements

Step 2 - Estimate Scale

Step 3 - API and Data Model

Step 4 - High-Level Design

Step 5 - Deep Dive: Consistent Hashing and Cache Correctness

Part A - Why not modulo

Part B - Consistent hashing

Part C - Virtual nodes

Part D - Replication

Cache correctness: consistency model

Cache correctness: hot keys

Cache correctness: the stampede

Failure modes

Eviction

Multi-region

Evolution path

Observability

Step 6 - Bottlenecks and Trade-offs

Reference Architecture

Common Mistakes in the Interview

Quick Reference

Related Articles

Ready to ace your interview?