Search autocomplete has the tightest latency budget in this series: the suggestion list must appear before the user presses the next key. That is tens of milliseconds, end to end, for a service answering billions of requests a day - one for nearly every keystroke on the platform. The problem is interesting precisely because that budget rules out doing any real work at request time. The entire design is an exercise in moving the work somewhere else: into an offline build, into a precomputed data structure, and into caches at the edge.
This walkthrough assumes the 6-step system design framework and applies it at senior depth. It is Part 9 of a system design series.
Table of Contents
- The Problem
- Step 1 - Clarify Requirements
- Step 2 - Estimate Scale
- Step 3 - API and Data Model
- Step 4 - High-Level Design
- Step 5 - Deep Dive: The Trie and Precomputed Top-K
- Step 6 - Bottlenecks and Trade-offs
- Reference Architecture
- Common Mistakes in the Interview
- Quick Reference
- Related Articles
The Problem
We are designing search autocomplete - the typeahead that, as a user types into a search box, shows the top few completions of what they are likely searching for. The canonical examples are the Google, YouTube, and Amazon search boxes.
The senior framing is that this is a read-dominated lookup whose answer can be fully precomputed. Suggestions come from aggregated search history, which changes slowly; the read side changes constantly and must be instant. When the answer to a query changes far more slowly than the query is asked, the design writes itself: precompute the answer, and make the read a trivial lookup.
Step 1 - Clarify Requirements
Functional requirements:
- Given a typed prefix, return the top-k completion suggestions (k is small - around 5 to 10).
- Suggestions are ranked by popularity.
- Suggestions are drawn from what people actually search - aggregated query logs - not arbitrary text.
Out of scope (name, then defer): the search results themselves, spelling correction, and personalisation - though we note where personalisation re-enters.
Non-functional requirements:
- Extreme low latency. A request fires on (nearly) every keystroke and must return before the next one - a p99 well under ~100 ms.
- Massive read volume - billions of requests per day.
- Freshness is relaxed. Suggestions lagging real searches by the rebuild interval is acceptable, with one exception for trending terms.
- High availability.
The clarifying questions: matching is prefix matching (true typeahead), not fuzzy full-text - fuzzy is an extension. Suggestions are whole popular query phrases from query logs. And k is small. Establishing that the corpus is "popular past queries" is what makes precomputation viable.
Step 2 - Estimate Scale
Read volume. Assume 5 billion searches/day. With debounce, a single search still triggers ~10 autocomplete requests, so 5B x 10 = 50 billion requests/day ≈ ~580,000/sec average, with peaks past 2 million/sec. This is one of the highest-read systems in the series.
Corpus. After discarding rare queries, perhaps ~100 million distinct phrases are worth suggesting. That set is the trie.
Trie memory. 100M phrases, ~20 characters each, with a precomputed top-k list cached at prefix nodes, runs to tens to hundreds of GB - so the trie is held in memory and sharded across nodes.
Write side. Query logs are aggregated in batch - hourly, say - to rebuild the trie. There is no real-time write path. The read-to-write asymmetry is total: billions of lookups against a handful of scheduled rebuilds.
Step 3 - API and Data Model
The API is a single endpoint:
GET /api/autocomplete?q=<prefix>
200 OK { "suggestions": ["...", "...", "..."] }The data model has two halves:
| Element | Contents |
|---|---|
| Trie node | One character; a node representing a prefix also stores that prefix's precomputed top-k completions |
| Query frequency | Per phrase: an aggregated search-count score, produced from query logs |
The decisive modelling choice is storing the top-k at each prefix node. A lookup then walks to the prefix node and reads its stored list - it never traverses the subtree of completions. That single decision is what meets the latency budget, and the rest of the deep dive explains why.
Step 4 - High-Level Design
flowchart TD
subgraph Build["Offline build path - periodic"]
Logs[(Search Query Logs)] --> Agg[Aggregation Job<br/>frequency + ranking]
Agg --> Builder[Trie Builder<br/>precompute top-k per node]
Builder --> Publish[Versioned Trie Snapshot]
end
subgraph Serve["Online serve path - per keystroke"]
Client([Client<br/>debounced]) --> CDN[CDN / Edge Cache]
CDN -->|miss| ACS[Autocomplete Service]
ACS --> Trie[(In-Memory Trie<br/>sharded by prefix)]
end
Publish -.load.-> TrieFigure 1. The architecture splits cleanly into an offline build path and an online serve path. The build aggregates query logs, ranks, and publishes a versioned trie snapshot; the serve path answers keystrokes from that snapshot via an edge cache and a sharded in-memory trie. The two paths are coupled only by snapshot publishes, so a broken build leaves the last good trie serving.
The build path turns query logs into a ranked, versioned trie snapshot on a schedule. The serve path answers keystrokes from a sharded in-memory trie, fronted by an edge cache. The dashed line is the only coupling - the serve path loads a published snapshot - so a broken build leaves the last good trie serving.
Step 5 - Deep Dive: The Trie and Precomputed Top-K
This is the core. Three ideas carry it: why the trie, why precompute the top-k, and how caching and the build pipeline make it production-ready.
Part A - Why a trie
The query is "given prefix P, return its best completions". The naive option scans all 100M phrases for those starting with P - O(N) per keystroke, hopeless at 580,000 requests/sec. A sorted array with binary search is better: the matching prefixes form a contiguous range, found in O(log N). But for a short prefix that range is enormous - "a" matches millions of phrases - and binary search still leaves all of them to rank.
A trie (prefix tree) is a tree in which each root-to-node path spells a prefix. Finding the node for a typed prefix is a walk of O(prefix length) - completely independent of corpus size. Every completion of that prefix lives in the node's subtree.
flowchart TD
Root(( )) --> A["a"]
A --> AM["am - top-k: amazon, amazon prime, amc"]
AM --> AMA["ama - top-k: amazon, amazon prime, amazon music"]
AMA --> AMAZ["amaz - top-k: amazon, amazon prime, amazon music"]
AMAZ --> AMAZO["amazo - top-k: amazon, amazon music, amazon mx"]Figure 2. The trie with the top-k completions stored at every prefix node. Walking from the root to "amazo" is O(prefix length); reading its top-k list is O(1). The whole latency story rests on the observation that ranking happens once, offline, so the request never has to traverse the subtree below.
Part B - Precomputing the top-k
A plain trie still has an expensive read: to return the best k completions of a short prefix, you would traverse its whole (huge) subtree and rank it - at request time. That breaks the latency budget.
The fix is to precompute the top-k completions at every prefix node during the offline build. A request then walks to the prefix node - O(prefix length) - and reads the stored list. No subtree traversal happens at request time at all. Total request work is a handful of pointer hops.
The cost is paid in two acceptable currencies: memory, because every node carries up to k suggestions, and offline build time, because ranking every node's subtree is heavy. For a system with billions of reads against hourly rebuilds, spending memory and batch compute to make the read trivial is not a trade-off so much as the obvious answer.
Part C - Ranking
The score of a phrase is its search popularity, taken from aggregated query-log frequency, usually with time decay so a surging term outranks a stale one. Crucially, ranking runs offline during the build - that is precisely what makes the per-node top-k a fixed, precomputable list.
Personalisation and location break that, because the best completions become user-dependent. The senior resolution mirrors Part 5's feed ranking: the trie stores the globally-ranked top-k of a slightly larger size (say top-20), and a lightweight re-rank applies personal signals to that bounded candidate set at request time. The expensive ranking stays offline; only a cheap reorder of 20 items is per-request.
Part D - The build pipeline and freshness
Query logs are aggregated in batch, the trie is rebuilt, and a versioned snapshot is published to the serving tier. A full rebuild each hour is simpler than incremental updates and, for a partitioned trie, fast enough.
Freshness is then bounded by the rebuild interval - about an hour. That is fine for the bulk of suggestions and too slow for breaking, trending terms. The standard answer is two-tier freshness: the main trie rebuilds hourly, and a small trending overlay of currently-spiking terms updates every few minutes and is merged into results. Most queries do not need real-time freshness; only the hot tail does, and the overlay serves exactly that tail cheaply.
Part E - Caching and the request path
Reads follow a steep popularity curve - a, am, ama and other short, common prefixes dominate - so caching is decisive.
- Client-side debounce. The client waits ~50-100 ms for a pause in typing before sending a request, collapsing a 20-keystroke search into a handful of calls. It also caches: deleting characters back to an already-seen prefix needs no request.
- Edge caching. A popular prefix's top-k changes only on rebuild, so it is effectively static between builds. Cache it at the CDN with a TTL matching the rebuild interval, and a large fraction of traffic is served from the edge, never reaching the trie service.
- Sharded in-memory trie. The service holds the trie in memory, sharded by prefix. Because "a"-prefixed queries vastly outnumber "z", sharding on the raw first letter is badly skewed - shard on a hash of the first few characters, or size shards adaptively by load.
The full request lifecycle - a debounced client hitting a popular prefix (served from the edge) then a rare one (falling through to the trie service):
sequenceDiagram
participant U as User typing
participant C as Client
participant CDN as CDN edge
participant S as Autocomplete service
participant T as Sharded trie
U->>C: keystroke "a"
U->>C: keystroke "am"
Note over C: debounce ~80ms, no more keys
C->>CDN: GET /autocomplete?q=am
Note over CDN: popular prefix - cache hit
CDN-->>C: top-k
C-->>U: render suggestions
U->>C: keystroke "amazxr" (rare prefix)
Note over C: debounce ~80ms
C->>CDN: GET /autocomplete?q=amazxr
Note over CDN: rare prefix - cache miss
CDN->>S: forward
S->>T: walk to prefix node
T-->>S: stored top-k
S-->>CDN: top-k (cached for next time)
CDN-->>C: top-k
C-->>U: render suggestionsFigure 3. The full request lifecycle: a debounced client hits a popular prefix served from the CDN edge, then a rare one that falls through to the trie service. Most traffic is absorbed at the edge - which is what lets the in-memory trie sit behind it with modest real throughput rather than carrying the full keystroke firehose.
Consistency model
The trie is eventually consistent with live searches, lagging by the rebuild interval, with the trending overlay narrowing the gap for hot terms. And the serving trie is derived data - rebuildable from the query logs, which are the real source of truth - exactly like the rebuildable stores of Part 4 and Part 5.
Failure modes
- Trie shard loss. The shard's in-memory data is gone, but it is reloaded from the last published snapshot onto a fresh node. Because the trie is read-only between builds, shards replicate trivially for availability.
- Build pipeline failure. The serve path keeps answering from the last published trie - stale, not down. Decoupling the two paths is what buys this.
- A bad build. A new snapshot is validated before publish (non-empty, sane size, spot-checked); a bad one is rejected and the previous version keeps serving. Versioned publishes make rollback instant.
- Traffic spike. The CDN absorbs popular prefixes and debounce caps per-user load, so spikes mostly never reach the service.
Multi-region
This is one of the easiest multi-region stories in the series: the trie is read-only between builds, so the published snapshot replicates to every region and each region serves from a local copy. Build once, centrally, from globally aggregated logs; publish everywhere. Where query popularity varies by region or language, build per-region tries from regional logs for better local relevance.
Evolution path
| Stage | Approach |
|---|---|
| Launch | Sorted array with binary search, ranking the range on read |
| Growth | A trie with precomputed top-k, edge caching, client debounce |
| Scale | Sharded trie, trending overlay, per-region tries, personalisation re-rank |
Build the precomputed top-k (the entire latency story), client debounce, and versioned snapshot publishing from day one. Defer the trending overlay, personalisation, and per-region tries.
Observability
Track request rate, end-to-end and trie-service p99 latency, CDN hit ratio, trie build duration and success, freshness lag (trend appearance time), suggestion click-through rate as a quality signal, and per-shard load skew. Reasonable SLOs: 99% of responses under 100 ms, and a successful trie rebuild every hour.
Step 6 - Bottlenecks and Trade-offs
- Request volume is enormous and is absorbed by debounce, edge caching, and an in-memory sharded trie - long before requests reach a database.
- Request-time ranking cost is eliminated entirely by precomputing the top-k during the build.
- Trie memory runs to hundreds of GB and is handled by sharding.
- Freshness versus cost is resolved with a batch rebuild plus a fast trending overlay, rather than a costly real-time trie.
- Shard skew from uneven prefix popularity is fixed by hashing the prefix or sizing shards by load.
Reference Architecture
The pattern this problem teaches, reusable well beyond autocomplete:
Precompute the answer at build time so the read is a trivial lookup; serve the read-only derived result from an in-memory, sharded-and-replicated tier behind an edge cache; refresh it in batch, with a small fast overlay for the hot tail.
flowchart LR
subgraph Offline["Offline - periodic"]
L[(Source data)] --> P[Precompute + rank]
P --> S[Versioned snapshot]
end
subgraph Online["Online - per request"]
E[Edge cache] --> T[(In-memory sharded tier)]
end
S -.publish.-> T
O[Fast trending overlay] -.merge.-> OnlineFigure 4. The reference architecture distils the design into "precompute once offline, serve from a fast cached tier online". Whenever the question is asked far more often than its answer changes - leaderboards, related-items panels, recommendation caches, DNS-style resolution - this shape applies. A small trending overlay handles the hot tail that the batch build is too slow for.
The same shape recurs wherever reads dominate and the answer changes slowly: leaderboards, "related items" panels, recommendation caches, even DNS-style resolution. Whenever the question is asked far more often than its answer changes, precompute the answer and make the read a lookup.
Common Mistakes in the Interview
- Ranking matches at request time by traversing a prefix's subtree - fatal for short prefixes.
- Scanning the corpus instead of using a prefix structure.
- Ignoring client-side debounce, turning every keystroke into a request.
- No edge caching of popular prefixes, sending head traffic all the way to the service.
- Treating the trie as a source of truth rather than rebuildable derived data.
- Assuming real-time freshness is required everywhere - only the trending tail needs it.
- Sharding on the raw first letter, producing badly skewed shards.
Quick Reference
| Topic | Key Point |
|---|---|
| Core pattern | Trie with top-k precomputed at every prefix node |
| Why a trie | Prefix lookup is O(prefix length), independent of corpus size |
| Precompute | Rank offline; the read becomes a walk plus a stored-list read |
| Ranking | Popularity from query logs with time decay; computed during the build |
| Personalisation | Lightweight re-rank of a bounded candidate set at request time |
| Freshness | Hourly rebuild + a few-minute trending overlay merged into results |
| Caching | Client debounce + cache; CDN edge cache for popular prefixes |
| Sharding | Hash the prefix - sharding on the raw first letter is heavily skewed |
| Consistency | Eventually consistent; the trie is rebuildable derived data |
| Multi-region | Read-only snapshot replicates everywhere; per-region tries for relevance |
Related Articles
- System Design Interview Problems: A Senior's Roadmap - the full series index and pattern library.
- System Design Interview Guide: The 6-Step Framework - the method this walkthrough applies.
- Design a Web Crawler - Part 8; the crawl that ultimately feeds a search corpus.
- Design Video Streaming - Part 10; another precompute-heavy, CDN-served, derived-artifact system.
- Design a Distributed Cache - Part 4; the edge and in-memory caching this design leans on.
- Design a News Feed - Part 5; ranking a bounded candidate set, reused here for personalisation.
This is Part 9 of a 12-part system design series where each post solves one problem around one core pattern. Next: Design Video Streaming.
