Why use a trie for search autocomplete?

A trie is a prefix tree where each path from the root spells a prefix, so finding the node for a typed prefix takes time proportional to the prefix length, independent of how many phrases the corpus holds. Scanning a list or even binary-searching a sorted array still leaves a large matching range to rank for short prefixes. The trie turns prefix lookup into a short, constant-shaped walk.

How does autocomplete return results fast enough to feel instant?

The key trick is to precompute the top-k suggestions at every prefix node when the trie is built offline, rather than ranking matches at request time. A request then only walks to the prefix node and reads its stored list - work proportional to the prefix length, with no traversal of the subtree. This trades extra memory and offline build time for near-instant reads.

How are autocomplete suggestions ranked?

Suggestions are ranked by popularity derived from aggregated search query logs, often with time decay so trending terms rise. Ranking is done offline during the trie build, which is exactly what makes the per-prefix top-k precomputable. Personalisation or location signals, which depend on the user, are applied as a lightweight re-rank of a slightly larger candidate set at request time.

How fresh do autocomplete suggestions need to be?

For the bulk of suggestions, lagging real searches by the rebuild interval - an hour or so - is perfectly acceptable. The exception is breaking or trending terms, which an hourly rebuild surfaces too slowly. A common design adds a small trending overlay updated every few minutes and merged into results, giving two-tier freshness without rebuilding the whole trie constantly.

Why is client-side debounce important for autocomplete?

Without debounce, every keystroke fires a request, so one search of twenty characters generates twenty calls. Debouncing waits for a brief pause in typing - around 50 to 100 milliseconds - before sending a request, which cuts request volume dramatically. The client also caches results, so deleting characters back to an already-seen prefix needs no new request.

Is the autocomplete trie a source of truth?

No. The serving trie is derived data, rebuilt from the search query logs, which are the actual source of truth. This means a lost trie shard is reloaded from the last published snapshot rather than recovered, the read-only trie replicates trivially across regions, and a failed build simply leaves the previous version serving.

Design Search Autocomplete: System Design Interview 2026

Search autocomplete has the tightest latency budget in this series: the suggestion list must appear before the user presses the next key. That is tens of milliseconds, end to end, for a service answering billions of requests a day - one for nearly every keystroke on the platform. The problem is interesting precisely because that budget rules out doing any real work at request time. The entire design is an exercise in moving the work somewhere else: into an offline build, into a precomputed data structure, and into caches at the edge.

This walkthrough assumes the 6-step system design framework and applies it at senior depth. It is Part 9 of a system design series.

The Problem
Step 1 - Clarify Requirements
Step 2 - Estimate Scale
Step 3 - API and Data Model
Step 4 - High-Level Design
Step 5 - Deep Dive: The Trie and Precomputed Top-K
Step 6 - Bottlenecks and Trade-offs
Reference Architecture
Common Mistakes in the Interview
Quick Reference
Related Articles

The Problem

We are designing search autocomplete - the typeahead that, as a user types into a search box, shows the top few completions of what they are likely searching for. The canonical examples are the Google, YouTube, and Amazon search boxes.

The senior framing is that this is a read-dominated lookup whose answer can be fully precomputed. Suggestions come from aggregated search history, which changes slowly; the read side changes constantly and must be instant. When the answer to a query changes far more slowly than the query is asked, the design writes itself: precompute the answer, and make the read a trivial lookup.

Step 1 - Clarify Requirements

Functional requirements:

Given a typed prefix, return the top-k completion suggestions (k is small - around 5 to 10).
Suggestions are ranked by popularity.
Suggestions are drawn from what people actually search - aggregated query logs - not arbitrary text.

Out of scope (name, then defer): the search results themselves, spelling correction, and personalisation - though we note where personalisation re-enters.

Non-functional requirements:

Extreme low latency. A request fires on (nearly) every keystroke and must return before the next one - a p99 well under ~100 ms.
Massive read volume - billions of requests per day.
Freshness is relaxed. Suggestions lagging real searches by the rebuild interval is acceptable, with one exception for trending terms.
High availability.

The clarifying questions: matching is prefix matching (true typeahead), not fuzzy full-text - fuzzy is an extension. Suggestions are whole popular query phrases from query logs. And k is small. Establishing that the corpus is "popular past queries" is what makes precomputation viable.

Step 2 - Estimate Scale

Read volume. Assume 5 billion searches/day. With debounce, a single search still triggers ~10 autocomplete requests, so 5B x 10 = 50 billion requests/day ≈ ~580,000/sec average, with peaks past 2 million/sec. This is one of the highest-read systems in the series.

Corpus. After discarding rare queries, perhaps ~100 million distinct phrases are worth suggesting. That set is the trie.

Trie memory. 100M phrases, ~20 characters each, with a precomputed top-k list cached at prefix nodes, runs to tens to hundreds of GB - so the trie is held in memory and sharded across nodes.

Write side. Query logs are aggregated in batch - hourly, say - to rebuild the trie. There is no real-time write path. The read-to-write asymmetry is total: billions of lookups against a handful of scheduled rebuilds.

Step 3 - API and Data Model

The API is a single endpoint:

GET /api/autocomplete?q=<prefix>
  200 OK   { "suggestions": ["...", "...", "..."] }

The data model has two halves:

Element	Contents
Trie node	One character; a node representing a prefix also stores that prefix's precomputed top-k completions
Query frequency	Per phrase: an aggregated search-count score, produced from query logs

The decisive modelling choice is storing the top-k at each prefix node. A lookup then walks to the prefix node and reads its stored list - it never traverses the subtree of completions. That single decision is what meets the latency budget, and the rest of the deep dive explains why.

Step 4 - High-Level Design

flowchart TD
    subgraph Build["Offline build path - periodic"]
        Logs[(Search Query Logs)] --> Agg[Aggregation Job<br/>frequency + ranking]
        Agg --> Builder[Trie Builder<br/>precompute top-k per node]
        Builder --> Publish[Versioned Trie Snapshot]
    end
    subgraph Serve["Online serve path - per keystroke"]
        Client([Client<br/>debounced]) --> CDN[CDN / Edge Cache]
        CDN -->|miss| ACS[Autocomplete Service]
        ACS --> Trie[(In-Memory Trie<br/>sharded by prefix)]
    end
    Publish -.load.-> Trie

Figure 1. The architecture splits cleanly into an offline build path and an online serve path. The build aggregates query logs, ranks, and publishes a versioned trie snapshot; the serve path answers keystrokes from that snapshot via an edge cache and a sharded in-memory trie. The two paths are coupled only by snapshot publishes, so a broken build leaves the last good trie serving.

The build path turns query logs into a ranked, versioned trie snapshot on a schedule. The serve path answers keystrokes from a sharded in-memory trie, fronted by an edge cache. The dashed line is the only coupling - the serve path loads a published snapshot - so a broken build leaves the last good trie serving.

Step 5 - Deep Dive: The Trie and Precomputed Top-K

This is the core. Three ideas carry it: why the trie, why precompute the top-k, and how caching and the build pipeline make it production-ready.

Part A - Why a trie

The query is "given prefix P, return its best completions". The naive option scans all 100M phrases for those starting with P - O(N) per keystroke, hopeless at 580,000 requests/sec. A sorted array with binary search is better: the matching prefixes form a contiguous range, found in O(log N). But for a short prefix that range is enormous - "a" matches millions of phrases - and binary search still leaves all of them to rank.

A trie (prefix tree) is a tree in which each root-to-node path spells a prefix. Finding the node for a typed prefix is a walk of O(prefix length) - completely independent of corpus size. Every completion of that prefix lives in the node's subtree.

flowchart TD
    Root(( )) --> A["a"]
    A --> AM["am - top-k: amazon, amazon prime, amc"]
    AM --> AMA["ama - top-k: amazon, amazon prime, amazon music"]
    AMA --> AMAZ["amaz - top-k: amazon, amazon prime, amazon music"]
    AMAZ --> AMAZO["amazo - top-k: amazon, amazon music, amazon mx"]

Figure 2. The trie with the top-k completions stored at every prefix node. Walking from the root to "amazo" is O(prefix length); reading its top-k list is O(1). The whole latency story rests on the observation that ranking happens once, offline, so the request never has to traverse the subtree below.

Part B - Precomputing the top-k

A plain trie still has an expensive read: to return the best k completions of a short prefix, you would traverse its whole (huge) subtree and rank it - at request time. That breaks the latency budget.

The fix is to precompute the top-k completions at every prefix node during the offline build. A request then walks to the prefix node - O(prefix length) - and reads the stored list. No subtree traversal happens at request time at all. Total request work is a handful of pointer hops.

The cost is paid in two acceptable currencies: memory, because every node carries up to k suggestions, and offline build time, because ranking every node's subtree is heavy. For a system with billions of reads against hourly rebuilds, spending memory and batch compute to make the read trivial is not a trade-off so much as the obvious answer.

Part C - Ranking

The score of a phrase is its search popularity, taken from aggregated query-log frequency, usually with time decay so a surging term outranks a stale one. Crucially, ranking runs offline during the build - that is precisely what makes the per-node top-k a fixed, precomputable list.

Personalisation and location break that, because the best completions become user-dependent. The senior resolution mirrors Part 5's feed ranking: the trie stores the globally-ranked top-k of a slightly larger size (say top-20), and a lightweight re-rank applies personal signals to that bounded candidate set at request time. The expensive ranking stays offline; only a cheap reorder of 20 items is per-request.

Part D - The build pipeline and freshness

Query logs are aggregated in batch, the trie is rebuilt, and a versioned snapshot is published to the serving tier. A full rebuild each hour is simpler than incremental updates and, for a partitioned trie, fast enough.

Freshness is then bounded by the rebuild interval - about an hour. That is fine for the bulk of suggestions and too slow for breaking, trending terms. The standard answer is two-tier freshness: the main trie rebuilds hourly, and a small trending overlay of currently-spiking terms updates every few minutes and is merged into results. Most queries do not need real-time freshness; only the hot tail does, and the overlay serves exactly that tail cheaply.

Part E - Caching and the request path

Reads follow a steep popularity curve - a, am, ama and other short, common prefixes dominate - so caching is decisive.

Client-side debounce. The client waits ~50-100 ms for a pause in typing before sending a request, collapsing a 20-keystroke search into a handful of calls. It also caches: deleting characters back to an already-seen prefix needs no request.
Edge caching. A popular prefix's top-k changes only on rebuild, so it is effectively static between builds. Cache it at the CDN with a TTL matching the rebuild interval, and a large fraction of traffic is served from the edge, never reaching the trie service.
Sharded in-memory trie. The service holds the trie in memory, sharded by prefix. Because "a"-prefixed queries vastly outnumber "z", sharding on the raw first letter is badly skewed - shard on a hash of the first few characters, or size shards adaptively by load.

The full request lifecycle - a debounced client hitting a popular prefix (served from the edge) then a rare one (falling through to the trie service):

sequenceDiagram
    participant U as User typing
    participant C as Client
    participant CDN as CDN edge
    participant S as Autocomplete service
    participant T as Sharded trie
 
    U->>C: keystroke "a"
    U->>C: keystroke "am"
    Note over C: debounce ~80ms, no more keys
    C->>CDN: GET /autocomplete?q=am
    Note over CDN: popular prefix - cache hit
    CDN-->>C: top-k
    C-->>U: render suggestions
    U->>C: keystroke "amazxr" (rare prefix)
    Note over C: debounce ~80ms
    C->>CDN: GET /autocomplete?q=amazxr
    Note over CDN: rare prefix - cache miss
    CDN->>S: forward
    S->>T: walk to prefix node
    T-->>S: stored top-k
    S-->>CDN: top-k (cached for next time)
    CDN-->>C: top-k
    C-->>U: render suggestions

Figure 3. The full request lifecycle: a debounced client hits a popular prefix served from the CDN edge, then a rare one that falls through to the trie service. Most traffic is absorbed at the edge - which is what lets the in-memory trie sit behind it with modest real throughput rather than carrying the full keystroke firehose.

Consistency model

The trie is eventually consistent with live searches, lagging by the rebuild interval, with the trending overlay narrowing the gap for hot terms. And the serving trie is derived data - rebuildable from the query logs, which are the real source of truth - exactly like the rebuildable stores of Part 4 and Part 5.

Failure modes

Trie shard loss. The shard's in-memory data is gone, but it is reloaded from the last published snapshot onto a fresh node. Because the trie is read-only between builds, shards replicate trivially for availability.
Build pipeline failure. The serve path keeps answering from the last published trie - stale, not down. Decoupling the two paths is what buys this.
A bad build. A new snapshot is validated before publish (non-empty, sane size, spot-checked); a bad one is rejected and the previous version keeps serving. Versioned publishes make rollback instant.
Traffic spike. The CDN absorbs popular prefixes and debounce caps per-user load, so spikes mostly never reach the service.

Multi-region

This is one of the easiest multi-region stories in the series: the trie is read-only between builds, so the published snapshot replicates to every region and each region serves from a local copy. Build once, centrally, from globally aggregated logs; publish everywhere. Where query popularity varies by region or language, build per-region tries from regional logs for better local relevance.

Evolution path

Stage	Approach
Launch	Sorted array with binary search, ranking the range on read
Growth	A trie with precomputed top-k, edge caching, client debounce
Scale	Sharded trie, trending overlay, per-region tries, personalisation re-rank

Build the precomputed top-k (the entire latency story), client debounce, and versioned snapshot publishing from day one. Defer the trending overlay, personalisation, and per-region tries.

Observability

Track request rate, end-to-end and trie-service p99 latency, CDN hit ratio, trie build duration and success, freshness lag (trend appearance time), suggestion click-through rate as a quality signal, and per-shard load skew. Reasonable SLOs: 99% of responses under 100 ms, and a successful trie rebuild every hour.

Step 6 - Bottlenecks and Trade-offs

Request volume is enormous and is absorbed by debounce, edge caching, and an in-memory sharded trie - long before requests reach a database.
Request-time ranking cost is eliminated entirely by precomputing the top-k during the build.
Trie memory runs to hundreds of GB and is handled by sharding.
Freshness versus cost is resolved with a batch rebuild plus a fast trending overlay, rather than a costly real-time trie.
Shard skew from uneven prefix popularity is fixed by hashing the prefix or sizing shards by load.

Reference Architecture

The pattern this problem teaches, reusable well beyond autocomplete:

Precompute the answer at build time so the read is a trivial lookup; serve the read-only derived result from an in-memory, sharded-and-replicated tier behind an edge cache; refresh it in batch, with a small fast overlay for the hot tail.

flowchart LR
    subgraph Offline["Offline - periodic"]
        L[(Source data)] --> P[Precompute + rank]
        P --> S[Versioned snapshot]
    end
    subgraph Online["Online - per request"]
        E[Edge cache] --> T[(In-memory sharded tier)]
    end
    S -.publish.-> T
    O[Fast trending overlay] -.merge.-> Online

Figure 4. The reference architecture distils the design into "precompute once offline, serve from a fast cached tier online". Whenever the question is asked far more often than its answer changes - leaderboards, related-items panels, recommendation caches, DNS-style resolution - this shape applies. A small trending overlay handles the hot tail that the batch build is too slow for.

The same shape recurs wherever reads dominate and the answer changes slowly: leaderboards, "related items" panels, recommendation caches, even DNS-style resolution. Whenever the question is asked far more often than its answer changes, precompute the answer and make the read a lookup.

Common Mistakes in the Interview

Ranking matches at request time by traversing a prefix's subtree - fatal for short prefixes.
Scanning the corpus instead of using a prefix structure.
Ignoring client-side debounce, turning every keystroke into a request.
No edge caching of popular prefixes, sending head traffic all the way to the service.
Treating the trie as a source of truth rather than rebuildable derived data.
Assuming real-time freshness is required everywhere - only the trending tail needs it.
Sharding on the raw first letter, producing badly skewed shards.

Quick Reference

Topic	Key Point
Core pattern	Trie with top-k precomputed at every prefix node
Why a trie	Prefix lookup is O(prefix length), independent of corpus size
Precompute	Rank offline; the read becomes a walk plus a stored-list read
Ranking	Popularity from query logs with time decay; computed during the build
Personalisation	Lightweight re-rank of a bounded candidate set at request time
Freshness	Hourly rebuild + a few-minute trending overlay merged into results
Caching	Client debounce + cache; CDN edge cache for popular prefixes
Sharding	Hash the prefix - sharding on the raw first letter is heavily skewed
Consistency	Eventually consistent; the trie is rebuildable derived data
Multi-region	Read-only snapshot replicates everywhere; per-region tries for relevance

System Design Interview Problems: A Senior's Roadmap - the full series index and pattern library.
System Design Interview Guide: The 6-Step Framework - the method this walkthrough applies.
Design a Web Crawler - Part 8; the crawl that ultimately feeds a search corpus.
Design Video Streaming - Part 10; another precompute-heavy, CDN-served, derived-artifact system.
Design a Distributed Cache - Part 4; the edge and in-memory caching this design leans on.
Design a News Feed - Part 5; ranking a bounded candidate set, reused here for personalisation.

This is Part 9 of a 12-part system design series where each post solves one problem around one core pattern. Next: Design Video Streaming.

Design Search Autocomplete: System Design Interview 2026

Table of Contents

The Problem

Step 1 - Clarify Requirements

Step 2 - Estimate Scale

Step 3 - API and Data Model

Step 4 - High-Level Design

Step 5 - Deep Dive: The Trie and Precomputed Top-K

Part A - Why a trie

Part B - Precomputing the top-k

Part C - Ranking

Part D - The build pipeline and freshness

Part E - Caching and the request path

Consistency model

Failure modes

Multi-region

Evolution path

Observability

Step 6 - Bottlenecks and Trade-offs

Reference Architecture

Common Mistakes in the Interview

Quick Reference

Ready to ace your interview?

Table of Contents

The Problem

Step 1 - Clarify Requirements

Step 2 - Estimate Scale

Step 3 - API and Data Model

Step 4 - High-Level Design

Step 5 - Deep Dive: The Trie and Precomputed Top-K

Part A - Why a trie

Part B - Precomputing the top-k

Part C - Ranking

Part D - The build pipeline and freshness

Part E - Caching and the request path

Consistency model

Failure modes

Multi-region

Evolution path

Observability

Step 6 - Bottlenecks and Trade-offs

Reference Architecture

Common Mistakes in the Interview

Quick Reference

Related Articles

Ready to ace your interview?