Design a Web Crawler: System Design Interview 2026

·15 min read
system-designweb-crawlerdistributed-queuearchitecturebackendinterview-preparation

A web crawler is a graph traversal where the graph has tens of billions of nodes, no map, an adversarial fraction of it, and a strict etiquette you must follow or get banned. The core loop - fetch a page, extract its links, follow them - fits in a sentence. Everything difficult is in the supporting machinery: the queue that decides what to crawl next without overwhelming any one site, and the deduplication that keeps the crawler from drowning in URLs it has already seen.

This walkthrough assumes the 6-step system design framework and applies it at senior depth. It is Part 8 of a system design series.

Table of Contents

  1. The Problem
  2. Step 1 - Clarify Requirements
  3. Step 2 - Estimate Scale
  4. Step 3 - API and Data Model
  5. Step 4 - High-Level Design
  6. Step 5 - Deep Dive: The URL Frontier and Deduplication
  7. Step 6 - Bottlenecks and Trade-offs
  8. Reference Architecture
  9. Common Mistakes in the Interview
  10. Quick Reference
  11. Related Articles

The Problem

We are designing a web crawler: starting from seed URLs, it fetches pages, extracts their links, and follows them to build a large corpus of web content - the input a search index is later built from. The canonical example is Googlebot.

The senior framing is that this is a distributed work queue with a feedback loop. Worker processes consume URLs and fetch pages; the link parser produces new URLs back into the queue. Two forces keep that loop from collapsing: a frontier that orders the work while respecting politeness, and a deduplication layer that stops the queue from re-ingesting the billions of URLs it has already processed.


Step 1 - Clarify Requirements

Functional requirements:

  • Crawl from a set of seed URLs, following discovered links.
  • Fetch page content and store it for downstream indexing.
  • Obey robots.txt.
  • Re-crawl pages over time to keep content fresh.

Out of scope (name, then defer): building the search index itself (that is Part 9), ranking, and the details of HTML parsing - assume a parser exists.

Non-functional requirements:

  • Web scale - billions of pages.
  • Politeness - never overwhelm a single domain; obey robots.txt. This is both courtesy and a hard operational requirement, since rude crawlers get banned.
  • Deduplication - never crawl the same URL twice.
  • Trap resistance - survive infinite URL spaces.
  • Fault tolerance - a full crawl runs for days or weeks and must survive worker failures.

The clarifying questions: this is a continuous crawl, not a one-shot - the web changes, so freshness matters. Traversal is breadth-first, with the frontier as the queue, because BFS discovers the web broadly rather than tunnelling into one site. And politeness is non-negotiable.


Step 2 - Estimate Scale

Crawl rate. Target a corpus of 10 billion pages refreshed monthly: 10B / 30 days ≈ 330M pages/day ≈ ~3,800 pages/sec average; design for a peak near 10,000/sec.

Content storage. At ~100 KB per page, 10 billion pages is ~1 PB of content, which lives in object storage or a distributed file system.

The deduplication set. Each page links to ~50 URLs, so the crawler sees an enormous URL stream, and the seen-set must hold 10 billion-plus URLs. At ~100 bytes per URL, an exact set is over 1 TB - the number that makes a Bloom filter not an optimisation but a necessity.

Workers. A fetch is I/O-bound at ~500 ms including DNS and latency. One worker at concurrency 100 manages ~200 fetches/sec, so 10,000/sec needs on the order of 50 worker processes, scaled wide.


Step 3 - API and Data Model

A crawler has no public API - it is an internal pipeline. Its state lives in four stores:

StoreContents
URL frontierPrioritised, per-domain queues of URLs pending crawl - durable
Seen-URL setDeduplication structure - a Bloom filter, optionally backed by an exact store
Content storeURL -> fetched content, fetch time, HTTP status, content hash
Domain metadataPer domain: cached robots.txt, crawl-delay, last-crawl time

A URL travelling through the system carries its url, a priority, the depth at which it was discovered, and a discoveredAt timestamp. The depth and a per-domain URL count are what later bound crawler traps.


Step 4 - High-Level Design

The architecture is the crawl loop made concrete, with the parser closing the feedback cycle back to the frontier.

flowchart TD
    Seeds[Seed URLs] --> Frontier[URL Frontier<br/>prioritised + per-domain]
    Frontier --> Fetcher[Fetcher Workers]
    Fetcher -->|resolve| DNS[(DNS Cache)]
    Fetcher -->|check| Robots[(robots.txt Cache)]
    Fetcher -->|store page| Content[(Content Store)]
    Fetcher --> Parser[Link Extractor]
    Parser -->|extracted URLs| Dedup{Seen before?}
    Dedup -->|no - normalise + add| Frontier
    Dedup -->|yes - discard| Drop[Discard]

Figure 1. The crawler architecture with the parser closing the feedback cycle. Workers fetch from the frontier, store content, extract links, and feed deduplicated discoveries back into the frontier - a producer-consumer loop where consumers also produce. The DNS and robots.txt caches are what keep that loop fast despite its enormous fan-out.

The frontier hands URLs to fetcher workers. A worker resolves DNS from cache, checks the cached robots.txt, fetches the page, and stores it. The parser extracts links; each one is normalised and checked against the seen-set, and only genuinely new URLs re-enter the frontier. Every component except the frontier is stateless and scales horizontally.


Step 5 - Deep Dive: The URL Frontier and Deduplication

This is the core. The frontier decides what to crawl next; deduplication decides what not to crawl at all.

Part A - The frontier is not a plain queue

A naive design uses one FIFO queue of URLs. It fails on two counts. It has no politeness: many workers pulling from one queue will, by chance, fetch the same domain concurrently and hammer it. And it has no prioritisation: a high-value news homepage is treated identically to an obscure forum post.

A real frontier is a two-layer structure, the classic Mercator design:

flowchart TD
    In[New URLs] --> PQ[Front: priority queues]
    PQ -->|high| H[High-priority queue]
    PQ -->|low| L[Low-priority queue]
    H --> Router{Domain router}
    L --> Router
    Router --> D1[Back: queue for domain A]
    Router --> D2[Back: queue for domain B]
    Router --> D3[Back: queue for domain C]
    D1 --> W1[Worker - one domain at a time]
    D2 --> W2[Worker]
    D3 --> W3[Worker]

Figure 2. The Mercator frontier in detail. The front layer prioritises URLs by value (a news homepage above an obscure forum), and the back layer routes by domain so that at most one worker fetches from any one site at any moment. Politeness is therefore structural - encoded in the shape of the queue - not a check bolted on afterwards.

The front is a set of priority queues - a URL's priority comes from signals such as the host's importance or estimated page value, so valuable pages are crawled sooner and re-crawled more often. The back is a set of per-domain queues, and a router guarantees each domain queue is served by at most one worker at a time, with an enforced delay between consecutive fetches. Politeness is therefore structural - it is the shape of the queue, not a check bolted on afterward. This is a distributed work queue with a feedback loop, the producer-consumer pattern of Part 3 applied to graph traversal.

Part B - Politeness

Beyond per-domain serialisation, the crawler fetches each site's robots.txt once, caches it, and obeys its Disallow rules and Crawl-delay. The crawl achieves high aggregate throughput not by hitting any one site hard but by crawling thousands of domains in parallel - throughput comes from breadth, not depth. A crawler that ignores this gets its IP ranges blocked, which is why politeness is a hard requirement rather than a nicety.

sequenceDiagram
    participant Q as Per-domain queue (example.com)
    participant W as Fetcher worker
    participant R as robots.txt cache
    participant S as example.com
 
    W->>Q: next URL?
    Q-->>W: /page1
    W->>R: rules for example.com?
    R-->>W: Crawl-delay: 2s, Disallow: /admin
    W->>S: GET /page1
    S-->>W: 200 OK
    Note over W: enforced wait - 2 seconds (Crawl-delay)
    W->>Q: next URL?
    Q-->>W: /page2
    W->>S: GET /page2
    S-->>W: 200 OK
    Note over Q,W: one worker per domain + Crawl-delay = polite by construction

Figure 3. Politeness in action. A worker consults the cached robots.txt for a domain's rules, fetches, and then enforces the Crawl-delay before its next request to the same site. One-worker-per-domain plus Crawl-delay is what keeps the crawler from being IP-banned at scale - aggregate throughput comes from breadth across many domains, never depth into one.

Part C - Deduplication and the Bloom filter

Before a discovered URL enters the frontier, the crawler must answer "have I seen this already?" - billions of times, against a set of billions. An exact hash set of 10B URLs costs over a terabyte of RAM.

A Bloom filter answers the same question probabilistically in a fraction of the space. It is a bit array with k hash functions: add(url) sets k bits, and contains(url) reports "present" only if all k bits are set.

flowchart LR
    U["URL"] --> H1[hash 1]
    U --> H2[hash 2]
    U --> H3[hash 3]
    H1 --> B["Bit array"]
    H2 --> B
    H3 --> B
    B --> R{All k bits set?}
    R -->|no| New["Definitely new - add to frontier"]
    R -->|yes| Seen["Probably seen - discard"]

Figure 4. The Bloom filter's structure. A URL is hashed by k independent functions, each setting or testing a bit in the array; the filter reports "probably seen" only when all k bits are set. The asymmetric answer - no false negatives, ~1% false positives - is what trades a tiny fraction of skipped pages for two orders of magnitude less memory than an exact set.

Its asymmetry is the whole point. A Bloom filter has no false negatives - if it says "new", the URL is genuinely new - but it does have false positives, occasionally reporting a fresh URL as seen. The cost is calibrated: ~10 bits per element gives a ~1% false-positive rate, so 10B URLs fit in ~12 GB instead of 1 TB. The price of that 1% is skipping about 1% of pages - and for a web crawl that is acceptable, because a skipped page is usually reachable through some other link anyway. Naming this trade-off explicitly is the senior signal. Where exactness genuinely matters, the Bloom filter becomes a fast pre-filter in front of an exact backing store: only a "probably seen" answer triggers the more expensive confirming lookup.

Deduplication also depends on URL normalisation - lowercasing the host, removing default ports, sorting query parameters, stripping fragments - so the same page under different URL spellings is recognised as one. Content-level deduplication complements it: hash the fetched body (or use shingling for near-duplicates) so different URLs serving identical content are not stored twice.

Part D - Crawler traps

Some URL spaces are effectively infinite: a calendar linking to the next month forever, infinitely deep path nesting, or session IDs that make every link unique. Defences: a maximum crawl depth and a maximum URL count per domain, URL normalisation that strips session parameters, and pattern detection that recognises repetitive low-value URLs.

Part E - Freshness and re-crawl

A continuous crawler re-admits crawled URLs to the frontier - but not uniformly. It estimates each page's change frequency by watching how often its content hash changes between crawls, then schedules a news homepage for frequent re-crawl and a static archive page for rare re-crawl. This spends finite crawl capacity where content actually moves.

Consistency model

The corpus is eventually consistent with the live web - at any instant it is a stale snapshot, which is inherent to crawling. Deduplication is approximate by design, via the Bloom filter's false positives. Only the content store is strongly durable, because it is the pipeline's output.

Failure modes

  • Worker crash mid-fetch. The frontier hands out URLs with at-least-once semantics: an un-acknowledged URL becomes visible again after a timeout and is re-crawled. The rare resulting double-crawl is harmless, and workers are stateless.
  • Frontier node loss. The frontier is the crawl's progress, so it must be durable and partitioned, not in-memory - losing it loses days of work.
  • Bloom filter loss. Losing it only causes some re-crawling - wasteful, never incorrect, since there are no false negatives - but it is still snapshotted periodically to avoid a large redo.
  • Slow or hostile site. Per-domain timeouts and a per-domain circuit breaker stop one bad site from tying up workers.

Multi-region

Crawl from several data centres, partitioning the URL space by domain - a hash of the hostname - so that each domain is owned by exactly one regional frontier shard. That ownership is what keeps politeness correct: two regions must never independently hammer the same domain. The content store is geo-distributed, and a site is ideally crawled from a region near it.

Evolution path

StageApproach
LaunchSingle-machine BFS crawler, in-memory queue and seen-set
GrowthDistributed durable frontier, many worker machines, a Bloom filter for dedup
ScaleMercator priority + per-domain frontier, adaptive re-crawl, multi-region, content dedup

Build URL normalisation, the seen-check abstraction, and per-domain politeness from day one - retrofitting politeness into a running crawler is painful. Defer adaptive freshness, content near-duplicate detection, and multi-region.

Observability

Track pages crawled/sec, frontier size per priority, the dedup hit rate, fetch error rate per domain, the robots.txt-blocked rate, the re-crawl rate, crawler-trap detections, and bandwidth. A reasonable SLO: crawl 99% of the target corpus within the monthly refresh window.


Step 6 - Bottlenecks and Trade-offs

  • Deduplication memory would be a terabyte as an exact set; the Bloom filter cuts it to ~12 GB at the cost of ~1% missed pages.
  • Politeness caps per-domain throughput, so aggregate speed must come from crawling many domains in parallel.
  • The frontier must be durable - it holds the crawl's entire progress.
  • DNS resolution at billions of lookups demands aggressive caching.
  • Crawler traps are bounded by depth and count limits plus normalisation.
  • Network I/O is the physical ceiling: the crawl is I/O-bound, so workers scale wide.

Reference Architecture

The pattern this problem teaches, reusable well beyond crawling:

A distributed work queue with a feedback loop - consumers process items and produce more items back into the queue - fronted by a probabilistic membership filter that keeps the work set from exploding.

flowchart LR
    subgraph Loop["Work queue with feedback"]
        Q[(Durable frontier)] --> Workers[Stateless workers]
        Workers -->|new items| Filter{Bloom filter:<br/>seen before?}
        Filter -->|no| Q
        Filter -->|yes| X[Discard]
    end
    Workers --> Out[(Output store)]

Figure 5. The reference architecture stripped to its loop. A durable work queue feeds stateless workers, which feed discoveries back through a probabilistic membership filter to suppress revisits. The same shape recurs in dependency resolvers, recursive scrapers, and build systems - whenever a huge graph must be traversed without visiting anything twice.

The same shape recurs in any large-scale graph traversal: dependency resolution in a package manager, recursive scraping, social-graph exploration, build-system task graphs. A durable work queue, stateless workers that feed discoveries back, and a probabilistic filter to suppress revisits is the toolkit for "traverse a huge graph without traversing anything twice".


Common Mistakes in the Interview

  • A single FIFO frontier with no politeness, hammering domains until the crawler is banned.
  • An exact hash set for deduplication, demanding a terabyte of memory.
  • Misstating the Bloom filter trade-off - it has no false negatives; false positives cause a small, acceptable fraction of skipped pages.
  • No URL normalisation, so one page is crawled under many URL spellings.
  • Ignoring crawler traps, letting an infinite URL space consume the crawl.
  • Treating the frontier as ephemeral, losing all crawl progress on a failure.
  • No freshness strategy, crawling every page once and never revisiting.

Quick Reference

TopicKey Point
Core patternDistributed work queue with a feedback loop - the URL frontier
FrontierFront priority queues + back per-domain queues (Mercator design)
PolitenessOne worker per domain, enforced delay, obey robots.txt - structural
ThroughputFrom breadth - many domains in parallel - never from hammering one
DeduplicationBloom filter: ~12 GB vs 1 TB; no false negatives, ~1% false positives
NormalisationCanonicalise URLs before the seen-check; hash content for near-dups
Crawler trapsMax depth, max URLs per domain, normalisation, pattern detection
FreshnessAdaptive re-crawl from observed content-hash change frequency
Frontier durabilityMust be durable and partitioned - it is the crawl's progress
Multi-regionPartition by domain so each domain has one owner - keeps politeness correct

This is Part 8 of a 12-part system design series where each post solves one problem around one core pattern. Next: Design Search Autocomplete.

Ready to ace your interview?

Get 550+ interview questions with detailed answers in our comprehensive PDF guides.

View PDF Guides