What is a URL frontier in a web crawler?

The URL frontier is the structured queue of URLs waiting to be crawled - far more than a plain FIFO. It has a prioritisation layer, so high-value pages are crawled sooner, and a politeness layer of per-domain queues, so no two workers hammer the same domain at once. Workers consume URLs from it, and the link parser feeds newly discovered URLs back into it, making it a work queue with a feedback loop.

How does a crawler avoid crawling the same URL twice?

It checks every discovered URL against a seen-set before adding it to the frontier. At web scale the seen-set holds billions of URLs, so an exact hash set would need a terabyte of memory. Instead a Bloom filter answers the membership check in a fraction of the space, optionally backed by an exact store for the cases where certainty matters.

Why use a Bloom filter in a web crawler?

A Bloom filter stores set membership in roughly ten bits per URL instead of the hundred-plus bytes an exact entry needs - around 12 GB for ten billion URLs versus over a terabyte. It never produces false negatives, so it never claims an uncrawled URL was seen; it can produce false positives, which cause a small fraction of pages to be skipped. For a web crawl, missing about 1% of pages is an acceptable price for the memory saved.

How does a web crawler stay polite to servers?

Politeness is built into the frontier rather than bolted on. Each domain has its own queue assigned to at most one worker at a time, with a delay enforced between consecutive requests to that domain. The crawler also fetches and caches each site's robots.txt and obeys its Disallow rules and Crawl-delay directive. Without this, a crawler overwhelms small sites and gets its IP addresses banned.

How does a crawler keep its content fresh?

The web changes constantly, so pages are re-crawled - but not uniformly. The crawler estimates each page's change frequency from how often its content hash changes over time, then re-admits frequently-changing pages such as news homepages to the frontier far more often than static archive pages. This adaptive scheduling spends limited crawl capacity where content actually moves.

Design a Web Crawler: System Design Interview 2026

Q: What is a crawler trap and how do you handle it?

A crawler trap is a region of a site that generates effectively infinite URLs - a calendar that links to the next month forever, infinitely deep path nesting, or session IDs that make every link unique. Defences include a maximum crawl depth and URL count per domain, URL normalisation that strips session parameters, and detecting repetitive low-value URL patterns.

A web crawler is a graph traversal where the graph has tens of billions of nodes, no map, an adversarial fraction of it, and a strict etiquette you must follow or get banned. The core loop - fetch a page, extract its links, follow them - fits in a sentence. Everything difficult is in the supporting machinery: the queue that decides what to crawl next without overwhelming any one site, and the deduplication that keeps the crawler from drowning in URLs it has already seen.

This walkthrough assumes the 6-step system design framework and applies it at senior depth. It is Part 8 of a system design series.

The Problem
Step 1 - Clarify Requirements
Step 2 - Estimate Scale
Step 3 - API and Data Model
Step 4 - High-Level Design
Step 5 - Deep Dive: The URL Frontier and Deduplication
Step 6 - Bottlenecks and Trade-offs
Reference Architecture
Common Mistakes in the Interview
Quick Reference
Related Articles

The Problem

We are designing a web crawler: starting from seed URLs, it fetches pages, extracts their links, and follows them to build a large corpus of web content - the input a search index is later built from. The canonical example is Googlebot.

The senior framing is that this is a distributed work queue with a feedback loop. Worker processes consume URLs and fetch pages; the link parser produces new URLs back into the queue. Two forces keep that loop from collapsing: a frontier that orders the work while respecting politeness, and a deduplication layer that stops the queue from re-ingesting the billions of URLs it has already processed.

Step 1 - Clarify Requirements

Functional requirements:

Crawl from a set of seed URLs, following discovered links.
Fetch page content and store it for downstream indexing.
Obey robots.txt.
Re-crawl pages over time to keep content fresh.

Out of scope (name, then defer): building the search index itself (that is Part 9), ranking, and the details of HTML parsing - assume a parser exists.

Non-functional requirements:

Web scale - billions of pages.
Politeness - never overwhelm a single domain; obey robots.txt. This is both courtesy and a hard operational requirement, since rude crawlers get banned.
Deduplication - never crawl the same URL twice.
Trap resistance - survive infinite URL spaces.
Fault tolerance - a full crawl runs for days or weeks and must survive worker failures.

The clarifying questions: this is a continuous crawl, not a one-shot - the web changes, so freshness matters. Traversal is breadth-first, with the frontier as the queue, because BFS discovers the web broadly rather than tunnelling into one site. And politeness is non-negotiable.

Step 2 - Estimate Scale

Crawl rate. Target a corpus of 10 billion pages refreshed monthly: 10B / 30 days ≈ 330M pages/day ≈ ~3,800 pages/sec average; design for a peak near 10,000/sec.

Content storage. At ~100 KB per page, 10 billion pages is ~1 PB of content, which lives in object storage or a distributed file system.

The deduplication set. Each page links to ~50 URLs, so the crawler sees an enormous URL stream, and the seen-set must hold 10 billion-plus URLs. At ~100 bytes per URL, an exact set is over 1 TB - the number that makes a Bloom filter not an optimisation but a necessity.

Workers. A fetch is I/O-bound at ~500 ms including DNS and latency. One worker at concurrency 100 manages ~200 fetches/sec, so 10,000/sec needs on the order of 50 worker processes, scaled wide.

Step 3 - API and Data Model

A crawler has no public API - it is an internal pipeline. Its state lives in four stores:

Store	Contents
URL frontier	Prioritised, per-domain queues of URLs pending crawl - durable
Seen-URL set	Deduplication structure - a Bloom filter, optionally backed by an exact store
Content store	URL -> fetched content, fetch time, HTTP status, content hash
Domain metadata	Per domain: cached `robots.txt`, crawl-delay, last-crawl time

A URL travelling through the system carries its url, a priority, the depth at which it was discovered, and a discoveredAt timestamp. The depth and a per-domain URL count are what later bound crawler traps.

Step 4 - High-Level Design

The architecture is the crawl loop made concrete, with the parser closing the feedback cycle back to the frontier.

flowchart TD
    Seeds[Seed URLs] --> Frontier[URL Frontier<br/>prioritised + per-domain]
    Frontier --> Fetcher[Fetcher Workers]
    Fetcher -->|resolve| DNS[(DNS Cache)]
    Fetcher -->|check| Robots[(robots.txt Cache)]
    Fetcher -->|store page| Content[(Content Store)]
    Fetcher --> Parser[Link Extractor]
    Parser -->|extracted URLs| Dedup{Seen before?}
    Dedup -->|no - normalise + add| Frontier
    Dedup -->|yes - discard| Drop[Discard]

Figure 1. The crawler architecture with the parser closing the feedback cycle. Workers fetch from the frontier, store content, extract links, and feed deduplicated discoveries back into the frontier - a producer-consumer loop where consumers also produce. The DNS and robots.txt caches are what keep that loop fast despite its enormous fan-out.

The frontier hands URLs to fetcher workers. A worker resolves DNS from cache, checks the cached robots.txt, fetches the page, and stores it. The parser extracts links; each one is normalised and checked against the seen-set, and only genuinely new URLs re-enter the frontier. Every component except the frontier is stateless and scales horizontally.

Step 5 - Deep Dive: The URL Frontier and Deduplication

This is the core. The frontier decides what to crawl next; deduplication decides what not to crawl at all.

Part A - The frontier is not a plain queue

A naive design uses one FIFO queue of URLs. It fails on two counts. It has no politeness: many workers pulling from one queue will, by chance, fetch the same domain concurrently and hammer it. And it has no prioritisation: a high-value news homepage is treated identically to an obscure forum post.

A real frontier is a two-layer structure, the classic Mercator design:

flowchart TD
    In[New URLs] --> PQ[Front: priority queues]
    PQ -->|high| H[High-priority queue]
    PQ -->|low| L[Low-priority queue]
    H --> Router{Domain router}
    L --> Router
    Router --> D1[Back: queue for domain A]
    Router --> D2[Back: queue for domain B]
    Router --> D3[Back: queue for domain C]
    D1 --> W1[Worker - one domain at a time]
    D2 --> W2[Worker]
    D3 --> W3[Worker]

Figure 2. The Mercator frontier in detail. The front layer prioritises URLs by value (a news homepage above an obscure forum), and the back layer routes by domain so that at most one worker fetches from any one site at any moment. Politeness is therefore structural - encoded in the shape of the queue - not a check bolted on afterwards.

The front is a set of priority queues - a URL's priority comes from signals such as the host's importance or estimated page value, so valuable pages are crawled sooner and re-crawled more often. The back is a set of per-domain queues, and a router guarantees each domain queue is served by at most one worker at a time, with an enforced delay between consecutive fetches. Politeness is therefore structural - it is the shape of the queue, not a check bolted on afterward. This is a distributed work queue with a feedback loop, the producer-consumer pattern of Part 3 applied to graph traversal.

Part B - Politeness

Beyond per-domain serialisation, the crawler fetches each site's robots.txt once, caches it, and obeys its Disallow rules and Crawl-delay. The crawl achieves high aggregate throughput not by hitting any one site hard but by crawling thousands of domains in parallel - throughput comes from breadth, not depth. A crawler that ignores this gets its IP ranges blocked, which is why politeness is a hard requirement rather than a nicety.

sequenceDiagram
    participant Q as Per-domain queue (example.com)
    participant W as Fetcher worker
    participant R as robots.txt cache
    participant S as example.com
 
    W->>Q: next URL?
    Q-->>W: /page1
    W->>R: rules for example.com?
    R-->>W: Crawl-delay: 2s, Disallow: /admin
    W->>S: GET /page1
    S-->>W: 200 OK
    Note over W: enforced wait - 2 seconds (Crawl-delay)
    W->>Q: next URL?
    Q-->>W: /page2
    W->>S: GET /page2
    S-->>W: 200 OK
    Note over Q,W: one worker per domain + Crawl-delay = polite by construction

Figure 3. Politeness in action. A worker consults the cached robots.txt for a domain's rules, fetches, and then enforces the Crawl-delay before its next request to the same site. One-worker-per-domain plus Crawl-delay is what keeps the crawler from being IP-banned at scale - aggregate throughput comes from breadth across many domains, never depth into one.

Part C - Deduplication and the Bloom filter

Before a discovered URL enters the frontier, the crawler must answer "have I seen this already?" - billions of times, against a set of billions. An exact hash set of 10B URLs costs over a terabyte of RAM.

A Bloom filter answers the same question probabilistically in a fraction of the space. It is a bit array with k hash functions: add(url) sets k bits, and contains(url) reports "present" only if all k bits are set.

flowchart LR
    U["URL"] --> H1[hash 1]
    U --> H2[hash 2]
    U --> H3[hash 3]
    H1 --> B["Bit array"]
    H2 --> B
    H3 --> B
    B --> R{All k bits set?}
    R -->|no| New["Definitely new - add to frontier"]
    R -->|yes| Seen["Probably seen - discard"]

Figure 4. The Bloom filter's structure. A URL is hashed by k independent functions, each setting or testing a bit in the array; the filter reports "probably seen" only when all k bits are set. The asymmetric answer - no false negatives, ~1% false positives - is what trades a tiny fraction of skipped pages for two orders of magnitude less memory than an exact set.

Its asymmetry is the whole point. A Bloom filter has no false negatives - if it says "new", the URL is genuinely new - but it does have false positives, occasionally reporting a fresh URL as seen. The cost is calibrated: ~10 bits per element gives a ~1% false-positive rate, so 10B URLs fit in ~12 GB instead of 1 TB. The price of that 1% is skipping about 1% of pages - and for a web crawl that is acceptable, because a skipped page is usually reachable through some other link anyway. Naming this trade-off explicitly is the senior signal. Where exactness genuinely matters, the Bloom filter becomes a fast pre-filter in front of an exact backing store: only a "probably seen" answer triggers the more expensive confirming lookup.

Deduplication also depends on URL normalisation - lowercasing the host, removing default ports, sorting query parameters, stripping fragments - so the same page under different URL spellings is recognised as one. Content-level deduplication complements it: hash the fetched body (or use shingling for near-duplicates) so different URLs serving identical content are not stored twice.

Part D - Crawler traps

Some URL spaces are effectively infinite: a calendar linking to the next month forever, infinitely deep path nesting, or session IDs that make every link unique. Defences: a maximum crawl depth and a maximum URL count per domain, URL normalisation that strips session parameters, and pattern detection that recognises repetitive low-value URLs.

Part E - Freshness and re-crawl

A continuous crawler re-admits crawled URLs to the frontier - but not uniformly. It estimates each page's change frequency by watching how often its content hash changes between crawls, then schedules a news homepage for frequent re-crawl and a static archive page for rare re-crawl. This spends finite crawl capacity where content actually moves.

Consistency model

The corpus is eventually consistent with the live web - at any instant it is a stale snapshot, which is inherent to crawling. Deduplication is approximate by design, via the Bloom filter's false positives. Only the content store is strongly durable, because it is the pipeline's output.

Failure modes

Worker crash mid-fetch. The frontier hands out URLs with at-least-once semantics: an un-acknowledged URL becomes visible again after a timeout and is re-crawled. The rare resulting double-crawl is harmless, and workers are stateless.
Frontier node loss. The frontier is the crawl's progress, so it must be durable and partitioned, not in-memory - losing it loses days of work.
Bloom filter loss. Losing it only causes some re-crawling - wasteful, never incorrect, since there are no false negatives - but it is still snapshotted periodically to avoid a large redo.
Slow or hostile site. Per-domain timeouts and a per-domain circuit breaker stop one bad site from tying up workers.

Multi-region

Crawl from several data centres, partitioning the URL space by domain - a hash of the hostname - so that each domain is owned by exactly one regional frontier shard. That ownership is what keeps politeness correct: two regions must never independently hammer the same domain. The content store is geo-distributed, and a site is ideally crawled from a region near it.

Evolution path

Stage	Approach
Launch	Single-machine BFS crawler, in-memory queue and seen-set
Growth	Distributed durable frontier, many worker machines, a Bloom filter for dedup
Scale	Mercator priority + per-domain frontier, adaptive re-crawl, multi-region, content dedup

Build URL normalisation, the seen-check abstraction, and per-domain politeness from day one - retrofitting politeness into a running crawler is painful. Defer adaptive freshness, content near-duplicate detection, and multi-region.

Observability

Track pages crawled/sec, frontier size per priority, the dedup hit rate, fetch error rate per domain, the robots.txt-blocked rate, the re-crawl rate, crawler-trap detections, and bandwidth. A reasonable SLO: crawl 99% of the target corpus within the monthly refresh window.

Step 6 - Bottlenecks and Trade-offs

Deduplication memory would be a terabyte as an exact set; the Bloom filter cuts it to ~12 GB at the cost of ~1% missed pages.
Politeness caps per-domain throughput, so aggregate speed must come from crawling many domains in parallel.
The frontier must be durable - it holds the crawl's entire progress.
DNS resolution at billions of lookups demands aggressive caching.
Crawler traps are bounded by depth and count limits plus normalisation.
Network I/O is the physical ceiling: the crawl is I/O-bound, so workers scale wide.

Reference Architecture

The pattern this problem teaches, reusable well beyond crawling:

A distributed work queue with a feedback loop - consumers process items and produce more items back into the queue - fronted by a probabilistic membership filter that keeps the work set from exploding.

flowchart LR
    subgraph Loop["Work queue with feedback"]
        Q[(Durable frontier)] --> Workers[Stateless workers]
        Workers -->|new items| Filter{Bloom filter:<br/>seen before?}
        Filter -->|no| Q
        Filter -->|yes| X[Discard]
    end
    Workers --> Out[(Output store)]

Figure 5. The reference architecture stripped to its loop. A durable work queue feeds stateless workers, which feed discoveries back through a probabilistic membership filter to suppress revisits. The same shape recurs in dependency resolvers, recursive scrapers, and build systems - whenever a huge graph must be traversed without visiting anything twice.

The same shape recurs in any large-scale graph traversal: dependency resolution in a package manager, recursive scraping, social-graph exploration, build-system task graphs. A durable work queue, stateless workers that feed discoveries back, and a probabilistic filter to suppress revisits is the toolkit for "traverse a huge graph without traversing anything twice".

Common Mistakes in the Interview

A single FIFO frontier with no politeness, hammering domains until the crawler is banned.
An exact hash set for deduplication, demanding a terabyte of memory.
Misstating the Bloom filter trade-off - it has no false negatives; false positives cause a small, acceptable fraction of skipped pages.
No URL normalisation, so one page is crawled under many URL spellings.
Ignoring crawler traps, letting an infinite URL space consume the crawl.
Treating the frontier as ephemeral, losing all crawl progress on a failure.
No freshness strategy, crawling every page once and never revisiting.

Quick Reference

Topic	Key Point
Core pattern	Distributed work queue with a feedback loop - the URL frontier
Frontier	Front priority queues + back per-domain queues (Mercator design)
Politeness	One worker per domain, enforced delay, obey `robots.txt` - structural
Throughput	From breadth - many domains in parallel - never from hammering one
Deduplication	Bloom filter: ~12 GB vs 1 TB; no false negatives, ~1% false positives
Normalisation	Canonicalise URLs before the seen-check; hash content for near-dups
Crawler traps	Max depth, max URLs per domain, normalisation, pattern detection
Freshness	Adaptive re-crawl from observed content-hash change frequency
Frontier durability	Must be durable and partitioned - it is the crawl's progress
Multi-region	Partition by domain so each domain has one owner - keeps politeness correct

System Design Interview Problems: A Senior's Roadmap - the full series index and pattern library.
System Design Interview Guide: The 6-Step Framework - the method this walkthrough applies.
Design a Notification Service - Part 3; the durable queue and producer-consumer pattern.
Design a Distributed Cache - Part 4; the DNS and robots.txt caches that keep the crawl loop fast.
Design Search Autocomplete - Part 9; the search corpus a crawl ultimately feeds.
Apache Kafka Interview Questions - durable queues and consumer groups for the frontier.

This is Part 8 of a 12-part system design series where each post solves one problem around one core pattern. Next: Design Search Autocomplete.

Design a Web Crawler: System Design Interview 2026

Table of Contents

The Problem

Step 1 - Clarify Requirements

Step 2 - Estimate Scale

Step 3 - API and Data Model

Step 4 - High-Level Design

Step 5 - Deep Dive: The URL Frontier and Deduplication

Part A - The frontier is not a plain queue

Part B - Politeness

Part C - Deduplication and the Bloom filter

Part D - Crawler traps

Part E - Freshness and re-crawl

Consistency model

Failure modes

Multi-region

Evolution path

Observability

Step 6 - Bottlenecks and Trade-offs

Reference Architecture

Common Mistakes in the Interview

Quick Reference

Ready to ace your interview?

Table of Contents

The Problem

Step 1 - Clarify Requirements

Step 2 - Estimate Scale

Step 3 - API and Data Model

Step 4 - High-Level Design

Step 5 - Deep Dive: The URL Frontier and Deduplication

Part A - The frontier is not a plain queue

Part B - Politeness

Part C - Deduplication and the Bloom filter

Part D - Crawler traps

Part E - Freshness and re-crawl

Consistency model

Failure modes

Multi-region

Evolution path

Observability

Step 6 - Bottlenecks and Trade-offs

Reference Architecture

Common Mistakes in the Interview

Quick Reference

Related Articles

Ready to ace your interview?