How do you avoid running the same scheduled job twice?

Coordinate ownership before firing: the job set is sharded, and each shard has exactly one scheduler that owns it through a leader-election lock in a consensus-backed coordination service. Firing a job is an atomic claim-and-advance in the database that flips its next-run-at timestamp to the next occurrence in the same transaction as dispatch. Even with this, execution is at-least-once - the job itself must be idempotent so a retried dispatch causes no harm.

What is leader election in a distributed scheduler?

Leader election is the mechanism by which the scheduler nodes agree that exactly one of them owns a given shard of the job set at a given time. It runs through a strongly-consistent coordination service like etcd, ZooKeeper, or Consul - which themselves implement consensus protocols such as Raft - using leases with a TTL. The owner renews its lease periodically; if it dies, the lease expires and another node claims the shard, recovering automatically.

What is a fencing token and why does a scheduler need one?

A fencing token is a monotonically increasing version number issued whenever shard ownership changes. Writes to the database carry the token, and a write with a stale token is rejected. This prevents split-brain: if a network partition makes two nodes briefly believe they own the same shard, only the latest owner's writes succeed, so a partition cannot cause duplicate dispatches even momentarily.

How do you handle scheduled jobs that were missed during an outage?

Missed schedules are settled by an explicit per-job misfire policy, not silently. The reasonable options are to catch up by firing every missed run, to skip and resume from the next scheduled time, or to fire once for the most recent missed run and resume. The right choice depends on the job - a billing run may need a catch-up, a per-minute heartbeat should skip - and a senior design makes it configurable rather than picking one globally.

How do you scale a job scheduler horizontally?

The job set is partitioned into many shards, hashed by job ID for an even distribution. Each shard is owned by exactly one scheduler node via leader election, so adding scheduler nodes adds capacity proportional to the number of shards each one owns. Workers are independent and scale on their own queue. The coordination service is sized by shard count and lease renewal rate, never by job count.

Can a job scheduler guarantee exactly-once execution?

No - exactly-once execution end to end is unachievable for the same reason as in messaging and payments: the worker call and its acknowledgement cannot be made atomic across an unreliable network. The contract is at-least-once dispatch plus an idempotent job, which gives the same user-observable result. The scheduler supplies a natural idempotency key per fire - the (jobId, scheduledAt) pair - so workers and downstream effects can deduplicate on it.

Design a Job Scheduler: System Design Interview 2026

A distributed job scheduler looks like cron, scales like a distributed database, and fails like a coordination problem. Its job description sounds dull - fire each scheduled task at its time, on some pool of machines - until you account for what happens when two scheduler nodes both think they should fire the same job, when one of them is down for ten minutes and a thousand jobs were due in that window, or when an aligned cron expression makes ten thousand jobs come due at the same second. None of those questions has a "just throw more servers at it" answer, which is what makes this the natural closing problem of a system design series: the answer composes nearly every pattern that came before.

This walkthrough assumes the 6-step system design framework and applies it at senior-plus depth. It is Part 12 - and the final part - of a system design series.

The Problem
Step 1 - Clarify Requirements
Step 2 - Estimate Scale
Step 3 - API and Data Model
Step 4 - High-Level Design
Step 5 - Deep Dive: Leader Election, Sharded Scheduling, and At-Least-Once Execution
Step 6 - Bottlenecks and Trade-offs
Reference Architecture
Common Mistakes in the Interview
Quick Reference
Related Articles

The Problem

We are designing a distributed job scheduler - the engine behind cron jobs at platform scale, delayed retries, scheduled emails, periodic ETL, billing runs, and any other "run this at time T" workload. The canonical examples are Quartz, Kubernetes CronJobs, AWS EventBridge Scheduler, and the scheduler underneath any workflow engine like Airflow.

The senior framing has two halves. The scheduling decision - "is this job due, who fires it" - must be coordinated, because uncoordinated schedulers all fire the same job. The execution must be at-least-once with idempotent jobs, because the dispatch-then-execute boundary cannot be made exactly-once. Almost every design decision is either a coordination mechanism or an idempotency mechanism, and they together produce a scheduler that never silently loses a job and almost never runs one twice.

Step 1 - Clarify Requirements

Functional requirements:

Register a job: a definition plus a schedule (cron expression, fixed delay, or a one-shot runAt).
Trigger jobs at the right time and dispatch them to workers.
Track job runs and their outcomes.
Retry failed jobs with backoff.
Cancel, pause, and modify registered jobs.

Out of scope (name, then defer): the job logic itself (we assume idempotent workers), workflow DAGs (Airflow is a scheduler plus a workflow engine - we design only the scheduler), and the authoring UI.

Non-functional requirements:

Do not run a job twice when not asked to - and accept that this is "minimise duplicates", not "eliminate", because of at-least-once execution.
Do not silently miss a scheduled run - and have an explicit policy for what to do about runs missed during an outage.
Scale to millions of jobs and peaks of tens of thousands of triggers per second.
Time precision to seconds; sub-second is a different problem.
High availability, so a node failure does not stop firing.
Tolerate clock skew between machines.

The decisive clarifying question, established the same way it was in Part 3 and Part 11: the execution guarantee is at-least-once with idempotency, never exactly-once. The scheduler's job is to make duplicate dispatches rare and to make missed schedules detectable and recoverable; the worker's job is to be idempotent so a duplicate dispatch has no effect.

Step 2 - Estimate Scale

Registered jobs. Assume 10 million across the platform - a mix of recurring crons and one-shot delayed jobs.

Trigger rate. Peak around ~10,000 triggers/sec, the peak coming from aligned cron expressions: "every hour at :00" makes thousands of jobs come due at the same instant. The scheduler must absorb that herd rather than serialise on it.

State storage. 10M jobs at ~500 bytes ≈ 5 GB of metadata. Job-run history at ~200 bytes per run, retained 30 days, runs to a few TB - partitioned by shard, time-tiered.

The interesting number is not the totals but the spike-to-average ratio: the scheduler is mostly idle and occasionally has to fire ten thousand jobs at the same second. Sharding is what spreads that spike.

Step 3 - API and Data Model

POST /api/jobs
  body: { jobId, schedule, payload, target, retryPolicy, misfirePolicy }
  201 Created
 
DELETE /api/jobs/{id}
GET    /api/jobs/{id}/runs

The data model is small but carefully indexed:

Entity	Key fields
Job	`jobId`, `shard`, `schedule`, `target`, `retryPolicy`, `misfirePolicy`, `status`, `nextRunAt`
Job run	`runId`, `jobId`, `scheduledAt`, `dispatchedAt`, `status`, `attempts`, `workerId`
Shard ownership	`shard` -> `(ownerNode, leaseExpiresAt, fencingToken)` - in the coordination service

Two fields carry the design. nextRunAt is the precomputed firing time, indexed by (shard, nextRunAt) so a shard owner can answer "what fires in the next second" with a tight indexed scan, regardless of how many jobs exist. misfirePolicy is the per-job knob that decides what happens when a scheduler outage causes a job to miss its scheduled time - catch up, skip to next, or fire once and skip.

Step 4 - High-Level Design

flowchart TD
    Cli([Client]) -->|register / cancel| API[Scheduler API]
    API --> Store[(Job Store<br/>indexed by shard, nextRunAt)]
    subgraph Sched["Scheduler tier - one owner per shard"]
        S1[Scheduler Node A]
        S2[Scheduler Node B]
        S3[Scheduler Node C]
    end
    Coord[Coordination Service<br/>etcd / ZooKeeper / Consul]
    S1 <-->|lease + fencing token| Coord
    S2 <-->|lease + fencing token| Coord
    S3 <-->|lease + fencing token| Coord
    S1 -->|tick: claim + advance due jobs| Store
    S2 --> Store
    S3 --> Store
    S1 -->|dispatch trigger| Q[Worker Queue]
    S2 --> Q
    S3 --> Q
    Q --> W[Worker Pool<br/>idempotent jobs]
    W -->|run result| Store

Figure 1. The architecture: multiple scheduler nodes coordinate through a consensus-backed coordination service for shard ownership, then run their tick loop against the indexed job store. The coordination service is the only strongly-consistent dependency outside the job store, and it is small - sized by shard count, not by job count.

The scheduler tier is multiple nodes; shard ownership lives in the coordination service, which provides leases and fencing tokens. Each owner runs a tick loop on its shards: query the job store for jobs due in the shard, atomically claim and advance them, and dispatch a trigger onto the worker queue. Workers consume the queue, execute the job (idempotently), and write the run result back. The coordination service is the only strongly-consistent piece outside the job store - and it is small.

Step 5 - Deep Dive: Leader Election, Sharded Scheduling, and At-Least-Once Execution

This is the core. Three mechanisms compose: shard ownership through leader election with fencing, the atomic claim-and-advance tick loop, and at-least-once dispatch onto idempotent workers - with an explicit policy for what happens when an outage means a job missed its time.

Part A - Why coordination is needed

Run N scheduler nodes that each scan the whole job set every second and the result is N duplicate fires per job. Coordination must guarantee that for each job, exactly one scheduler is responsible at any given moment - and two structures can deliver that:

A single global leader. One node fires everything; the rest stand by for failover. Simple, but the throughput ceiling is one node, and the single point of contention is the scheduler itself.
Sharded ownership with one leader per shard. Partition the job set into many shards (typically by hash(jobId)); each shard has exactly one owner; scheduler capacity grows with shard count.

The second option is the senior answer. The system scales horizontally by spreading shards across nodes - which is also how Kafka brokers own partitions, how database leaders own primary key ranges, and how every other partitioned-leader system in the wild works.

Part B - Leader election and fencing tokens

Shard ownership is delegated to a strongly-consistent coordination service - etcd, ZooKeeper, or Consul - that implements consensus internally (Raft, Zab) and exposes leases as a primitive. Each scheduler node tries to acquire a lease on a shard's key; the holder periodically renews it; a crashed holder's lease expires and another node grabs it. Recovery is automatic and bounded by the lease TTL.

flowchart TD
    Start([Node attempts to own shard S]) --> Try[Compare-and-swap on shard-S key]
    Try -->|success| Own["Owner: lease with TTL + fencing token N"]
    Try -->|already owned| Wait[Wait for lease to expire]
    Wait --> Try
    Own --> Renew[Periodic renew]
    Renew -->|ok| Renew
    Renew -->|crash or partition| Expire[Lease expires]
    Expire --> Try
    Own --> Work["Run scheduler tick for shard S<br/>writes carry fencing token N"]

Figure 2. The lease lifecycle for one shard. A node tries compare-and-swap, becomes owner with a TTL and a fencing token, renews periodically, and loses the lease on crash or partition; another node then claims it. Recovery is automatic and bounded by the TTL - the simplest correct primitive for "exactly one owner at a time".

What still has to be handled is split-brain. A network partition can briefly leave two nodes each convinced they own a shard - the old owner has not noticed the partition, the new owner has already taken the lease. Without protection, both would dispatch the same jobs.

The defence is a fencing token. The coordination service issues a monotonically increasing version number every time ownership changes; writes to the job store carry that token, and the store rejects any write whose token is older than the most recent one it has seen. So even if two nodes believe they own a shard, only the latest one can mutate state - the old one's claim-and-advance silently fails. Fencing tokens are one of the few cleanly-correct mechanisms for split-brain in a distributed system and are worth naming explicitly.

sequenceDiagram
    participant A as Scheduler A (former owner)
    participant Co as Coordination service
    participant B as Scheduler B (new owner)
    participant DB as Job store
 
    Note over A: owns shard, fencing token = 7
    Note over A,Co: network partition - A cannot renew
    Note over Co: A's lease expires
    B->>Co: claim shard S
    Co-->>B: granted, fencing token = 8
    Note over A: partition heals, A still thinks it owns S
    A->>DB: claim-and-advance, token = 7
    DB-->>A: REJECTED - latest token is 8
    B->>DB: claim-and-advance, token = 8
    DB-->>B: OK
    Note over A,B: only B's writes land - brief split-brain produces no duplicates

Figure 3. The split-brain scenario neutralised by a fencing token. During the partition the new owner takes the lease with token 8; when the partition heals, the former owner's writes still carry the stale token 7 and are silently rejected by the store. Brief disagreement about ownership therefore produces zero duplicate dispatches - the core correctness property of the whole design.

Part C - The tick loop and atomic claim-and-advance

Each scheduler owns a set of shards. For every owned shard, every ~second:

Query: SELECT * WHERE shard = S AND nextRunAt <= now AND status = active LIMIT N. The (shard, nextRunAt) index makes this cheap.
Atomically claim and advance, in one transaction: insert a job_run row with scheduledAt, update the job's nextRunAt to its next occurrence, and pass the shard's fencing token in the write predicate.
Dispatch a trigger event onto the worker queue.

The atomic claim-and-advance is the mechanism that makes "no missed schedules" true. If a scheduler crashes after step 2 but before step 3, the dispatch is lost - but the run row is already recorded, and a watchdog (or the next scheduler) sees a job_run without a worker outcome and re-dispatches. If the scheduler crashes before step 2, no state changed and the next owner sees the same jobs as still due. Either way, the job fires - at-least-once, never silently skipped.

The atomic claim+advance is also the Part 2 atomic-counter idea generalised: a single indivisible operation prevents the race between detecting work and recording that you took it.

The aligned cron herd (ten thousand jobs all due at :00) is handled by sharding: that ten-thousand-job spike is split across, say, 100 shards owned by many scheduler nodes, so each tick fires ~100 jobs and the worker queue absorbs the rest.

Part D - At-least-once execution and idempotency

Once a trigger is on the worker queue, the message follows queue semantics - at-least-once delivery (Part 3). A worker crash mid-execution causes the message to be redelivered; without protection, the job runs twice. Exactly-once is not achievable, for the same reason as in messaging and payments: the worker call and its acknowledgement cannot be wrapped in an atomic bracket.

The contract is therefore at-least-once dispatch plus an idempotent job. The scheduler hands the worker a natural idempotency key - the (jobId, scheduledAt) pair - and workers either consult a recent-executions cache to dedup on it or write their effects under that key (the Part 11 discipline applied to scheduled work). For workloads where redoing the job is truly unacceptable, the job itself must be designed around this contract; the scheduler will not save you.

Job lifecycle

stateDiagram-v2
    [*] --> SCHEDULED: register
    SCHEDULED --> DISPATCHED: tick claim+advance
    DISPATCHED --> RUNNING: worker picks up
    RUNNING --> SUCCEEDED: ok
    RUNNING --> RETRYING: transient failure
    RETRYING --> DISPATCHED: backoff elapsed
    RETRYING --> DEAD_LETTER: max attempts reached
    SCHEDULED --> SCHEDULED: recurring -> next nextRunAt
    SUCCEEDED --> [*]
    DEAD_LETTER --> [*]

Figure 4. The job lifecycle as a state machine. A recurring job loops through SCHEDULED -> DISPATCHED -> RUNNING -> SUCCEEDED and re-enters SCHEDULED with a new nextRunAt; failures retry with backoff and ultimately land in DEAD_LETTER. Every transition is durable, which is what makes "never silently lost" true.

A recurring job moves back to SCHEDULED with its new nextRunAt set during the same claim-and-advance transaction. Failed attempts retry with backoff (the Part 3 exponential-with-jitter pattern); a job that exhausts its retries lands in a dead-letter state for inspection.

Part E - Missed schedules and time

A scheduler outage of ten minutes leaves jobs whose nextRunAt has passed. On recovery the new owner could fire every missed run (catch up), skip to next (resume), or fire once and skip (acknowledge once, then resume). The right choice is per job: a monthly billing run wants catch-up, a per-minute heartbeat wants skip, a "send the daily report" wants fire-once-and-skip. The senior answer makes this a misfirePolicy on the job definition rather than a global decision.

Clock skew is the other time hazard. Workers must not decide whether a job is due; only the scheduler does, based on the job store's clock or a single authoritative time source. NTP keeps machine clocks close, but not closer than a scheduler-level decision can require, and disagreement about "now" between two would-be deciders is exactly how a sloppy design produces duplicates.

Consistency model

Shard ownership and the (shard, nextRunAt) -> claim transaction are strongly consistent. Execution is at-least-once with idempotency, observably effectively-once. Jobs may run slightly late (within seconds, by design) and occasionally twice (workers dedup), and they are never silently lost thanks to the durable job store and at-least-once dispatch. State all three out loud.

Failure modes

Scheduler crash. Its lease expires within the TTL, another node claims the shards, scans the store, and fires anything due - automatic recovery.
Coordination service down. New leases cannot be acquired, but already-leased shards keep working until lease expiry; the coordination service is itself HA via consensus, so this is rare and brief.
Worker queue down. Dispatches fail; the scheduler stops emitting, but the durable nextRunAt index still records what is due, and the queue drains its backlog when it returns.
Worker crash mid-execution. At-least-once redelivery plus idempotent jobs cover it.
Network partition between owner and store. The owner's writes fail their fencing-token check and are silently rejected; the new owner takes over after lease expiry, with no duplicate dispatches.
Poison job that always fails. Retries with cap then dead-letter, the Part 3 DLQ pattern. Alert on DLQ growth.

Multi-region

Partition shards by region with affinity by jobId, so each region owns its own slice; a region failure causes another region to absorb its shards via the same lease mechanism. A global job that must fire exactly once worldwide at a given time has one designated owner region - because doing it everywhere is exactly the split-brain a fencing token is designed to prevent, only stretched across continents.

Evolution path

Stage	Approach
Launch	Single scheduler node, a simple cron table in the DB, in-process workers
Growth	Many schedulers with naive shard claim, a worker queue, retries with backoff
Scale	Per-shard leader election with fencing tokens, configurable misfire policies, multi-region

The non-negotiable day-one investments are idempotent jobs as a contract, the (jobId, scheduledAt) idempotency key, the durable job store, and the atomic claim-and-advance. Defer multi-region, elaborate rebalancing, and per-job priority.

Observability

Track scheduler tick latency, jobs fired/sec, the missed-schedule count (alert - it should be zero outside of an outage), the duplicate-execution count, dispatch-to-execute queue lag, worker p99 execution latency, lease churn rate (frequent ownership changes hint at flaky nodes), and per-shard load skew. Reasonable SLOs: 99.9% of jobs fired within 5 s of their scheduled time; missed-schedule rate under 0.01%.

Step 6 - Bottlenecks and Trade-offs

The coordination service is sized by shard count and lease renewal rate, never by job count - keep shards in the thousands, not millions.
The (shard, nextRunAt) index keeps each tick's query bounded regardless of job count - the wrong index turns the scheduler into a full-table scanner.
The aligned cron herd is the worst-case load and is absorbed by spreading it across many shards on many nodes.
Lease churn - frequent ownership changes - signals flaky nodes or too-tight TTLs and is itself a metric.
The duplicate-execution rate is the headline correctness signal; a non-trivial value means a worker is not actually idempotent.

Reference Architecture

The pattern this problem teaches, reusable far beyond schedulers:

Shard the work, elect one owner per shard through a consensus-backed coordination service with fencing tokens, perform every state change as an atomic claim-and-advance in a durable store, and dispatch to idempotent at-least-once consumers.

flowchart LR
    subgraph Coord["Coordination - one owner per shard"]
        L[Lease + fencing token]
    end
    subgraph Tick["Owner's tick loop"]
        S[Scheduler] -->|atomic claim + advance| D[(Durable job store)]
        S --> Q[Worker queue]
    end
    Q --> W[Idempotent workers]
    L -.protects.-> S

Figure 5. The reference architecture for any "one writer per partition at a time" system: a coordination layer with leases and fencing protects the owner's tick loop, which atomically claims work in the durable store and dispatches to idempotent workers. The same composition underlies Kafka brokers, primary databases, and partitioned stream processors.

The same shape recurs in any "one writer per partition at a time" system: Kafka brokers owning partitions, primary databases owning shards, distributed lock services, partitioned stream processors. Whenever the answer to "who is responsible for this slice of state right now" must be exactly one node, this composition - leases plus fencing plus atomic-claim plus at-least-once delivery to idempotent consumers - is the toolkit.

This is also the convergence post of the series: the atomic claim comes from Part 2, the durable queue and at-least-once + DLQ from Part 3, the producer-consumer feedback from Part 8, and idempotency at every boundary from Part 11. The scheduler exists at the seam where all of those patterns meet, which is why it is the right closing problem.

Common Mistakes in the Interview

No coordination, so every scheduler fires every job and duplicates explode.
A single global leader as the only design, making the scheduler itself the bottleneck.
Leader election without fencing tokens, leaving split-brain duplicate dispatch under a partition.
Claiming exactly-once execution instead of at-least-once plus idempotency.
No misfire policy, so a long outage silently skips runs or fires thousands of catch-ups.
Letting workers decide whether a job is due, exposing the design to clock skew.
Storing scheduler state in process memory rather than the durable job store, losing it on every restart.

Quick Reference

Topic	Key Point
Core pattern	Sharded ownership + leader election + fencing + atomic claim-and-advance
Coordination	etcd / ZooKeeper / Consul leases with TTL; one owner per shard
Fencing token	Monotonic version on every write; rejects stale owners under partition
Tick loop	Query `(shard, nextRunAt)`, atomically claim + advance, dispatch
Execution	At-least-once dispatch + idempotent job; `(jobId, scheduledAt)` as the key
Missed schedules	Explicit per-job misfire policy: catch-up, skip, or fire-once-and-skip
Time	Scheduler decides "due", never the worker - clock skew otherwise duplicates
Aligned cron herd	Spread the spike across many shards; worker queue absorbs the rest
Failure recovery	Lease expiry -> new owner -> store-driven re-dispatch; nothing silently lost
Multi-region	Shards owned per region; truly global jobs pinned to one designated region

System Design Interview Problems: A Senior's Roadmap - the full series index and pattern library.
System Design Interview Guide: The 6-Step Framework - the method this walkthrough applies.
Design a Rate Limiter - Part 2; the atomic-claim primitive generalised here to claim-and-advance.
Design a Notification Service - Part 3; at-least-once dispatch, retries with backoff, and the dead-letter queue.
Design a Payment System - Part 11; idempotency at every boundary, applied here to scheduled work.
Design a Unique ID Generator - Part 13; the Tier 5 expansion opens with decentralised ID generation.

This is Part 12, the close of the core 12-part track in a system design series where each post solves one problem around one core pattern. The job scheduler sits where the series' threads converge - leases, atomic claims, durable queues, dead letters, and idempotency at every boundary - which is why it makes a fitting close. Continue with the Tier 5 expansion, or return to the series roadmap to revisit any pattern.

Design a Job Scheduler: System Design Interview 2026

Table of Contents

The Problem

Step 1 - Clarify Requirements

Step 2 - Estimate Scale

Step 3 - API and Data Model

Step 4 - High-Level Design

Step 5 - Deep Dive: Leader Election, Sharded Scheduling, and At-Least-Once Execution

Part A - Why coordination is needed

Part B - Leader election and fencing tokens

Part C - The tick loop and atomic claim-and-advance

Part D - At-least-once execution and idempotency

Job lifecycle

Part E - Missed schedules and time

Consistency model

Failure modes

Multi-region

Evolution path

Observability

Step 6 - Bottlenecks and Trade-offs

Reference Architecture

Common Mistakes in the Interview

Quick Reference

Ready to ace your interview?

Table of Contents

The Problem

Step 1 - Clarify Requirements

Step 2 - Estimate Scale

Step 3 - API and Data Model

Step 4 - High-Level Design

Step 5 - Deep Dive: Leader Election, Sharded Scheduling, and At-Least-Once Execution

Part A - Why coordination is needed

Part B - Leader election and fencing tokens

Part C - The tick loop and atomic claim-and-advance

Part D - At-least-once execution and idempotency

Job lifecycle

Part E - Missed schedules and time

Consistency model

Failure modes

Multi-region

Evolution path

Observability

Step 6 - Bottlenecks and Trade-offs

Reference Architecture

Common Mistakes in the Interview

Quick Reference

Related Articles

Ready to ace your interview?