A distributed job scheduler looks like cron, scales like a distributed database, and fails like a coordination problem. Its job description sounds dull - fire each scheduled task at its time, on some pool of machines - until you account for what happens when two scheduler nodes both think they should fire the same job, when one of them is down for ten minutes and a thousand jobs were due in that window, or when an aligned cron expression makes ten thousand jobs come due at the same second. None of those questions has a "just throw more servers at it" answer, which is what makes this the natural closing problem of a system design series: the answer composes nearly every pattern that came before.
This walkthrough assumes the 6-step system design framework and applies it at senior-plus depth. It is Part 12 - and the final part - of a system design series.
Table of Contents
- The Problem
- Step 1 - Clarify Requirements
- Step 2 - Estimate Scale
- Step 3 - API and Data Model
- Step 4 - High-Level Design
- Step 5 - Deep Dive: Leader Election, Sharded Scheduling, and At-Least-Once Execution
- Step 6 - Bottlenecks and Trade-offs
- Reference Architecture
- Common Mistakes in the Interview
- Quick Reference
- Related Articles
The Problem
We are designing a distributed job scheduler - the engine behind cron jobs at platform scale, delayed retries, scheduled emails, periodic ETL, billing runs, and any other "run this at time T" workload. The canonical examples are Quartz, Kubernetes CronJobs, AWS EventBridge Scheduler, and the scheduler underneath any workflow engine like Airflow.
The senior framing has two halves. The scheduling decision - "is this job due, who fires it" - must be coordinated, because uncoordinated schedulers all fire the same job. The execution must be at-least-once with idempotent jobs, because the dispatch-then-execute boundary cannot be made exactly-once. Almost every design decision is either a coordination mechanism or an idempotency mechanism, and they together produce a scheduler that never silently loses a job and almost never runs one twice.
Step 1 - Clarify Requirements
Functional requirements:
- Register a job: a definition plus a schedule (cron expression, fixed delay, or a one-shot
runAt). - Trigger jobs at the right time and dispatch them to workers.
- Track job runs and their outcomes.
- Retry failed jobs with backoff.
- Cancel, pause, and modify registered jobs.
Out of scope (name, then defer): the job logic itself (we assume idempotent workers), workflow DAGs (Airflow is a scheduler plus a workflow engine - we design only the scheduler), and the authoring UI.
Non-functional requirements:
- Do not run a job twice when not asked to - and accept that this is "minimise duplicates", not "eliminate", because of at-least-once execution.
- Do not silently miss a scheduled run - and have an explicit policy for what to do about runs missed during an outage.
- Scale to millions of jobs and peaks of tens of thousands of triggers per second.
- Time precision to seconds; sub-second is a different problem.
- High availability, so a node failure does not stop firing.
- Tolerate clock skew between machines.
The decisive clarifying question, established the same way it was in Part 3 and Part 11: the execution guarantee is at-least-once with idempotency, never exactly-once. The scheduler's job is to make duplicate dispatches rare and to make missed schedules detectable and recoverable; the worker's job is to be idempotent so a duplicate dispatch has no effect.
Step 2 - Estimate Scale
Registered jobs. Assume 10 million across the platform - a mix of recurring crons and one-shot delayed jobs.
Trigger rate. Peak around ~10,000 triggers/sec, the peak coming from aligned cron expressions: "every hour at :00" makes thousands of jobs come due at the same instant. The scheduler must absorb that herd rather than serialise on it.
State storage. 10M jobs at ~500 bytes ≈ 5 GB of metadata. Job-run history at ~200 bytes per run, retained 30 days, runs to a few TB - partitioned by shard, time-tiered.
The interesting number is not the totals but the spike-to-average ratio: the scheduler is mostly idle and occasionally has to fire ten thousand jobs at the same second. Sharding is what spreads that spike.
Step 3 - API and Data Model
POST /api/jobs
body: { jobId, schedule, payload, target, retryPolicy, misfirePolicy }
201 Created
DELETE /api/jobs/{id}
GET /api/jobs/{id}/runsThe data model is small but carefully indexed:
| Entity | Key fields |
|---|---|
| Job | jobId, shard, schedule, target, retryPolicy, misfirePolicy, status, nextRunAt |
| Job run | runId, jobId, scheduledAt, dispatchedAt, status, attempts, workerId |
| Shard ownership | shard -> (ownerNode, leaseExpiresAt, fencingToken) - in the coordination service |
Two fields carry the design. nextRunAt is the precomputed firing time, indexed by (shard, nextRunAt) so a shard owner can answer "what fires in the next second" with a tight indexed scan, regardless of how many jobs exist. misfirePolicy is the per-job knob that decides what happens when a scheduler outage causes a job to miss its scheduled time - catch up, skip to next, or fire once and skip.
Step 4 - High-Level Design
flowchart TD
Cli([Client]) -->|register / cancel| API[Scheduler API]
API --> Store[(Job Store<br/>indexed by shard, nextRunAt)]
subgraph Sched["Scheduler tier - one owner per shard"]
S1[Scheduler Node A]
S2[Scheduler Node B]
S3[Scheduler Node C]
end
Coord[Coordination Service<br/>etcd / ZooKeeper / Consul]
S1 <-->|lease + fencing token| Coord
S2 <-->|lease + fencing token| Coord
S3 <-->|lease + fencing token| Coord
S1 -->|tick: claim + advance due jobs| Store
S2 --> Store
S3 --> Store
S1 -->|dispatch trigger| Q[Worker Queue]
S2 --> Q
S3 --> Q
Q --> W[Worker Pool<br/>idempotent jobs]
W -->|run result| StoreFigure 1. The architecture: multiple scheduler nodes coordinate through a consensus-backed coordination service for shard ownership, then run their tick loop against the indexed job store. The coordination service is the only strongly-consistent dependency outside the job store, and it is small - sized by shard count, not by job count.
The scheduler tier is multiple nodes; shard ownership lives in the coordination service, which provides leases and fencing tokens. Each owner runs a tick loop on its shards: query the job store for jobs due in the shard, atomically claim and advance them, and dispatch a trigger onto the worker queue. Workers consume the queue, execute the job (idempotently), and write the run result back. The coordination service is the only strongly-consistent piece outside the job store - and it is small.
Step 5 - Deep Dive: Leader Election, Sharded Scheduling, and At-Least-Once Execution
This is the core. Three mechanisms compose: shard ownership through leader election with fencing, the atomic claim-and-advance tick loop, and at-least-once dispatch onto idempotent workers - with an explicit policy for what happens when an outage means a job missed its time.
Part A - Why coordination is needed
Run N scheduler nodes that each scan the whole job set every second and the result is N duplicate fires per job. Coordination must guarantee that for each job, exactly one scheduler is responsible at any given moment - and two structures can deliver that:
- A single global leader. One node fires everything; the rest stand by for failover. Simple, but the throughput ceiling is one node, and the single point of contention is the scheduler itself.
- Sharded ownership with one leader per shard. Partition the job set into many shards (typically by
hash(jobId)); each shard has exactly one owner; scheduler capacity grows with shard count.
The second option is the senior answer. The system scales horizontally by spreading shards across nodes - which is also how Kafka brokers own partitions, how database leaders own primary key ranges, and how every other partitioned-leader system in the wild works.
Part B - Leader election and fencing tokens
Shard ownership is delegated to a strongly-consistent coordination service - etcd, ZooKeeper, or Consul - that implements consensus internally (Raft, Zab) and exposes leases as a primitive. Each scheduler node tries to acquire a lease on a shard's key; the holder periodically renews it; a crashed holder's lease expires and another node grabs it. Recovery is automatic and bounded by the lease TTL.
flowchart TD
Start([Node attempts to own shard S]) --> Try[Compare-and-swap on shard-S key]
Try -->|success| Own["Owner: lease with TTL + fencing token N"]
Try -->|already owned| Wait[Wait for lease to expire]
Wait --> Try
Own --> Renew[Periodic renew]
Renew -->|ok| Renew
Renew -->|crash or partition| Expire[Lease expires]
Expire --> Try
Own --> Work["Run scheduler tick for shard S<br/>writes carry fencing token N"]Figure 2. The lease lifecycle for one shard. A node tries compare-and-swap, becomes owner with a TTL and a fencing token, renews periodically, and loses the lease on crash or partition; another node then claims it. Recovery is automatic and bounded by the TTL - the simplest correct primitive for "exactly one owner at a time".
What still has to be handled is split-brain. A network partition can briefly leave two nodes each convinced they own a shard - the old owner has not noticed the partition, the new owner has already taken the lease. Without protection, both would dispatch the same jobs.
The defence is a fencing token. The coordination service issues a monotonically increasing version number every time ownership changes; writes to the job store carry that token, and the store rejects any write whose token is older than the most recent one it has seen. So even if two nodes believe they own a shard, only the latest one can mutate state - the old one's claim-and-advance silently fails. Fencing tokens are one of the few cleanly-correct mechanisms for split-brain in a distributed system and are worth naming explicitly.
sequenceDiagram
participant A as Scheduler A (former owner)
participant Co as Coordination service
participant B as Scheduler B (new owner)
participant DB as Job store
Note over A: owns shard, fencing token = 7
Note over A,Co: network partition - A cannot renew
Note over Co: A's lease expires
B->>Co: claim shard S
Co-->>B: granted, fencing token = 8
Note over A: partition heals, A still thinks it owns S
A->>DB: claim-and-advance, token = 7
DB-->>A: REJECTED - latest token is 8
B->>DB: claim-and-advance, token = 8
DB-->>B: OK
Note over A,B: only B's writes land - brief split-brain produces no duplicatesFigure 3. The split-brain scenario neutralised by a fencing token. During the partition the new owner takes the lease with token 8; when the partition heals, the former owner's writes still carry the stale token 7 and are silently rejected by the store. Brief disagreement about ownership therefore produces zero duplicate dispatches - the core correctness property of the whole design.
Part C - The tick loop and atomic claim-and-advance
Each scheduler owns a set of shards. For every owned shard, every ~second:
- Query:
SELECT * WHERE shard = S AND nextRunAt <= now AND status = active LIMIT N. The(shard, nextRunAt)index makes this cheap. - Atomically claim and advance, in one transaction: insert a
job_runrow withscheduledAt, update the job'snextRunAtto its next occurrence, and pass the shard's fencing token in the write predicate. - Dispatch a trigger event onto the worker queue.
The atomic claim-and-advance is the mechanism that makes "no missed schedules" true. If a scheduler crashes after step 2 but before step 3, the dispatch is lost - but the run row is already recorded, and a watchdog (or the next scheduler) sees a job_run without a worker outcome and re-dispatches. If the scheduler crashes before step 2, no state changed and the next owner sees the same jobs as still due. Either way, the job fires - at-least-once, never silently skipped.
The atomic claim+advance is also the Part 2 atomic-counter idea generalised: a single indivisible operation prevents the race between detecting work and recording that you took it.
The aligned cron herd (ten thousand jobs all due at :00) is handled by sharding: that ten-thousand-job spike is split across, say, 100 shards owned by many scheduler nodes, so each tick fires ~100 jobs and the worker queue absorbs the rest.
Part D - At-least-once execution and idempotency
Once a trigger is on the worker queue, the message follows queue semantics - at-least-once delivery (Part 3). A worker crash mid-execution causes the message to be redelivered; without protection, the job runs twice. Exactly-once is not achievable, for the same reason as in messaging and payments: the worker call and its acknowledgement cannot be wrapped in an atomic bracket.
The contract is therefore at-least-once dispatch plus an idempotent job. The scheduler hands the worker a natural idempotency key - the (jobId, scheduledAt) pair - and workers either consult a recent-executions cache to dedup on it or write their effects under that key (the Part 11 discipline applied to scheduled work). For workloads where redoing the job is truly unacceptable, the job itself must be designed around this contract; the scheduler will not save you.
Job lifecycle
stateDiagram-v2
[*] --> SCHEDULED: register
SCHEDULED --> DISPATCHED: tick claim+advance
DISPATCHED --> RUNNING: worker picks up
RUNNING --> SUCCEEDED: ok
RUNNING --> RETRYING: transient failure
RETRYING --> DISPATCHED: backoff elapsed
RETRYING --> DEAD_LETTER: max attempts reached
SCHEDULED --> SCHEDULED: recurring -> next nextRunAt
SUCCEEDED --> [*]
DEAD_LETTER --> [*]Figure 4. The job lifecycle as a state machine. A recurring job loops through SCHEDULED -> DISPATCHED -> RUNNING -> SUCCEEDED and re-enters SCHEDULED with a new nextRunAt; failures retry with backoff and ultimately land in DEAD_LETTER. Every transition is durable, which is what makes "never silently lost" true.
A recurring job moves back to SCHEDULED with its new nextRunAt set during the same claim-and-advance transaction. Failed attempts retry with backoff (the Part 3 exponential-with-jitter pattern); a job that exhausts its retries lands in a dead-letter state for inspection.
Part E - Missed schedules and time
A scheduler outage of ten minutes leaves jobs whose nextRunAt has passed. On recovery the new owner could fire every missed run (catch up), skip to next (resume), or fire once and skip (acknowledge once, then resume). The right choice is per job: a monthly billing run wants catch-up, a per-minute heartbeat wants skip, a "send the daily report" wants fire-once-and-skip. The senior answer makes this a misfirePolicy on the job definition rather than a global decision.
Clock skew is the other time hazard. Workers must not decide whether a job is due; only the scheduler does, based on the job store's clock or a single authoritative time source. NTP keeps machine clocks close, but not closer than a scheduler-level decision can require, and disagreement about "now" between two would-be deciders is exactly how a sloppy design produces duplicates.
Consistency model
Shard ownership and the (shard, nextRunAt) -> claim transaction are strongly consistent. Execution is at-least-once with idempotency, observably effectively-once. Jobs may run slightly late (within seconds, by design) and occasionally twice (workers dedup), and they are never silently lost thanks to the durable job store and at-least-once dispatch. State all three out loud.
Failure modes
- Scheduler crash. Its lease expires within the TTL, another node claims the shards, scans the store, and fires anything due - automatic recovery.
- Coordination service down. New leases cannot be acquired, but already-leased shards keep working until lease expiry; the coordination service is itself HA via consensus, so this is rare and brief.
- Worker queue down. Dispatches fail; the scheduler stops emitting, but the durable
nextRunAtindex still records what is due, and the queue drains its backlog when it returns. - Worker crash mid-execution. At-least-once redelivery plus idempotent jobs cover it.
- Network partition between owner and store. The owner's writes fail their fencing-token check and are silently rejected; the new owner takes over after lease expiry, with no duplicate dispatches.
- Poison job that always fails. Retries with cap then dead-letter, the Part 3 DLQ pattern. Alert on DLQ growth.
Multi-region
Partition shards by region with affinity by jobId, so each region owns its own slice; a region failure causes another region to absorb its shards via the same lease mechanism. A global job that must fire exactly once worldwide at a given time has one designated owner region - because doing it everywhere is exactly the split-brain a fencing token is designed to prevent, only stretched across continents.
Evolution path
| Stage | Approach |
|---|---|
| Launch | Single scheduler node, a simple cron table in the DB, in-process workers |
| Growth | Many schedulers with naive shard claim, a worker queue, retries with backoff |
| Scale | Per-shard leader election with fencing tokens, configurable misfire policies, multi-region |
The non-negotiable day-one investments are idempotent jobs as a contract, the (jobId, scheduledAt) idempotency key, the durable job store, and the atomic claim-and-advance. Defer multi-region, elaborate rebalancing, and per-job priority.
Observability
Track scheduler tick latency, jobs fired/sec, the missed-schedule count (alert - it should be zero outside of an outage), the duplicate-execution count, dispatch-to-execute queue lag, worker p99 execution latency, lease churn rate (frequent ownership changes hint at flaky nodes), and per-shard load skew. Reasonable SLOs: 99.9% of jobs fired within 5 s of their scheduled time; missed-schedule rate under 0.01%.
Step 6 - Bottlenecks and Trade-offs
- The coordination service is sized by shard count and lease renewal rate, never by job count - keep shards in the thousands, not millions.
- The
(shard, nextRunAt)index keeps each tick's query bounded regardless of job count - the wrong index turns the scheduler into a full-table scanner. - The aligned cron herd is the worst-case load and is absorbed by spreading it across many shards on many nodes.
- Lease churn - frequent ownership changes - signals flaky nodes or too-tight TTLs and is itself a metric.
- The duplicate-execution rate is the headline correctness signal; a non-trivial value means a worker is not actually idempotent.
Reference Architecture
The pattern this problem teaches, reusable far beyond schedulers:
Shard the work, elect one owner per shard through a consensus-backed coordination service with fencing tokens, perform every state change as an atomic claim-and-advance in a durable store, and dispatch to idempotent at-least-once consumers.
flowchart LR
subgraph Coord["Coordination - one owner per shard"]
L[Lease + fencing token]
end
subgraph Tick["Owner's tick loop"]
S[Scheduler] -->|atomic claim + advance| D[(Durable job store)]
S --> Q[Worker queue]
end
Q --> W[Idempotent workers]
L -.protects.-> SFigure 5. The reference architecture for any "one writer per partition at a time" system: a coordination layer with leases and fencing protects the owner's tick loop, which atomically claims work in the durable store and dispatches to idempotent workers. The same composition underlies Kafka brokers, primary databases, and partitioned stream processors.
The same shape recurs in any "one writer per partition at a time" system: Kafka brokers owning partitions, primary databases owning shards, distributed lock services, partitioned stream processors. Whenever the answer to "who is responsible for this slice of state right now" must be exactly one node, this composition - leases plus fencing plus atomic-claim plus at-least-once delivery to idempotent consumers - is the toolkit.
This is also the convergence post of the series: the atomic claim comes from Part 2, the durable queue and at-least-once + DLQ from Part 3, the producer-consumer feedback from Part 8, and idempotency at every boundary from Part 11. The scheduler exists at the seam where all of those patterns meet, which is why it is the right closing problem.
Common Mistakes in the Interview
- No coordination, so every scheduler fires every job and duplicates explode.
- A single global leader as the only design, making the scheduler itself the bottleneck.
- Leader election without fencing tokens, leaving split-brain duplicate dispatch under a partition.
- Claiming exactly-once execution instead of at-least-once plus idempotency.
- No misfire policy, so a long outage silently skips runs or fires thousands of catch-ups.
- Letting workers decide whether a job is due, exposing the design to clock skew.
- Storing scheduler state in process memory rather than the durable job store, losing it on every restart.
Quick Reference
| Topic | Key Point |
|---|---|
| Core pattern | Sharded ownership + leader election + fencing + atomic claim-and-advance |
| Coordination | etcd / ZooKeeper / Consul leases with TTL; one owner per shard |
| Fencing token | Monotonic version on every write; rejects stale owners under partition |
| Tick loop | Query (shard, nextRunAt), atomically claim + advance, dispatch |
| Execution | At-least-once dispatch + idempotent job; (jobId, scheduledAt) as the key |
| Missed schedules | Explicit per-job misfire policy: catch-up, skip, or fire-once-and-skip |
| Time | Scheduler decides "due", never the worker - clock skew otherwise duplicates |
| Aligned cron herd | Spread the spike across many shards; worker queue absorbs the rest |
| Failure recovery | Lease expiry -> new owner -> store-driven re-dispatch; nothing silently lost |
| Multi-region | Shards owned per region; truly global jobs pinned to one designated region |
Related Articles
- System Design Interview Problems: A Senior's Roadmap - the full series index and pattern library.
- System Design Interview Guide: The 6-Step Framework - the method this walkthrough applies.
- Design a Rate Limiter - Part 2; the atomic-claim primitive generalised here to claim-and-advance.
- Design a Notification Service - Part 3; at-least-once dispatch, retries with backoff, and the dead-letter queue.
- Design a Payment System - Part 11; idempotency at every boundary, applied here to scheduled work.
- Design a Unique ID Generator - Part 13; the Tier 5 expansion opens with decentralised ID generation.
This is Part 12, the close of the core 12-part track in a system design series where each post solves one problem around one core pattern. The job scheduler sits where the series' threads converge - leases, atomic claims, durable queues, dead letters, and idempotency at every boundary - which is why it makes a fitting close. Continue with the Tier 5 expansion, or return to the series roadmap to revisit any pattern.
