How do you prevent double charges in a payment system?

The client generates an idempotency key per logical payment, and the service stores the request and response keyed by it before any side effect. A retry with the same key returns the stored result rather than re-processing. Idempotency is also propagated to the external provider via its idempotency token, so a retried provider call is safe at the bank boundary too. Without these layers, any network hiccup can produce a duplicate charge.

What is an idempotency key?

An idempotency key is a unique identifier the client attaches to a request so the server can recognise retries as the same logical operation. The server records the key, the response, and the resulting status, and any subsequent request with that key returns the original response without doing the work again. Keys are kept for a long TTL - days to weeks - to cover all realistic retry windows.

What is the transactional outbox pattern?

The outbox solves the dual-write problem: a service cannot atomically write to its database and publish to a message bus, so doing one then the other risks dropping events on failure. Instead the service writes an event row to an outbox table inside the same database transaction as the state change, and a separate publisher polls the outbox and emits the events. The event is durably enqueued atomically with the state change, eliminating the inconsistency window.

When should you use a saga instead of two-phase commit?

Two-phase commit requires every participant to speak the same coordination protocol and to block resources during the prepare phase - neither of which applies across external providers like a bank. A saga models a cross-service workflow as a sequence of local transactions, each with a compensating action that undoes its effect. If a step fails, compensations run in reverse to back out the work, giving correctness without a global commit.

Why use a double-entry ledger as the source of truth for money?

A double-entry, append-only ledger records every money movement as a balanced set of debit and credit entries summing to zero, which makes the system auditable, replayable, and self-checking. Balances are derived from the ledger rather than stored as mutable numbers, so a corrupted column cannot quietly destroy money. Corrections become new compensating entries, not edits to history.

What is reconciliation in a payment system?

Reconciliation periodically compares the platform's ledger with the external source of truth - the payment processor's records or bank statements - to detect drift. Discrepancies are alerted on and repaired with compensating ledger entries, never by editing history. It exists because no in-system mechanism can fully eliminate the chance of a timed-out provider call leaving the two sides in disagreement, so the system must check itself against external ground truth.

Design a Payment System: System Design Interview 2026

A payment system inverts almost every priority the rest of this series operates under. Throughput is modest. Latency is dictated by an external bank. The data model fits on a napkin. What is non-negotiable is correctness: a payment must not be silently lost and must never be duplicated. Money is the one resource where "eventually consistent and probably right" is not an acceptable answer, which is why the design is dominated by mechanisms that make every operation idempotent, every cross-system step recoverable, and every internal record cross-checkable against external ground truth.

This walkthrough assumes the 6-step system design framework and applies it at senior-plus depth. It is Part 11 of a system design series.

The Problem
Step 1 - Clarify Requirements
Step 2 - Estimate Scale
Step 3 - API and Data Model
Step 4 - High-Level Design
Step 5 - Deep Dive: Idempotency, the Outbox, and Sagas
Step 6 - Bottlenecks and Trade-offs
Reference Architecture
Common Mistakes in the Interview
Quick Reference
Related Articles

The Problem

We are designing a payment system: a customer pays a merchant, the platform takes a fee, and the money moves through an external payment provider that ultimately settles with the bank rails. The canonical examples are Stripe and the payment subsystem inside any marketplace or platform.

The senior framing is that correctness dominates every other concern. The throughput is modest, the data model is small, and the external provider is slow - none of those is the difficulty. The difficulty is that any retry, network glitch, or partial failure must not turn into a duplicate charge or a lost payment, even when the layer below you - the payment processor - times out without telling you whether the charge happened. Everything in the deep dive is a mechanism for making that true.

Step 1 - Clarify Requirements

Functional requirements:

Process a payment: debit a customer, credit a merchant, take a platform fee.
Issue refunds.
Integrate with an external payment provider that talks to the bank rails.
Expose transaction history.

Out of scope (name, then defer): fraud detection, KYC, fiat conversion, the customer-facing payment UI, and DRM-style chargeback handling beyond a basic refund flow.

Non-functional requirements:

Correctness over availability. Money must not be lost or duplicated. This priority is the opposite of every other system in this series.
Idempotency at every boundary that can retry.
Auditability. Every state change recorded, no in-place edits.
Strong consistency around balances.
Modest throughput, but every operation matters.
Regulatory constraints - PCI-DSS, audit logs, long retention.

The clarifying question worth surfacing first: the user-observable guarantee is exactly-once in effect, but, as Part 3 established, exactly-once is not achievable end to end. What we actually deliver is at-least-once with idempotency at every boundary plus a reconciliation safety net against external ground truth. State this up front - the rest of the design follows.

Step 2 - Estimate Scale

The numbers here are deliberately small compared to the rest of the series, because the difficulty is per-operation, not per-second.

Throughput. Assume 10 million payments/day: ~115/sec average, ~500/sec at peak. Tiny by any other system's standards.

Storage. Each payment writes ~2 KB across payment record, ledger entries, and audit data: ~20 GB/day, retained for years for regulatory reasons. The cost is in retention and immutability, not volume.

Latency. The external provider call dominates - 100 to 500 ms - and that bound is the per-payment latency, not any internal hop.

The arithmetic confirms that bandwidth, CPU, and storage are not constraints. Correctness is the constraint.

Step 3 - API and Data Model

POST /api/payments
  Idempotency-Key: <uuid>
  body: { customerId, merchantId, amount, currency, methodToken }
  201 Created   { paymentId, status: "captured" | "pending" | "failed" }
 
POST /api/payments/{id}/refund
  Idempotency-Key: <uuid>
  body: { amount }
 
GET  /api/payments/{id}   -> the payment state

The data model has four pieces, and the relationships between them are the design:

Entity	Role
Payment	`paymentId`, `status` state machine, `amount`, parties, `providerRef`
Idempotency record	`idempotencyKey` -> `paymentId` + cached response, long TTL
Ledger entries	Append-only, double-entry - the authoritative record of money movements
Outbox	Pending events committed atomically with state changes

Two modelling choices are non-obvious and structural. First, balances are not a column. They are derived from the ledger, because a mutable balance column can be corrupted silently and have no audit trail. Every money movement is one set of debit and credit entries that sum to zero, the standard double-entry pattern, and a corrected entry is always a new entry, never an edit. Second, the payment status is an explicit state machine - CREATED -> AUTHORIZED -> CAPTURED -> REFUNDED plus FAILED and PENDING - which makes "we don't yet know whether the provider charged the customer" a first-class state instead of an undefined gap.

Step 4 - High-Level Design

flowchart TD
    Client([Client]) -->|idempotency key| PS[Payment Service]
    PS -->|check + claim key| Idem[(Idempotency Store)]
    PS -->|atomic: payment + ledger + outbox| DB[(Transactional DB)]
    PS -->|provider idempotency token| Prov[External Payment Provider]
    DB --> Pub[Outbox Publisher]
    Pub --> Bus[Event Bus]
    Bus --> Notif[Notification Service]
    Bus --> Fulfil[Fulfilment Service]
    Bus --> Analytics[Analytics]
    Recon[Reconciliation Service] -.read.-> DB
    Recon -.read provider reports.-> Prov
    Recon -->|drift -> compensating entries| DB

Figure 1. The payment service is the orchestrator at the centre: it claims the idempotency key, atomically writes payment + ledger + outbox in one DB transaction, then talks to the provider with the provider's own idempotency token. The outbox publisher decouples downstream events; the reconciliation service cross-checks against external ground truth and writes compensating entries when drift is found.

The payment service is the orchestrator. It claims the idempotency key first, writes the payment state, ledger entries, and outbox events in one database transaction, then talks to the external provider with the provider's own idempotency token. A separate publisher drains the outbox onto the event bus for downstream consumers. A reconciliation service periodically cross-checks the platform's ledger against the provider's reports and emits compensating ledger entries when drift is found.

Step 5 - Deep Dive: Idempotency, the Outbox, and Sagas

This is the core. Five mechanisms together make the system correct: idempotency keys at every retryable boundary, an append-only ledger as the source of truth, the transactional outbox for reliable event publication, sagas with compensations for cross-service work, and reconciliation against external ground truth as the final safety net.

Part A - Idempotency at every boundary

A retry that re-processes is the most common way payment systems double-charge. The fix is to make every boundary that can be retried idempotent, which means recognising a retry as the same logical operation and returning the original result instead of redoing the work.

Client to service. The client generates an Idempotency-Key per logical payment. The service's first action - before any side effect - is to claim the key (insert it; succeed only if not already present) and bind it to a payment record. A subsequent request with the same key returns the cached prior response. Keys live with a long TTL (days to weeks) to comfortably cover any retry window.

sequenceDiagram
    participant C as Client
    participant PS as Payment Service
    participant Idem as Idempotency Store
    participant Prov as Payment Provider
 
    C->>PS: POST /payments (key=K)
    PS->>Idem: claim K
    Idem-->>PS: claimed -> proceed
    PS->>Prov: charge (provider token T)
    Prov-->>PS: 200 OK (chargeId)
    PS->>Idem: store response for K
    PS-->>C: 201 Created
    Note over C: network timeout - retry
    C->>PS: POST /payments (key=K, again)
    PS->>Idem: claim K
    Idem-->>PS: already exists -> return cached
    PS-->>C: 201 Created (same response, no re-charge)

Figure 2. The idempotency contract in action. The first request claims the key, charges the provider, and stores the response; the retry hits the already-claimed key and returns the cached response without re-charging. This is the canonical defence against the most common cause of double charges - a network timeout that the client retries.

Service to provider. The provider call itself is made idempotent using the provider's own idempotency token (Stripe accepts one). If a retry hits the provider, the provider either re-runs the operation safely or returns its cached prior response. Never retry a provider call without an idempotency token - that is the canonical way platforms double-charge customers.

This is the same at-least-once-plus-idempotency reasoning from Part 3, but here it is layered at every boundary instead of just one, and the consequence of getting it wrong is real money rather than a duplicate email.

Part B - The append-only double-entry ledger

A real payment system never stores "balance" as a mutable number. Instead, every money movement is committed as a set of double-entry rows - one debit and one or more credits that sum to zero - to an append-only ledger. A balance is then derived by aggregating ledger entries (optionally cached as a materialised snapshot).

This carries three properties no mutable column gives you:

Auditability. Every change is a row, with a reason, and history is replayable.
Self-checking. The sum of all entries should always be zero; an imbalance is an immediate bug or fraud signal.
Correctability without rewriting history. A mistake is fixed with new compensating entries that reverse the bad ones - the bad entries stay, annotated.

This is the standard pattern in real financial systems and is the same idea event-sourced architectures generalise to other domains.

Part C - The transactional outbox

Once a payment is committed, downstream services need to know - notifications, fulfilment, analytics. The naive approach commits the payment, then publishes to a message bus. If the publish fails, the database and the bus disagree forever (a dual-write problem), and there is no way to atomically span the two.

The transactional outbox sidesteps this. In the same database transaction that writes the payment and the ledger entries, the service writes one row per outgoing event to an outbox table. A separate outbox publisher polls the outbox and emits the events to the bus, marking each as published.

Because the event is in the transaction, it is durably enqueued atomically with the state change. If the publisher crashes mid-way, it re-reads the outbox and republishes; consumers deduplicate. The result is reliable cross-system event publication with no two-phase commit and no inconsistency window - and that single guarantee is why the outbox is one of the most useful patterns in this whole series.

Part D - Sagas for cross-service work

A single payment can touch several services and external systems: charge the provider, credit the merchant balance, notify, update fulfilment. A single ACID transaction cannot span all of them, and two-phase commit is the wrong tool - it requires every participant to speak the same protocol and blocks resources during the prepare phase, which is hopeless across an external bank.

A saga models the workflow as a sequence of local transactions, each with a compensating action that undoes its effect. If a step fails, compensations run in reverse to back out the work the saga already did.

flowchart TD
    S1[1. Authorize charge<br/>at provider] -->|ok| S2[2. Capture funds<br/>at provider]
    S2 -->|ok| S3[3. Credit merchant<br/>ledger]
    S3 -->|ok| S4[4. Publish payment<br/>complete event]
    S2 -.fail.-> C1[Compensate 1:<br/>release authorization]
    S3 -.fail.-> C2[Compensate 2:<br/>refund at provider]
    S4 -.fail.-> C3[Compensate 3:<br/>debit merchant]

Figure 3. A multi-step payment as a saga, with each step's compensating action wired in. A failure at step 2, 3, or 4 triggers the compensations of all previously completed steps, running in reverse - the alternative to two-phase commit, which is impossible across an external bank. Compensations must themselves be idempotent, because the saga coordinator may retry them after a crash.

The compensation must itself be idempotent, because the saga coordinator may retry it after a crash. Choreography (services react to each other's events) scales but is hard to reason about for compensations; orchestration (a central coordinator drives the steps) is clearer for payment flows where the compensation graph matters, and is the usual senior choice.

Part E - Reconciliation - the safety net

Even with all of the above, drift can still happen. A provider call times out but actually succeeded. An outbox row is somehow lost. A bug skews the ledger. The only defence is to refuse to trust your own state alone for money and cross-check it against external ground truth.

A reconciliation job runs on a schedule, pulling the provider's reports (and, where relevant, bank statements) and comparing them to the platform's ledger. Any discrepancy is alerted on and repaired by appending compensating entries - never by editing history. A non-zero drift count is by definition an incident.

Failure modes

Provider call times out. The canonical hard case - did the charge happen or not? Mark the payment PENDING and retry with the same provider idempotency token; if that is impossible, leave it for reconciliation to resolve. Never silently assume failure and retry without a token - that double-charges.
DB write fails after a successful provider charge. Write the payment record with status PENDING before calling the provider; update it to CAPTURED after the provider confirms. If the post-charge update itself fails, the payment is in PENDING and reconciliation will reconcile it.
Outbox publisher down. Events accumulate in the outbox; on recovery they publish. Consumers dedup. Nothing is lost - the outbox is durable.
Concurrent updates on one account. Per-account serialisation (a partition key or row-level optimistic locking) prevents lost updates.
Discovered fraud or bug. Append compensating entries; never edit historical rows.

Multi-region

Payments are jurisdictional and processors are often per-region, so partition by customer with region affinity. A cross-region transaction becomes a saga that spans regions. The ledger is replicated synchronously to a hot standby - money cannot tolerate even seconds of recent-write loss, so the disaster-recovery RPO is effectively zero. This is materially stricter than the eventual replication every prior system in this series happily accepted.

Evolution path

Stage	Approach
Launch	Synchronous payment service, idempotency keys, append-only ledger from day one
Growth	Transactional outbox, orchestrated saga, automated reconciliation
Scale	Multi-region with region-affine partitioning, synchronous DR replication

The non-negotiable day-one investments are the idempotency keys, the append-only ledger, the status state machine, and the outbox - none of these can be retrofitted without rewriting history. Defer the elaborate saga orchestrator and multi-region.

Observability

Track payment success rate per provider, end-to-end payment latency p50/p99, provider latency and error rate, outbox lag (events waiting to publish - the headline backpressure metric), the reconciliation drift count (anything non-zero is an incident), idempotency-hit rate (a proxy for retry traffic), saga compensation rate, and a continuous ledger-balance check (the sum of all entries must always be zero). Reasonable SLOs: 99.9% of payments complete within 5 s; reconciliation drift = 0.

Step 6 - Bottlenecks and Trade-offs

Provider latency dominates per-payment time and is largely out of your control - keep the rest of the path lean.
Provider rate limits call for a per-provider token-bucket limiter - exactly the primitive from Part 2.
Per-account contention is bounded by serialising operations on a single account through a partition key.
Outbox lag is the visible health signal for the event pipeline; alert on growth.
Reconciliation drift is the visible health signal for cross-system correctness; any drift is an incident.

Reference Architecture

The pattern this problem teaches, reusable far beyond payments:

Atomically write state and outgoing events in one transaction (outbox), process retries idempotently at every boundary, model cross-service operations as sagas with idempotent compensations, and treat external ground truth as the final reconciliation source.

flowchart LR
    subgraph Atomic["Atomic transaction"]
        A1[State change] --- A2[Ledger entries] --- A3[Outbox event]
    end
    Atomic --> Pub[Outbox publisher]
    Pub --> Bus[Event bus]
    Atomic <-->|saga steps + compensations| Ext[External systems]
    Recon[Reconciliation] -.compare.-> Atomic
    Recon -.compare.-> Ext
    Recon -->|compensating entries| Atomic

Figure 4. The reference architecture for correctness-first systems: atomically write state + ledger + outbox in one transaction, drain the outbox to downstream consumers, model cross-service work as sagas with idempotent compensations, and use reconciliation against external ground truth as the final safety net. The same toolkit applies to any system where being approximately right is unacceptable.

The same shape recurs in any cross-service workflow where correctness matters more than throughput: order processing, inventory, accounting, healthcare records. Idempotent boundaries, an append-only authoritative store, the outbox for atomic state-plus-event publication, sagas for cross-system steps, and reconciliation against an external source of truth is the toolkit for "this must be right, not fast".

Common Mistakes in the Interview

Claiming exactly-once delivery end to end, instead of at-least-once + idempotency + reconciliation.
Dual-write - committing to the DB and then publishing to a bus separately - opening a permanent inconsistency window.
Proposing two-phase commit across services and an external provider, which the provider does not actually support.
Storing a mutable balance column instead of deriving balances from an append-only ledger.
Retrying a provider call without an idempotency token, the canonical way to double-charge.
No reconciliation against the external source of truth.
Treating a provider timeout as a definite failure, then retrying and double-charging.
A saga without compensations, or with non-idempotent ones that fail on retry.

Quick Reference

Topic	Key Point
Core principle	Correctness over availability; never lose or duplicate money
Idempotency	At every boundary: client to service AND service to provider, with TTL
Ledger	Append-only, double-entry, balances derived - never a mutable column
Outbox	State change and event row written in one transaction - no dual-write
Saga	Cross-service workflow as local transactions with idempotent compensations
2PC	Wrong tool here - external providers do not participate
Provider timeout	Mark `PENDING`, retry with the provider's idempotency token, reconcile
Reconciliation	Cross-check ledger against the provider's reports; fix drift with compensating entries
Consistency	Internally strong; across boundaries at-least-once + dedup + reconcile
Multi-region	Region-affine partitioning; ledger replicated synchronously (RPO ≈ 0)

System Design Interview Problems: A Senior's Roadmap - the full series index and pattern library.
System Design Interview Guide: The 6-Step Framework - the method this walkthrough applies.
Design a Notification Service - Part 3; at-least-once delivery and idempotency at one boundary - here generalised to every boundary.
Design a Rate Limiter - Part 2; the per-provider token bucket reused to respect provider limits.
Design a Job Scheduler - Part 12; idempotency at every boundary applied to scheduled work, with leader election added.
SQL Interview Questions - transactions, isolation levels, and the database guarantees the outbox depends on.

This is Part 11 of a 12-part system design series where each post solves one problem around one core pattern. Next: Design a Job Scheduler.

Design a Payment System: System Design Interview 2026

Table of Contents

The Problem

Step 1 - Clarify Requirements

Step 2 - Estimate Scale

Step 3 - API and Data Model

Step 4 - High-Level Design

Step 5 - Deep Dive: Idempotency, the Outbox, and Sagas

Part A - Idempotency at every boundary

Part B - The append-only double-entry ledger

Part C - The transactional outbox

Part D - Sagas for cross-service work

Part E - Reconciliation - the safety net

Failure modes

Multi-region

Evolution path

Observability

Step 6 - Bottlenecks and Trade-offs

Reference Architecture

Common Mistakes in the Interview

Quick Reference

Ready to ace your interview?

Table of Contents

The Problem

Step 1 - Clarify Requirements

Step 2 - Estimate Scale

Step 3 - API and Data Model

Step 4 - High-Level Design

Step 5 - Deep Dive: Idempotency, the Outbox, and Sagas

Part A - Idempotency at every boundary

Part B - The append-only double-entry ledger

Part C - The transactional outbox

Part D - Sagas for cross-service work

Part E - Reconciliation - the safety net

Failure modes

Multi-region

Evolution path

Observability

Step 6 - Bottlenecks and Trade-offs

Reference Architecture

Common Mistakes in the Interview

Quick Reference

Related Articles

Ready to ace your interview?