Design a Payment System: System Design Interview 2026

·17 min read
system-designidempotencytransactional-outboxsagaarchitectureinterview-preparation

A payment system inverts almost every priority the rest of this series operates under. Throughput is modest. Latency is dictated by an external bank. The data model fits on a napkin. What is non-negotiable is correctness: a payment must not be silently lost and must never be duplicated. Money is the one resource where "eventually consistent and probably right" is not an acceptable answer, which is why the design is dominated by mechanisms that make every operation idempotent, every cross-system step recoverable, and every internal record cross-checkable against external ground truth.

This walkthrough assumes the 6-step system design framework and applies it at senior-plus depth. It is Part 11 of a system design series.

Table of Contents

  1. The Problem
  2. Step 1 - Clarify Requirements
  3. Step 2 - Estimate Scale
  4. Step 3 - API and Data Model
  5. Step 4 - High-Level Design
  6. Step 5 - Deep Dive: Idempotency, the Outbox, and Sagas
  7. Step 6 - Bottlenecks and Trade-offs
  8. Reference Architecture
  9. Common Mistakes in the Interview
  10. Quick Reference
  11. Related Articles

The Problem

We are designing a payment system: a customer pays a merchant, the platform takes a fee, and the money moves through an external payment provider that ultimately settles with the bank rails. The canonical examples are Stripe and the payment subsystem inside any marketplace or platform.

The senior framing is that correctness dominates every other concern. The throughput is modest, the data model is small, and the external provider is slow - none of those is the difficulty. The difficulty is that any retry, network glitch, or partial failure must not turn into a duplicate charge or a lost payment, even when the layer below you - the payment processor - times out without telling you whether the charge happened. Everything in the deep dive is a mechanism for making that true.


Step 1 - Clarify Requirements

Functional requirements:

  • Process a payment: debit a customer, credit a merchant, take a platform fee.
  • Issue refunds.
  • Integrate with an external payment provider that talks to the bank rails.
  • Expose transaction history.

Out of scope (name, then defer): fraud detection, KYC, fiat conversion, the customer-facing payment UI, and DRM-style chargeback handling beyond a basic refund flow.

Non-functional requirements:

  • Correctness over availability. Money must not be lost or duplicated. This priority is the opposite of every other system in this series.
  • Idempotency at every boundary that can retry.
  • Auditability. Every state change recorded, no in-place edits.
  • Strong consistency around balances.
  • Modest throughput, but every operation matters.
  • Regulatory constraints - PCI-DSS, audit logs, long retention.

The clarifying question worth surfacing first: the user-observable guarantee is exactly-once in effect, but, as Part 3 established, exactly-once is not achievable end to end. What we actually deliver is at-least-once with idempotency at every boundary plus a reconciliation safety net against external ground truth. State this up front - the rest of the design follows.


Step 2 - Estimate Scale

The numbers here are deliberately small compared to the rest of the series, because the difficulty is per-operation, not per-second.

Throughput. Assume 10 million payments/day: ~115/sec average, ~500/sec at peak. Tiny by any other system's standards.

Storage. Each payment writes ~2 KB across payment record, ledger entries, and audit data: ~20 GB/day, retained for years for regulatory reasons. The cost is in retention and immutability, not volume.

Latency. The external provider call dominates - 100 to 500 ms - and that bound is the per-payment latency, not any internal hop.

The arithmetic confirms that bandwidth, CPU, and storage are not constraints. Correctness is the constraint.


Step 3 - API and Data Model

POST /api/payments
  Idempotency-Key: <uuid>
  body: { customerId, merchantId, amount, currency, methodToken }
  201 Created   { paymentId, status: "captured" | "pending" | "failed" }
 
POST /api/payments/{id}/refund
  Idempotency-Key: <uuid>
  body: { amount }
 
GET  /api/payments/{id}   -> the payment state

The data model has four pieces, and the relationships between them are the design:

EntityRole
PaymentpaymentId, status state machine, amount, parties, providerRef
Idempotency recordidempotencyKey -> paymentId + cached response, long TTL
Ledger entriesAppend-only, double-entry - the authoritative record of money movements
OutboxPending events committed atomically with state changes

Two modelling choices are non-obvious and structural. First, balances are not a column. They are derived from the ledger, because a mutable balance column can be corrupted silently and have no audit trail. Every money movement is one set of debit and credit entries that sum to zero, the standard double-entry pattern, and a corrected entry is always a new entry, never an edit. Second, the payment status is an explicit state machine - CREATED -> AUTHORIZED -> CAPTURED -> REFUNDED plus FAILED and PENDING - which makes "we don't yet know whether the provider charged the customer" a first-class state instead of an undefined gap.


Step 4 - High-Level Design

flowchart TD
    Client([Client]) -->|idempotency key| PS[Payment Service]
    PS -->|check + claim key| Idem[(Idempotency Store)]
    PS -->|atomic: payment + ledger + outbox| DB[(Transactional DB)]
    PS -->|provider idempotency token| Prov[External Payment Provider]
    DB --> Pub[Outbox Publisher]
    Pub --> Bus[Event Bus]
    Bus --> Notif[Notification Service]
    Bus --> Fulfil[Fulfilment Service]
    Bus --> Analytics[Analytics]
    Recon[Reconciliation Service] -.read.-> DB
    Recon -.read provider reports.-> Prov
    Recon -->|drift -> compensating entries| DB

Figure 1. The payment service is the orchestrator at the centre: it claims the idempotency key, atomically writes payment + ledger + outbox in one DB transaction, then talks to the provider with the provider's own idempotency token. The outbox publisher decouples downstream events; the reconciliation service cross-checks against external ground truth and writes compensating entries when drift is found.

The payment service is the orchestrator. It claims the idempotency key first, writes the payment state, ledger entries, and outbox events in one database transaction, then talks to the external provider with the provider's own idempotency token. A separate publisher drains the outbox onto the event bus for downstream consumers. A reconciliation service periodically cross-checks the platform's ledger against the provider's reports and emits compensating ledger entries when drift is found.


Step 5 - Deep Dive: Idempotency, the Outbox, and Sagas

This is the core. Five mechanisms together make the system correct: idempotency keys at every retryable boundary, an append-only ledger as the source of truth, the transactional outbox for reliable event publication, sagas with compensations for cross-service work, and reconciliation against external ground truth as the final safety net.

Part A - Idempotency at every boundary

A retry that re-processes is the most common way payment systems double-charge. The fix is to make every boundary that can be retried idempotent, which means recognising a retry as the same logical operation and returning the original result instead of redoing the work.

Client to service. The client generates an Idempotency-Key per logical payment. The service's first action - before any side effect - is to claim the key (insert it; succeed only if not already present) and bind it to a payment record. A subsequent request with the same key returns the cached prior response. Keys live with a long TTL (days to weeks) to comfortably cover any retry window.

sequenceDiagram
    participant C as Client
    participant PS as Payment Service
    participant Idem as Idempotency Store
    participant Prov as Payment Provider
 
    C->>PS: POST /payments (key=K)
    PS->>Idem: claim K
    Idem-->>PS: claimed -> proceed
    PS->>Prov: charge (provider token T)
    Prov-->>PS: 200 OK (chargeId)
    PS->>Idem: store response for K
    PS-->>C: 201 Created
    Note over C: network timeout - retry
    C->>PS: POST /payments (key=K, again)
    PS->>Idem: claim K
    Idem-->>PS: already exists -> return cached
    PS-->>C: 201 Created (same response, no re-charge)

Figure 2. The idempotency contract in action. The first request claims the key, charges the provider, and stores the response; the retry hits the already-claimed key and returns the cached response without re-charging. This is the canonical defence against the most common cause of double charges - a network timeout that the client retries.

Service to provider. The provider call itself is made idempotent using the provider's own idempotency token (Stripe accepts one). If a retry hits the provider, the provider either re-runs the operation safely or returns its cached prior response. Never retry a provider call without an idempotency token - that is the canonical way platforms double-charge customers.

This is the same at-least-once-plus-idempotency reasoning from Part 3, but here it is layered at every boundary instead of just one, and the consequence of getting it wrong is real money rather than a duplicate email.

Part B - The append-only double-entry ledger

A real payment system never stores "balance" as a mutable number. Instead, every money movement is committed as a set of double-entry rows - one debit and one or more credits that sum to zero - to an append-only ledger. A balance is then derived by aggregating ledger entries (optionally cached as a materialised snapshot).

This carries three properties no mutable column gives you:

  • Auditability. Every change is a row, with a reason, and history is replayable.
  • Self-checking. The sum of all entries should always be zero; an imbalance is an immediate bug or fraud signal.
  • Correctability without rewriting history. A mistake is fixed with new compensating entries that reverse the bad ones - the bad entries stay, annotated.

This is the standard pattern in real financial systems and is the same idea event-sourced architectures generalise to other domains.

Part C - The transactional outbox

Once a payment is committed, downstream services need to know - notifications, fulfilment, analytics. The naive approach commits the payment, then publishes to a message bus. If the publish fails, the database and the bus disagree forever (a dual-write problem), and there is no way to atomically span the two.

The transactional outbox sidesteps this. In the same database transaction that writes the payment and the ledger entries, the service writes one row per outgoing event to an outbox table. A separate outbox publisher polls the outbox and emits the events to the bus, marking each as published.

Because the event is in the transaction, it is durably enqueued atomically with the state change. If the publisher crashes mid-way, it re-reads the outbox and republishes; consumers deduplicate. The result is reliable cross-system event publication with no two-phase commit and no inconsistency window - and that single guarantee is why the outbox is one of the most useful patterns in this whole series.

Part D - Sagas for cross-service work

A single payment can touch several services and external systems: charge the provider, credit the merchant balance, notify, update fulfilment. A single ACID transaction cannot span all of them, and two-phase commit is the wrong tool - it requires every participant to speak the same protocol and blocks resources during the prepare phase, which is hopeless across an external bank.

A saga models the workflow as a sequence of local transactions, each with a compensating action that undoes its effect. If a step fails, compensations run in reverse to back out the work the saga already did.

flowchart TD
    S1[1. Authorize charge<br/>at provider] -->|ok| S2[2. Capture funds<br/>at provider]
    S2 -->|ok| S3[3. Credit merchant<br/>ledger]
    S3 -->|ok| S4[4. Publish payment<br/>complete event]
    S2 -.fail.-> C1[Compensate 1:<br/>release authorization]
    S3 -.fail.-> C2[Compensate 2:<br/>refund at provider]
    S4 -.fail.-> C3[Compensate 3:<br/>debit merchant]

Figure 3. A multi-step payment as a saga, with each step's compensating action wired in. A failure at step 2, 3, or 4 triggers the compensations of all previously completed steps, running in reverse - the alternative to two-phase commit, which is impossible across an external bank. Compensations must themselves be idempotent, because the saga coordinator may retry them after a crash.

The compensation must itself be idempotent, because the saga coordinator may retry it after a crash. Choreography (services react to each other's events) scales but is hard to reason about for compensations; orchestration (a central coordinator drives the steps) is clearer for payment flows where the compensation graph matters, and is the usual senior choice.

Part E - Reconciliation - the safety net

Even with all of the above, drift can still happen. A provider call times out but actually succeeded. An outbox row is somehow lost. A bug skews the ledger. The only defence is to refuse to trust your own state alone for money and cross-check it against external ground truth.

A reconciliation job runs on a schedule, pulling the provider's reports (and, where relevant, bank statements) and comparing them to the platform's ledger. Any discrepancy is alerted on and repaired by appending compensating entries - never by editing history. A non-zero drift count is by definition an incident.

Failure modes

  • Provider call times out. The canonical hard case - did the charge happen or not? Mark the payment PENDING and retry with the same provider idempotency token; if that is impossible, leave it for reconciliation to resolve. Never silently assume failure and retry without a token - that double-charges.
  • DB write fails after a successful provider charge. Write the payment record with status PENDING before calling the provider; update it to CAPTURED after the provider confirms. If the post-charge update itself fails, the payment is in PENDING and reconciliation will reconcile it.
  • Outbox publisher down. Events accumulate in the outbox; on recovery they publish. Consumers dedup. Nothing is lost - the outbox is durable.
  • Concurrent updates on one account. Per-account serialisation (a partition key or row-level optimistic locking) prevents lost updates.
  • Discovered fraud or bug. Append compensating entries; never edit historical rows.

Multi-region

Payments are jurisdictional and processors are often per-region, so partition by customer with region affinity. A cross-region transaction becomes a saga that spans regions. The ledger is replicated synchronously to a hot standby - money cannot tolerate even seconds of recent-write loss, so the disaster-recovery RPO is effectively zero. This is materially stricter than the eventual replication every prior system in this series happily accepted.

Evolution path

StageApproach
LaunchSynchronous payment service, idempotency keys, append-only ledger from day one
GrowthTransactional outbox, orchestrated saga, automated reconciliation
ScaleMulti-region with region-affine partitioning, synchronous DR replication

The non-negotiable day-one investments are the idempotency keys, the append-only ledger, the status state machine, and the outbox - none of these can be retrofitted without rewriting history. Defer the elaborate saga orchestrator and multi-region.

Observability

Track payment success rate per provider, end-to-end payment latency p50/p99, provider latency and error rate, outbox lag (events waiting to publish - the headline backpressure metric), the reconciliation drift count (anything non-zero is an incident), idempotency-hit rate (a proxy for retry traffic), saga compensation rate, and a continuous ledger-balance check (the sum of all entries must always be zero). Reasonable SLOs: 99.9% of payments complete within 5 s; reconciliation drift = 0.


Step 6 - Bottlenecks and Trade-offs

  • Provider latency dominates per-payment time and is largely out of your control - keep the rest of the path lean.
  • Provider rate limits call for a per-provider token-bucket limiter - exactly the primitive from Part 2.
  • Per-account contention is bounded by serialising operations on a single account through a partition key.
  • Outbox lag is the visible health signal for the event pipeline; alert on growth.
  • Reconciliation drift is the visible health signal for cross-system correctness; any drift is an incident.

Reference Architecture

The pattern this problem teaches, reusable far beyond payments:

Atomically write state and outgoing events in one transaction (outbox), process retries idempotently at every boundary, model cross-service operations as sagas with idempotent compensations, and treat external ground truth as the final reconciliation source.

flowchart LR
    subgraph Atomic["Atomic transaction"]
        A1[State change] --- A2[Ledger entries] --- A3[Outbox event]
    end
    Atomic --> Pub[Outbox publisher]
    Pub --> Bus[Event bus]
    Atomic <-->|saga steps + compensations| Ext[External systems]
    Recon[Reconciliation] -.compare.-> Atomic
    Recon -.compare.-> Ext
    Recon -->|compensating entries| Atomic

Figure 4. The reference architecture for correctness-first systems: atomically write state + ledger + outbox in one transaction, drain the outbox to downstream consumers, model cross-service work as sagas with idempotent compensations, and use reconciliation against external ground truth as the final safety net. The same toolkit applies to any system where being approximately right is unacceptable.

The same shape recurs in any cross-service workflow where correctness matters more than throughput: order processing, inventory, accounting, healthcare records. Idempotent boundaries, an append-only authoritative store, the outbox for atomic state-plus-event publication, sagas for cross-system steps, and reconciliation against an external source of truth is the toolkit for "this must be right, not fast".


Common Mistakes in the Interview

  • Claiming exactly-once delivery end to end, instead of at-least-once + idempotency + reconciliation.
  • Dual-write - committing to the DB and then publishing to a bus separately - opening a permanent inconsistency window.
  • Proposing two-phase commit across services and an external provider, which the provider does not actually support.
  • Storing a mutable balance column instead of deriving balances from an append-only ledger.
  • Retrying a provider call without an idempotency token, the canonical way to double-charge.
  • No reconciliation against the external source of truth.
  • Treating a provider timeout as a definite failure, then retrying and double-charging.
  • A saga without compensations, or with non-idempotent ones that fail on retry.

Quick Reference

TopicKey Point
Core principleCorrectness over availability; never lose or duplicate money
IdempotencyAt every boundary: client to service AND service to provider, with TTL
LedgerAppend-only, double-entry, balances derived - never a mutable column
OutboxState change and event row written in one transaction - no dual-write
SagaCross-service workflow as local transactions with idempotent compensations
2PCWrong tool here - external providers do not participate
Provider timeoutMark PENDING, retry with the provider's idempotency token, reconcile
ReconciliationCross-check ledger against the provider's reports; fix drift with compensating entries
ConsistencyInternally strong; across boundaries at-least-once + dedup + reconcile
Multi-regionRegion-affine partitioning; ledger replicated synchronously (RPO ≈ 0)

This is Part 11 of a 12-part system design series where each post solves one problem around one core pattern. Next: Design a Job Scheduler.

Ready to ace your interview?

Get 550+ interview questions with detailed answers in our comprehensive PDF guides.

View PDF Guides