Saga Pattern for Distributed Transactions
- Reading time
- 8 min read
- Word count
- 1544 words
- Diagram count
- 0 diagrams
Source: Victor Bona's Obsidian Compendium snapshot, Knowledge base/Design Patterns/Saga Pattern for Distributed Transactions.md.
Distributed systems are allergic to “one big transaction.”
If your workflow spans multiple services or multiple database shards, you don’t get the luxury of a single ACID transaction covering everything. The Saga pattern is a practical, battle-tested way to coordinate work across boundaries while staying highly available and failure-tolerant.
A saga gives you a disciplined structure for:
- Splitting a distributed transaction into local transactions.
- Advancing progress using events/commands.
- Recovering from failure with compensating actions.
- Achieving correctness via eventual consistency (instead of global atomicity).
1) What Problem Sagas Solve
1.1 Why “regular transactions” break in distributed systems
A classic database transaction works because:
- All writes happen in one place.
- The DB can lock, validate, and commit atomically.
In microservices and sharded databases:
- Each service/shard owns its data.
- Network calls can fail mid-flight.
- A global lock or two-phase commit (2PC) is slow, brittle, and often avoided.
So the real question becomes: > How do we build multi-step operations that are correct even when components fail, retry, and run in partial states?
1.2 The Saga answer
A Saga is a sequence of local transactions, one per service/shard.
Each step commits locally. If a later step fails, earlier steps are “undone” via compensating transactions.
This replaces “roll back everything atomically” with:
- “Make progress step-by-step”
- “If needed, compensate explicitly”
- “Always be able to resume/retry safely”
2) Core Concepts
2.1 Local transaction
A transaction that only touches one service’s database (or one shard).
Example: “Debit User A balance on Shard 3.”
2.2 Saga step
A saga step typically includes:
- A local transaction (write to DB)
- Emission of a message/event/command to trigger the next step
2.3 Compensating transaction
A compensating transaction is the “undo” action for a completed step.
Important nuance:
- Compensation is often not a perfect rollback.
- It’s a domain-level correction (“refund”, “release reservation”, “cancel order”).
2.4 Eventual consistency
During a saga, the system may temporarily be inconsistent:
- Money debited but not credited (yet)
- Inventory reserved but order not confirmed (yet)
But the saga guarantees:
- It eventually completes successfully, or
- It eventually compensates to a valid state
3) Two Saga Styles
3.1 Choreography (event-driven, decentralized)
There is no central coordinator. Each participant reacts to events and emits new events.
Pros
- No single orchestrator to maintain
- Naturally fits event-driven systems
Cons
- Harder to trace/understand end-to-end
- Complex when many steps exist
- Business logic becomes “emergent” across services
Flow sketch
- Service A emits
ACompleted - Service B listens and does work, emits
BCompleted - Service C listens and does work, emits
CCompleted - Failures emit
BFailedetc. which trigger compensations
3.2 Orchestration (command-driven, centralized)
A central Saga Orchestrator coordinates steps by sending commands.
Pros
- Clear flow control in one place
- Easier to reason about and debug
- Explicit state machine
Cons
- Orchestrator becomes a critical component
- Still must design for retries/failures
Flow sketch
- Orchestrator ->
DebitAccount - Shard/Service replies ->
DebitSucceededorDebitFailed - Orchestrator ->
CreditAccount - On failure, orchestrator triggers compensations
4) Where to Use the Saga Pattern
Use sagas when
- A business process spans multiple services/shards
- You can define compensations
- Eventual consistency is acceptable
- You need high availability and resilience
Common domains
- Payments / Transfers
- Orders (order -> inventory -> payment -> shipping)
- Booking systems (reserve -> confirm)
- Cross-shard operations (transfer between users on different shards)
Avoid sagas when
- You require strict global atomicity at all times
- Compensations are impossible or unacceptable
- A single DB transaction would suffice (keep it simple)
5) Applying Sagas to Sharded Databases
5.1 The cross-shard transaction problem
Example: Transfer 10 coins from:
- User A on Shard 3
- to User B on Shard 9
A single DB transaction cannot span shards (in most shard designs). So you need a distributed approach.
5.2 A typical saga approach (transfer example)
Steps:
- Debit A (Shard 3) – local commit
- Credit B (Shard 9) – local commit
- Mark transfer as Completed
If credit fails:
- Compensate by refunding A
5.3 Why this works (in practice)
Because you design the workflow so:
- Every step is idempotent
- Every message can be retried
- Progress is tracked via a saga state machine
- Side effects are controlled via outbox/inbox patterns
6) Design Principles You Must Follow
6.1 Idempotency (non-negotiable)
Messages will be redelivered. Consumers will restart. You must make each step safe to run multiple times.
Rule
- Every command/event handler must be idempotent based on a unique key like
transferId/sagaId.
6.2 State machine (explicitly track progress)
Track states like:
INITDEBITEDCREDITEDCOMPLETEDCOMPENSATINGCOMPENSATEDFAILED
This state machine belongs either in:
- the orchestrator (orchestration), or
- the participating services (choreography)
6.3 Outbox pattern (don’t lose events)
The #1 failure mode in event-driven workflows: > DB commit succeeds, but publishing the event fails
Outbox pattern fixes this:
- Within the same local transaction:
- write business data
- write an outbox record
- A publisher reliably reads outbox rows and publishes events
This guarantees: if the DB commit happened, the event will eventually be published.
6.4 Inbox / de-duplication (avoid double handling)
Store processed message IDs (transferId, eventId) so duplicates don’t apply twice.
6.5 Observability
Without tracing and logs:
- sagas become “distributed ghost stories”
Track:
- sagaId / transferId in every log line
- span tracing across services
- metrics: retries, compensations, time-to-complete
7) Implementation Example: Cross-Shard Transfer (Orchestration)
7.1 Data model (simplified)
Transfers table (central or dedicated “transfer” service DB)
transfer_id(PK)from_user_idto_user_idamountstatus(INIT, DEBITED, CREDITED, COMPLETED, COMPENSATING, COMPENSATED, FAILED)created_at,updated_at
Ledger entries (per shard)
entry_id(PK)transfer_id(unique constraint per (transfer_id, type))user_idtype(DEBIT, CREDIT, REFUND, REVERSAL, etc.)amountcreated_at
Outbox (per DB that publishes events)
outbox_id(PK)event_typepayload_jsoncreated_atpublished_atnullable
7.2 Orchestrator state machine (pseudo-code)
// TypeScript-like pseudo-code
enum TransferStatus {
INIT = "INIT",
DEBITED = "DEBITED",
CREDITED = "CREDITED",
COMPLETED = "COMPLETED",
COMPENSATING = "COMPENSATING",
COMPENSATED = "COMPENSATED",
FAILED = "FAILED",
}
async function startTransfer(transferId: string) {
// Load transfer
const t = await transfers.get(transferId);
if (t.status !== TransferStatus.INIT) return; // idempotent
// Step 1: Debit A
await sendCommand("DebitUser", { transferId, userId: t.fromUserId, amount: t.amount });
// Orchestrator will continue upon receiving DebitSucceeded / DebitFailed
}
async function onDebitSucceeded(evt: { transferId: string }) {
const t = await transfers.get(evt.transferId);
if (t.status !== TransferStatus.INIT) return; // idempotent guard
await transfers.updateStatus(evt.transferId, TransferStatus.DEBITED);
// Step 2: Credit B
await sendCommand("CreditUser", { transferId: evt.transferId, userId: t.toUserId, amount: t.amount });
}
async function onCreditSucceeded(evt: { transferId: string }) {
const t = await transfers.get(evt.transferId);
if (t.status !== TransferStatus.DEBITED) return;
await transfers.updateStatus(evt.transferId, TransferStatus.CREDITED);
// Step 3: Complete
await transfers.updateStatus(evt.transferId, TransferStatus.COMPLETED);
}
async function onCreditFailed(evt: { transferId: string; reason: string }) {
const t = await transfers.get(evt.transferId);
if (t.status !== TransferStatus.DEBITED) return;
await transfers.updateStatus(evt.transferId, TransferStatus.COMPENSATING);
// Compensation: refund A
await sendCommand("RefundUser", { transferId: evt.transferId, userId: t.fromUserId, amount: t.amount });
}
async function onRefundSucceeded(evt: { transferId: string }) {
const t = await transfers.get(evt.transferId);
if (t.status !== TransferStatus.COMPENSATING) return;
await transfers.updateStatus(evt.transferId, TransferStatus.COMPENSATED);
}
Key points:
- Each handler checks state and exits if already progressed.
- Retries won’t double-apply because of state guards + idempotency on shards.
7.3 Debit/Credit handlers on each shard (idempotent)
Each shard service does a local transaction:
- ensure we haven’t already processed this
transferId - write ledger entry
- update balance
- write outbox event
-- Pseudo SQL (inside a local DB transaction)
-- 1) Idempotency check via unique constraint
-- Ledger has a unique constraint on (transfer_id, type)
INSERT INTO ledger_entries (transfer_id, user_id, type, amount, created_at)
VALUES (:transfer_id, :user_id, 'DEBIT', :amount, now());
-- 2) Apply balance update
UPDATE accounts
SET balance = balance - :amount
WHERE user_id = :user_id
AND balance >= :amount;
-- If no rows updated -> insufficient funds -> throw error, rollback transaction
-- 3) Write outbox event
INSERT INTO outbox (event_type, payload_json, created_at)
VALUES ('DebitSucceeded', json_build_object('transferId', :transfer_id), now());
If the same command is retried:
- the ledger insert fails (unique constraint), so you can treat it as “already done”
- or you check first and return success
7.4 Handling “insufficient funds”
Debit step can fail cleanly:
- Saga ends in FAILED (no compensation needed)
- Because nothing else happened
8) Implementation Example: Choreography (Event-Driven)
8.1 Event flow
- TransferRequested
- DebitSucceeded (or DebitFailed)
- CreditSucceeded (or CreditFailed)
- RefundSucceeded (if needed)
Each service subscribes to events and emits the next event.
8.2 Example (high-level)
-
Transfer service emits
TransferRequested -
Shard A listens:
- debits
- emits
DebitSucceeded
-
Shard B listens:
- credits
- emits
CreditSucceeded
-
Transfer service listens:
- marks completed
-
On credit failure:
- Shard B emits
CreditFailed - Shard A listens and refunds
- emits
RefundSucceeded
- Shard B emits
Choreography works well when:
- flows are short and stable
- you have strong observability
- teams agree on event contracts
9) Practical Patterns That Pair With Sagas
9.1 Saga log / durable state
Store saga progress durably so you can resume after crashes.
9.2 Timeouts and dead-letter queues
A saga step might stall:
- message lost
- consumer down
- external dependency degraded
You need:
- step timeouts (e.g., “if no CreditSucceeded within 5 minutes, compensate”)
- dead-letter queues for manual intervention
9.3 Reconciliation / repair jobs
Even with good design, weird things happen. Periodic reconciliation is a superpower:
- find transfers stuck in DEBITED for too long
- re-drive credit
- or compensate
9.4 Ledger-based design (especially for money)
For transfers, a ledger is ideal:
- append-only, auditable
- balance derived or updated consistently
- makes recovery and debugging easier
10) Testing Strategy (Don’t Skip This)
10.1 Unit tests
- state machine transitions
- idempotency logic
10.2 Integration tests
Simulate failures:
- credit service down after debit
- duplicate events
- out-of-order delivery
- slow consumers
10.3 Chaos / resilience tests
- kill orchestrator mid-saga
- kill shard consumer mid-transaction
- inject network failures
Goal:
- saga still converges to a valid outcome
11) “Rules of Thumb” Cheat Sheet
- Prefer a saga when you need cross-service or cross-shard workflows.
- If you can show “PENDING” to users, you’re already 80% saga-compatible.
- Make every step idempotent.
- Use outbox/inbox patterns to avoid lost events and double-processing.
- Track saga state explicitly (state machine).
- Build observability early (trace IDs everywhere).
- Accept that compensation is domain logic, not a perfect rollback.
12) Minimal Reference Architecture (Cross-Shard Transfer)
Components
- Transfer Orchestrator (or event choreography)
- Shard A “Account Service”
- Shard B “Account Service”
- Message bus (Kafka/NATS/Rabbit/etc.)
- Outbox publisher on each DB
- Optional reconciliation worker
Data
- Transfers table (or saga log)
- Ledger entries per shard
- Outbox tables per DB
Properties
- No global locks
- No 2PC
- Safe retries
- Eventual correctness
13) Final Perspective
Sagas don’t give you global ACID. They give you something more valuable in distributed systems:
> A controlled, observable, recoverable path through failure.
For cross-shard “transactions” (like user-to-user transfers), the Saga pattern is one of the most practical ways to stay correct without sacrificing availability or building a fragile distributed locking system.
The key is to treat “pending” as a first-class state and build your system so retries and compensation are normal, not exceptional.