01 / 06

// act 0: assume it works

Act 0: Assume It Works

You place an order online. Behind that one request, five separate services each do their part: one holds the item in stock, one creates your order record, one charges your card, one notifies the warehouse to pick and pack, one books the delivery slot. Each has its own database. There's no single undo that covers all five.

On a normal database, a failure mid-way rolls everything back automatically. That's what transactions are for. But this isn't one database. It's five teams, five services, five separate commits. When step four fails, the first three have already gone through. Your card is charged. The warehouse hasn't been told. Nothing ships.

Nobody designed a fix for that. That's the problem. A saga is how you design for it upfront: every step gets an explicit undo, and one thing keeps track of what's succeeded so it knows exactly what to reverse when something doesn't.

// act 1: someone has to drive

Act 1: Someone Has to Drive

With five services that need to work together, you have two ways to connect them.

Choreography. Each service reacts to events from the others. Service A finishes, fires an event, Service B picks it up. Nobody has the full picture.

Orchestration. One coordinator calls each step directly, waits for the response, and tracks what's happened. It's the only thing that knows the complete sequence.

Both work fine until something breaks. When an event gets dropped in a choreographed flow, nobody notices. The inventory is still held, the payment is still charged, three services are sitting in a state that needs undoing and none of them know it.

The orchestrator tracks every step. When something fails, it knows exactly which compensations to run, and in which order. If it breaks at 3am, there's one place to look.

// same order, same failure point. watch what each model does about it

The orchestrator's only job is sequencing: call this, wait, call that. Logic stays in your services: what a valid charge looks like, what "available" means, how to handle a pick request. The moment any of that moves into the orchestrator, it's untestable and invisible to the teams that own it.

// act 2: clean up after yourself

Act 2: Clean Up After Yourself

Think about returning something to a shop. The original sale doesn't get erased. It already happened. Instead, two new things occur: you get a refund, the item goes back on the shelf. The reversal isn't undoing the past. It's doing something new in the opposite direction.

The same principle applies here. When step four fails, the first three steps have already happened, permanently, in three separate services. There's no global undo button. Each service already wrote its changes to its own database. You can't reach in and erase them.

What you can do is write an explicit reverse action for each step upfront. If the inventory was reserved, the reverse is releasing that hold. If the payment was charged, the reverse is issuing a refund. These are called compensating transactions. They touch real data and can fail, so write them like production code, not an afterthought. Pick a failure point below.

// inject a failure. watch which compensations fire and in what order

What if the compensation itself fails? The refund returns 500. Your saga is now in an inconsistent state with no automatic recovery. Compensating transactions must be idempotent, retried with backoff, and alerted on if they don't eventually succeed. Don't swallow those errors.

// act 3: retries aren't free

Act 3: Retries Aren't Free

Imagine tapping your card at a terminal. The screen freezes, you tap again, and both payments go through. You've been charged twice for the same thing.

In a distributed system, retries are unavoidable. When you don't hear back, you have to try again. The problem is that "try again" can mean "charge the card again" if the first request actually worked but the confirmation got lost on the way back.

The fix is an idempotency key, a unique identifier sent with every request. If the service has seen that key before, it skips the work and returns the original result. Toggle it below.

// payment step retry. idempotency key on vs off

saga:

namespace

ord-8412

order ID

charge-payment

step name

Unique to this order and this step. Reuse the order ID alone and two different steps would share a key. A retry sends the exact same string; the service recognises it and returns the cached result.

// act 4: the slot's already taken

Act 4: The Slot's Already Taken

You've seen this happen on a ticketing site. One seat left for a sold-out show. Two people on separate laptops both see it as available, both click "Buy", both fill in their payment details within a second of each other. Both get order confirmations. There's still only one seat.

The website showed both of them the same availability because neither purchase had been written back yet when they both read it. By the time either transaction committed, the other had already started. This is a read-before-write problem, and it's exactly what sagas run into.

Two orders arrive at the same moment. Both sagas check stock, both see one unit available, both proceed, and neither knows the other is doing the same thing.

A semantic lock is an application-level flag. Before your saga does anything with a record, it marks it as PENDING. Any other saga that reads PENDING knows to back off. Nothing in your framework does this. You write it yourself.

// two concurrent orders, one unit left in stock

// key takeaways

The rules don't change
across platforms.

Step Functions, Temporal, Durable Functions, a hand-rolled orchestrator. The implementation changes. These don't.

Write the undo path before the happy path. You'll skip it otherwise. Every step needs an explicit compensating transaction. It touches real data and can fail, so write it like production code.

Every step must be idempotent. The orchestrator will retry on timeout. Scope the idempotency key to the saga ID and step name. The second call returns the cached result.

Choreography breaks silently. When one of five services drops an event, nobody knows who needs to compensate. With orchestration, one thing tracks the full state.

Mark records PENDING before you touch them. Other sagas that see PENDING know to wait or fail fast, not proceed and sell stock they don't have. Your framework won't do this. You write it.

The orchestrator controls when. Services control what. Business logic in a proprietary DSL is untestable, unportable, and invisible until it breaks. Keep it in your services.

// see your sagas in action

Trace your saga steps in production.

Datadog APM tracks distributed calls across every service in a saga, so you can see exactly which step failed, which compensating transactions ran, and how long each step took. Broken sagas become obvious before a customer notices.

Try Datadog APM →

Act 0: Assume It Works

Act 1: Someone Has to Drive

Act 2: Clean Up After Yourself

Act 3: Retries Aren't Free

Act 4: The Slot's Already Taken

The rules don't changeacross platforms.

Trace your saga steps in production.

The rules don't change
across platforms.