You place an order online. Behind that one request, five separate services each do their part: one holds the item in stock, one creates your order record, one charges your card, one notifies the warehouse to pick and pack, one books the delivery slot. Each has its own database. There's no single undo that covers all five.
On a normal database, a failure mid-way rolls everything back automatically. That's what transactions are for. But this isn't one database. It's five teams, five services, five separate commits. When step four fails, the first three have already gone through. Your card is charged. The warehouse hasn't been told. Nothing ships.
Nobody designed a fix for that. That's the problem. A saga is how you design for it upfront: every step gets an explicit undo, and one thing keeps track of what's succeeded so it knows exactly what to reverse when something doesn't.
With five services that need to work together, you have two ways to connect them.
Choreography. Each service reacts to events from the others. Service A finishes, fires an event, Service B picks it up. Nobody has the full picture.
Orchestration. One coordinator calls each step directly, waits for the response, and tracks what's happened. It's the only thing that knows the complete sequence.
Both work fine until something breaks. When an event gets dropped in a choreographed flow, nobody notices. The inventory is still held, the payment is still charged, three services are sitting in a state that needs undoing and none of them know it.
The orchestrator tracks every step. When something fails, it knows exactly which compensations to run, and in which order. If it breaks at 3am, there's one place to look.
The orchestrator's only job is sequencing: call this, wait, call that. Logic stays in your services: what a valid charge looks like, what "available" means, how to handle a pick request. The moment any of that moves into the orchestrator, it's untestable and invisible to the teams that own it.
Think about returning something to a shop. The original sale doesn't get erased. It already happened. Instead, two new things occur: you get a refund, the item goes back on the shelf. The reversal isn't undoing the past. It's doing something new in the opposite direction.
The same principle applies here. When step four fails, the first three steps have already happened, permanently, in three separate services. There's no global undo button. Each service already wrote its changes to its own database. You can't reach in and erase them.
What you can do is write an explicit reverse action for each step upfront. If the inventory was reserved, the reverse is releasing that hold. If the payment was charged, the reverse is issuing a refund. These are called compensating transactions. They touch real data and can fail, so write them like production code, not an afterthought. Pick a failure point below.
What if the compensation itself fails? The refund returns 500. Your saga is now in an inconsistent state with no automatic recovery. Compensating transactions must be idempotent, retried with backoff, and alerted on if they don't eventually succeed. Don't swallow those errors.
Imagine tapping your card at a terminal. The screen freezes, you tap again, and both payments go through. You've been charged twice for the same thing.
In a distributed system, retries are unavoidable. When you don't hear back, you have to try again. The problem is that "try again" can mean "charge the card again" if the first request actually worked but the confirmation got lost on the way back.
The fix is an idempotency key, a unique identifier sent with every request. If the service has seen that key before, it skips the work and returns the original result. Toggle it below.
You've seen this happen on a ticketing site. One seat left for a sold-out show. Two people on separate laptops both see it as available, both click "Buy", both fill in their payment details within a second of each other. Both get order confirmations. There's still only one seat.
The website showed both of them the same availability because neither purchase had been written back yet when they both read it. By the time either transaction committed, the other had already started. This is a read-before-write problem, and it's exactly what sagas run into.
Two orders arrive at the same moment. Both sagas check stock, both see one unit available, both proceed, and neither knows the other is doing the same thing.
PENDING. Any other saga that reads PENDING knows to back off. Nothing in your framework does this. You write it yourself.
Step Functions, Temporal, Durable Functions, a hand-rolled orchestrator. The implementation changes. These don't.
Datadog APM tracks distributed calls across every service in a saga, so you can see exactly which step failed, which compensating transactions ran, and how long each step took. Broken sagas become obvious before a customer notices.
Try Datadog APM →