01 / 06
// act 0: assume the best
// act 0: assume the best

Act 0: Assume The Best

The simplest distributed system assumes the downstream will be there. Three clients call an API, and the API calls one payment service. Everything works — right up until the instance disappears.

// live simulation
// act 1: route around it

Act 1: Route Around It

The first question is usually: won't the load balancer just handle this? Sometimes, yes. Health checks can remove a dead instance and route traffic to a healthy one. But that only helps when there is another healthy path to use — it does not fix a shared downstream dependency, a partial outage, or callers all retrying into the same bottleneck.

// failover simulation
// act 2: try again

Act 2: Try Again

When routing cannot save you, the next obvious fix is to retry inside the API. The client still makes one request, but the API can try the downstream more than once. If every API handler retries immediately, though, they all create the same second spike together.

That is a retry storm: a small outage gets amplified by well-meaning retries. When many callers wake up and retry at the same time, it becomes the classic thundering herd problem — the recovery attempt becomes the load spike that keeps the service down.

// retry storm simulation
// act 3: wait your turn

Act 3: Wait Your Turn

Immediate retries create the burst from Act 2. Backoff spreads attempts over time. Jitter makes sure clients do not all retry at the same moments.

// backoff timing comparison

The green dashed line marks when the service recovers. Fixed retries keep bunching up. Exponential backoff reduces pressure. Jitter de-correlates clients so the recovery window is not hit by one synchronized wave.

// exponential backoff with full jitter
function getDelay(attempt: number): number {
const base = 1000; // 1s base
const cap = 30_000; // 30s ceiling
const exp = Math.min(cap, base * 2 ** attempt);
return Math.random() * exp; // full jitter: uniform [0, exp]
}
// act 4: calm yourself

Act 4: Calm Yourself

Backoff and jitter reduce pressure, but they still keep calling. A circuit breaker watches the recent error rate. When the dependency is clearly unhealthy, it stops new network calls and fails fast locally. After a cooldown, it allows a probe to check recovery.

The incoming request rate is fixed in this simulation. You control the service's capacity. Pull capacity below incoming and the service starts failing; the breaker trips, stops calling, and probes again later.

// circuit breaker simulation
// the short version

One principle.
Four patterns.

Every service has a capacity. Stay under it, or fail gracefully.

01
Route around it. Health checks plus failover — when there is a healthy path.
02
Retry with backoff and jitter. Spread retries out so they don’t become the next outage.
03
Trip the circuit. Stop calling a service that’s clearly over capacity. Probe later.
04
Know your capacity. You can’t respect a limit you can’t see. Measure request and error rates per service in production.
// see your own capacity

Watch these patterns in your own apps.

Datadog APM surfaces request rates, error rates, retry storms, and circuit-breaker trips across every service — so resilience issues are obvious before they page you at 3am.

Try Datadog APM →