The simplest distributed system assumes the downstream will be there. Three clients call an API, and the API calls one payment service. Everything works — right up until the instance disappears.
The first question is usually: won't the load balancer just handle this? Sometimes, yes. Health checks can remove a dead instance and route traffic to a healthy one. But that only helps when there is another healthy path to use — it does not fix a shared downstream dependency, a partial outage, or callers all retrying into the same bottleneck.
When routing cannot save you, the next obvious fix is to retry inside the API. The client still makes one request, but the API can try the downstream more than once. If every API handler retries immediately, though, they all create the same second spike together.
That is a retry storm: a small outage gets amplified by well-meaning retries. When many callers wake up and retry at the same time, it becomes the classic thundering herd problem — the recovery attempt becomes the load spike that keeps the service down.
Immediate retries create the burst from Act 2. Backoff spreads attempts over time. Jitter makes sure clients do not all retry at the same moments.
The green dashed line marks when the service recovers. Fixed retries keep bunching up. Exponential backoff reduces pressure. Jitter de-correlates clients so the recovery window is not hit by one synchronized wave.
Backoff and jitter reduce pressure, but they still keep calling. A circuit breaker watches the recent error rate. When the dependency is clearly unhealthy, it stops new network calls and fails fast locally. After a cooldown, it allows a probe to check recovery.
The incoming request rate is fixed in this simulation. You control the service's capacity. Pull capacity below incoming and the service starts failing; the breaker trips, stops calling, and probes again later.
Every service has a capacity. Stay under it, or fail gracefully.
Datadog APM surfaces request rates, error rates, retry storms, and circuit-breaker trips across every service — so resilience issues are obvious before they page you at 3am.
Try Datadog APM →