Resilience Patterns: The Software Industry's Most Decorated Design Debt
The distributed systems industry has spent two decades normalising compensatory design as operational maturity. If Netflix open-sourced it, if the SRE book mentions it, if the conference talk has a thousand views — the pattern is accepted without interrogation. Teams reach for them because it solved someone else's problem once, and naming it gave it permanence.
This post makes a different claim: most resilience patterns, as commonly applied, are not techniques for building reliable systems — they are runtime evidence of design decisions made without reasoning about the physical constraints of distributed computation. Some patterns are legitimate encodings of genuine physical constraints. Many are not. The difference is derivable, not a matter of taste, and the inability to tell them apart is itself a design failure.
Physical Properties
Before diagnosing any pattern, we need precision about what "physical constraint" means here.
There are four structural properties of distributed systems that are not negotiable. They are properties of the universe your software executes in, and your design either acknowledges them or gets corrected by them at runtime.
Causality. Effects have causes. Causes precede effects. Causal chains are directional and real. A system that doesn't reason about causal structure — what depends on what, for which property, under which condition, and what happens when that dependency cannot be satisfied — will discover its causal structure by watching failures propagate in directions it didn't predict.
Boundaries. A boundary either contains something or it doesn't. Containment is physical — it applies to failure, to state ownership, to resource consumption, to the propagation of pressure. A boundary that fails to contain what it claims is not a boundary. It is a membrane with holes, and your system will behave as if the membrane doesn't exist under the conditions that matter most: high load and partial failure.
Constraints. Every component has a real capacity ceiling derivable from its resource model. A system that doesn't encode its own constraints — in its interfaces, in its protocols, in the design of its boundaries — will discover those constraints in production. Constraints that aren't encoded become silent assumptions that govern behavior and break when conditions change.
Temporal Ordering. Events have order, that order is not global, and concurrent clients do not naturally coordinate their temporal behavior. Unmanaged temporal ordering produces thundering herds, retry storms, split-brain states, and the class of failures that looks like random flakiness but is structurally determined. Every protocol between components has temporal semantics — the order in which things are committed, observed, and acted on — and those semantics are either explicitly designed or implicitly assumed.
These four properties are the physics. Every resilience pattern, correctly examined, is either an honest encoding of one of these constraints — or a runtime patch for having violated one in design.
Properties Combine
Here is the diagnostic error that makes resilience pattern analysis shallow: treating each property in isolation. Real systems don't violate one property at a time. The interesting failure modes — the ones that produce the resilience patterns that feel most justified — are almost always products of two or more properties interacting. Understanding which combination is in play, and how the properties reinforce each other, is what produces design insight rather than just pattern critique.
The most consequential combination is Causality × Temporal Ordering.
A causal chain is not just a dependency graph. It is a dependency graph that executes through time. Service A calls B calls C calls D. Each hop has a latency cost. Each hop can fail. But the critical observation is this: every hop in that chain is executing within a time budget available before the original request is no longer useful to the caller who made it. That time budget is a property of the causal chain as a whole. It is not a property of any individual hop.
When resilience mechanisms are designed hop-by-hop — each service with its own timeout, its own retry policy, its own circuit breaker — each service is making locally rational decisions that are globally incoherent.
Consider what happens when D fails. C retries D, because its local policy says to. Meanwhile B, waiting for C, observes a slow response and retries C. A, waiting for B, observes a slow response and retries B. The failure of D has produced multiplicative load amplification up the entire causal chain.
This is the retry cascade, and it is a direct product of treating Causality and Temporal Ordering as separate concerns. The causal chain is also the temporal chain. A response from D that arrives after A's caller has given up is not a recovery — it is wasted work that generates load with no corresponding value.
The correct instrument for this combination is the deadline: a temporal constraint that is a property of the causal chain and propagates through every hop. Every component that receives a request within a causal chain knows the remaining budget. If the budget is exhausted, it does not initiate downstream calls because it knows the result cannot be used. If the budget is nearly exhausted, it does not retry because the temporal cost of a retry exceeds the value it could produce. The deadline makes temporal constraint causally available at every hop.
A deadline is not a resilience pattern. It is a design property of the causal chain, and it eliminates entire classes of behavior that resilience patterns then try to patch.
The boundary between Causality and Boundaries produces a different insight: what a component can observe about the causal chain inside another component is determined by the boundary between them. This is the structural source of idempotency requirements and how it affects retries.
The combination of Constraints and Boundaries determines whether capacity information can propagate upstream through the system or whether it can only manifest as rejection — the structural source of the rate limiting problem.
Almost every interesting design question in distributed systems is a question about which combination of properties is in play and how they interact at the specific point of failure.
Hedging
Hedged requests — firing a duplicate to a second replica before the first has responded — have a legitimate use and a category error use that look identical from the outside. The difference is a direct product of the Causality × Boundaries combination: what is the causal source of the latency variance, and is the hedge target causally independent of it?
When hedging encodes physics
Consider reads distributed across JVM-based replicas. The p50 is 4ms. The p99 is 180ms. That variance is the consequence of stop-the-world GC, which is real, bounded, and local to each replica. At any moment, replica queue semantics are genuinely different: one is mid-pause, one is warm, one is under compaction pressure. These are causally independent physical processes.
A hedged request sent to a different replica does not encounter the same GC event. The causal source of the latency is local to the first replica; the boundary between replicas means the second replica has independent exposure. The duplicate goes somewhere causally independent. Hedging in this case earns its place because the variance it compensates for is bounded, local, and genuinely independent across the hedge target.
When hedging makes things worse
Now consider hedging against shared infrastructure — a coordination service, a metadata store, a configuration backend that multiple services call through. A slow response triggers a hedge. The duplicate goes to the same infrastructure under the same load.
Here the Causality × Boundaries analysis gives the opposite answer. The causal source of the latency is load-induced queuing on shared state. The boundary between services and the shared infrastructure does not isolate them from this cause. The hedge does not go somewhere causally independent. It adds load to the same causal source producing the slowness, worsening the condition it claims to address.
The deeper point is diagnostic. Shared infrastructure that produces correlated latency across callers is a Boundaries problem: the dependency is not isolated, its load isn't owned, its capacity isn't allocated.
The question that produces the correct answer: does my hedge target have causally independent exposure to the source of my latency variance? If the source is a shared state or shared infrastructure, the answer is no, and hedging is contraindicated.
Rate Limiting
Rate limiting is called by the same name whether it sits at the edge of your system facing the internet or between two services inside your infrastructure. These are categorically different operations, and the category is determined by the Constraints × Boundaries combination.
Admission control at a trust boundary
An API facing the internet receives traffic from clients whose behavior you do not control. The boundary here is real: external traffic is on one side, your system's finite resources on the other. The constraint is real: those resources have a capacity ceiling. Admission control at this boundary is legitimate — it encodes the constraint at the correct boundary, preventing uncontrolled external behavior from consuming internal resources without bound.
The parameters of this admission control should be derived: what is the maximum rate that represents plausible legitimate use for a single external client, and what is the total capacity you're protecting? The answers produce a limit. The category — enforcing a constraint at a trust boundary against uncontrolled external behavior — is legitimate independent of the specific value.
Internal rate limiting
When a rate limiter sits between two services inside your own infrastructure, you need to ask what it is actually doing. The answer is almost always: it is dropping work.
When the rate limit fires, the caller receives a rejection. The work the caller wanted to do still needs to be done. Now the caller must decide what to do with that rejection — fail upstream, retry, queue locally. In every case, the total cost to the system has increased: the work hasn't gone away, but now the coordination overhead of managing the rejection has been added to it. You have not solved a capacity problem. You have shuffled it upstream and added friction.
This is what "dropping work instead of controlling flow" means precisely. The correct response to a capacity constraint is not rejection — it is propagating the constraint upstream so that the producer coordinates its demand with the consumer's capacity. When the producer knows how much the consumer can absorb, it adjusts. No work is dropped. No coordination cost is added. The constraint is honored through coordination rather than through rejection.
This is backpressure, and it is a property of the boundary between the producer and consumer. TCP's window sizing is backpressure: the receiver signals how much it can absorb, and the sender adjusts. Reactive streams' demand signaling is backpressure: the consumer signals readiness, and the producer emits only what was requested. In both cases, the capacity constraint propagates upstream through the boundary as a flow control signal. The producer doesn't need to discover the constraint through rejection — it receives it through the protocol.
Internal rate limiting exists because the boundary between the two services was designed with request-response semantics — HTTP, synchronous RPC — which have no native channel for the consumer to signal capacity to the producer. The only signal available in that direction is the response: success, or rejection, or timeout. So when the consumer is full, the only thing it can tell the producer is "go away." The rate limiter is what you build when your boundary doesn't carry the signal that backpressure requires.
The diagnosis is a Constraints × Boundaries product: the capacity constraint is real, but the boundary design prevents it from propagating to where it could be honored. The internal rate limiter is the compensatory mechanism for a boundary that doesn't carry flow control. The correct design is a boundary that does.
Retries
The retry problem is more complex than it appears, and most of that complexity comes from the combination of Causality and Temporal Ordering. Getting the analysis right requires starting from the physical reality rather than from the conventional framing.
The fundamental impossibility
When a caller issues a request to a remote system and receives no response within its timeout window, it faces a problem that is not solvable with certainty: it cannot distinguish a slow system from a dead one.
Given any finite timeout — and all practical timeouts are finite — the absence of a response within that window is consistent with two completely different states of the remote system: it is still executing and will eventually respond, or it has failed and will never respond. No information available to the caller before it must decide can distinguish these two cases with certainty.
A retry is therefore not "handling a transient error." It is placing a bet about which failure mode you're in — a bet whose prior depends on your historical failure distribution, your system's failure characteristics, and the time elapsed since the original request. The bet may be correct. It may be wrong. Retries are probabilistic interventions, not corrective mechanisms. Systems designed with the expectation that retries handle transient errors are systems that have not reckoned with the actual semantics of what a retry does.
Defining transient precisely
"Transient errors" is the most commonly cited justification for retry logic and among the least examined. An error is transient if the condition that produced it resolves on its own within some window. But that definition is only useful if you can specify the window — and the window that matters is not abstract. It is the remaining temporal budget available in the current causal context.
A network disruption that lasts 200ms is transient relative to a request with a 5-second deadline and wasted work relative to a request with a 150ms deadline. A GC pause that lasts 3s is transient relative to a background job and a hard failure relative to an interactive user request. Whether an error is transient is not a property of the error in isolation. It is a temporal relationship between the failure duration and the caller's deadline budget.
This precision matters for retry design because "retry on transient errors" without this definition produces retry logic that retries regardless of whether the remaining budget could absorb a retry. A retry that consumes the remaining deadline budget and delivers a late result is not a recovery — it is a waste of capacity that could have been admitted to a different request. Retries should be conditioned not just on the error type but on whether the remaining temporal budget is large enough to make a retry meaningful. This is Causality × Temporal Ordering: the retry decision is a property of the causal context (what is the deadline of this causal chain) intersecting with temporal ordering (where are we in the execution of that chain).
Deadlines
The timeout and the deadline are not the same thing, and the distinction is architecturally significant.
A timeout is a local decision: this service will wait at most N milliseconds for this downstream call. It is set at a single hop and has no relationship to what has already been spent in the causal chain above or what is needed in the causal chain below.
A deadline is a property of the entire causal chain: this request must complete by time T, and every hop in the chain that serves this request inherits the remaining budget. Deadlines enable flow propagation in the precise sense: the temporal constraint flows through the causal chain, making every component's decisions consistent with the global context. gRPC's deadline propagation is an implementation of this.
The retry cascade analysis becomes clean under deadlines. When D fails, C checks the remaining deadline. If it's insufficient for a retry, C does not retry D — it returns a typed failure to B. B checks the remaining deadline. If it's insufficient to try an alternative, it returns a typed failure to A. The failure propagates cleanly up the causal chain. No retry amplification. No multiplicative load. The causal chain terminates coherently because every component has access to the temporal constraint of the chain it's serving.
Without deadlines, every component makes its retry decision in local temporal isolation. The retry cascade is the structural consequence of that isolation: each component acts rationally given what it knows, and the aggregate behavior is irrational.
Causal opacity and idempotency
The second major dimension of the retry problem is what happens when a retry is placed: is re-execution safe?
The Kafka consumer model makes the derivation concrete. The broker delivers a message to a consumer. The consumer processes it — writes state, calls an API, updates a record. The broker has zero observability into what happens inside the consumer. It cannot observe whether the database write succeeded. It cannot observe whether the API call completed. The only coordination signal available across the broker-consumer boundary is the offset commit: when the consumer commits, the broker treats the message as processed.
This creates a forced choice with two failure modes. Commit before processing completes: if the consumer crashes mid-processing, the offset is already committed. The broker will not re-deliver. The work is lost. The failure is invisible across the boundary — the only signal says everything is fine. Commit after processing completes: if the commit itself fails — network partition, broker restart, crash after the write but before the commit — the broker will re-deliver. The consumer processes again.
The second failure mode is a justified retry derived from the boundary structure, not from a policy choice. The broker cannot distinguish "not yet processed" from "processed but commit failed" — the boundary between them allows only the offset signal to cross, and that signal is absent in both cases. Given that ambiguity, re-delivery is correct: it prevents data loss at the cost of possible duplicate processing.
The idempotency key is the design response to this specific causal opacity. It doesn't prevent re-delivery. It makes re-delivery safe by giving the consumer a mechanism to detect that it has already processed this message — even though that fact cannot cross the boundary to the broker. The consumer maintains enough state to answer "have I already done this?" and the answer makes re-delivery harmless.
The derivation from the four properties: Causality — the broker cannot observe the causal chain inside the consumer. Boundaries — the boundary allows only the offset signal to cross; processing state cannot. Temporal Ordering — the commit-relative-to-processing ordering determines everything; the failure window is the gap between processing completion and successful commit. Constraints — the idempotency key storage and deduplication window are constrained by acceptable duplicate-processing rate, which is derivable.
The general principle from this derivation: wherever the caller cannot determine the callee's internal state from the failure signal — wherever there is causal opacity at a boundary — re-execution safety must be explicitly designed. The mechanism follows from the structure of the opacity: idempotency keys for operations where the state of execution is unobservable, conditional writes for operations where precondition mismatch is detectable, saga compensation for operations where partial execution requires defined rollback. These are not general good practices. They are specific design responses to specific causal structures.
Where the failure signal is precise enough to distinguish "not started" from "completed" from "partially completed," the idempotency requirement changes and potentially simplifies. Investing in precise failure signaling at the boundary reduces the idempotency burden. This is the Causality × Boundaries product at work: better boundary design directly reduces the compensatory mechanism required.
Derivation
Each pattern section above has applied the same underlying mechanism. Here it is made explicit and structured so it can be applied to any inter-service interaction before a resilience pattern is introduced.
The four questions are not a sequential checklist. They are probes of the same interaction from four angles, and their answers interact. The design insight almost always lives in the interaction between answers, not in any individual answer.
Question 1: What does the caller causally need from the callee, and when?
What does the caller need to make progress, and whether it needs it before it can produce any output at all — a synchronous causal dependency — or whether it needs it eventually but could proceed in the interim.
Most dependencies are eventual on honest examination. A synchronous call made out of implementation habit rather than causal necessity is a Causality violation: the design has imposed synchrony that the problem doesn't require. When the dependency is eventual, the resilience question changes entirely — the right design is async, and the resilience property is the durability and ordering guarantee of the channel, not the availability of the callee at the moment of the call.
When the dependency is genuinely synchronous, the failure behavior is a design artifact to be specified, not a runtime default. What does the caller return when the callee cannot respond? This must be designed before the system is built.
Question 2: What is the remaining deadline budget in this causal context?
This question has no meaning for a single-hop system designed in isolation. It becomes essential the moment the interaction is a hop in a causal chain.
How much of the original request's deadline budget has been consumed by the time this call is initiated? How much does the callee typically need? Is a retry feasible within the remaining budget? Is the result of this call useful if it arrives after the deadline?
If the remaining budget is insufficient for the intended behavior — too small for a retry, too small for the callee's normal execution — then the correct action is to fail fast and return a typed failure up the causal chain. Not because of a circuit breaker. Because a result that arrives after the deadline is waste, not recovery. This is the Causality × Temporal Ordering product: the decision at this hop should be made relative to the temporal envelope of the causal chain, not relative to the hop's local timeout alone.
Question 3: What does the boundary between caller and callee need to carry?
A request-response boundary carries requests and responses. That is sufficient for the happy path. For the cases that matter, the boundary also needs to carry: typed failure responses that distinguish failure modes (slow vs unavailable vs invalid vs overloaded), backpressure signals that propagate capacity constraints upstream, and deadline information that makes the temporal context of the causal chain available at every hop.
If the boundary is missing any of these, the system compensates with mechanisms that approximate them: circuit breakers approximate typed failure detection, rate limiters approximate capacity signaling, local timeouts approximate deadline propagation. In each case, the boundary design generated the need for the compensatory mechanism.
Question 4: What can the caller observe about the callee's internal state through the boundary, and what is causally opaque?
For any failure the caller might observe — timeout, connection reset, error response, silence — what can the caller conclude about what happened inside the callee?
A timeout is consistent with two states: still executing, or failed before starting. These are different situations that require different responses, and the caller cannot distinguish them. The answer to this question determines the idempotency requirement, the retry safety, and the required precision of the failure signaling. Where causal opacity is wide — the caller can learn little about the callee's internal state — idempotency must be designed in and failure signals must be treated as inherently ambiguous. Where the boundary design produces precise failure signals, the opacity narrows and the compensatory design simplifies.
The answers to these four questions in combination produce the correct design. When Question 1 reveals an eventual dependency that was implemented as synchronous, the design is wrong and no resilience pattern makes it right. When Questions 2 and 3 together reveal that a retry would exceed the deadline budget and the boundary has no way to communicate that, the correct response is deadline propagation through the boundary, not a more sophisticated retry policy. When Question 4 reveals causal opacity at a boundary, the required mechanism follows from the specific structure of what can and cannot cross that boundary.
Conclusion
For any resilience mechanism in your system, three questions should be answerable without consulting the person who implemented it:
What physical constraint, or combination of constraints, is this encoding? Name it precisely. Not "service B might be slow" but the specific resource, the specific condition, the specific observable behavior, and which of the four physical properties — alone or in combination — produces it.
Where did this parameter come from? Trace the rate, timeout, threshold, or budget back to a derivation — a capacity model, a latency budget analysis, a formal property of the operation, a deadline calculation. If the derivation doesn't exist, the parameter is folklore, and folklore fails in ways that are hard to predict and harder to diagnose.
What does this mechanism do when it fires, and what is its own failure mode? The designed behavior, not the library default. What happens when the circuit breaker stays open against a recovered service. What happens when the deadline budget is set incorrectly and truncates valid requests. What happens when the retry policy amplifies load during a partial outage. Every mechanism has a failure mode, and that failure mode should be designed for, not discovered.
If these questions have answers, the mechanism is a design artifact. If they don't, it is accumulated operational response to failures that were never fully understood — and accumulated response, unlike derived design, gets harder to reason about and more expensive to operate as the system grows.
The discomfort you feel looking at a system dense with resilience layers is correct signal. It is telling you that the system's failure behavior is not designed. It is telling you that somewhere in the causal chain, a boundary doesn't carry the right signals, a constraint isn't encoded where it should be, or a temporal relationship between components is implicit rather than explicit. The path from that observation to a better system is the four questions applied honestly — and the answers almost always reveal that fewer mechanisms are needed, those that remain are placed correctly, and the design that results can be reasoned about when things go wrong.
Build systems whose failure behavior you designed. Not systems whose failure behavior you discovered.