Backpressure Is a Class of Problems, Not a Solution

A producer that emits work faster than its consumer can complete it will eventually exhaust something — memory, file descriptors, downstream connection pools, or the patience of callers who've already timed out. That failure mode has one root cause and four distinct questions. Conflating them produces systems that handle the easy case and fall apart on the interesting ones.

The root cause

Service time has a distribution. Under light load, your P99 is acceptable. Under heavy load, queuing theory gives you a non-linear result: as utilization approaches 1, mean wait time approaches infinity. The service isn't slower — it's just that requests spend longer waiting behind other requests.

Rate limiting and load balancing don't change this. Rate limiting bounds arrival rate, not resource consumption — a request that misses cache and joins a database query consumes 100× the backend capacity of one that hits cache, and your rate limiter counts them identically. Load balancing distributes requests across backends, but if those backends share a database, a cache tier, or a downstream dependency, you've distributed the arrival rate, not the bottleneck. You need a mechanism that propagates the consumer's actual processing capacity back to the producer. That mechanism is backpressure. But "backpressure" names a class of solutions, not a single one. The four questions below are structurally distinct, and each demands a different answer.

Question 1: Signaling — how does the consumer communicate capacity to the producer?

The consumer knows something the producer doesn't: how much work it can actually accept. The signaling question is how that information crosses the boundary.

TCP flow control uses a receive window — a credit the receiver advertises, indicating how many bytes the sender may transmit before waiting for an acknowledgment. The sender cannot exceed the window. This is backpressure at the transport layer: the receiver's buffer capacity is directly reflected in the sender's allowed throughput, with no additional protocol machinery required.

Reactive Streams formalizes this for application-level streams. The subscriber calls request(n), signaling that it can accept exactly n items. The publisher emits at most n before stopping. Demand propagates upstream, not downstream. The producer never pushes; it waits for pull. This inversion — demand-driven rather than supply-driven — is the fundamental shift.

gRPC flow control operates at the HTTP/2 frame level. Each stream and each connection has a flow control window. DATA frames reduce the window; WINDOW_UPDATE frames restore it. A slow server-side handler that stops reading will cause the client's window to fill, which blocks the client from sending more data without any application-level code to enforce it.

What these have in common: the signal is structural, not advisory. It's not a header saying "I'm busy." It's a physical constraint the producer cannot exceed without the protocol breaking. Advisory signals — 429 responses, Retry-After headers — require the producer to cooperate. Structural signals enforce compliance.

The cost of structural signaling is coupling: producer and consumer must speak the same protocol. Which leads directly to question 2.

Question 2: Composition — how does the signal cross an async boundary?

Most systems aren't two services talking directly. They're pipelines: an ingest layer writes to a queue, a processing service reads from the queue and writes to a database, the database writes to a replication log consumed downstream. Backpressure propagates cleanly within a protocol boundary — TCP window, gRPC flow control, Reactive Streams demand. It breaks at async boundaries, because the signal has no channel to cross them.

Consider: a gRPC service receives requests and writes results to Kafka. gRPC flow control will propagate backpressure from the gRPC server to the gRPC client. But Kafka is append-only and unbounded by default — it doesn't signal the producer that consumers are slow. The pressure stops at Kafka. Downstream consumer lag grows invisibly until either the consumer catches up, the retention window expires and data is lost, or the topic's partition count becomes a coordination problem.

Crossing an async boundary requires explicit wiring. Three mechanisms:

Consumer lag as a feedback signal. Measure consumer lag on the queue. When lag exceeds a threshold, instruct producers to reduce emit rate — either via a rate control API or by reducing the number of producer threads pulling from upstream. This is advisory, not structural, so it requires producer cooperation, but it works across protocol boundaries.

Bounded queue with blocking producers. If the queue is in-process (a java.util.concurrent.BlockingQueue, a Go buffered channel), set a capacity and block the producer when it's full. The producer now feels the consumer's slowness directly. This works only for in-process async boundaries — not across network hops.

Back-channel signaling. Some systems implement explicit feedback channels: the consumer publishes capacity tokens to a topic the producer subscribes to. The producer only emits when it holds a token. This is structurally similar to Reactive Streams demand signaling, rebuilt over a message broker. It's operationally heavy but it crosses arbitrary async boundaries.

The important realization: end-to-end backpressure across a heterogeneous pipeline is not automatic. It requires deliberate design at every async boundary. A system that achieves it within each stage but not across stages has only partially solved the problem — the pressure accumulates at the boundaries.

Question 3: Policy — what does the producer do when it receives the signal?

Receiving the signal is not the same as knowing how to respond. The producer has three options, and they have different failure modes.

Block. The producer waits until the consumer can accept more work. This is the purest form of backpressure — the pressure propagates transitively. If A produces for B, and B produces for C, and C is slow, C backs up B, which backs up A. The whole pipeline slows to the speed of its slowest stage. The failure mode is deadlock in cyclic graphs: if A is waiting for B, and B is waiting for A (say, for a response before it can release a slot), the system halts. Blocking also converts a downstream latency problem into an upstream latency problem — callers of A now experience the slowness of C, which they may not be expecting and may not have timeout budgets for.

Reject. The producer returns an error immediately — HTTP 503, a RejectedExecutionException, a failed channel send. The caller gets a fast failure instead of a slow one. The work is lost unless the caller retries. The failure mode is work loss: if callers don't retry, or retry against a system that's still overloaded, the signal is correct but the outcome is data loss. Rejection is load shedding, not backpressure — it's a response to the signal, not a propagation of it.

Shed by priority. Rather than rejecting indiscriminately, the producer drops low-priority work when capacity is constrained. This requires that requests carry priority information, and that the queue be ordered by priority rather than arrival time. The failure mode is starvation: if high-priority work is sustained, low-priority work never executes. Priority inversion — a low-priority request holding a lock that a high-priority request needs — can make this worse rather than better.

The choice is determined by the semantics of the work:

Work that is idempotent and retriable → reject. Fast failure, let the caller retry when capacity recovers.
Work that must complete → block, with a timeout. Accept the latency propagation, but don't lose the work.
Work with heterogeneous cost or criticality → priority-based shedding, provided you can attach meaningful priority at the boundary.

A single system may need all three, applied at different boundaries. The ingestion boundary might reject. The processing queue might block. Critical background jobs might use priority shedding. Choosing a single policy system-wide is a simplification that will fail at the boundary where it's wrong.

Question 4: Steady-state overload versus transient spikes

For a transient spike with a bounded duration, buffering is often correct — absorb the spike, process the backlog when the downstream recovers, return to steady state. The question is whether you can bound the buffer. If the spike duration is D seconds and your steady-state throughput is T requests/second, you need a buffer of D×T requests to absorb it. If that buffer is feasible, buffer. If it isn't, shed — reject at the boundary, return fast failures, and let callers retry when capacity recovers. Shedding contains the condition. Backpressure propagates it. For a transient spike you expect to self-resolve, propagation is the wrong choice — you'd be signaling a capacity problem that doesn't exist to upstream systems that can't do anything useful with that information.

The operational question you need to answer before choosing: is this overload structural or transient? Structural overload means you need more capacity or less load. Backpressure is the right signal to make that visible — it converts a latency problem that hides in queues into a throughput problem that's impossible to ignore. Transient overload means you need buffer capacity and a recovery path. Backpressure applied to a transient spike may make it worse by propagating a local condition into adjacent systems that would otherwise be unaffected.

Concretely: a spike in cache miss rate caused by a cache flush is transient — the cache will warm. Applying backpressure here propagates a self-resolving condition to all upstream producers. A sustained increase in request volume because a product launched successfully is structural — the cache won't help, the backend is at capacity, and backpressure is the correct signal.

What this means for system design

A system that handles backpressure correctly has made explicit decisions on all four questions:

At each service boundary: what is the signaling mechanism, and is it structural or advisory?
At each async boundary: is there a wiring that propagates the signal, or does pressure accumulate there silently?
At each producer: what is the policy when the signal arrives — block, reject, or shed — and does that policy match the semantics of the work?
For each class of overload: is this structural or transient, and is the chosen mechanism appropriate for that class?

A system that hasn't answered these explicitly hasn't solved backpressure. It's managed the easy case — a single slow consumer with a blocking producer — and left the rest to chance.

Backpressure Is a Class of Problems, Not a Solution

Read more

The Deep Structure of Sophisticated Software

Agentic Systems and the DevSecOps Attack Surface

RAG Doesn't Fix Hallucination. Neither Does Anything Else

The Edge-Cloud Continuum