Subscribe for more posts like this →

The Engineering Discipline Your Observability Stack Is Missing

Engineering teams at mid-scale routinely pay $30–50k/month for telemetry infrastructure and simultaneously cannot answer the questions that matter during an incident. The dashboards exist. The traces exist. And when an incident occurs that affects 0.3% of users, nobody can find it.

The conventional response is to instrument more. The actual problem is the opposite: telemetry has accumulated the same architectural debt as application code, and nobody is treating it that way.


Instrumentation As A Discipline

Telemetry accumulates the same way technical debt does. An incident occurs, instrumentation gets added. An SLO gets introduced, metrics follow. Nobody comes back to clean up because there's no forcing function. Bad code slows delivery. Redundant telemetry just quietly inflates the bill.

Two years in, the payment service has more telemetry than anyone can reason about. 4,000 metric names, but only 200 appear in any dashboard. Eighteen months of trace retention, but no incident investigation has ever looked back beyond 72 hours. The team is paying $44,000/month to store signals they cannot audit, cannot attribute to an owner, and cannot safely remove without fear of breaking an alert nobody knew existed.

The payment.processed metric exists in three variants with subtly different cardinalities — payment_processed, payment.processed, payments_processed_total. Three engineers, three incidents, no documentation. Nobody knows which one is correct. Nobody knows if any of them are.


Sampling

The payment service runs distributed tracing at 50,000 requests per minute. To manage cost, the team implements head-based sampling at 1%. Cost drops 85%. This looks like a win.

It is a correctness problem dressed as a budget decision.

Head-based sampling decides whether to keep a trace at the start of a request - before it completes, blind to outcome. A 15ms success and a 4,200ms failure have equal probability of being sampled. For average-case analysis, that's acceptable. For tail latency investigation, the primary use case for distributed tracing — it's exactly wrong. The failure affecting 0.3% of users is already rare before sampling. At 1%, the expected trace coverage for that failure mode approaches zero.

Sampling strategy is an architectural claim about what your telemetry can see. Head-based sampling at 1% claims: "we care about average traffic, not our tail." Most teams making that claim don't know they're making it.

Tail-based sampling inverts this: the decision is made after the request completes, biased toward outcomes that matter. Errors always sampled. High-latency requests always sampled. Uninteresting happy-path traffic sampled at a low rate. The trace store reflects what actually went wrong, not a random cross-section of traffic.

But tail-based sampling is better for debuggability — not universally better. It requires buffering full traces in memory before the sampling decision. At very high throughput, that buffer demands a dedicated processing pipeline - often Kafka-backed, and adds latency to ingestion that can matter for real-time alerting. The operational investment is real.

High-throughput systems converge on hybrid strategies: head-based sampling for the bulk of traffic, tail-based rules applied to outcomes that matter - errors always kept, latency outliers always captured, specific endpoints fully sampled. Dynamic sampling, adjusting rates at runtime based on observed error rates, is where the industry is moving.

The point is not "use tail-based sampling." It is: make the sampling decision deliberately, document what it makes invisible, and revisit it when incidents reveal the blind spots.


The Correlation Problem

The payment service has metrics in Prometheus, traces in Tempo, logs in Loki. During an incident, the on-call engineer sees the failure rate spike, finds the error in the logs, then tries to pull the trace for a specific failed payment. That's where it breaks.

The trace ID from the log line doesn't resolve — it was sampled out. Metrics tell you that something is wrong. Logs tell you what errored. Traces show where in the call chain it originated. Stitching all three together for a specific request requires a common identifier — the trace ID — flowing through every signal, present in every signal, and resolving to a retained trace. One broken link in that chain and the correlation collapses.

This is the most common source of observability failure in systems that are technically well-instrumented.

Three prerequisites for correlation to hold:

Context propagation must be universal. Every request must carry a trace context header — W3C traceparent or equivalent propagated by every service, background job, and queue consumer in the call chain. A single service that drops the context breaks it for every downstream signal. Auto-instrumentation handles HTTP boundaries; it does not handle async workers or queue consumers.

Logs must be structured and carry the trace ID as a field. "Show me all logs for trace ID X" should be a single query, not a grep.

The trace must actually exist. A sampling strategy that discards error traces breaks correlation exactly when you need it most.

The payment service, fully correlated: every request gets a trace ID at ingress. That ID propagates through every service call, database query, and queue message. Every log line carries trace_id as a structured field. Alerts fire with the trace ID of a representative failing request in the alert body. The on-call engineer goes from alert to correlated trace in under 90 seconds. Without this, the same engineer spends 20 minutes triangulating disconnected signals and may still not find the specific failing request.


Auto-Instrumentation

Drop in an OpenTelemetry agent and you get spans for every HTTP call, database query, and cache operation with zero code changes. Complete coverage. Genuinely useful.

Also not telemetry design.

Auto-instrumentation tells you a database query happened, how long it took, and whether it errored. It does not tell you the query was for an enterprise customer's payment on a retry through the EU gateway. Business context, the information that makes telemetry operationally useful rather than infrastructurally descriptive lives in your application, not in any framework.

An auto-instrumented payment service and a deliberately instrumented one will both show a 2,400ms database query. Only the second tells you it was payment_id: TXN-8821, customer_tier: enterprise, gateway: stripe-eu, retry attempt 2. That context is the difference between "something was slow" and "EU enterprise payments are slow on retry, here is the gateway call."

The two-layer model that works: auto-instrumentation as the baseline for coverage and correlation anchoring, explicit business-context instrumentation as the signal layer. The first you enable. The second you design. A service with only the first is a codebase with no architecture.


How To Define Your Telemetry Architecture

What question does this signal answer?

Most high-cost signals should justify themselves with a specific operational question. "This metric detects whether EU payment authorization rates are degrading" is a signal with a purpose.

"We might need this someday" is how you end up with 4,000 metric names and 200 in use.

The constraint applies most strictly to high-cardinality metrics and long-retained traces, where the cost of speculative instrumentation is highest. Logs are different: they carry debugging information for failure modes you haven't encountered yet. Over-constraining log instrumentation is its own failure mode. The framing that holds: high-cost signals justify themselves with a question; low-cost signals can be insurance.

What is the tiered retention contract?

Traces are most useful within 72 hours for active incident debugging, but not exclusively. Fraud investigation and long-range regression analysis can require weeks. The practical model is tiered storage: hot retention for 72 hours, warm (sampled or compressed) for 30 days, cold for compliance duration. Each tier carries an explicit cost and query latency contract. The $44k/month bill drops significantly when 18 months of hot trace retention becomes a tiered model.

What does your sampling strategy make invisible?

A 1% head-based sample makes rare failure modes invisible. A tail-based sampler with a 5-minute buffer makes real-time alerting on trace data impractical. A hybrid that samples all errors but discards slow successes misses a specific class of degradation. Document the blind spots explicitly. Revisit them when incidents reveal new ones.

What is the cardinality budget for this metric?

High-cardinality labels — user_id, payment_id, session_id — generate millions of distinct time series, each billed separately. The three variants of payment_processed in the payment service exist because an engineer added payment_id as a label without understanding cardinality, the backend rejected it, and they created a new metric name to work around the rejection rather than understand the constraint. Cardinality budgets per metric, enforced in CI, prevent this.


Logs Done Right

The instinct when facing an 800GB/day log bill is to log less. More often, the problem is structure and indexing, not volume.

Every search on unstructured logs is a full-text scan across the entire corpus. The same query on structured logs — payment_status=failed AND gateway=stripe-eu AND customer_tier=enterprise — is a millisecond indexed lookup. Same data, different cost profile, radically different debugging capability.

The model that works: index the high-value fields (trace ID, service name, error code, business identifiers), store but don't index the verbose payload (stack traces, request bodies), and tier storage by access pattern. The volume question becomes a retention and indexing question. "Should we log less" is usually the wrong question.


Conclusion

Telemetry maturity is not about vendor choice, collector sophistication, or OpenTelemetry adoption. It is whether, during an incident, you can answer:

"Show me every failed payment in the last 30 minutes, grouped by upstream service, filtered to specific customers, with the full correlated trace for the three slowest ones."

Thirty seconds to a useful answer: the observability layer is working. Twenty minutes of navigation across disconnected tools with no clear result: it is not regardless of what you're paying for it.


Read more