Understanding Your Data at Scale: Quantiles, Heavy Hitters, and Why Both Matter
Imagine you’re the manager of a busy restaurant. Every day, hundreds of customers come through your doors, and you want to understand how…
Imagine you're the manager of a busy restaurant. You want to understand how well it's performing, so you calculate the average wait time for food. The number looks fine. But here's the problem: two restaurants can have the same average and represent completely different realities.
In the first, most customers wait 10 minutes — but a handful wait 45 because of kitchen mistakes. In the second, every customer waits exactly 15 minutes. Same average. Completely different experience. The first restaurant has a problem the number isn't showing you.
This is why understanding the full distribution of your data matters.
What Quantiles Actually Tell You
Quantiles divide your data into equal-sized groups and let you ask specific questions about specific positions in the distribution. Instead of one number that hides everything, you get a picture.
Going back to the restaurant:
- 50th percentile (median): 8 minutes — half your customers wait less, half wait more
- 75th percentile: 12 minutes — three-quarters of customers are served within this time
- 90th percentile: 18 minutes — 90% of customers are within this window
- 95th percentile: 25 minutes — only 5% wait longer
- 99th percentile: 40 minutes — only 1% experience this
Most customers have a reasonable experience. But 5% are waiting over 25 minutes — enough to generate complaints, bad reviews, lost regulars. That 1% at 40 minutes points directly at something broken: a kitchen mistake pattern, a staffing gap, an order routing problem. The average of 12 minutes would have told you everything was fine.
The Problem at Scale
Now scale the restaurant to a food delivery app handling millions of orders a day. You still want to understand delivery times — but suddenly the math becomes impossible to do exactly.
To calculate exact quantiles, you need to sort all your data. Sorting 10 million numbers takes time and memory. Worse, delivery data arrives continuously — new orders complete every second. You can't pause the world, sort everything, and start again. And if you tried to store every delivery time to sort later, you'd need gigabytes just for one day's data.
Greenwald-Khanna Algorithm
Imagine you've just started tracking delivery times for your food app. The first delivery takes 12 minutes. You write it down. The second takes 28 minutes. Third is 9 minutes — you re-sort: 9, 12, 28. You know exactly where everyone stands.
At five deliveries, ten deliveries, this is fine. But at ten thousand deliveries, storing and sorting everything becomes untenable. You need a smarter structure — one that stays small without losing the ability to answer percentile questions accurately.
The algorithm's approach: insert every new delivery time into a summary structure, but periodically compress that summary by merging nearby entries. Every entry you insert doesn't live as a raw number — it carries three things:
- Its value — the delivery time itself
- g — how many data points this entry definitely represents
- Δ — the uncertainty in its rank: how many additional data points might sit near it in the true sorted order
Together, g and Δ define a rank window — the range of positions this entry could occupy in the true sorted list. An entry with a minimum rank of 450 and a Δ of 30 sits somewhere between rank 450 and rank 480. You don't know exactly where — but you know it's within that window.
What ε controls
Before the algorithm starts, you choose one number: ε (epsilon) — your error tolerance. It's the fraction of the total dataset you're willing to be wrong by when answering a percentile query. Set ε = 0.1 and you're saying: "My answers can be off by up to 10% of n in rank terms." With a million deliveries, the true p90 value might come back as anything between p89 and p91 — a deliberate, bounded imprecision.
This single number drives one invariant the algorithm enforces for every entry in the summary:
g + Δ ≤ 2εn
As long as this holds, no percentile query can be answered incorrectly beyond your stated tolerance.
Merging
As more deliveries arrive, the summary grows. Periodically, the algorithm looks for neighbouring entries whose rank windows are close enough that collapsing them into one still respects the invariant.
Say you have two adjacent entries: a 31-minute delivery and a 33-minute delivery. Merging them produces a single entry — value 33, combined g, combined uncertainty. The merged entry is less precise than either original. But if g + Δ of the result still satisfies 2εn, the merge is safe. Two entries become one, and the guarantee holds.
This compression is what keeps the summary size bounded. And here's the key scaling property: as n grows, 2εn grows with it. A larger dataset allows each entry to carry more uncertainty before violating the invariant, which means more merges become legal, which means the summary compresses more aggressively. The summary size ends up depending on ε — not on how many data points you've seen.
At scale
At a million deliveries with ε = 0.1, a single entry can represent up to 200,000 adjacent delivery times — an enormous slice of the distribution compressed into one (value, g, Δ) triple, still carrying the guarantee that any query landing in its range will be answered within ε × n ranks of the truth.
Tighten ε and the summary grows — more entries, smaller windows, more precise answers. Loosen it and the summary compresses further — fewer entries, wider windows, coarser answers. But the trade-off is always explicit. You chose ε upfront. The algorithm delivers exactly what you asked for, nothing more, nothing less.
Heavy Hitters
Quantiles describe the shape of your distribution. They don't tell you what's driving it.
Heavy hitters are the items in your data that appear far more frequently than you'd expect if everything were evenly distributed. In a music streaming service, a small number of songs get played vastly more than others. In an e-commerce site, one product might account for 15% of all searches during a sale. In a banking system, one merchant might process 8% of all credit card transactions.
These dominant patterns matter for multiple reasons :
Resource management: If one user is consuming 20% of your server capacity, you need to know — for capacity planning, for rate limiting, for provisioning decisions.
Security: If one IP address is making 10,000 requests per minute while most make fewer than 10, that's not a usage pattern — it's an attack.
Performance: If one database query pattern represents 50% of all queries, optimizing that one pattern could transform your system's overall performance. Optimizing anything else is noise.
Anomaly detection: When heavy hitter patterns change suddenly, something significant has happened — a marketing campaign landed, a system component started misbehaving, a trending piece of content went viral.
Finding Heavy Hitters
The obvious way to find heavy hitters is to keep a counter for every item you see. Every time a search term arrives, increment its counter. At the end, look for counters above your threshold.
This works perfectly at small scale. At large scale it breaks down — not because the counting is hard, but because the number of distinct items is unbounded. A search engine might see hundreds of millions of distinct queries in a day. Keeping a counter for each one means keeping hundreds of millions of counters in memory, continuously. That's the same problem as storing every delivery time — you're back to O(n) memory.
What you actually need is a fixed-memory structure that tracks frequency — one that doesn't grow as new distinct items appear.
The core insight
The algorithm maintains exactly k slots — fixed, regardless of how many distinct items appear. Each slot holds an item and a count.
When a new item arrives:
- If it's already tracked, increment its count
- If there's an empty slot, insert it with count 1
- Otherwise, decrement every slot by 1, remove any that hit zero, and if a slot opens up the new item may take it — otherwise it is dropped
That last step is the key. Each such decrement effectively removes k total occurrences from the stream — you can think of it as cancelling out groups of k distinct items. Any item that appears more than n/k times cannot be completely cancelled, no matter how many decrement events happen. It simply occurs too often to be eliminated. So it must survive in the final set.
What the guarantee actually means
Set k = 100 and process a stream of a million searches. Any search term appearing more than 10,000 times — more than 1% of all searches — is guaranteed to appear in the final 100 slots.
The reverse is not guaranteed. Some items in those slots may not actually exceed n/k — they are candidates, not confirmed heavy hitters. The stored counts are not exact either: each is a lower bound on the true frequency, with an error of at most n/k. If exact counts are needed, a second pass over the data can verify them.
The memory cost is fixed: k counters, independent of the number of distinct items or the size of the stream.
Quantile analysis tells you the shape of the problem: 90% of users are fine, 5% are struggling, 1% are having a completely broken experience. That's essential signal — but it's incomplete. A spike in your 99th percentile latency tells you something is wrong. It doesn't tell you where, for whom, or why.
Heavy hitters give you the second half. When your 95th percentile response time jumps from 800ms to 3 seconds, the question is: is this affecting everyone uniformly, or is it concentrated? If one endpoint accounts for 40% of your traffic and is showing 10× its normal error rate, the percentile spike is a symptom. The heavy hitter is the cause.
Return to the restaurant one more time. Quantile analysis reveals that 5% of your customers are waiting over 25 minutes — an experience bad enough to drive them away. Heavy hitter analysis reveals that 60% of those slow orders are coming from one section of the kitchen, or one category of dish, or one time window in the evening rush. The first tells you the problem exists and how bad it is. The second tells you exactly where to look.
Both approaches work by accepting a small, bounded inaccuracy in exchange for the ability to operate at scale. Exact answers to both questions exist — they're just too expensive to compute continuously in a live system.
In your system right now: how long does it take to go from "the percentiles look bad" to knowing specifically what's driving it?