Subscribe for more posts like this →

Approximate Membership : Understanding Bloom and Quotient Filters

Imagine you’re a security guard at a massive corporate building with millions of employees. Your job is to quickly check whether each…

Imagine you're a security guard at a massive corporate building with millions of employees. Your job is to quickly decide whether each person trying to enter is authorized. You could keep a complete list of every employee — but searching through millions of names for every person who walks through the door is too slow. You need something faster.

What if your system worked like this: it never lets unauthorized people through, but occasionally flags an authorized employee for additional verification? Most of the time it's right. When it's wrong, it's always wrong in the same direction — it never waves through someone who shouldn't be there, it only stops someone who should be.

That asymmetry is the key insight behind Bloom filters and quotient filters. They trade perfect accuracy for dramatic gains in speed and memory — but only in one direction. They can tell you with certainty that something is not in the set. They can only tell you probabilistically that something is.


The Problem They're Solving

Consider a web crawler that has already visited 500 million pages and needs to avoid revisiting them. The obvious solution — store every URL in a hash table — would require around 25 gigabytes just for the URLs themselves. At a billion pages, that becomes 50 gigabytes. The data structure that checks for duplicates starts to cost more than the data itself.

What you actually need is a structure that can answer "have I seen this URL before?" using a tiny fraction of that memory. The trade-off you're willing to make: occasionally the structure says "yes, I've seen this" when it actually hasn't — causing you to skip a page you should have crawled. That's acceptable. What's not acceptable is the reverse: saying "no, I haven't seen this" when you have, causing you to crawl the same page twice.

This is the one-sided error guarantee both structures provide. False positives are possible — false negatives are not.


Bloom Filters

A Bloom filter doesn't store the items themselves. It stores fingerprints of items — multiple compact signatures spread across a large array of bits, all initially set to zero.

When you add a URL to the filter, you run it through several independent hash functions — say, three. Each hash function maps the URL to a position in the bit array, and you set those three positions to 1. You're not storing the URL. You're marking the positions its fingerprints land on.

When you later ask "have I seen this URL?", you run it through the same three hash functions, get the same three positions, and check whether all three are set to 1. If any position is still 0, the URL was definitely never added — nothing else could have set that exact combination of positions. If all three positions are 1, the URL was probably added — but you can't be certain.

Why false positives happen

The uncertainty comes from collisions. Each position in the bit array is shared across every URL in the filter. Over time, different URLs set different positions — and by coincidence, the three positions that a new URL would use might already be set by three different previously-added URLs. The filter sees all three positions marked and concludes the URL has been seen, even though it hasn't.

The more items you add, the more positions get marked, and the more likely these coincidental collisions become. A nearly-full Bloom filter starts saying "probably yes" to almost everything — its false positive rate degrades as it fills up.

A concrete example: browser security

Google Chrome uses a Bloom filter to protect users from malicious websites. Chrome keeps a compact filter — a few megabytes — containing fingerprints of known dangerous URLs. When you visit a site, Chrome first checks this local filter. If the filter says "definitely not dangerous," the page loads immediately with no network call. If it says "possibly dangerous," Chrome contacts Google's servers for a full verification.

The result: 99% of safe websites load instantly. The occasional false positive — a safe site that coincidentally matches a dangerous site's fingerprints — gets an unnecessary server check. That's a minor cost. The alternative — checking every single site against remote servers — would add hundreds of milliseconds to every page load.

What Bloom filters can't do

Two limitations follow directly from how the structure works.

You cannot remove items. Removing a URL would mean clearing the three bit positions its hash functions land on. But those positions might be shared with other URLs. Clearing them would corrupt the filter — other URLs would suddenly appear to never have been added. Once a bit is set, you can't unset it safely.

You cannot enumerate what's stored. The filter holds no record of which items set which bits. It knows only "set" or "not set" for each position. There's no way to reconstruct the original items from that information.


Quotient Filters

Quotient filters solve both limitations — deletion and enumeration — by storing fingerprints more systematically. Instead of scattering bits randomly across the array, they give each fingerprint a home address.

The filing cabinet model

Think of a quotient filter as a filing cabinet where each document gets assigned to a specific drawer based on part of its identifier. When you add "facebook.com" to the filter, you compute its hash and split it into two parts: the quotient and the remainder. The quotient tells you which drawer this item belongs in. The remainder is what you actually store in that drawer — a compact fingerprint of the item.

If the assigned drawer is already occupied, the remainder gets placed in the next available drawer, with metadata indicating where it originally belonged. Lookups work by going to the assigned drawer and scanning forward through any nearby occupied drawers, checking remainders as you go.

This organisation produces three properties Bloom filters don't have.

Deletion works

Because each item's remainder is stored in a predictable location, you can find it and remove it without disturbing anything else. Removing facebook.com's remainder from its drawer doesn't affect google.com's remainder in a different drawer. The shared-position problem that makes Bloom filter deletion impossible simply doesn't exist here.

Lookups are more cache-friendly

A Bloom filter lookup jumps to several random positions across a large array — positions that are unlikely to be in the CPU's cache. A quotient filter lookup starts at one drawer and scans forward through a small contiguous region. That locality means the memory your lookup needs is far more likely to already be in cache, which matters enormously at scale.

Enumeration is possible

Since remainders are stored in predictable locations, you can scan through the array and reconstruct the fingerprints of everything stored. You can produce a list of all items in the filter — something a Bloom filter structurally cannot do.

A concrete example: database query optimisation

A database receiving a query for a customer ID that doesn't exist would normally need to check its storage files — potentially a slow disk read — before concluding the record isn't there. With a quotient filter over the stored customer IDs, the database first checks the filter. If the filter says "definitely not here," the disk read never happens. If it says "probably here," the database proceeds with the normal lookup.

Since a significant proportion of real-world queries target records that don't exist — typos, stale references, speculative lookups — eliminating those disk reads dramatically reduces the load on the storage layer.


Which One to Use

The choice comes down to three questions.

Do you need to remove items? If yes, Bloom filters are ruled out. Their structure makes deletion impossible without rebuilding. Quotient filters handle deletion cleanly.

How static is your dataset? Bloom filters work well when you load a dataset once and then perform many lookups — a spell checker, a malware signature database, a URL blacklist. Quotient filters are better suited to datasets that evolve continuously — a crawler's visited-URL set, a cache's key index, anything where items are regularly added and removed.

How full will the structure get? Bloom filters degrade gracefully at low fill rates but their false positive rate rises steeply as they approach capacity. Quotient filters maintain more consistent performance across fill rates and can be resized more gracefully when the dataset grows beyond initial estimates.

Both structures give you the same fundamental guarantee: false positives are possible, false negatives are not. The question is which trade-offs around mutability, cache behaviour, and fill-rate sensitivity fit your use case.


The deeper pattern here is the same one behind approximate quantiles and heavy hitter sketches. Perfect accuracy at scale is expensive — sometimes prohibitively so. Structures that accept small, bounded, one-directional errors can operate at orders of magnitude less cost. The engineering skill is knowing which errors are acceptable, which direction they need to be bounded in, and building that contract explicitly into the system rather than discovering it accidentally under load.


What membership checks in your system are paying the full cost of exact lookups — and would a bounded false positive rate be an acceptable trade?

Read more