Bloom Filters vs Counting Bloom Filters: When Deletions Kill Performance

Bloom filters are a great tool for membership checks when disk is the enemy and a probabilistic answer is acceptable. But production reality has a sharp edge: a standard Bloom filter is append-only. As soon as you need deletions, many engineers reach for a Counting Bloom Filter (also called a deletable Bloom filter). On paper it fixes deletions. In practice, it often pushes the structure out of CPU cache and turns each lookup into a string of cache misses.

This log explains when Bloom filters work, why counting variants are often a trap, and what to do instead.

Bloom filter refresher: what it guarantees (and what it does not)

A Bloom filter answers one question: is x definitely not in the set, or maybe in the set?

No false negatives: if the filter says “not present”, it is correct.
False positives exist: if the filter says “present”, it means “maybe”.

The structure:

A bit array of size m.
k hash functions.
Insert sets k bits.
Lookup checks those same k bits.

The operational problem: standard Bloom filters do not support deletion

Standard Bloom filters are append-only because bits are shared across items:

Item A sets bits at positions p1, p2, p3.
Item B also sets some overlapping bit, say p2.

If you delete A and try to unset p2, you might break B and introduce a false negative. That is unacceptable for the Bloom filter contract.

Counting Bloom filters: how the deletion trick works

Counting Bloom filters replace each bit with a small counter.

Structure	Bucket type	Insert	Delete
Bloom filter	`1-bit`	set bit to `1`	not supported
Counting Bloom filter	`4-bit` or `8-bit` counter	increment counter	decrement counter

Conceptually:

Insert: counter[i] += 1 for each hashed bucket.
Delete: counter[i] -= 1 for each hashed bucket.
Membership: treat counter[i] > 0 as “bit is set”.

It is a clean solution at the algorithm level.

Mechanical sympathy: why counting Bloom filters often lose in production

The performance story changes when you look at cache residency and random memory access.

Cache residency is the whole game

A well-tuned Bloom filter for millions of items can be only a few megabytes.

That matters because:

A small filter can sit in L3 cache.
L3 access is on the order of 10 to 15 ns on modern CPUs.

When you move from 1-bit buckets to 4-bit or 8-bit counters, the footprint multiplies:

Variant	Bucket size	Relative memory
Bloom	`1 bit`	`1x`
Counting Bloom (4-bit)	`4 bits`	`4x`
Counting Bloom (8-bit)	`8 bits`	`8x`

Cross the cache size boundary and the filter falls into main memory.

Random access becomes expensive

Bloom filters depend on hash functions that behave like uniform random mappings.

So a single lookup touches k non-contiguous buckets:

Typical k: 3 to 5
Typical pattern: k random reads and comparisons

If the structure is in RAM, not L3, each random read is likely a cache miss.

Rough order-of-magnitude latencies:

Location	Typical access time
`L3 cache`	`10` to `15 ns`
RAM	`~100 ns`

Example: k = 4 hashes and the filter lives in RAM.

4 random loads * 100 ns ≈ 400 ns

That is 400 ns to return a probabilistic “maybe.”

At scale, this becomes a CPU stall factory:

You burn cycles waiting on memory.
The benefit of avoiding disk I/O is real, but the cache miss rate can dominate CPU time.

This is where counting Bloom filters quietly become a throughput bottleneck.

When not to use a Counting Bloom filter

Avoid counting Bloom filters when any of these are true:

You need high QPS membership checks and the filter no longer fits in cache.
Your hot path budget is tight (e.g., low millisecond or sub-millisecond tail latency targets).
Deletions are frequent and you are constantly touching random counters.
You only used the Bloom filter to avoid a read, but now the filter itself is your CPU bottleneck.

What to do instead

Option 1: use an exact hash table if memory is not the real constraint

If you need:

exact membership, and
true deletion,

then a real hash set is often the simplest answer.

It costs more memory, but it can win in practice because:

you avoid k random accesses across a large array,
and modern hash tables can be tuned for locality and load factor.

If RAM is available, do not over-optimize with a probabilistic structure that becomes slower.

Option 2: use a Cuckoo filter for deletions with better locality

If you truly need a probabilistic structure with deletions and good performance, consider Cuckoo filters:

They store small fingerprints in buckets.
They support insertion, lookup, and deletion.
Their memory access pattern can be more cache-friendly than counting Bloom filters.

They are not a universal replacement, but they often behave better when cache pressure is the dominant cost.

Architecture view: where the time goes

graph TD
    Q["Membership<br/>query"] --> H["Compute<br/>k hashes"]
    H --> BF{"Bloom filter<br/>fits in cache?"}
    BF -->|"Yes"| L3["L3 hits:<br/>fast random reads"]
    BF -->|"No"| RAM["RAM misses:<br/>slow random reads"]
    RAM --> STALL["CPU stall and<br/>throughput loss"]
    L3 --> FAST["Low overhead<br/>probabilistic check"]

The same algorithm can be fast or slow depending on one question: does the structure fit in cache?

Key takeaways

Bloom filters are great when they stay small and cache-resident. The primary win is fast probabilistic membership without touching disk.
Standard Bloom filters do not support deletion because unsetting bits breaks the no-false-negative guarantee.
Counting Bloom filters make deletion possible but can multiply memory footprint by 4x to 8x, pushing the filter out of cache.
Once the filter lives in RAM, k random accesses can cost ~k * 100 ns, which adds up quickly at high QPS.
If you need deletions, consider:
- an exact hash table (when memory is acceptable), or
- a Cuckoo filter (when you need probabilistic membership with deletion and better locality).