Bloom Filters
Probabilistic data structure: "Definitely not" or "Probably yes."
1The Nightclub Bouncer Analogy
Bloom filter is a space-efficient probabilistic data structure that tests whether an element is in a set. It can have false positives (says "yes" when "no") but never false negatives.
2How It Works
Visual Example
Bit Array (m=10): [0,0,0,0,0,0,0,0,0,0]
Add "hello": hash1→2, hash2→5, hash3→9
[0,0,1,0,0,1,0,0,0,1]
Add "world": hash1→1, hash2→5, hash3→7
[0,1,1,0,0,1,0,1,0,1]
Query "hello": positions 2,5,9 all = 1 → PROBABLY YES
Query "foo": positions 3,6,8 → 0 found → DEFINITELY NO
Query "bar": positions 1,5,9 all = 1 → PROBABLY YES (false positive!)3Trade-offs
Pros
- ✓Extremely space efficient
- ✓O(k) add and query (constant)
- ✓No false negatives ever
- ✓Parallelizable
Cons
- ✗False positives possible
- ✗Can't delete elements
- ✗Can't retrieve elements
- ✗Need to tune m and k
False positive rate ≈ (1 - e^(-kn/m))^k where n = elements, m = bits, k = hash functions. More bits = fewer false positives.
4Use Cases
Cache: Check Before Expensive Lookup
Before querying database, check Bloom filter. If 'definitely no', skip the query entirely.
Google Bigtable, HBase, Cassandra use Bloom filters for SSTable lookups
Duplicate Detection
Have I seen this URL/email/username before?
Web crawlers, spam filters, one-click unsubscribe
Password Breach Check
Is this password in a leaked password database?
HaveIBeenPwned uses Bloom filters
CDN: Cache Lookup
Is this object in cache? Check filter before disk lookup.
Akamai, Cloudflare
5Sizing a Bloom Filter
Optimal Parameters
Given: n = expected elements, p = desired false positive rate
Bits needed: m = -n * ln(p) / (ln(2))^2
Hash functions: k = (m/n) * ln(2)
| Elements | FP Rate | Bits | Size |
|---|---|---|---|
| 1 million | 1% | 9.6M | 1.2 MB |
| 1 million | 0.1% | 14.4M | 1.8 MB |
| 10 million | 1% | 96M | 12 MB |
| 1 billion | 1% | 9.6B | 1.2 GB |
6Variants
Counting Bloom Filter
Use counters instead of bits. Supports deletion.
Tradeoff: 4x more space
Cuckoo Filter
Supports deletion and has better space efficiency for low FP rates.
Tradeoff: More complex
Quotient Filter
Cache-friendly, supports merging and deletion.
Tradeoff: Slightly slower inserts
Scalable Bloom Filter
Grows dynamically as elements are added.
Tradeoff: Multiple underlying filters
7Key Takeaways
?Quiz
1. Bloom filter says 'no'. What do you know?
2. When is Bloom filter most useful?