Module 10 - Specialized Databases

Bloom Filters

Probabilistic data structure: "Definitely not" or "Probably yes."

1The Nightclub Bouncer Analogy

Simple Analogy
A bouncer at a club has a mental list of banned people. He can quickly say "definitely not on the list" or "might be on the list, let me check the full list." A Bloom filter works the same way-fast rejection of non-members, occasional false positives for members.

Bloom filter is a space-efficient probabilistic data structure that tests whether an element is in a set. It can have false positives (says "yes" when "no") but never false negatives.

2How It Works

1
Initialize
Create a bit array of size m, all set to 0. Choose k hash functions.
2
Add Element
Hash the element k times. Set those k bit positions to 1.
3
Query
Hash the element k times. If ALL k bits are 1 → 'probably yes'. If ANY bit is 0 → 'definitely no'.
Visual Example
Bit Array (m=10): [0,0,0,0,0,0,0,0,0,0]

Add "hello": hash1→2, hash2→5, hash3→9
              [0,0,1,0,0,1,0,0,0,1]

Add "world": hash1→1, hash2→5, hash3→7
              [0,1,1,0,0,1,0,1,0,1]

Query "hello": positions 2,5,9 all = 1 → PROBABLY YES
Query "foo":   positions 3,6,8 → 0 found → DEFINITELY NO
Query "bar":   positions 1,5,9 all = 1 → PROBABLY YES (false positive!)

3Trade-offs

Pros
  • Extremely space efficient
  • O(k) add and query (constant)
  • No false negatives ever
  • Parallelizable
Cons
  • False positives possible
  • Can't delete elements
  • Can't retrieve elements
  • Need to tune m and k

False positive rate ≈ (1 - e^(-kn/m))^k where n = elements, m = bits, k = hash functions. More bits = fewer false positives.

4Use Cases

Cache: Check Before Expensive Lookup

Before querying database, check Bloom filter. If 'definitely no', skip the query entirely.

Google Bigtable, HBase, Cassandra use Bloom filters for SSTable lookups

Duplicate Detection

Have I seen this URL/email/username before?

Web crawlers, spam filters, one-click unsubscribe

Password Breach Check

Is this password in a leaked password database?

HaveIBeenPwned uses Bloom filters

CDN: Cache Lookup

Is this object in cache? Check filter before disk lookup.

Akamai, Cloudflare

5Sizing a Bloom Filter

Optimal Parameters

Given: n = expected elements, p = desired false positive rate

Bits needed: m = -n * ln(p) / (ln(2))^2

Hash functions: k = (m/n) * ln(2)

ElementsFP RateBitsSize
1 million1%9.6M1.2 MB
1 million0.1%14.4M1.8 MB
10 million1%96M12 MB
1 billion1%9.6B1.2 GB

6Variants

Counting Bloom Filter

Use counters instead of bits. Supports deletion.

Tradeoff: 4x more space

Cuckoo Filter

Supports deletion and has better space efficiency for low FP rates.

Tradeoff: More complex

Quotient Filter

Cache-friendly, supports merging and deletion.

Tradeoff: Slightly slower inserts

Scalable Bloom Filter

Grows dynamically as elements are added.

Tradeoff: Multiple underlying filters

7Key Takeaways

1Bloom filter: "Definitely no" or "Probably yes" with no false negatives
2O(k) constant time for add and query, regardless of set size
3Space efficient: 1M elements with 1% FP = only 1.2 MB
4Use to avoid expensive lookups: cache, database, disk
5Can't delete from standard Bloom filter. Use Counting Bloom Filter for that.

?Quiz

1. Bloom filter says 'no'. What do you know?

2. When is Bloom filter most useful?