Skip to main content

🌸 Bloom Filters

  1. Suppose we want to query some existence but we are okay with some false positives. For example, Netflix doesn't want to show you already watched movies.
  2. It is a kind of probabilistic data structure which allows us to quickly check whether an element might be in a set.

Components

  1. Bit Array
    1. A bit array of size n where initially all bits are set to zero.
  2. Hash Functions
    1. A hash function takes an input and maps it to an index of the bit array.
    2. In filters, we use k hash functions, so for each value there will be k indexes to be mapped in the array.

Working

  1. For each value, we get k indexes and mark bits as 1 there.
  2. In lookup, we see if all bits for a given value's hash are 1, then the value is present in the set.
  3. If any of the indexes is off, then the value is not present.
  4. There can be false positives as well.

Things to Consider

  1. There is one drawback: traditional Bloom filters don't support deletion.
  2. Size n must be chosen correctly, as large can reduce false positives but can increase memory usage.
  3. Higher k improves accuracy but slows down insertions and lookups.
  4. For more dynamic datasets, we can use Scalable Bloom filters, etc.

Where Not to Use

  1. Where false positives are not acceptable.
  2. If exact checks are needed.
  3. If deleting an element is necessary.
  4. Storage is not an issue.

Real World Applications

Use CaseCompany/ServiceWhy Bloom Filters?
Database IndexingGoogle Bigtable, HBase, CassandraAvoids unnecessary disk lookups
Web CachingFacebook, MemcachedPrevents unnecessary database queries
CDN OptimizationAkamaiReduces backend fetches
URL BlacklistGoogle Chrome Safe BrowsingFaster security checks
BlockchainBitcoin SPV WalletsReduces storage & bandwidth usage
Password Leak ChecksHave I Been Pwned?Checks without exposing passwords
Spam FilteringGmail, Yahoo MailSpeeds up blocklist lookups

Traditional vs Counting vs Scalable Bloom Filter

FeatureTraditional Bloom Filter (BF)Counting Bloom Filter (CBF)Scalable Bloom Filter (SBF)
Memory UsageLowHigher (stores counters)Dynamic (expands as needed)
False Positive RateFixed, increases with sizeFixed, similar to BFControlled, grows with data
Supports Deletions?NoYes (uses counters)No
Handles Growing Data?No (fixed size)No (fixed size)Yes (expands dynamically)
Use CaseFast membership checksCache eviction, DB indexingWeb crawling, large datasets
Hash Functions UsedMultipleMultipleMultiple (per sub-filter)
Example UsageCaching, DB queriesMemcached, DB indexingGoogle Bigtable, log storage
  1. Counting stores data in an array and supports deletion, as in each insertion count is increased and in deletion reduced.
  2. Scalable uses a dynamic array.