A bloom filter is a simple space-efficient probabilistic data structure designed to test whether an element is present in a set.
It is blazingly fast and use minimal memory at the cost of potential false positives. False positive matches are possible, but false negatives are not.
Bloom proposed the technique for applications where the amount of source data would require an impractically large amount of memory if conventional error-free hashing techniques were applied.
The base of a bloom filter is a bit-array associated with following variables:
The Bloom filter takes an array of bits, initially all set to 0, and for up to different elements, either test or set bits using positions chosen using hash functions. If all bits are set, the element probably already exists, with a false positive rate of ; if any of the bits are not set, the element certainly does not exist.
These variables, , , and , should be picked based on how acceptable false positives are. If the values are picked and the resulting probability is too high, the values should be tweaked and the probability re-calculated.
It’s a nice property of Bloom filters that you can modify the false positive rate of your filter. A larger filter will have less false positives, and a smaller one more.
The formula to calculate probablity of a false positives is:
The hash functions should be fast, independent, and uniformly distributed. The more hash functions you have, the slower your Bloom filter, and the quicker it fills up. If you have too few, however, you may suffer too many false positives.
Putting this constraint aside, for a given and , the value of that minimizes the false positive probability is:
Cryptographic hash functions provide stability and guarantee but are expensive in calculation. With increase in , bloom filter become slow. Non-cryptographic hash functions do not provide guarantee but provide major performance improvement.
Bloom filter only supports the insertion and search operations. You can’t iterate through the items in the set or delete items.
Bloom filters take up space, regardless of the number of items inserted. Both insertion and search operations have time-complexity of .
During insertion, hash functions, are used to create hashes of the input. These hash functions output indexes. At every index received, we simply change the value in our bloom filter to 1.
Bloom filters can only definitively identify true negatives. They cannot identify true positives. When you search an item in a bloom filter, the possible answers are:
During a search, the same hash functions are called and used to hash the input. We then check if the indexes received all have a value of 1 inside our bloom filter. If they all have a value of 1, we know that the bloom filter may have had the value previously inserted.
However, it’s not certain, because it’s possible that other values previously inserted flipped the values to 1. The values aren’t necessarily 1 due to the item currently being searched for. Absolute certainty is impossible unless only a single item has previously been inserted.
While checking the bloom filter for the indexes returned by our hash functions, if even one of them has a value of 0, we definitively know that the item was not previously inserted.
The Bloom filter has many variants, some deal with deletion and others don’t, some use the memory efficiently but increase the where others pay the trade-off in the reversed way.
Typical use cases of Bloom filter are content summaries and sets that would usually grow too large in fields such as networking, distributed systems, databases and analytics.
The Bloom filter’s simplicity, ease of use, and excellent performance make it a standard data structure that is and will continue to be of great use in many applications.