BloomFilter
Categories: Engineering
We recently had a situation where we had to search a big list of 500 million hashes against a list of 40 million hashes. The 500M hashes were stored in flat, unsorted text files on 5 DVDs, so there was no easy way to search that list. The 40M hashes were stored in a MySQL database. Some benchmarking showed that it would take something like 20 days to run every one of those 500M hashes against our database, and that just wasn't acceptable.
So what could we do to reduce the database load and speed up the whole process in general? Use a Bloom filter.
Bloom Filters are a data structure that can be used to compactly represent a very large set of items. I won't go into the details of how they work, but the important takeaway point is that they produce no false negatives, so you can use them to save lots of time when it comes to searching large datasets. If your filter tells you that an item isn't in the dataset, then it just isn't there, and you can save yourself a potentially much more expensive query against the actual datastore.
We dumped the 40M list of hashes to a file and computed a Bloom filter on it. The resulting file was a 100MB bit array, distilled down from 1.6GB of hashes. Then, we checked every one of the 500M hashes against the filter, producing a list of possible matches. This took only ~8 hours, a significant speed improvement. The resulting list had only about 19M entries, meaning about 50% of the 40M set were found in the 500M set.
Of course, due to the properties of Bloom filters, some of the items in that list were false positives. At this point, we verified the results against the actual database. At the end, 98% of the positives were true positives.
When it's all said and done, Bloom filters saved a LOT of time. Portions of this implementation are based off of BloominSimple, as implemented by Peter Cooper. The major differences between my implementation and Peter's is that I improved the speed of the bit field by optimizing for write-only operations. (Peter's implementation of BitField allowed you to set and unset the values of the field; mine only does setting.) I also fixed a small issue with the way that a single SHA1 was being turned into four independent hashes.
You can install our BloomFilter implementation via:
gem install bloomfilter
That is, if the gem has propagated yet. If not, head on over to RubyForge and download it directly.

Rock. I may use this in iHate to implement the /off-topic command. Currently I'm using a Java lib that supports vector and bayes filters. The idea is that as your chatting, if someone mentions something you think is off-topic you just use this command to register it. When potentially off-topic stuff is said, it gets shrunk down so you aren't bothered by it.
A bloom filter might be more appropriate, especially since that could be stored and would map more to "messages with these words are off-topic".
Thanks again.
This is great
I was using the old BloominSimple thing last year. Glad to see something packaged now.
Do you have any plans to make these blooms serializable in an Internet-transport-friendly form, eg. base64? The old LOAF project had something in that vein, see http://loaf.cantbedone.org/download.htm and I'm wondering about including "who my buddies are" blooms in FOAF files if we can figure an encoding, per http://www.perl.com/pub/a/2004/04/08/bloom_filters.html?page=2
Unfortunately bit-level stuff makes my head hurt! I'll get there eventually but wondered what your plans are...
@Dan: Right now I have no plans to make "Internet-serializable" filters, but it would really be a trivial thing to do. My implementation uses a byte array under the hood, so all you'd have to do is base64 each byte in the array. You're welcome to submit a patch
Hi Bryan,
Since there's an upper-limit to how many items you can store in an initial BloomFilter, how would you recommend growing the BloomFilter once you've exceeded the original size?
Could you create new, larger BloomFilter and then somehow UNION the two so that you don't lose your history from the first BloomFilter? Any examples would be helpful.
@dkindlund:
Sadly, there's no way of growing a bloom filter to a larger size, since you lose the original items and hash values. When your first filter gets too full, you have to make a whole new one. For this reason, I recommend making your filters big enough to accommodate as many items as you could possibly imagine being in the set. This usually doesn't lead to impossibly large filters and it will save you a real headache down the road.
Bryan, I think there's a way to fake growing an filter, by automating the processing of creating a new one. I'm not actually quite sure how it works, but check out this Python "scalable bloom filter" library: http://github.com/jaybaird/python-bloomfilter/tree/master
Quote from the readme:
pybloom is a module that includes a Bloom Filter data structure along with
an implmentation of Scalable Bloom Filters as discussed in:
P. Almeida, C.Baquero, N. Preguiça, D. Hutchison, Scalable Bloom Filters,
(GLOBECOM 2007), IEEE, 2007.
Bloom filters are great if you understand what amount of bits you need to set
aside early to store your entire set. Scalable Bloom Filters allow your bloom
filter bits to grow as a function of false positive probability and size.
You shouldn't have used SHA1.
The only hashing criterion you need for a Bloom Filter is a uniform distribution over the hash output space.
While a cryptographic hash function most certainly satisfies this [it has to, otherwise it would be terribly insecure], it's also a littlebit like killing a fly with a bazooka. Next time try it with the FNV hash, which only uses one XOR and one MUL per byte of hashed data, making it a few orders of magnitude faster than SHA1.
(The FNV hash is not cryptographically secure, and has slightly imperfect avalanche behaviour, but will give you a uniform distribution over output space. If you think "naw this seems too easy it can't possibly work", then apply it twice just to be safe, and I guarantee it will be just fine for a Bloom Filter)
Would making the filter as big as possible have any effect on the speed and performance?