BloomFilter

We recently had a situation where we had to search a big list of 500 million hashes against a list of 40 million hashes. The 500M hashes were stored in flat, unsorted text files on 5 DVDs, so there was no easy way to search that list. The 40M hashes were stored in a MySQL database. Some benchmarking showed that it would take something like 20 days to run every one of those 500M hashes against our database, and that just wasn't acceptable.

So what could we do to reduce the database load and speed up the whole process in general? Use a Bloom filter.

Bloom Filters are a data structure that can be used to compactly represent a very large set of items. I won't go into the details of how they work, but the important takeaway point is that they produce no false negatives, so you can use them to save lots of time when it comes to searching large datasets. If your filter tells you that an item isn't in the dataset, then it just isn't there, and you can save yourself a potentially much more expensive query against the actual datastore.

We dumped the 40M list of hashes to a file and computed a Bloom filter on it. The resulting file was a 100MB bit array, distilled down from 1.6GB of hashes. Then, we checked every one of the 500M hashes against the filter, producing a list of possible matches. This took only ~8 hours, a significant speed improvement. The resulting list had only about 19M entries, meaning about 50% of the 40M set were found in the 500M set.

Of course, due to the properties of Bloom filters, some of the items in that list were false positives. At this point, we verified the results against the actual database. At the end, 98% of the positives were true positives.

When it's all said and done, Bloom filters saved a LOT of time. Portions of this implementation are based off of BloominSimple, as implemented by Peter Cooper. The major differences between my implementation and Peter's is that I improved the speed of the bit field by optimizing for write-only operations. (Peter's implementation of BitField allowed you to set and unset the values of the field; mine only does setting.) I also fixed a small issue with the way that a single SHA1 was being turned into four independent hashes.

You can install our BloomFilter implementation via:
gem install bloomfilter

That is, if the gem has propagated yet. If not, head on over to RubyForge and download it directly.

Share Button

14 Responses to “BloomFilter”

  1. Zed A. Shaw Sep 12, 2007 at 2:19 pm #

    Rock. I may use this in iHate to implement the /off-topic command. Currently I'm using a Java lib that supports vector and bayes filters. The idea is that as your chatting, if someone mentions something you think is off-topic you just use this command to register it. When potentially off-topic stuff is said, it gets shrunk down so you aren't bothered by it.

    A bloom filter might be more appropriate, especially since that could be stored and would map more to "messages with these words are off-topic".

    Thanks again.

  2. Dan Brickley Jan 8, 2008 at 9:34 am #

    This is great :) I was using the old BloominSimple thing last year. Glad to see something packaged now.

    Do you have any plans to make these blooms serializable in an Internet-transport-friendly form, eg. base64? The old LOAF project had something in that vein, see http://loaf.cantbedone.org/download.htm and I'm wondering about including "who my buddies are" blooms in FOAF files if we can figure an encoding, per http://www.perl.com/pub/a/2004/04/08/bloom_filters.html?page=2

    Unfortunately bit-level stuff makes my head hurt! I'll get there eventually but wondered what your plans are...

  3. bryan Jan 8, 2008 at 10:55 am #

    @Dan: Right now I have no plans to make "Internet-serializable" filters, but it would really be a trivial thing to do. My implementation uses a byte array under the hood, so all you'd have to do is base64 each byte in the array. You're welcome to submit a patch :)

  4. dkindlund Jul 9, 2009 at 8:03 am #

    Hi Bryan,
    Since there's an upper-limit to how many items you can store in an initial BloomFilter, how would you recommend growing the BloomFilter once you've exceeded the original size?

    Could you create new, larger BloomFilter and then somehow UNION the two so that you don't lose your history from the first BloomFilter? Any examples would be helpful.

  5. bryan Jul 9, 2009 at 8:16 am #

    @dkindlund:
    Sadly, there's no way of growing a bloom filter to a larger size, since you lose the original items and hash values. When your first filter gets too full, you have to make a whole new one. For this reason, I recommend making your filters big enough to accommodate as many items as you could possibly imagine being in the set. This usually doesn't lead to impossibly large filters and it will save you a real headache down the road.

  6. ehsanul Aug 21, 2009 at 12:36 pm #

    Bryan, I think there's a way to fake growing an filter, by automating the processing of creating a new one. I'm not actually quite sure how it works, but check out this Python "scalable bloom filter" library: http://github.com/jaybaird/python-bloomfilter/tree/master

    Quote from the readme:

    pybloom is a module that includes a Bloom Filter data structure along with
    an implmentation of Scalable Bloom Filters as discussed in:

    P. Almeida, C.Baquero, N. Preguiça, D. Hutchison, Scalable Bloom Filters,
    (GLOBECOM 2007), IEEE, 2007.

    Bloom filters are great if you understand what amount of bits you need to set
    aside early to store your entire set. Scalable Bloom Filters allow your bloom
    filter bits to grow as a function of false positive probability and size.

  7. tripzilch Nov 8, 2010 at 3:05 am #

    You shouldn't have used SHA1.

    The only hashing criterion you need for a Bloom Filter is a uniform distribution over the hash output space.

    While a cryptographic hash function most certainly satisfies this [it has to, otherwise it would be terribly insecure], it's also a littlebit like killing a fly with a bazooka. Next time try it with the FNV hash, which only uses one XOR and one MUL per byte of hashed data, making it a few orders of magnitude faster than SHA1.

    (The FNV hash is not cryptographically secure, and has slightly imperfect avalanche behaviour, but will give you a uniform distribution over output space. If you think "naw this seems too easy it can't possibly work", then apply it twice just to be safe, and I guarantee it will be just fine for a Bloom Filter)

  8. Filters for fridges Nov 6, 2011 at 8:02 am #

    Would making the filter as big as possible have any effect on the speed and performance?

Trackbacks/Pingbacks

  1. GSIY … Ruby-Rails Portal - Sep 10, 2007

    [...] BloomFilter is a new Ruby library (available as a gem - gem install bloomfilter) by Bryan Duxbury that provides operations to create and query “bloom filters“, an extremely space-efficient “probablistic data structure” that makes it quick and easy to test set membership. This all sounds incredibly geeky and uninteresting until you discover how bloom filters can be used to make things like ultra-fast, low-overhead spell checkers, spam filters, stop word removers, and other tools that require checking two sets of data against each other. Bryan’s use for bloom filters (and BloomFilter) was to search a big list of 500 million hashes against a set of 40 million hashes. Given the size of each hash, coupled with the large quantity, a major, memory hungry task was looming. Bloom filters enabled the task to reduce from a 20 day running time to just 8 hours. [...]

  2. Bloom Filter: una nuova libreria Ruby - Ruby on Rails, AJAX, WEB 2.0, CSS, MAC - Rails on the road - Sep 11, 2007

    [...] ricerche con un dataset di grandi dimensioni. I filtri di Bloom esistono in più tipi, la gemma BloomFilter è l’implementazione di un filtro di Bloom semplice e potete installarla semplicemente [...]

  3. On My Rails Radar » Damon Clinkscales - Sep 11, 2007

    [...] bloom filters (rapleaf): bit array for searching large datasets [...]

  4. Twitter Updates for 2007-11-17 at blogwi.se - Nov 17, 2007

    [...] Filter for ruby (a gem that is): http://blog.liveramp.com/?p=6 [...]

  5. Bloom Filter – It’s Variants- and their Applications « Appolo85's Blog - Aug 3, 2010

    [...] This post from Rapleaf  was an interesting read. [...]

  6. Bloom filters demystified :P | padlobatch - Dec 8, 2011

    [...] http://blog.liveramp.com/2007/09/05/bloomfilter/ Some random guy used it. It has pretty good benefits in real life scenarios. It just makes things easy-peasy. [...]

Leave a Reply