We recently had a situation where we had to search a big list of 500 million hashes against a list of 40 million hashes. The 500M hashes were stored in flat, unsorted text files on 5 DVDs, so there was no easy way to search that list. The 40M hashes were stored in a MySQL database. Some benchmarking showed that it would take something like 20 days to run every one of those 500M hashes against our database, and that just wasn't acceptable.
So what could we do to reduce the database load and speed up the whole process in general? Use a Bloom filter.
Bloom Filters are a data structure that can be used to compactly represent a very large set of items. I won't go into the details of how they work, but the important takeaway point is that they produce no false negatives, so you can use them to save lots of time when it comes to searching large datasets. If your filter tells you that an item isn't in the dataset, then it just isn't there, and you can save yourself a potentially much more expensive query against the actual datastore.
We dumped the 40M list of hashes to a file and computed a Bloom filter on it. The resulting file was a 100MB bit array, distilled down from 1.6GB of hashes. Then, we checked every one of the 500M hashes against the filter, producing a list of possible matches. This took only ~8 hours, a significant speed improvement. The resulting list had only about 19M entries, meaning about 50% of the 40M set were found in the 500M set.
Of course, due to the properties of Bloom filters, some of the items in that list were false positives. At this point, we verified the results against the actual database. At the end, 98% of the positives were true positives.
When it's all said and done, Bloom filters saved a LOT of time. Portions of this implementation are based off of BloominSimple, as implemented by Peter Cooper. The major differences between my implementation and Peter's is that I improved the speed of the bit field by optimizing for write-only operations. (Peter's implementation of BitField allowed you to set and unset the values of the field; mine only does setting.) I also fixed a small issue with the way that a single SHA1 was being turned into four independent hashes.
You can install our BloomFilter implementation via:
gem install bloomfilter
That is, if the gem has propagated yet. If not, head on over to RubyForge and download it directly.