Announcing Hank: A Fast, Open-Source, Batch-Updatable, Distributed Key-Value Store

We're really excited to announce the open-source debut of a cool piece of LiveRamp's internal infrastructure, a distributed database project we call Hank.

Our use case is very particular: we have tons of data that needs to get processed, producing a lot of data points for individual browsers, which then need to be made randomly accessible so they can be served through our APIs. You can think of it as the "process and publish" pattern.

For the processing component, Hadoop and Cascading were an obvious choice. However, making our results randomly accessible for the API was more challenging. We couldn't find an existing solution that was fast, scalable, and perhaps most importantly, wouldn't degrade performance during updates. Our APIs need to have lightning-fast responses so that our customers can use them in realtime, and it's just not acceptable for us to have periods where reads contend with writes while we're updating.

We boiled this all down to the following key requirements:

  1. Random reads need to be fast - reliably on the order of a few milliseconds.
  2. Datastores need to scale to terabytes, with keys and values on the order of kilobytes.
  3. We need to be able to push out hundreds of millions of updates a day, but they don't have to happen in realtime. Most will come from our Hadoop cluster.
  4. Read performance should not suffer while updates are in progress.

Additionally, we identified a few non-requirements:

  1. During the update process, it doesn't matter if there is more than one version of our datastores available. Our application is tolerant of this inconsistency.
  2. We have no need for random writes.

The system we came up with is tailored to meet these needs. It consists of a fast, read-only data server backed by a custom-designed batch-updatable file format, a set of tools for writing these files from Hadoop, and a special daemon process that manages the deploy of data from the Hadoop cluster to the actual server machines. Clients of Hank are aware of ongoing updates and avoid connecting to servers that are busy. When the time comes to push out a new version of our data, the data deployer allows only a fraction of the data servers to perform an update at a time, making sure that sufficient data serving capacity remains online.

There's a more detailed look at the architecture and infrastructure of the project, and you can find the code on GitHub, which is shared under the Apache Software License. This codebase is still a work in progress - our older, internal version was in need of a serious refactor - but most of the necessary pieces are there, and we're going to finish the development in the open. We'd love to hear your thoughts on the project and would doubly love to get your contributions, whatever form they might take.

Share Button

7 Responses to “Announcing Hank: A Fast, Open-Source, Batch-Updatable, Distributed Key-Value Store”

  1. claudio martella Mar 16, 2011 at 5:42 am #

    Nice project! What was the problem with HBase? It provides good latency on random reads and builts smoothly on hadoop mapreduce jobs. To me, it provides all the 4 points of your requirements.

    • Bryan Duxbury Mar 16, 2011 at 8:20 am #

      @Claudio - A few things about HBase stood in the way. First of all, at the time, the stability and performance just weren't there, and the M/R updates had only been proposed in a far-off manner. Second, even if HBase had performed as needed, there's still the issue that it runs on the same nodes as our Hadoop cluster, and we aren't able to cope with that kind of contention. (Our Hadoop cluster is very heavily used.) If we were starting over again today, HBase might be more of an option.

  2. Ted Dunning Mar 16, 2011 at 7:44 am #

    A fast read-only batch-updatable KV store?

    That sounds a lot like Voldemort. What made that unacceptable?

    • Bryan Duxbury Mar 16, 2011 at 8:25 am #

      @Ted - Hank and Voldemort are a lot alike. While they share many of the same concepts, I think the implementation differs substantially. I'm hoping that both projects will be able to grow a lot as a result of whatever "competition" might arise.

      That said, we started working on Hank at about the same time as LinkedIn started on Voldemort, and we just didn't know it existed. If we were starting over, we'd probably be contributing to Voldemort.

  3. Jay Mar 25, 2011 at 9:44 am #

    Do you have any benchmarks? It would be great to see how well Hank holds up to a scaling load.

    • Bryan Duxbury Mar 26, 2011 at 8:06 pm #

      @Jay - I don't have any benchmarks yet, but I'm planning to get them, and I'll publish them when I have them.

  4. lokeshk May 2, 2011 at 6:58 pm #

    How about RIAK? Ring architecture sounds like Amazon Dynamo based system.

Leave a Reply