We're really excited to announce the open-source debut of a cool piece of LiveRamp's internal infrastructure, a distributed database project we call Hank.
Our use case is very particular: we have tons of data that needs to get processed, producing a lot of data points for individual browsers, which then need to be made randomly accessible so they can be served through our APIs. You can think of it as the "process and publish" pattern.
For the processing component, Hadoop and Cascading were an obvious choice. However, making our results randomly accessible for the API was more challenging. We couldn't find an existing solution that was fast, scalable, and perhaps most importantly, wouldn't degrade performance during updates. Our APIs need to have lightning-fast responses so that our customers can use them in realtime, and it's just not acceptable for us to have periods where reads contend with writes while we're updating.
We boiled this all down to the following key requirements:
- Random reads need to be fast - reliably on the order of a few milliseconds.
- Datastores need to scale to terabytes, with keys and values on the order of kilobytes.
- We need to be able to push out hundreds of millions of updates a day, but they don't have to happen in realtime. Most will come from our Hadoop cluster.
- Read performance should not suffer while updates are in progress.
Additionally, we identified a few non-requirements:
- During the update process, it doesn't matter if there is more than one version of our datastores available. Our application is tolerant of this inconsistency.
- We have no need for random writes.
The system we came up with is tailored to meet these needs. It consists of a fast, read-only data server backed by a custom-designed batch-updatable file format, a set of tools for writing these files from Hadoop, and a special daemon process that manages the deploy of data from the Hadoop cluster to the actual server machines. Clients of Hank are aware of ongoing updates and avoid connecting to servers that are busy. When the time comes to push out a new version of our data, the data deployer allows only a fraction of the data servers to perform an update at a time, making sure that sufficient data serving capacity remains online.
There's a more detailed look at the architecture and infrastructure of the project, and you can find the code on GitHub, which is shared under the Apache Software License. This codebase is still a work in progress - our older, internal version was in need of a serious refactor - but most of the necessary pieces are there, and we're going to finish the development in the open. We'd love to hear your thoughts on the project and would doubly love to get your contributions, whatever form they might take.