LiveRamp makes extensive use of Thrift's CompactProtocol to save space for long-term data storage and for communicating between services. However, this summer, star LiveRamp intern Armaan Sarkar took us to a new level of compact-ness with his work on the new TupleProtocol.
While we are completely happy with the CompactProtocol for permanent data storage and RPC, there is one place where it's not a perfect fit. Since all our objects are Thrift structs, we pass those same structs around during our Cascading flows. We've built a connector (extremely similar to this one) that allows these Thrift structs to be serialized in the CompactProtocol when Cascading serializes Tuples between mappers and reducers, and this all works great. However, as part of our never-ending performance efforts, one of the things we are always striving to do is to decrease the amount of data that needs to be shipped between mappers and reducers. You're probably thinking, isn't the CompactProtocol... compact? The answer is yes. It's the smallest representation we could come up with for a general-purpose Thrift protocol.
But wait! Recall that Thrift as a framework is designed to allow smooth transitions for users using different versions of their schema (like an old client contacting an updated server). Thrift does this by making sure that the data that gets serialized is sufficient to fully describe itself - serialized messages contain markers about field IDs and field data types. Using these markers, it's possible for the Thrift library to skip over unrecognized fields. But this ability comes at the cost of a few bytes of overhead here and there, and these bytes can really add up when you have 200 billion records to move around.
Look back at our imperfect use case: intermediate serialization during a Cascading flow. When would the Thrift schema ever change during a single job? The answer is that it wouldn't. This means that the extra bytes that the CompactProtocol spends on the feature to support changing schemas is 100% waste. It would be great if we didn't have to pay the cost of a feature we don't use.
This is exactly what the TupleProtocol does. Instead of being a general-purpose protocol, TupleProtocol is a purpose-built protocol designed specifically for the case when you know beyond the shadow of a doubt that the schema for a record cannot change. With this precondition, it can avoid writing out the type markers and field IDs and just uses the metadata implicit in the code itself to guide deserialization. For instance, it knows that what's coming up next is going to be a string field called "foo" because it can assume the writer of the data wrote it exactly to that spec. By dumping this extra overhead, users of the TupleProtocol can see a significant decrease in the size of serialized objects. One sample job we tested saw about a 5% size decrease. The actual savings you see will vary a lot depending on what kind of data you have in your structs, but in general, the more non-string fields you have, the more benefit you'll see. For us, just based on the 5% figure, it amounts to tens of gigabytes less shuffle in our biggest jobs.
The TupleProtocol is currently committed to Thrift TRUNK and will be released as part of Thrift 0.8. If you'd like to try it out now, we'd love to hear your feedback!