This is Varun from Pinterest.<p>We did look at a few options before building this. ElephantDB seemed a bit heavy handed, such as having to modify ring configuration every time we added/removed servers and also, modifying domain spec yaml files for newly added data sets. It did not allow us to easily change # of shards across different versions of the data - something that our developers do often to make their jobs run faster etc. Also, it does not GC out older versions and since our workflows write new versions every day, this was a problem.<p>We did look at Cassandra but we also did not want to operate another data store. However, we definitely wanted to get the data loaded fast i.e. through simple file copy operations. We found that for this option Cassandra had similar issues as HBase i.e. having to do major compactions to get rid of older data versions. Tweaking the number of reduce shards was also harder.<p>With Terrapin, we essentially tried to build serving system on top of HDFS given the recent improvements in HDFS performance when there is data locality. We felt that HDFS was rock solid and the best storage system (in terms of scalability & ease of operation) for immutable data sets. On top of that, we built versioning, cheap garbage collection, extensible serving formats etc. as mentioned in the blog<p>As for Apache Drill, it is more suited to running analyst queries with latencies ranging upto seconds or 100s of milliseconds. This is not acceptable for webscale work loads where the latencies must be < 10ms for lower level serving systems like terrapin.
Either more tools like this are going to pop up, or the existing ones will mature, as more people adopt Lambda style architectures, I'd imagine.<p>While building one out, we looked at VoldemortDB, SploutSQL, and ElephantDB to serve bulk data coming out of Hadoop in batches. Voldemort turned out to be much rougher around the edges than expected, ElephantDB looked very bleeding edge, and SploutSQL wasn't as general purpose. In the end we turned to Cassandra and this tool - <a href="https://github.com/spotify/hdfs2cass" rel="nofollow">https://github.com/spotify/hdfs2cass</a>.<p>Good to see Pinterest open sourcing this.
The URL <a href="https://engineering.pinterest.com/blog/open-sourcing-terrapin-serving-system-batch-generated-data-0" rel="nofollow">https://engineering.pinterest.com/blog/open-sourcing-terrapi...</a> gives more info, which this article links through to.
Bulk Uploads into KV stores are slow - so Terrapin allows KV access over immutable HDFS files.<p>Very typical use case for recommendation systems etc. We face similar problems with latencies on HBase (At Groupon).<p>So this solution seems interesting. Would be good to have comparison of other solutions Pinterest tried before building this. eg. loading data into Cassandra instead of HBase etc.<p>In nutshell - very specific use case - but the one which comes across very often