TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Pinterest open-sources Terrapin, a tool for serving data from Hadoop

31 pointsby gexosover 9 years ago

5 comments

varunsharmaover 9 years ago
This is Varun from Pinterest.<p>We did look at a few options before building this. ElephantDB seemed a bit heavy handed, such as having to modify ring configuration every time we added&#x2F;removed servers and also, modifying domain spec yaml files for newly added data sets. It did not allow us to easily change # of shards across different versions of the data - something that our developers do often to make their jobs run faster etc. Also, it does not GC out older versions and since our workflows write new versions every day, this was a problem.<p>We did look at Cassandra but we also did not want to operate another data store. However, we definitely wanted to get the data loaded fast i.e. through simple file copy operations. We found that for this option Cassandra had similar issues as HBase i.e. having to do major compactions to get rid of older data versions. Tweaking the number of reduce shards was also harder.<p>With Terrapin, we essentially tried to build serving system on top of HDFS given the recent improvements in HDFS performance when there is data locality. We felt that HDFS was rock solid and the best storage system (in terms of scalability &amp; ease of operation) for immutable data sets. On top of that, we built versioning, cheap garbage collection, extensible serving formats etc. as mentioned in the blog<p>As for Apache Drill, it is more suited to running analyst queries with latencies ranging upto seconds or 100s of milliseconds. This is not acceptable for webscale work loads where the latencies must be &lt; 10ms for lower level serving systems like terrapin.
评论 #10222699 未加载
optimusclimbover 9 years ago
Either more tools like this are going to pop up, or the existing ones will mature, as more people adopt Lambda style architectures, I&#x27;d imagine.<p>While building one out, we looked at VoldemortDB, SploutSQL, and ElephantDB to serve bulk data coming out of Hadoop in batches. Voldemort turned out to be much rougher around the edges than expected, ElephantDB looked very bleeding edge, and SploutSQL wasn&#x27;t as general purpose. In the end we turned to Cassandra and this tool - <a href="https:&#x2F;&#x2F;github.com&#x2F;spotify&#x2F;hdfs2cass" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;spotify&#x2F;hdfs2cass</a>.<p>Good to see Pinterest open sourcing this.
kevinbowmanover 9 years ago
The URL <a href="https:&#x2F;&#x2F;engineering.pinterest.com&#x2F;blog&#x2F;open-sourcing-terrapin-serving-system-batch-generated-data-0" rel="nofollow">https:&#x2F;&#x2F;engineering.pinterest.com&#x2F;blog&#x2F;open-sourcing-terrapi...</a> gives more info, which this article links through to.
ameyamkover 9 years ago
Bulk Uploads into KV stores are slow - so Terrapin allows KV access over immutable HDFS files.<p>Very typical use case for recommendation systems etc. We face similar problems with latencies on HBase (At Groupon).<p>So this solution seems interesting. Would be good to have comparison of other solutions Pinterest tried before building this. eg. loading data into Cassandra instead of HBase etc.<p>In nutshell - very specific use case - but the one which comes across very often
评论 #10221501 未加载
arthurcolleover 9 years ago
I wonder if the authors are Maryland alumni!