Launch HN: Quilt (YC W16) – A versioned data portal for S3

177 pointsby akarveover 5 years ago

We're Aneesh and Kevin of Quilt (<a href="https://open.quiltdata.com/" rel="nofollow">https://open.quiltdata.com/</a>). Quilt is a versioned data portal for S3 that makes it easier to share, discover, model, and decide based on data at scale. It consists of a Python client, web catalog, and lambda functions (all open source), plus a suite of backend containers and CloudFormation templates for businesses to run their own stacks. Public data are free. Private stacks are available for a flat monthly licensing fee.Try searching for anything on <a href="https://open.quiltdata.com/" rel="nofollow">https://open.quiltdata.com/</a> and let us know how search works for you. We kind of surprised ourselves with a Google-like experience that returns primary data instead of links to web pages. We've got over 1M Jupyter notebooks, 100M Amazon reviews, and many more public S3 objects on over a dozen topics indexed in ElasticSearch.The best example, so far, of "S3 bucket as data repo" is from the Allen Institute for Cell Science <a href="https://open.quiltdata.com/b/allencell/tree/" rel="nofollow">https://open.quiltdata.com/b/allencell/tree/</a>.Kevin and I met in grad school. We started with the belief that if data could be "managed like code," data would be easier to access, more accurate, and could serve as the foundation for smarter decisions. While we loved databases and systems, we found that technical and cost barriers kept data out of the hands of people that needed it the most: NGOs, citizens, and non-technical users. That led to three distinct iterations of Quilt over as many years and has now culminated in open.quiltdata.com, where we've made a few petabytes of public data in S3 easy to search, browse, visualize, and summarize.In earlier versions of Quilt, we focused on writing new software to version and package data. We also attempted to host private user data in our own cloud. For reasons that we would soon realize, these were mistakes:* Few users were willing to copy data—especially sensitive and large data—into Quilt* It was difficult to gather a critical mass of interesting and useful data that would keep users coming back* Data are consumed in teams that include a variety of non-technical users* Even in 2019, it's unnecessarily difficult and expensive to host and share large files. (GitHub, Dropbox, and Google Drive all have quotas, performance limitations, and none of them can serve as a distributed backend for an application.)* It's difficult for a small team to build both "git for data" (core tech) and "Github for data" (website + network effect) at the same timeOn the plus side, our users confirmed that "immutable data dependencies" (something Quilt still does) went a long way towards making analysis reproducible and trace-able.Put all of the above together, and we had the realization that if we viewed S3 as "git for data", it would solve a lot of problems at once: S3 supports object versioning, a huge chunk of public and customer data are already there (no copying), and it keeps users in direct control of their own data. Looking forward, the S3 interface is general enough (especially with tools like min.io) to abstract away any storage layer. And we want to bring Quilt to other clouds, and even to on-prem volumes. We repurposed our "immutable dataset abstraction" (Quilt packages) and used them to solve a problem that S3 object versioning doesn't: the ability to take an immutable snapshot of an entire directory, bucket, or collection of buckets.We believe that public data should be free and open to all—with no competing interests from advertisers—that private data should be secure, and that all data should remain under the direct control of its creators. We feel that a "federated network of S3 buckets" offers the foundations on which to achieve such a vision.All of that said, wow do we have a long way to go. We ran into all kinds of challenges scaling and sharding ElasticSearch to accommodate the 10 billion objects on open.quiltdata.com, and we are still researching the best way to fork and merge datasets. (The Quilt package manifests are JSONL, so our leading theory is to check these into git so that diffs and merges can be accomplished over S3 key metadata, without the need to diff or even touch primary data in S3, which are too large to fit into git anyway.)Your comments, design suggestions, and open source contributions to any of the above topics are welcomed.

18 comments

timsehnover 5 years ago

Congratulations to the Quilt team on the launch!Quilt reached out to me and suggested I chime in suggesting that people interested in versioning data also check out Dolt (<a href="https://github.com/liquidata-inc/dolt" rel="nofollow">https://github.com/liquidata-inc/dolt</a>) and DoltHub (<a href="https://www.dolthub.com" rel="nofollow">https://www.dolthub.com</a>).We've taken the Git and GitHub for data analogy a lot more literally than Quilt has :-) We are a SQL database with native Git semantics. Instead of versioning files like Git, we version table rows. This allows for diff and conflict detection down to the cell level. We are built on top of another open source project called Noms (<a href="https://github.com/attic-labs/noms" rel="nofollow">https://github.com/attic-labs/noms</a>).We think there is a ton of room in this space for a bunch of tools: Quilt, Noms, QRI (<a href="https://qri.io/" rel="nofollow">https://qri.io/</a>), Pachyderm (<a href="https://www.pachyderm.io/" rel="nofollow">https://www.pachyderm.io/</a>), and even Git. We're excited to see so many bright minds trying to solve this problem.We're going to be populating DoltHub with a bunch of datasets we harvest from the open data community to show off the capabilities of Dolt. The coolest one so far is the Google open images dataset: <a href="https://www.dolthub.com/repositories/Liquidata/open-images" rel="nofollow">https://www.dolthub.com/repositories/Liquidata/open-images</a>.

评论 #21064804 未加载

评论 #21067227 未加载

breckover 5 years ago

This is great. Thank you for so openly sharing your strategic thinking and lessons learned.I follow about 100 projects in this space "github for data" and haven't yet seen a breakout hit. Yours looks like it has potential. I like the simplicity and the "objects by file extension". Lots of these sites I think get too complex too quick.At the UH Cancer Center we routinely deal with datasets in the TB - PB range, and that type of size definitely makes this problem qualitatively different. Your splitting of the storage (S3) from the front end is the correct technical decision, IMO.I've worked in this space for about 10 years. My open source project is called Ohayo, and I used to try and do both front end and backend, and then similarly decided to drop the data storage backend and instead focus on my strengths, which is front end exploratory data analysis.I think adding a "quilt" keyword to Ohayo, and access to the Quilt datasets directly in Ohayo may be mutually beneficial. Ohayo is just a single dumb web app (no online storage, no tracking, full program source code are stored in the url) and pulls in data via HTTP. Here's an example program that shows the post history from the 2 quilt founders on hackernews: <a href="https://ohayo.computer?filename=hncomparison.flow&yi=~&xi=_&data=hackernews.submissions_100_akarve_kevinemoore~_tables.basic~_hidden~_filter.where_by_!%253D_~__hidden~__filter.where_type_%253D_story~___hidden~___vega.scatter~____xColumn_time~____yColumn_score~____colorColumn_by~layout_column" rel="nofollow">https://ohayo.computer?filename=hncomparison.flow&yi=~&xi=_&...</a>We use Vega for visualization. You could imagine allowing fast simple EDA on these Quilt data sets through simple Ohayo links. Ohayo version 14 is a substantial improvement and I hope to ship next week or two, and then would love to add Quilt to the picture.

评论 #21064627 未加载

评论 #21064655 未加载

kevinemooreover 5 years ago

Aneesh's co-founder here. I just want to add a word of thanks to Jed Sundwall and the AWS Registry of Open Data. The support of AWS makes publishing data at this scale possible. I also want to thank Jackson Brown and everyone else who worked so hard to compile, document and annotate these large and extremely valuable datasets.

lichtenbergerover 5 years ago

So you basically store S3 Buckets in Elastic Search and you're using Git for versioning a hierarchy of buckets, right?It's interesting that versioning now finally seems to be getting some traction in mainstream database systems (even though they are not really optimal in these systems my opinion) and for instance also in your data store. You position this as a Dropbox or Google Drive replacement, right? :-)I'm asking all these questions, because I'm engineering a temporal, versioned Open Source storage system myself (since I studied at the University of Konstanz until 2012), possibly on a much more database oriented level -- currently for storing both XML and JSON data in a binary format.A resource in this storage system basically stores a huge tree of database pages whereas an UberPage is the main entry point (reminiscent of ZFSs UberPage, from which SirixDB borrows some ideas and puts these to the sub-file level), consisting of various more or less hash-array based subtrees as in ZFS. Thus, levels of indirect pages are added if more data needs to be stored. I've added some optimizations from in-memory hash-array based tries.Each revision is indexed. SirixDB stores per revision and per page deltas based on a copy-on-write log-structure.I've thought about storing each database page fragment in a S3 storage backend as another storage option and using Apache BookKeeper directly or Apache Pulsar for distributing an in-memory intent log (it doesn't need to be persisted before committing to the data files, as the UberPage just needs to be swapped atomically for consistency).For the interested reader:<a href="https://sirix.io" rel="nofollow">https://sirix.io</a> and <a href="https://github.com/sirixdb/sirix" rel="nofollow">https://github.com/sirixdb/sirix</a>

评论 #21064507 未加载

评论 #21064705 未加载

JoshTriplettover 5 years ago

While naming conflicts aren't necessarily always a problem, given that you specifically describe aspects of the problem as a "git for X", you should know that "quilt" is already the name of a popular piece of version control software.

评论 #21063324 未加载

heinrichhartmanover 5 years ago

Excited to see this being re-launched. "git for data" ranks pretty high on my all time list of tech I want to see succeed.I find the business model very interesting: A kind of "middle layer" SAAS, where you provide a new front-end for an existing service. Not seen that very often. Certainly helps with the data privacy issues. Rapid on-boarding is another immediate benefit.

评论 #21063222 未加载

rabidratover 5 years ago

Is there a way to get a "dataset of datasets"? That is, all datasets you have, in downloadable tabular form with metadata for each dataset?

评论 #21066063 未加载

评论 #21065822 未加载

评论 #21065794 未加载

gidimover 5 years ago

Really excited to see this relaunched. Every DS team has issues around dataset management. We previously shared a tutorial on how to get a fully reproducible pipeline with Quilt + Comet.ml <a href="https://blog.quiltdata.com/building-a-fully-reproducible-machine-learning-pipeline-with-comet-ml-and-quilt-c0e682b8e25" rel="nofollow">https://blog.quiltdata.com/building-a-fully-reproducible-mac...</a>

trailerfinsover 5 years ago

I also really appreciated your lessons learned — pretty compelling. The showcase buckets on the site are awesome. What's the mechanism by which the public data ends up in S3, just out of curiosity?

评论 #21065359 未加载

lyalover 5 years ago

I'm very excited to see this -- data portability and management is a primary struggle we're trying to map out. Would love to see an engineering post on what you did for ElasticSearch.

评论 #21063505 未加载

codetrotterover 5 years ago

> Try searching for anything on <a href="https://open.quiltdata.com/" rel="nofollow">https://open.quiltdata.com/</a> and let us know how search works for you.I suggest adding the possibility of searching for exact matches with quotation marks, and also to ensure that it works with the quotation marks that the default keyboard on iOS has.For example, I want to search for “Irish Setter” and only see results that include those two words next to each other like that.

评论 #21066720 未加载

评论 #21075798 未加载

评论 #21066822 未加载

diegoscaraover 5 years ago

I'm really excited to start exploring these datasets through quilt and build ML models. Thanks to all the Quilt team and everyone involved for making this possible!

DTEover 5 years ago

Congrats to Aneesh and team! We (Paperspace, YCW15) are big fans and have been following these guys for a while now!

FanaHOVAover 5 years ago

Congrats on the launch guys! Excited to read that you've already connected with Tim, more and more smart people tackling this problem is always a plus for everyone :)

foxhopover 5 years ago

Does it work with digital ocean spaces?

评论 #21064395 未加载

digitaltreesover 5 years ago

Congrats. You are an awesome team.

antmanover 5 years ago

Is there any tool in this space that also handles permissions e.g. per column or table?

admirethemeyerover 5 years ago

Thanks for sharing this and driving Quilt forward @Kevin!