Only 39 days since the last "GitHub for data" was announced: <a href="https://news.ycombinator.com/item?id=22375774" rel="nofollow">https://news.ycombinator.com/item?id=22375774</a><p>I'll say what I said in February: I started a company with the same premise 9 years ago, during the prime "big data" hype cycle. We burned through a lot of investor money only to realize that there was not a market opportunity to capture. That is, many people thought it was cool - we even did co-sponsored data contests with The Economist - but at the end of the day, we couldn't find anyone with an urgent problem that they were willing to pay to solve.<p>I wish these folks luck! Perhaps things have changed; we were part of a flock of 5 or 10 similar projects and I'm pretty sure the only one still around today is Kaggle.<p><a href="https://www.youtube.com/watch?v=EWMjQhhxhQ4" rel="nofollow">https://www.youtube.com/watch?v=EWMjQhhxhQ4</a>
Very cool! The world needs better version control for data.<p>How does this compare to something like Pachyderm?<p>How does it work under the covers? What is a splice and what does it mean when it overlaps? <a href="https://github.com/liquidata-inc/dolt/blob/84d9eded517167eb2b1f76073df88e85665eec1d/go/store/merge/three_way_list_test.go#L137" rel="nofollow">https://github.com/liquidata-inc/dolt/blob/84d9eded517167eb2...</a><p>Is it feasible to use Conflict-free Replicated Data Types (CRDT) for this?
> <i>Dolt is the only database with branches</i><p>There's also litetree, whose slogan is simply "SQLite with branches":<p><a href="https://github.com/aergoio/litetree" rel="nofollow">https://github.com/aergoio/litetree</a>
Any reason or history behind the name? It means "a stupid person", which seems like a bad choice IMHO: <a href="https://www.merriam-webster.com/dictionary/dolt" rel="nofollow">https://www.merriam-webster.com/dictionary/dolt</a>
So, we ingest a third-party dataset that changes daily. One of our problems is that we need to retrospectively measure arbitrary metrics (how many X had condition Y on days 1 through 180 of the current year?). Imagine the external data like this:<p>UUID,CategoryA,CategoryACount,CategoryB,CategoryBCount,BooleanC,BooleanD...etc<p>When we ingest a new UUID, we add a column "START_DATE" which is the first date the UUID's metrics were valid. When any of the metric counts changes, we add "END_DATE" to the row and add a new row for that UUID with an updated START_DATE.<p>It works, but it sucks to analyse because you have to partition the database by the days each row was valid and do your aggregations on those partitions. And it sucks to get a snapshot of how a dataset looked on a particular day. It would be <i>much</i> easier if we could just access the daily diffs, which seems like a task Dolt would accomplish.<p>I mean it has a better chance of working than getting the third party to implement versioning on their data feed.
A year or so I looked into "git for data" for medical research data curation. At the time I found a couple of promising solutions based on wrapping git and git annex:<p>GIN: <a href="https://gin.g-node.org/" rel="nofollow">https://gin.g-node.org/</a>
datalad: <a href="https://www.datalad.org/" rel="nofollow">https://www.datalad.org/</a><p>At the time GIN looked really promising as something potentially simple enough for end users in the lab but with a lot of power behind it. (Unfortunately we never got it deployed due to organizational constraints... but that's a separate story.)
I think they could find funding and use-cases, if they had something like lincensing and terms of use backed into data to track lineage.
E.g. "this columns contains emails" and is revokable. Or when you publish data, "this column needs hashing/anonymizing/...".
And if you track data across versions and can version relations, you can create lineage.<p>Overall seen many of these lately, waiting for one to really shine. But not because I think it's a grand problem, as I can version my DDL/DML even/code, but I see some need for it because I have a lot of non-tech people working with data throwing it left and right and expecting me to clean up after them.
Comparison to Dat?<p><a href="https://docs.dat.foundation/docs/intro" rel="nofollow">https://docs.dat.foundation/docs/intro</a>
Eh, I worked on database with branches in 2002 for 3 years while I was at ESRI. It is called a versioned system... Here is how it works from an answer I gave several years back on gis.stackexchange <a href="https://gis.stackexchange.com/questions/15203/when-versioning-with-arcsde-can-posted-edits-be-cancelled-or-rejected" rel="nofollow">https://gis.stackexchange.com/questions/15203/when-versionin...</a>
Seems like a lot of work went into this and there are very smart people behind it. However, I can’t help the feeling that this will lead to so many unintentional data leaks.<p>Nevertheless, starred. Let’s see what does it give.
It's a cool idea. There's also <a href="https://quiltdata.com/" rel="nofollow">https://quiltdata.com/</a> but I haven't heard anything about them in a long time.
Really interesting. Would be nice to see documentation. All their examples show modifying the database by running command line sql queries, does it turn up a normal mysql instance or just emulate? Are hooks available in Go? Surprised they don't market it as a blockchain database. I'm building a Dapp right now and this could be really useful.
I think data (as in raw, collected / measured / surveyed data) doesn't really change, but you get more of it. Some data may occasionally supersede old data. Maybe the schema of the data changes, so your first set of data is in one form, and subsequent data might have more information, or recorded in a different way.
Maybe not a killer app, but there are certain kinds of collaborative 'CRUD' apps that could benefit greatly from having versioning built into the database as a service.<p>For instance, how much of a functional wiki could one assemble from off-the-shelf parts? Edit, display, account management, templating, etc could all be handled with existing libraries in a wide array of programming languages.<p>The logic around the edit history is likely to contain the plurality if not the majority of the custom code.
Looks like they are a fork of noms (<a href="https://github.com/attic-labs/noms" rel="nofollow">https://github.com/attic-labs/noms</a>). The object store has the telling name `.dolt/noms`.<p>Inside are a bunch of binary files. It would be interesting to know more about the on-disk layout of the stored tables.<p>I was not able to find any documentation. Does someone know more about this? Pointers would be appreciated.
Since Wil Shipleys presentation "Git as a Document Format" (AltConf, 2015, [1]) the idea of using git to track data has stuck with me.<p>Cool to see another approach at this.<p>From the first look, I miss the representation of data as plain-old-text-files, but I guess that's a little bit in competition with the goal of getting performance for larger data sets.<p>Anyway, I am wondering, did somebody here try using plain git like a database to track data in a repository?<p>[1] <a href="https://academy.realm.io/posts/altconf-wil-shipley-git-document-format/" rel="nofollow">https://academy.realm.io/posts/altconf-wil-shipley-git-docum...</a>
The idea is good, the product may be good too (can't find any whitepapers or something about underlying technology). But some of their marketing is suspiciously unprofessional. Like "Better Database Backups". In DB world, you can't call a "backup" anything that can't restore all of your DB files bit-for-bit, anything non-deterministic. You can call it "dump", "export" or whatever, but not backup.<p>I don't think they plan to compete on DB backups storage market. So please don't mislead you potential customers.
I use a Python based CMS called CodeRedCMS for my website. They store all their content in a file called db.sqlite3. I use PythonAnywhere for hosting the site and they read the website-files from GitHub. So whenever I update my site (including the blog), I just push the latest version of the db.sqlite3 file to GitHub and pull it into PythonAnywhere.<p>So, as I understand, as long as the DB can be converted into files, it will work as anything else on Git and GitHub. What am I missing?
Non binary data can be saved as text - for example you can have an SQL database dump. You can put that text into git. What does this solution add to that simple idea?
Dolt is not Git for data.<p>Git take existing files, and allow you to version them.<p>Git for data would take existing tables or rows, and allow you to version them.<p>A uniform, drop in, open source way to have an history or row, merge them, restore them, etc. that works for Postgres, Mysql or Oracle in the same way. And is compatible with migrations.<p>You can have an history if you use big table or couchdb, not need for Dolt if it's about using a specific product.
Whetever it works or not, I find the introduction confusing.<p>Compare and contrast it with the clarity of these introductions:<p>- <a href="https://git-lfs.github.com/" rel="nofollow">https://git-lfs.github.com/</a> (Git Large File Storage)<p>- <a href="http://paulfitz.github.io/daff/" rel="nofollow">http://paulfitz.github.io/daff/</a> ("data diff for tables")
it look like Daff (align and compare tables)<p><a href="https://github.com/paulfitz/daff" rel="nofollow">https://github.com/paulfitz/daff</a><p>and Coopy (distributed spreadsheets with intelligent merges)<p><a href="https://github.com/paulfitz/coopy" rel="nofollow">https://github.com/paulfitz/coopy</a>
Slightly related - how does ML track new data input and ensure that the data hasn't introduced a regression?<p>I would assume there's an automated test suite, but also some way of diffing large amounts of input data and visualizing those input additions relative to model classifications?<p>What are the common tools for this?
Looks interesting, depending on performance this could neatly cover a few use-cases I have at the moment without needing to build as much myself. At least dolt on its own, whether we would need the hub is another matter but I guess it depends on uptake.
Recently i was working with some open-data data and i was in need for a tool that transforms those csv/jsons to something standardized, that i can run queries against and patch the data. Maybe this is a use case for dolt.
> With Dolt, you can view a human-readable diff of the data you received last time versus the data you received this time.<p>How is this accomplished if the data is binary?<p>Also, how does this compare to git lfs?
Is there a way to page sql results? Also, it would be awesome if I could use rlwrap with `dolt sql`, so I can use the shortcuts I'm used to in an REPL environment.
Pardon my ignorance, but is data copy writable? Or can it be owned? Obviously someone can get into trouble upload propriety code to git. Is there proprietary data?
Can you give some more information about what you're doing with your cloud infrastructure? Would be intrigued to hear about what you're running.
An example use case that "git for data" seems to break: storing data for medical research where the participants are allowed to withdraw from the study after the fact. Then their data must be deleted retroactively, not just in the head node. I don't know of a good methodology for dealing with this at all as it breaks backups, for example.<p>The problem extends beyond medical research due to privacy laws like the GDPR. A participant or user must be able to delete their data not merely hide it so as to protect themselves from data breaches. Suggestions welcome.