Dolt is Git for data

358 pointsby timsehnabout 5 years ago

49 comments

petefordeabout 5 years ago

Only 39 days since the last "GitHub for data" was announced: <a href="https://news.ycombinator.com/item?id=22375774" rel="nofollow">https://news.ycombinator.com/item?id=22375774</a>I'll say what I said in February: I started a company with the same premise 9 years ago, during the prime "big data" hype cycle. We burned through a lot of investor money only to realize that there was not a market opportunity to capture. That is, many people thought it was cool - we even did co-sponsored data contests with The Economist - but at the end of the day, we couldn't find anyone with an urgent problem that they were willing to pay to solve.I wish these folks luck! Perhaps things have changed; we were part of a flock of 5 or 10 similar projects and I'm pretty sure the only one still around today is Kaggle.<a href="https://www.youtube.com/watch?v=EWMjQhhxhQ4" rel="nofollow">https://www.youtube.com/watch?v=EWMjQhhxhQ4</a>

评论 #22736049 未加载

评论 #22734677 未加载

评论 #22734839 未加载

评论 #22735030 未加载

评论 #22734742 未加载

评论 #22734738 未加载

评论 #22736785 未加载

评论 #22737514 未加载

评论 #22735661 未加载

评论 #22736513 未加载

评论 #22737860 未加载

评论 #22735019 未加载

评论 #22738642 未加载

评论 #22735213 未加载

评论 #22735358 未加载

sytseabout 5 years ago

Very cool! The world needs better version control for data.How does this compare to something like Pachyderm?How does it work under the covers? What is a splice and what does it mean when it overlaps? <a href="https://github.com/liquidata-inc/dolt/blob/84d9eded517167eb2b1f76073df88e85665eec1d/go/store/merge/three_way_list_test.go#L137" rel="nofollow">https://github.com/liquidata-inc/dolt/blob/84d9eded517167eb2...</a>Is it feasible to use Conflict-free Replicated Data Types (CRDT) for this?

评论 #22734407 未加载

评论 #22735544 未加载

samatmanabout 5 years ago

> Dolt is the only database with branchesThere's also litetree, whose slogan is simply "SQLite with branches":<a href="https://github.com/aergoio/litetree" rel="nofollow">https://github.com/aergoio/litetree</a>

评论 #22735400 未加载

评论 #22735764 未加载

评论 #22735032 未加载

评论 #22736222 未加载

timdorrabout 5 years ago

Any reason or history behind the name? It means "a stupid person", which seems like a bad choice IMHO: <a href="https://www.merriam-webster.com/dictionary/dolt" rel="nofollow">https://www.merriam-webster.com/dictionary/dolt</a>

评论 #22735211 未加载

评论 #22734423 未加载

评论 #22735134 未加载

评论 #22734043 未加载

评论 #22734157 未加载

评论 #22734040 未加载

评论 #22734038 未加载

评论 #22734230 未加载

flashmanabout 5 years ago

So, we ingest a third-party dataset that changes daily. One of our problems is that we need to retrospectively measure arbitrary metrics (how many X had condition Y on days 1 through 180 of the current year?). Imagine the external data like this:UUID,CategoryA,CategoryACount,CategoryB,CategoryBCount,BooleanC,BooleanD...etcWhen we ingest a new UUID, we add a column "START_DATE" which is the first date the UUID's metrics were valid. When any of the metric counts changes, we add "END_DATE" to the row and add a new row for that UUID with an updated START_DATE.It works, but it sucks to analyse because you have to partition the database by the days each row was valid and do your aggregations on those partitions. And it sucks to get a snapshot of how a dataset looked on a particular day. It would be much easier if we could just access the daily diffs, which seems like a task Dolt would accomplish.I mean it has a better chance of working than getting the third party to implement versioning on their data feed.

评论 #22734778 未加载

ChrisFosterabout 5 years ago

A year or so I looked into "git for data" for medical research data curation. At the time I found a couple of promising solutions based on wrapping git and git annex:GIN: <a href="https://gin.g-node.org/" rel="nofollow">https://gin.g-node.org/</a> datalad: <a href="https://www.datalad.org/" rel="nofollow">https://www.datalad.org/</a>At the time GIN looked really promising as something potentially simple enough for end users in the lab but with a lot of power behind it. (Unfortunately we never got it deployed due to organizational constraints... but that's a separate story.)

评论 #22738053 未加载

评论 #22735554 未加载

terakuabout 5 years ago

I think they could find funding and use-cases, if they had something like lincensing and terms of use backed into data to track lineage. E.g. "this columns contains emails" and is revokable. Or when you publish data, "this column needs hashing/anonymizing/...". And if you track data across versions and can version relations, you can create lineage.Overall seen many of these lately, waiting for one to really shine. But not because I think it's a grand problem, as I can version my DDL/DML even/code, but I see some need for it because I have a lot of non-tech people working with data throwing it left and right and expecting me to clean up after them.

cjbprimeabout 5 years ago

Comparison to Dat?<a href="https://docs.dat.foundation/docs/intro" rel="nofollow">https://docs.dat.foundation/docs/intro</a>

评论 #22734415 未加载

dominotwabout 5 years ago

> Dolt is the only database with branches.datomic has branching too afaik.

评论 #22734376 未加载

评论 #22734515 未加载

评论 #22736471 未加载

rburhumabout 5 years ago

Eh, I worked on database with branches in 2002 for 3 years while I was at ESRI. It is called a versioned system... Here is how it works from an answer I gave several years back on gis.stackexchange <a href="https://gis.stackexchange.com/questions/15203/when-versioning-with-arcsde-can-posted-edits-be-cancelled-or-rejected" rel="nofollow">https://gis.stackexchange.com/questions/15203/when-versionin...</a>

rad_gruchalskiabout 5 years ago

Seems like a lot of work went into this and there are very smart people behind it. However, I can’t help the feeling that this will lead to so many unintentional data leaks.Nevertheless, starred. Let’s see what does it give.

kdamicaabout 5 years ago

It's a cool idea. There's also <a href="https://quiltdata.com/" rel="nofollow">https://quiltdata.com/</a> but I haven't heard anything about them in a long time.

评论 #22734440 未加载

oldgreggabout 5 years ago

Really interesting. Would be nice to see documentation. All their examples show modifying the database by running command line sql queries, does it turn up a normal mysql instance or just emulate? Are hooks available in Go? Surprised they don't market it as a blockchain database. I'm building a Dapp right now and this could be really useful.

评论 #22734437 未加载

quickthrower2about 5 years ago

I think data (as in raw, collected / measured / surveyed data) doesn't really change, but you get more of it. Some data may occasionally supersede old data. Maybe the schema of the data changes, so your first set of data is in one form, and subsequent data might have more information, or recorded in a different way.

评论 #22734268 未加载

评论 #22734428 未加载

aabbcc1241about 5 years ago

It claimed to be the first database that support versioning. How does it compare to the revision mechanism in couchdb?

hinkleyabout 5 years ago

Maybe not a killer app, but there are certain kinds of collaborative 'CRUD' apps that could benefit greatly from having versioning built into the database as a service.For instance, how much of a functional wiki could one assemble from off-the-shelf parts? Edit, display, account management, templating, etc could all be handled with existing libraries in a wide array of programming languages.The logic around the edit history is likely to contain the plurality if not the majority of the custom code.

heinrichhartmanabout 5 years ago

Looks like they are a fork of noms (<a href="https://github.com/attic-labs/noms" rel="nofollow">https://github.com/attic-labs/noms</a>). The object store has the telling name `.dolt/noms`.Inside are a bunch of binary files. It would be interesting to know more about the on-disk layout of the stored tables.I was not able to find any documentation. Does someone know more about this? Pointers would be appreciated.

hypewatchabout 5 years ago

Does Dolt have any benchmarks against other databases at scale? I would think that a git SQL database would not be very snappy at scale

评论 #22735045 未加载

ralfebertabout 5 years ago

Since Wil Shipleys presentation "Git as a Document Format" (AltConf, 2015, [1]) the idea of using git to track data has stuck with me.Cool to see another approach at this.From the first look, I miss the representation of data as plain-old-text-files, but I guess that's a little bit in competition with the goal of getting performance for larger data sets.Anyway, I am wondering, did somebody here try using plain git like a database to track data in a repository?[1] <a href="https://academy.realm.io/posts/altconf-wil-shipley-git-document-format/" rel="nofollow">https://academy.realm.io/posts/altconf-wil-shipley-git-docum...</a>

ComodoHackerabout 5 years ago

The idea is good, the product may be good too (can't find any whitepapers or something about underlying technology). But some of their marketing is suspiciously unprofessional. Like "Better Database Backups". In DB world, you can't call a "backup" anything that can't restore all of your DB files bit-for-bit, anything non-deterministic. You can call it "dump", "export" or whatever, but not backup.I don't think they plan to compete on DB backups storage market. So please don't mislead you potential customers.

databeetleabout 5 years ago

I use a Python based CMS called CodeRedCMS for my website. They store all their content in a file called db.sqlite3. I use PythonAnywhere for hosting the site and they read the website-files from GitHub. So whenever I update my site (including the blog), I just push the latest version of the db.sqlite3 file to GitHub and pull it into PythonAnywhere.So, as I understand, as long as the DB can be converted into files, it will work as anything else on Git and GitHub. What am I missing?

zbyabout 5 years ago

Non binary data can be saved as text - for example you can have an SQL database dump. You can put that text into git. What does this solution add to that simple idea?

BiteCode_devabout 5 years ago

Dolt is not Git for data.Git take existing files, and allow you to version them.Git for data would take existing tables or rows, and allow you to version them.A uniform, drop in, open source way to have an history or row, merge them, restore them, etc. that works for Postgres, Mysql or Oracle in the same way. And is compatible with migrations.You can have an history if you use big table or couchdb, not need for Dolt if it's about using a specific product.

评论 #22735610 未加载

staredabout 5 years ago

Whetever it works or not, I find the introduction confusing.Compare and contrast it with the clarity of these introductions:- <a href="https://git-lfs.github.com/" rel="nofollow">https://git-lfs.github.com/</a> (Git Large File Storage)- <a href="http://paulfitz.github.io/daff/" rel="nofollow">http://paulfitz.github.io/daff/</a> ("data diff for tables")

apichatabout 5 years ago

it look like Daff (align and compare tables)<a href="https://github.com/paulfitz/daff" rel="nofollow">https://github.com/paulfitz/daff</a>and Coopy (distributed spreadsheets with intelligent merges)<a href="https://github.com/paulfitz/coopy" rel="nofollow">https://github.com/paulfitz/coopy</a>

aantixabout 5 years ago

Slightly related - how does ML track new data input and ensure that the data hasn't introduced a regression?I would assume there's an automated test suite, but also some way of diffing large amounts of input data and visualizing those input additions relative to model classifications?What are the common tools for this?

评论 #22735230 未加载

Noumenon72about 5 years ago

Great start page. Very persuasive writing, tells me what the project will do for me and not just what it is.

IanCalabout 5 years ago

Looks interesting, depending on performance this could neatly cover a few use-cases I have at the moment without needing to build as much myself. At least dolt on its own, whether we would need the hub is another matter but I guess it depends on uptake.

pedro1976about 5 years ago

Recently i was working with some open-data data and i was in need for a tool that transforms those csv/jsons to something standardized, that i can run queries against and patch the data. Maybe this is a use case for dolt.

nerdponxabout 5 years ago

How does Dolt compare to DVC?

评论 #22737596 未加载

brachiabout 5 years ago

> With Dolt, you can view a human-readable diff of the data you received last time versus the data you received this time.How is this accomplished if the data is binary?Also, how does this compare to git lfs?

评论 #22734901 未加载

jnbicheabout 5 years ago

Is there a way to page sql results? Also, it would be awesome if I could use rlwrap with `dolt sql`, so I can use the shortcuts I'm used to in an REPL environment.

评论 #22739314 未加载

honksilletabout 5 years ago

Pardon my ignorance, but is data copy writable? Or can it be owned? Obviously someone can get into trouble upload propriety code to git. Is there proprietary data?

评论 #22734911 未加载

perfect_waveabout 5 years ago

Can you give some more information about what you're doing with your cloud infrastructure? Would be intrigued to hear about what you're running.

评论 #22734690 未加载

kthejoker2about 5 years ago

Love that the name "rhymes" with Git (both are insults), potentially a good fit for MLOps to version your training splits.

chrisweeklyabout 5 years ago

We all know naming things is hard, but "dolt" -- as in "idiot" or "imbecile" -- is a head-scratcher.

fmajidabout 5 years ago

Delphix has provided for years branching and test database functionality for real databases people actually use like Oracle.

fleetside72about 5 years ago

As far as I can tell the only way to use this is to push everything into a mysql instance. def some pros and cons there.

评论 #22734527 未加载

tgbabout 5 years ago

An example use case that "git for data" seems to break: storing data for medical research where the participants are allowed to withdraw from the study after the fact. Then their data must be deleted retroactively, not just in the head node. I don't know of a good methodology for dealing with this at all as it breaks backups, for example.The problem extends beyond medical research due to privacy laws like the GDPR. A participant or user must be able to delete their data not merely hide it so as to protect themselves from data breaches. Suggestions welcome.

评论 #22736959 未加载

评论 #22739368 未加载

russfinkabout 5 years ago

I can't tell from the font - is it DOLT - delta Oscar lima tango - or DOIT delta Oscar India tango?

评论 #22737182 未加载

senorsmileabout 5 years ago

This solves an immediate need that I was considering noms for. Thank you!

hypewatchabout 5 years ago

> As far as we can tell, Dolt is the only database with branchesWhat about pachyderm?

评论 #22734915 未加载

aerovistaeabout 5 years ago

Really bad name lol. A dolt is an idiot.

评论 #22738356 未加载

评论 #22737893 未加载

amoloabout 5 years ago

I'm curious. So what is Kaggle?

olliejabout 5 years ago

Wow, this really emphasizes that the rationale for choosing “Ok” rather than “Do It” for buttons was correct.:-/

danzig13about 5 years ago

Can I get a git for excel?

gitgudabout 5 years ago

Git tracks changes in logic (software).Tracking changes in Data is simply called a database...

fiatjafabout 5 years ago

What happened to Dat?

matthewbauerabout 5 years ago

I thought Git was Git for data.

评论 #22734935 未加载

评论 #22734890 未加载

49 comments

petefordeabout 5 years ago

评论 #22736049 未加载

评论 #22734677 未加载

评论 #22734839 未加载

评论 #22735030 未加载

评论 #22734742 未加载

评论 #22734738 未加载

评论 #22736785 未加载

评论 #22737514 未加载

评论 #22735661 未加载

评论 #22736513 未加载

评论 #22737860 未加载

评论 #22735019 未加载

评论 #22738642 未加载

评论 #22735213 未加载

评论 #22735358 未加载

sytseabout 5 years ago

评论 #22734407 未加载

评论 #22735544 未加载

samatmanabout 5 years ago

评论 #22735400 未加载

评论 #22735764 未加载

评论 #22735032 未加载

评论 #22736222 未加载

timdorrabout 5 years ago

评论 #22735211 未加载

评论 #22734423 未加载

评论 #22735134 未加载

评论 #22734043 未加载

评论 #22734157 未加载

评论 #22734040 未加载

评论 #22734038 未加载

评论 #22734230 未加载

flashmanabout 5 years ago

评论 #22734778 未加载

ChrisFosterabout 5 years ago

评论 #22738053 未加载

评论 #22735554 未加载

terakuabout 5 years ago

cjbprimeabout 5 years ago

Comparison to Dat?<a href="https://docs.dat.foundation/docs/intro" rel="nofollow">https://docs.dat.foundation/docs/intro</a>

评论 #22734415 未加载

dominotwabout 5 years ago

> Dolt is the only database with branches.datomic has branching too afaik.

评论 #22734376 未加载

评论 #22734515 未加载

评论 #22736471 未加载

rburhumabout 5 years ago

rad_gruchalskiabout 5 years ago

kdamicaabout 5 years ago

It's a cool idea. There's also <a href="https://quiltdata.com/" rel="nofollow">https://quiltdata.com/</a> but I haven't heard anything about them in a long time.

评论 #22734440 未加载

oldgreggabout 5 years ago

评论 #22734437 未加载

quickthrower2about 5 years ago

评论 #22734268 未加载

评论 #22734428 未加载

aabbcc1241about 5 years ago

It claimed to be the first database that support versioning. How does it compare to the revision mechanism in couchdb?

hinkleyabout 5 years ago

heinrichhartmanabout 5 years ago

hypewatchabout 5 years ago

Does Dolt have any benchmarks against other databases at scale? I would think that a git SQL database would not be very snappy at scale

评论 #22735045 未加载

ralfebertabout 5 years ago

ComodoHackerabout 5 years ago

databeetleabout 5 years ago

zbyabout 5 years ago

Non binary data can be saved as text - for example you can have an SQL database dump. You can put that text into git. What does this solution add to that simple idea?

BiteCode_devabout 5 years ago

评论 #22735610 未加载

staredabout 5 years ago

apichatabout 5 years ago

aantixabout 5 years ago

评论 #22735230 未加载

Noumenon72about 5 years ago

Great start page. Very persuasive writing, tells me what the project will do for me and not just what it is.

IanCalabout 5 years ago

pedro1976about 5 years ago

nerdponxabout 5 years ago

How does Dolt compare to DVC?

评论 #22737596 未加载

brachiabout 5 years ago

评论 #22734901 未加载

jnbicheabout 5 years ago

Is there a way to page sql results? Also, it would be awesome if I could use rlwrap with `dolt sql`, so I can use the shortcuts I'm used to in an REPL environment.

评论 #22739314 未加载

honksilletabout 5 years ago

Pardon my ignorance, but is data copy writable? Or can it be owned? Obviously someone can get into trouble upload propriety code to git. Is there proprietary data?

评论 #22734911 未加载

perfect_waveabout 5 years ago

Can you give some more information about what you're doing with your cloud infrastructure? Would be intrigued to hear about what you're running.

评论 #22734690 未加载

kthejoker2about 5 years ago

Love that the name "rhymes" with Git (both are insults), potentially a good fit for MLOps to version your training splits.

chrisweeklyabout 5 years ago

We all know naming things is hard, but "dolt" -- as in "idiot" or "imbecile" -- is a head-scratcher.

fmajidabout 5 years ago

Delphix has provided for years branching and test database functionality for real databases people actually use like Oracle.

fleetside72about 5 years ago

As far as I can tell the only way to use this is to push everything into a mysql instance. def some pros and cons there.

评论 #22734527 未加载

tgbabout 5 years ago

评论 #22736959 未加载

评论 #22739368 未加载

russfinkabout 5 years ago

I can't tell from the font - is it DOLT - delta Oscar lima tango - or DOIT delta Oscar India tango?

评论 #22737182 未加载

senorsmileabout 5 years ago

This solves an immediate need that I was considering noms for. Thank you!

hypewatchabout 5 years ago

> As far as we can tell, Dolt is the only database with branchesWhat about pachyderm?

评论 #22734915 未加载

aerovistaeabout 5 years ago

Really bad name lol. A dolt is an idiot.

评论 #22738356 未加载

评论 #22737893 未加载

amoloabout 5 years ago

I'm curious. So what is Kaggle?

olliejabout 5 years ago

Wow, this really emphasizes that the rationale for choosing “Ok” rather than “Do It” for buttons was correct.:-/

danzig13about 5 years ago

Can I get a git for excel?

gitgudabout 5 years ago

Git tracks changes in logic (software).Tracking changes in Data is simply called a database...

fiatjafabout 5 years ago

What happened to Dat?

matthewbauerabout 5 years ago

I thought Git was Git for data.

评论 #22734935 未加载

评论 #22734890 未加载