Show HN: Noms – A new decentralized database based on ideas from Git

508 点作者 ahl将近 9 年前

34 条评论

nartz将近 9 年前

So, i realize this project is early, but it would be EXTREMELY helpful to walk through someone's use case - like, who is the target here? A business analyst who iterates on cleaning / analyzing small excel csvs? Or someone else?After watching the screencast, all I saw was a bunch of commands explained (could have read the docs for that), instead, I'd like to walk through a use-case where this solves someones problem.

评论 #12215551 未加载

评论 #12216464 未加载

评论 #12216279 未加载

评论 #12217544 未加载

im_down_w_otp将近 9 年前

GC I can see a shape of solution for since you can use something like a per-object DVVset to determine the minimum set of unresolved histories required to avoid losing data during conflicts while not unnecessarily ballooning the size of the dataset.However, the inner-object conflict-resolution problem seems a lot harder to solve given that there's no obvious join-semilattice for arbitrary fields/data. Can you discuss what conflict-resolution strategies you're working on for auto-resolution and/or what metadata you intend to provide to the end-user in the event that you're going to punt resolution to them to handle?Given this is supposed to be for collaborative workloads, the conflict-resolution issue seems to be a cornerstone. Git handles this by inserting sibling sections into the documents and forcing the end-user to manually deal with fixing problems, which is often fraught with pain and peril, and doesn't seem like a strategy that would work for something that's a database (as opposed to something that's a workflow).

评论 #12214073 未加载

评论 #12213915 未加载

zphds将近 9 年前

Going through the SDK docs, why was a scheme like '<a href="http://localhost:8000::people'" rel="nofollow">http://localhost:8000::people'</a> chosen instead of the plain old '<a href="http://localhost:8000/people'" rel="nofollow">http://localhost:8000/people'</a>? Are there any benefits? If yes, curious to know what they are.

评论 #12213145 未加载

评论 #12214543 未加载

评论 #12213111 未加载

评论 #12213070 未加载

评论 #12213310 未加载

pinko将近 9 年前

My queston is on scalability. You say "large datasets" on the website. What is large? 1x/10x/100x Terabytes? 1x/10x/100x Petabytes?What kind of access rates? Etc.Very general answers are okay -- I'm trying to wrap my head around whether this is even in the right ballpark for my world.Distinguishing current proof-of-concept vs. design-goal scale is okay too.Thanks!

评论 #12212294 未加载

tlb将近 9 年前

Strawman marketing alert: "The most common way to share data today is to post CSV files on a website". Maybe there are a bunch of people that still do that somewhere, but if so, they ain't early adopters of decentralized database technology and so not your target customers. It's always better to talk about what your most likely customers are doing now.

评论 #12213166 未加载

评论 #12213109 未加载

评论 #12212995 未加载

评论 #12213185 未加载

评论 #12217549 未加载

评论 #12212716 未加载

joshmarlow将近 9 年前

Very cool. I would love to have something like this production ready. Some day...Anyone who finds this interesting may also be intrigued by Irmin [0] - a library for applications to persist data in a git-compatible format.[0] - <a href="https://github.com/mirage/irmin" rel="nofollow">https://github.com/mirage/irmin</a>

评论 #12213527 未加载

latortuga将近 9 年前

At first glance, this reminds me a little bit of datomic - all data history is preserved/deduplicated, fork/decentralization features. Can you comment on how it compares?

评论 #12213886 未加载

评论 #12213376 未加载

lachenmayer将近 9 年前

This looks really exciting, congrats to the team for launching!Could you tell us a bit about how this compares to dat? <a href="http://dat-data.com/" rel="nofollow">http://dat-data.com/</a>

评论 #12212885 未加载

评论 #12212417 未加载

评论 #12212230 未加载

评论 #12214015 未加载

aboodman将近 9 年前

Hi all. I'm one of the creators of Noms. Happy to answer any questions!

评论 #12212100 未加载

评论 #12212123 未加载

评论 #12213435 未加载

评论 #12213124 未加载

评论 #12213723 未加载

评论 #12213927 未加载

评论 #12212142 未加载

评论 #12212218 未加载

评论 #12211959 未加载

was_boring将近 9 年前

It's an interesting idea.The HN title suggested it's a database, which made me really curious as I can finally stop using history tables (or wal logging, or the other myriad ways of seeing a point in time). However, that doesn't seem to be the case here?That said, the idea of "git as a datastore" does seem akin to "blockchain as data verification". Combine those two ideas together, get PWC involved and you have multimillion dollar deals coming in for audit protection.

评论 #12214211 未加载

pinko将近 9 年前

Here's a relevant (albeit 4-year-old) StackExchange thread, "Is there a Git for data?":<a href="http://opendata.stackexchange.com/questions/748/is-there-a-git-for-data" rel="nofollow">http://opendata.stackexchange.com/questions/748/is-there-a-g...</a>

评论 #12212433 未加载

kragen将近 9 年前

I've been wanting something like Noms for a while. Prolly trees sound really promising.In intro.md, you suggest, "If you wanted to find all the people of a particular age AND having a particular hair color, you could construct a second map having type Map<String, Set<Person>>, and intersect the two sets." In that case, how should I keep the two maps in sync? Do I need to atomically update the logic of all the instances of the application to modify both maps instead of just one? Or do I keep the second map (the hair color index) in a separate index database and update the index whenever I pull changes from a remote database? (What does the API look like for getting notified of new changes that haven't been indexed yet?)I see that "noms sync" does both push and pull. Does that mean I can't pull data from a database I can't write to? How does that work over HTTP — do I need to use a special HTTP server that knows how to accept and authenticate write requests, or can I just dump a Noms dataset in a directory and serve it up with Apache?Forgive me if these questions are obvious — I've read the docs I could find, but I haven't read any of the code beyond the hr sample.

评论 #12221720 未加载

tombert将近 9 年前

I'm surprised you didn't use a functional language like Haskell or OCaml or Rust to do this, since the article talks about love for functional programming.I'm not criticizing Go at all, it's just not really a functional language.

nathancahill将近 9 年前

Excellent! This has been on my "things to build someday" list for a while now. Excited to start playing with it.

评论 #12212739 未加载

paxcoder将近 9 年前

Pretty impressive work but seems like reinventing wheels. Why wasn't it built upon existing tech?I think the docs should enumerate the most important differences and use cases for which it should be a better fit.

评论 #12213080 未加载

fizzbatter将近 9 年前

This is really interesting! What are some ideal use cases for the current implementation? I've seen Git is considered a competitor, but Noms also appears to be a generic database, so i would just like to hear some basic use cases, if possible.Eg: If used as a database, what applications would benefit from Noms? Could/should this be used for personal storage? Could/should this be used for code versioning (ie, Git)?

评论 #12212956 未加载

robzyb将近 9 年前

Wow, this could be quite interesting.Firstly, it would be cool if this could be a single gateway to "all the data in the world". Right now its a pain to find, say, energy generation statistics for, say, Portugal, but it would be great if I could do something like:<pre><code> noms get statistics.industry.energy.portugal.all(); </code></pre> Secondly, the versioning idea could have some really cool applications. For example, I work in data analytics, and sometimes I want to transform some data in an SQL table.Doing transformations nicely is a bit difficult. Either I'm doing the calculations in a column of a view, with the associated performance hit, or I'm tacking columns onto the table, which quickly leads to a mess, especially during the initial stages of analyses.It would be so cool if I could treat the database as a constantly-evolving git tree.

juol将近 9 年前

Your mascot looks like it giving an 'air' blowjob.Otherwise looks like a cool project, keep up the good work!

评论 #12215547 未加载

shruubi将近 9 年前

I really like the idea in theory, but seeing it in practice I feel the whole thing is too concerned with being a wrapper around git handling for their dataset files. I would much rather see diffs based around the records themselves, and not so much the structure of the data.

评论 #12214425 未加载

phantom_oracle将近 9 年前

I don't want to downplay this idea, it really is nice to see people doing different/unique things with technology.However, 1 question I have is:Couldn't you just put a CSV/JSON file(s) behind VCS?Eg. Drop my CSV/JSON file(s) onto github.com and then it will be version-controlled ?

评论 #12213894 未加载

chenster将近 9 年前

"...inspired by the elegance and power of Git for years.."Definitely powerful, but elegance?

评论 #12221735 未加载

woodcut将近 9 年前

We've been struggling managing a collection of periodically updated CSVs & binaries over a few GB's in size, we struggled with Git-LFS and gave up, and we were considering (dreading) SVN, this looks really promising. Cheers!

ah-将近 9 年前

Can you elaborate a bit on how the hashing and chunking works? There's a rolling hash for determining chunk boundaries, and also SHA-512/256 somewhere.Does the same data chunked differently have a different hash?

评论 #12212820 未加载

评论 #12212830 未加载

anilgulecha将近 9 年前

No ones mentioned this yet, but with good (mongo-like) query interface, this can add an important database to the offline-first movement.(Right now pouchdb or gundb are the only available options.)

musicmatze将近 9 年前

This looks really interesting. I've been thinking about the problem of distributed issue tracking lately... and the set of sub-problems it has (authorization and authentification, synchronization and so on) ... I'm not sure all these problems could be covered by this, but I guess at least the "distributed"-part could be covered by something like this.

cdbattags将近 9 年前

I had an idea for this with a buddy in college after doing case study research into Git. I've always considered this the next step into a decentralized world outside of code and non-typed "text". I know .csv's where mentioned a few times; are you looking to narrow into a few specific file types for proof of concept?

评论 #12219151 未加载

billconan将近 9 年前

I'm curious about merging.When there is a conflict, like when a file gets changed by different people, how merging is performed?

评论 #12236174 未加载

kfk将近 9 年前

Those are exactly the kind of ideas the finance world needs to get out of its ethernal mess of spreadsheets.

nkohari将近 9 年前

This is really interesting, thanks for sharing it!I haven't had a chance to dig into the code yet, but I notice that you say two replicas of the same database can be disconnected, altered, and then merged. Could you explain how Noms takes care of that, particularly in the case of collisions?

ianai将近 9 年前

This really piqued my interest and "next big thing" sense

pbkhrv将近 9 年前

Something like this could be used as a backing store for package managers like npm or apt or ruby gems or pypi.

sigi45将近 9 年前

How do you handle hash collision?

评论 #12213344 未加载

评论 #12213305 未加载

评论 #12212810 未加载

mschaef将近 9 年前

First off... I'm excited to see this project. There's a lot of potential here and this looks like a good implementation of a nice concept. I have at least a bit of authority behind that statement, since a few years ago, I had the opportunity to build something similar (although smaller in ambition.) A couple things to think about:* Type accretion - This doesn't change the fact that database clients need to be able to accept historical data formats if they need to access historical data. The schema can't be changed for the older data objects without changing the hashes for that data, so there's no way to do something like a schema migration would work in SQL. For simple schema changes like adding fields, this might not be so hard to deal with, but some changes will be structural in nature and change the relative paths between objects. (This adds complexity to the code of database clients, as well as testing effort.)* Security - Is there a way to secure objects stored within noms? Let's say I store $SECRET into noms and get back a hash. Does it then become the case that every user with access to the database and the hash can now retrieve the $SECRET? What if permissions need to be granted or revoked to a particular object after it's been stored? A field within a particular object? What if an object shouldn't have been stored in the database at all and needs to be obliterated? (This last problem gets worse if the object to be obliterated contains the only path to data that needs to be retained.)* Performance - The CAS model effectively takes the stored data, runs it through a blender, and returns you a grey goo of hashes...this is good for replication, but it means you can't get much meaningful information out of a hash. This tends to mean a lot of operations like you might find in an old-school navigational database, and a huge dependency on the time to fetch an object given a hash. Indices can help by reducing the complexity of the traversals you need to do, but only if they're current and you have the index you need.* Data roll off - How do you expire off data so that it doesn't just monotonically increase in volume? Let's say there's an API to mark an object as purgeable, the problem of identifying other purgeable objects turns into effectively a garbage collection process. (git gc, etc.) There's also the issue of the sheer number of objects that can be involved. The system I was involved with had something like 500K objects/day that had to be purged after 120 days in the system. (Total of 60MM objects line and around 6TB or so) Identifying 500K objects to purge and then specifying those to the data layer for action is not necessarily an easy thing....* Querying - Server side query logic (and an expression language) is basically essential to performance. Otherwise, you wind up with a network round trip for every edge of the graph you follow. Going back to my first point, whatever querying language is used has to be flexible enough to handle a schema that might be varying over time (through schema accretion).All four of these bullet points are worthy of a great deal more discussion, and I haven't even broached issues around conflict resolution, differencing, UI concerns, etc. I think there are good approaches to managing lots of these issues, but there's a bunch of engineering involved, as well as some close attention to scope and goals...

评论 #12236135 未加载

rejschaap将近 9 年前

Interesting project, would just like to say that the Git workflow isn't that great and CVS isn't that bad.The Git workflow is quite complicated and will probably not appeal to people who typically just use Excel for everything.It is true that CVS is messy, but its strength is that it is really simple, and it can easily be fixed.Also, CVS can be versioned with Git quite well in many cases.

34 条评论

nartz将近 9 年前

评论 #12215551 未加载

评论 #12216464 未加载

评论 #12216279 未加载

评论 #12217544 未加载

im_down_w_otp将近 9 年前

评论 #12214073 未加载

评论 #12213915 未加载

zphds将近 9 年前

评论 #12213145 未加载

评论 #12214543 未加载

评论 #12213111 未加载

评论 #12213070 未加载

评论 #12213310 未加载

pinko将近 9 年前

评论 #12212294 未加载

tlb将近 9 年前

评论 #12213166 未加载

评论 #12213109 未加载

评论 #12212995 未加载

评论 #12213185 未加载

评论 #12217549 未加载

评论 #12212716 未加载

joshmarlow将近 9 年前

评论 #12213527 未加载

latortuga将近 9 年前

At first glance, this reminds me a little bit of datomic - all data history is preserved/deduplicated, fork/decentralization features. Can you comment on how it compares?

评论 #12213886 未加载

评论 #12213376 未加载

lachenmayer将近 9 年前

This looks really exciting, congrats to the team for launching!Could you tell us a bit about how this compares to dat? <a href="http://dat-data.com/" rel="nofollow">http://dat-data.com/</a>

评论 #12212885 未加载

评论 #12212417 未加载

评论 #12212230 未加载

评论 #12214015 未加载

aboodman将近 9 年前

Hi all. I'm one of the creators of Noms. Happy to answer any questions!

评论 #12212100 未加载

评论 #12212123 未加载

评论 #12213435 未加载

评论 #12213124 未加载

评论 #12213723 未加载

评论 #12213927 未加载

评论 #12212142 未加载

评论 #12212218 未加载

评论 #12211959 未加载

was_boring将近 9 年前

评论 #12214211 未加载

pinko将近 9 年前

评论 #12212433 未加载

kragen将近 9 年前

评论 #12221720 未加载

tombert将近 9 年前

nathancahill将近 9 年前

Excellent! This has been on my "things to build someday" list for a while now. Excited to start playing with it.

评论 #12212739 未加载

paxcoder将近 9 年前

评论 #12213080 未加载

fizzbatter将近 9 年前

评论 #12212956 未加载

robzyb将近 9 年前

juol将近 9 年前

Your mascot looks like it giving an 'air' blowjob.Otherwise looks like a cool project, keep up the good work!

评论 #12215547 未加载

shruubi将近 9 年前

评论 #12214425 未加载

phantom_oracle将近 9 年前

评论 #12213894 未加载

chenster将近 9 年前

"...inspired by the elegance and power of Git for years.."Definitely powerful, but elegance?

评论 #12221735 未加载

woodcut将近 9 年前

ah-将近 9 年前

评论 #12212820 未加载

评论 #12212830 未加载

anilgulecha将近 9 年前

musicmatze将近 9 年前

cdbattags将近 9 年前

评论 #12219151 未加载

billconan将近 9 年前

I'm curious about merging.When there is a conflict, like when a file gets changed by different people, how merging is performed?

评论 #12236174 未加载

kfk将近 9 年前

Those are exactly the kind of ideas the finance world needs to get out of its ethernal mess of spreadsheets.

nkohari将近 9 年前

ianai将近 9 年前

This really piqued my interest and "next big thing" sense

pbkhrv将近 9 年前

Something like this could be used as a backing store for package managers like npm or apt or ruby gems or pypi.