GitHub’s Large File Storage is no panacea for Open Source

82 pointsby narnerover 9 years ago

14 comments

jedbrownover 9 years ago

In the interest of not propagating this common misconception:"The main problem with Git is that binary files are stored “as is” in the history of the project, so that every single revision of a new binary file (even if just a single byte has changed) is stored in full. [...] On the other hand, source files being mostly text, they are more intelligently handled and typically only differences between revisions are stored in the commits."This is false. Git stores the full version of each file in "loose" format and uses compressed incremental diffs (originally based on xdiff) in packfiles (after "git gc") without distinguishing text vs binary in either case. The issue is that binary files are often compressed themselves (so a one-byte semantic change has nonlocal effect) or have positional references (like jump targets in an executable, causing small changes to cascade).These factors explain the inefficient handling of binary files, but improving efficiency requires changing the semantics. LFS follows in the path of a few other tools (based on smudge/clean filters) that try to hide the semantic difference from the casual user, though that difference seems to bite people more frequently than we'd like.

评论 #10445368 未加载

pbiggarover 9 years ago

How uncharitable can a single blog post be! The entire post is discredited by the author repeatedly projecting his unfounded opinions onto GitHub, such as"My guess is that some high-level greedy marketing dickwad, completely unaware of the asinine implications of his brilliant idea, signed off on this dumb-as-a-bag-of-rocks pricing model.""All the marketing material pimping GitHub’s LFS support [...]. I do not believe this is unintentional.""This is completely batshit. The side effect of this pernicious, greedy pricing model is to [...]""I honestly couldn’t believe that GitHub would be willing to do something that shortsighted, visibly motivated by greed from the cash they thought they could extract from some of their users".Charitable explanation for forks not working: they haven't yet written the code to make this work with forks, and it's better to ship something working early, than to make it work in all cases.Charitable explanation for charging for bandwith: bandwidth costs money. (I believe this is a real problem for Dropbox, which doesn't charge for bandwidth but must still pay for it). Also, all CDNs, and also AWS charge for bandwidth.Overall, while GitHub may be able to support it's OSS folks better by changing the pricing on some parts of its product, this post is incredibly uncharitable. I hope the OP will consider removing the unfounded narrative that he's projecting onto GitHub (esp the "marketing dickwad" thing - wtf) and focus on the facts.[Disclaimer: my company partners with GitHub on lots of stuff]

paulddraperover 9 years ago

> Case in point: if a very popular Github repository (such as the one for the Linux kernel) decided to start using LFS for some of their files, they would instantly alienate all of their users. They would no longer be able to properly fork the project, or even clone it to get its binary files stored via LFS. Nobody would be able to send a pull request to Linus as a result without considerable effort.Odd example. Linux doesn't use GitHub pull requests.

评论 #10445139 未加载

评论 #10445011 未加载

评论 #10444980 未加载

dantiberianover 9 years ago

There's a lot of assumptions here about GitHub being greedy. I've got no idea how much money it costs GitHub to support Open Source projects, but it must easily be in the millions. I think that by this point GitHub deserves the benefit of the doubt before launching into vicious accusations.

评论 #10445167 未加载

评论 #10445202 未加载

评论 #10445133 未加载

评论 #10445230 未加载

alkonautover 9 years ago

The most interesting takeaway for me was that Microsoft seems to privide the only(?) free git hosting that includes LFS?Does anyone know if their repos supports forking in combination with LFS too?

nkurzover 9 years ago

This seems like an odd problem, but I'm not as familiar with Git as I should be. Is there not a reasonable way to download only the most recent version of these large binary files on the initial request, and then download the historical versions only in the (likely very rare) case that the user actually wants to use them? This would seem more useful in this case than hoping that binary diffs the repository small enough.

评论 #10445347 未加载

评论 #10445172 未加载

lemeviover 9 years ago

Edit: I was wrong, however I learned from the conversation so I am leaving it here! Thanks to those who corrected me.> On the other hand, source files being mostly text, they are more intelligently handled and typically only differences between revisions are stored in the commits.This is completely incorrect, git stores whole blobs from one commit to the other.svn stored patches, but git does not. Every version of a file is stored in its entirety in your git tree since the beginning of the repository's existence. This is one of the reasons why git is so fast. You can go through your objects in your .git directory and verify this for yourself[0].<pre><code> $ find .git/objects -type f .git/objects/ff/a5d733354ae6f8bdc67764d58d87c9a3161f66 .git/objects/ff/deb08f4856bd6eb5b31d7f800b3e480ae3e2e0 $ git cat-file -p ffa5d733354ae6f8bdc67764d58d87c9a3161f66 ...file contents appear... </code></pre> [0] <a href="https://git-scm.com/book/en/v2/Git-Internals-Git-Objects" rel="nofollow">https://git-scm.com/book/en/v2/Git-Internals-Git-Objects</a>

评论 #10445204 未加载

评论 #10445213 未加载

评论 #10445229 未加载

bhugaover 9 years ago

I suspect setting up the free LFS reference/test server[1] that GitHub provides would have taken less time than writing this post complaining that GitHub isn't free enough.1: <a href="https://github.com/github/lfs-test-server" rel="nofollow">https://github.com/github/lfs-test-server</a>

评论 #10445327 未加载

sytseover 9 years ago

At GitLab we're working to support LFS. Initial support might or might not work with forks. As now with our Git Annex support storage will be free with a soft limit of 10GB of disk space per project (includes Git, Git Annex and Git LFS data) and there is no bandwidth limit. It will work with public and private projects (both are free).

eshamowover 9 years ago

I'm not sure I understand why artifacts can't be stored in a different service - even an S3 bucket, if not a real repository service - and fetched dynamically via a build process.Is there a reason why binary blobs need to be stored directly next to code in order to be versioned?

评论 #10445234 未加载

评论 #10445215 未加载

forrestthewoodsover 9 years ago

I wonder if Perforce Cloud will be able to fill this role at all. Probably not. Open Source isn't their target audience. But it could be a consideration.Has anyone tried the new Perforce/Git stuff? Is it any good? We're still on an older pre-Helix version.

bsimpsonover 9 years ago

The post seems hyperbolic. I'd love to hear GitHub's rebuttal.

评论 #10445228 未加载

cwyersover 9 years ago

> I honestly couldn’t believe that GitHub would be willing to do something that shortsighted, visibly motivated by greed from the cash they thought they could extract from some of their users.They're a business. C'mon here.

评论 #10445314 未加载

i386over 9 years ago

Shock and horror: commercial company has a paid value add.GitHub is not a charity.

评论 #10445178 未加载