This is similar to what Google uses internally. See <a href="http://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext" rel="nofollow">http://cacm.acm.org/magazines/2016/7/204032-why-google-store...</a>:<p>"Most developers access Piper through a system called Clients in the Cloud, or CitC, which consists of a cloud-based storage backend and a Linux-only FUSE13 file system. Developers see their workspaces as directories in the file system, including their changes overlaid on top of the full Piper repository. CitC supports code browsing and normal Unix tools with no need to clone or sync state locally. Developers can browse and edit files anywhere across the Piper repository, and only modified files are stored in their workspace. This structure means CitC workspaces typically consume only a small amount of storage (an average workspace has fewer than 10 files) while presenting a seamless view of the entire Piper codebase to the developer."<p>This is a very powerful model when dealing with large code bases, as it solves the issue of downloading all the code to each client. Kudos to Microsoft for open sourcing it, and under the MIT license no less.
There is a discussion thread on r/programming, where MS folks, who implemented this answer questions. A lot of questions like why not use multiple repos, why not git-lfs, why not git subtree, etc. are answered there<p><a href="https://www.reddit.com/r/programming/comments/5rtlk0/git_virtual_file_system_from_microsoft/" rel="nofollow">https://www.reddit.com/r/programming/comments/5rtlk0/git_vir...</a>
It's interesting how all the cool things seem to come from Microsoft these days.<p>I still think we need something better than Git, though.
It brought some very cool ideas and the inner workings are reasonably understandable, but the UI is atrociously complicated. And yes, dealing with large files is a very sore point.<p>I'd love to see a second attempt at a distributed version control system.<p>But I applaud MS's initiative. Git's got a lot of traction and mind share already and they'd probably be heavily criticized if they tried to invent its own thing, even if it was open sourced. Will take a long time to overcome its embrace, extend and extinguish history.
Using git with large repos and large (binary blob) files has been a pain point for quite a while. There have been several attempts to solve the problem, none of which have really taken off. I think all the attempts have been (too) proprietary – without wide support, it doesn’t get adopted.<p>I'll be watching this to see if Microsoft can break the logjam. By open sourcing the client and protocol, there is potential...<p>Other attempts:<p>* <a href="https://github.com/blog/1986-announcing-git-large-file-storage-lfs" rel="nofollow">https://github.com/blog/1986-announcing-git-large-file-stora...</a><p>* <a href="https://confluence.atlassian.com/bitbucketserver/git-large-file-storage-794364846.html" rel="nofollow">https://confluence.atlassian.com/bitbucketserver/git-large-f...</a><p>Article on GitHub’s implementation and issues (2015):
<a href="https://medium.com/@megastep/github-s-large-file-storage-is-no-panacea-for-open-source-quite-the-opposite-12c0e16a9a91" rel="nofollow">https://medium.com/@megastep/github-s-large-file-storage-is-...</a>
It's disappointing that all the comments are so negative. This is a great idea and solves a real problem for a lot of use cases.<p>I remembering years ago Facebook says it had this problem. A lot of the comments were centered around that you could change your codebase to for what git can do. I'm glad there's another option now.
I'm immediately reminded of MVFS and clearcase. Lots of companies still use clearcase, but IMO it's not the best tool for the job. git is superior in most dimensions. From what this article says, it's not quite the same as clearcase but there's certainly some hints of similarities.<p>The biggest PITA with clearcase was keeping their lousy MVFS kernel module in sync with ever-advancing linux distros.<p>I really liked Clearcase in 1999, it was an incredible advancement over other offerings then. MVFS was like "yeah! this is how I'd design a sweet revision control system. Transparent revision access according to a ranked set of rules, read-only files until checked out." But with global collaborators, multi-site was too complex IMO. And overall, clearcase was so different from other revision control systems that training people on it was a headache. Performance for dynamic views would suffer for elements whose vtrees took a lot of branches. Derived objects no longer made sense -- just too slow. Local disk was cheap now, it got bigger much faster than object files.<p>> However, we also have a handful of teams with repos of unusual size! ... You can see that in action when you run “git checkout” and it takes up to 3 hours, or even a simple “git status” takes almost 10 minutes to run. That’s assuming you can get past the “git clone”, which takes 12+ hours.<p>This seems like a way-out-there use case, but it's good to know that there's other solutions. I'd be tempted to partition the codebase by decades or something.
The article doesn't directly say it, but are they migrating the Windows source code repository to git? That seems like a big deal.<p>I seem to recall that Microsoft has previously used a custom Perforce "fork" for their larger code bases (Windows, Server, Office, etc.).
If I understand this correctly, unlike git-annex and git lfs, this not about extending the git format with special large files, but changing the algorithm for the current data format.<p>A custom filesystem is indeed the correct approach, and one that git itself should have probably supported long ago. In fact, there should really only be one "repo" per machine, name-spaced branches, and multiple mountpoints a la `git worktree`. In other words there should be a system daemon managing a single global object store.<p>I wonder/hope IPFS can benefit from this implementation on Windows, where FUSE isn't an option.
This is pretty big news. I know that when I was at Adobe, the only reason that Perforce was used for things like Acrobat, is because it was simply the only source control solution that could handle the size of the repo. Smaller projects were starting to use Git, but the big projects all stuck with Perforce.
I love this approach. From working at Google I appreciate the virtual filesystem, it makes a lot of things a lot easier. However all my repos are large enough to fit on a single machine so I wish there was a mode where it was backed by a local repository, however the filesystem allows git to avoid tree scans.<p>Basically most operations in git are O(modified files) however there are a few that are O(working tree size). For example checkout and status were mentioned by the article. However these operations can be made to O(modified) files if git doesn't have to scan the working tree for changes.<p>So pretty much I would be all over this if:<p>- It worked locally.<p>- It worked on Linux.<p>Maybe I'll see how it's implemented and see if I could add the features required. I'm really excited for the future of this project.
Assuming that the repo was this big in the beginning, I wonder why the ever migrated to git (I'm assuming they did, because they can tell how long it takes to checkout). At least when somebody "tries" do the migration, wouldn't they realize that maybe git is not the right tool for them? Or did they actually migrate and then work with "git status" that take 10 minutes for some time until they realize they may need to change something?<p>Also, it would have been interesting if the article mentioned whether they tried other approaches taken by facebook (mercurial afaik) or google.
Did they really need to make a name collision?<p><a href="https://en.wikipedia.org/wiki/GVfs" rel="nofollow">https://en.wikipedia.org/wiki/GVfs</a>
This sounds like a solid use case and a solid extension for that use case - but definitely not the end-all-be-all.<p>For one, it's not really distributed if you're only downloading when you need that specific file.<p>But that doesn't change the merrits of this at all, I think.
My sysadmin: "we won't switch to git because it can't handle binary files and our code base is too big"<p>Our whole codebase is 800MB.
Just to make sure I have this right, this has to do with the _amount_ of files in their repo and not the _size_ of the files? So projects like git annex and LFS would not help the speed of the git repos?
> <i>when you run “git checkout” and it takes up to 3 hours, or even a simple “git status” takes almost 10 minutes to run. That’s assuming you can get past the “git clone”, which takes 12+ hours.</i><p>How on Earth can anybody work like that?<p>I'd have thought you may as well ditch git at that point, since nobody's going to be <i>using</i> it as a tool, surely?<p><pre><code> git commit -m 'Add today\'s work - night all!' && git push; shutdown</code></pre>
Or how about we start some compartmentalizing your codebase so that you can like. You know, organize your code and restore sanity to the known universe.<p>I think when the powers that be said that whole thing about geniuses and clutter, they were specifically talking about their living spaces and not their work...
Does anyone know Microsoft's open source policy works internally? I'm thinking from a governance perspective, as I'm involved in a similar effort at $WORK.
I had a medium sized project in Ruby on Rails as git repo inside vm.<p>It was slow to do 'git status' and other common commands. Restarting RoR app was also slo. I've put repo on RAM disk which made the whole experience at least few times faster.<p>Since all was in vm that I rarely restarted I didn't have to recreate files on ram disk all that often. I was syncing changes with the persistent disk with rsync running periodically.
"For example, the Windows codebase has over 3.5 million files and is over 270 GB in size."<p>Okay, so this is a networking issue. Or is it a stick everything in the same branch issue?<p>Whatever the reason here the issue is pure size vs. network pipe, pure and simple. Hum, when can I get a laptop with a 10GBaseT interface?<p>One of the issue with the way they are doing this (only grab files when needed) is you cannot really work offline anymore.
I'm no expert but if most single developers only use 5-10% of the codebase in their daily life, wouldn't it make to maybe break the project into multiple codebases of about 5% each and use a build pipeline that combines them together when needed?<p>Although I could definitely be wrong but this sounds a lot like monolith vs microservices to me.
Microsoft is moving away from source depo to git it seems. I think its fantastic that a company like Microsoft is adapting git for its big king and queen projects such as office and windows. Also open sourcing the underlying magic tells a lot about the new Microsoft. They're really moving away from not-invented here syndrome
MS has been doing really neat stuff lately. I never worked on a project that takes hours to clone. The largest repository I regularly clone is the Linux repo. It still takes only a few minutes. Yet I can see the GVFS being beneficial for me as I spend most of the time just reading the code (so no need to compile) on my laptop.
Could this also help a smaller repo but with long history, making the total repo size too large?<p>The whole repo is needed for every developer - i.e it's not possible to do a sparse checkout but many gigs of old versions of small binaries I would prefer to keep only at the server until I need it (which is never).
And for all those who still try to stick to anything older:<p><a href="https://github.com/Microsoft/gvfs" rel="nofollow">https://github.com/Microsoft/gvfs</a><p>"GVFS requires Windows 10 Anniversary Update or later."
Check out the GVFS back story and details here:
<a href="https://news.ycombinator.com/item?id=13563439" rel="nofollow">https://news.ycombinator.com/item?id=13563439</a>
Is it really that fucking hard to check if your package name is unique?<p>Here is another virtual filesystem with the exact same name: <a href="https://wiki.gnome.org/Projects/gvfs" rel="nofollow">https://wiki.gnome.org/Projects/gvfs</a><p>Debian package for it: <a href="https://packages.debian.org/jessie/gvfs" rel="nofollow">https://packages.debian.org/jessie/gvfs</a>
Why is that so hard to believe? America is run by Donald Trump.<p>The problems with these companies is that developers aren't making technical decisions, it's executives who know nothing about computer science. That's why Windows 10 is such a mess with spyware and adware.<p>Now they have some FOSS advocate who doesn't really know anything about software or VCS but saw that an internal problem they were trying to solve was making their code base work with git. So he decided it would be really cool for Microsofts image to develop an open source extension of git, instead of actually solving the underlying problems (because he didn't recognize them). Now he's probably got a promotion at Microsoft for "fixing" their problem with git.
Interesting M$ is moving to Git and the rest of the world is pretty much Github & alternatives while Facebook and Google are going with Mercurial. I actually liked Mercurial apart from its name being little hard to pronounce, but it doesn't seems to get used anywhere.<p>So are the DVCS converging to Git and Git only?