Facebook engineer here, working on this problem with Joshua.<p>What this comes down to is that git uses a lot of essentially O(n) data structures, and when n gets big, that can be painful.<p>A few examples:<p>* There's no secondary index from file or path name to commit hash. This is what slows down operations like "git blame": they have to search every commit to see if it touched a file.<p>* Since git uses lstat to see if files have been changed, the sheer number of system calls on a large filesystem becomes an issue. If the dentry and inode caches aren't warm, you spend a ton of time waiting on disk I/O.<p>An inotify daemon could help, but it's not perfect: it needs a long time to warm up in the case of a reboot or crash. Also, inotify is an incredibly tricky interface to use efficiently and reliably. (I wrote the inotify support in Mercurial, FWIW.)<p>* The index is also a performance problem. On a big repo, it's 100MB+ in size (hence expensive to read), <i>and</i> the whole thing is rewritten from scratch any time it needs to be touched (e.g. a single file's stat entry goes stale).<p>None of these problems is insurmountable, but neither is any of them amenable to an easy solution. (And no, "split up the tree" is not an easy solution.)
Wow. I was expecting an interesting discussion. I was disappointed. Apparently the consensus on hacker news is that there exists a repository size N above which the benefits of splitting the repo _always_ outweigh the negatives. And, if that wasn't absurd enough, we've decided that git can already handle N and the repository in question is clearly above N. And I guess all along we'll ignore the many massive organizations who cannot and will not use git for precisely the same issue.<p>So instead of (potentially very enlightening conversation) identifying and talking about limitations and possible solutions in git, we've decided that anyone who can't use git because of its perf issues is "doing it wrong".
Yes, it's well known that big companies with big continuously integrated codebases don't manage the entire codebase with Git. It's slow, and splitting repositories means you can't have company-wide atomic commits. It's convenient to have a bunch of separate projects that share no state or code, but also wasteful.<p>So often, the tool used to manage the central repository, which needs to cleanly handle a large codebase, is different from the tool developers use for day-to-day work, which only needs to handle a small subset. At Google, everything is in Perforce, but since I personally need only four or five projects from Perforce for my work, I mirror that to git and interact with git on a day-to-day basis. This model seems to scale fairly well; Google has a big codebase with a lot of reuse, but all my git operations execute instantaneously.<p>Many projects can "shard" their code across repositories, but this is usually an unhappy compromise.<p>People always use the Linux kernel as an example of a big project, but even as open source projects go, it's pretty tiny. Compare the entire CPAN to Linux, for example. It's nice that I can update CPAN modules one at a time, but it would be nicer if I could fix a bug in my module and all modules that depend on it in one commit. But I can't, because CPAN is sharded across many different developers and repositories. This makes working on one module fast but working on a subset of modules impossible.<p>So really, Facebook is not being ridiculous here. Many companies have the same problem and decide not to handle it at all. Facebook realizes they want great developer tools <i>and</i> continuous integration across all their projects. And Git just doesn't work for that.
This looks like it could be of assistance:<p><a href="http://source.android.com/source/version-control.html" rel="nofollow">http://source.android.com/source/version-control.html</a><p><pre><code> Repo is a repository management tool that we built on top
of Git. Repo unifies the many Git repositories when
necessary, does the uploads to our revision control
system, and automates parts of the Android development
workflow. Repo is not meant to replace Git, only to make
it easier to work with Git in the context of Android. The
repo command is an executable Python script that you can
put anywhere in your path. In working with the Android
source files, you will use Repo for across-network
operations. For example, with a single Repo command you
can download files from multiple repositories into your
local working directory.
</code></pre>
<a href="http://google-opensource.blogspot.com/2008/11/gerrit-and-repo-android-source.html" rel="nofollow">http://google-opensource.blogspot.com/2008/11/gerrit-and-rep...</a><p><pre><code> With approximately 8.5 million lines of code (not
including things like the Linux Kernel!), keeping this all
in one git tree would've been problematic for a few reasons:
* We want to delineate access control based on location in the tree.
* We want to be able to make some components replaceable at a later date.
* We needed trivial overlays for OEMs and other projects who either aren't ready or aren't able to embrace open source.
* We don't want our most technical people to spend their time as patch monkeys.
The repo tool uses an XML-based manifest file describing
where the upstream repositories are, and how to merge them
into a single working checkout. repo will recurse across
all the git subtrees and handle uploads, pulls, and other
needed items. repo has built-in knowledge of topic
branches and makes working with them an essential part of
the workflow.
</code></pre>
Looks like it's worth taking a serious look at this repo script, as it's been used in production for Android. Might allow splitting into multiple git repositories for performance while still retaining some of the benefits of a single repository.
Huh, fascinating. git was initially created for the Linux kernel development, and I haven't heard of any issues there. Offhand I would have said, as a codebase, the Linux kernel would be larger and more complex than facebook, but I don't have a great sense of everything involved in both cases.<p>So what's the story here: kernel developers put up with longer git times, the kernel is better organized, the scope of facebook is more massive even than the linux kernel, or there's some inherent design in git that works better for kernel work than web work?
While I'd be interested in seeing this issue further unfold, just the prospect of a 1.3M-file repo gives me the creeps.<p>I'm not sure what the exact situation at Facebook is with this repository, but I'm positive that if they had to start with a clean slate, this repo would easily find itself broken up into at least a dozen different repos.<p>Not to mention the fact that if _git_ has issues dealing with 1.3M files, I wonder what other (D)VCS they're thinking of as an alternative that would be more performant.
<a href="http://thread.gmane.org/gmane.comp.version-control.git/189776" rel="nofollow">http://thread.gmane.org/gmane.comp.version-control.git/18977...</a><p>They keep every project in a single repo, mystery solved.<p>Edit:<p>> We already have some of the easily separable projects in separate repositories, like HPHP.<p>Yeah, because it makes no sense, it's C++. They probably use for everything PHP i assume then. Is there no good build management tool for it?
Somewhat off-topic, could somebody explain why<p><pre><code> echo 3 | tee /proc/sys/vm/drop_caches
</code></pre>
rather than just<p><pre><code> echo 3 > /proc/sys/vm/drop_caches
</code></pre>
Is it because the output to stdout lets you be extra sure that the right data was sent to the kernel?<p>I'm just wondering if this is an idiom with a deeper meaning that I'm not aware of.<p>EDIT: I'm guessing that when you run it in a script (without set -x), rather than on the command line, you can see in the log what it is you sent?
Others have tried and keep throwing more and more smart people at the problem they just shouldn't have.<p>MSFT with Windows codebase that runs out of several labs. Crazy branching and merging infrastructure. They use source-depot, originally a clone of perforce.<p>Google with all their source code in one Perforce repo.<p>Facebook will be on perforce before we know it.<p>The solution is an internal Github, not one giant project.
Large repos bring their own problems, and results in some design decisions accordingly. For example, Visual Studio itself is 5M+ files and this affected some of the the initial design decisions (Server side workspaces, for this example) when developing TFS 2005 (the first version) [1]. That decision suits MS but not the small to medium clients well. So they're now alternating that design with client side workspaces.<p>It's not wise to offer Facebook to split the repository. Looks like it's time to improve the tool.<p>[1] <a href="http://blogs.msdn.com/b/bharry/archive/2011/08/02/version-control-model-enhancements-in-tfs-11.aspx" rel="nofollow">http://blogs.msdn.com/b/bharry/archive/2011/08/02/version-co...</a>
I can believe this working with a former facebook employee. They do not believe in separating or distilling anything into separate repos. Why the fuck would you want to have a 15GB repo?<p>Ideally they should have many small, manageable repositories that are well tested and owned by a specific group/person/whatever. At least something small enough a single dev or team can get their head around.<p>Sheesh.
the obvious answer, repeatedly mentioned in comments:<p>> factor into modules, one project per repo<p>where i work we have a project with clear module boundaries, but all in the same repo. we have an "app" and some dependencies including our platform/web framework. none of these are stable, they're all growing together. Commits on the app require changes in the platform, and in code review it is helpful to see things all together. Porting commits across different branches requires porting both the application change and the dependent platform changes. Often a client-specific branch will require severe one-off changes so the platform may diverge -- it is not practical for us (right now) to continually rebase client branches onto the latest platform.<p>this is just our experience, not facebook's, but lets face it: real life software isn't black and white, and discussion that doesn't acknowledge this isn't particularly helpful.
I hope these guys do take the route of developing a large-scale performant patch.<p>Git as so many interesting uses at scale as just a tool that navigates and tracks DAGs over time.
This was actually pretty fascinating to me. On one hand, I am astonished at how long it takes to perform seemingly trivial git operations on repositories at this scale. On the other hand, I'm utterly mystified that a company like Facebook has such monolithic repositories. Even back when I was using SVN a lot, I relied on externals and such to break up large projects into their smaller service-level components.<p>I'd be very interested to see some benchmarks on their current VCS solution for repositories of this scale.
$100B company, maybe they can afford to put some people onto solving this for the open software community (and put the solution into the open), especially since nobody else in the community seems to have this problem.
This is Joshua (who posted the original email). I'm glad to see so much interest in source control scalability. If there are others who have ever contemplated investing a bit of time to improving git, it'd be great to coordinate and see what makes sense to do - even if it turns out that the right answer is just to make the tools that manage multiple repos so good that it feels as easy as a single repo.
There's two issues: the width of the repository (number of files) and the depth (the number of commits).<p>Since "status" and "commit" perform fairly well after the OS file cache has been warmed up, that probably can be resolved by having background processes that keep it warm. (Also, how long would it take to just simply stat that number of files? )<p>The issue of "blame" still taking over 10 minutes: We need to know far back in the repository they're searching. What happen if there's one line that hasn't been changed since the initial commit? Are you being forced to go back to through the whole commit history?<p>How old is the repository? Years? Months? I'm probably guessing in the at least years range based on the number of commits (unless the developers are extremely commit-happy).<p>At a certain point, you're going to be better off taking the tip revisions off a branch and starting a fresh repository. It doesn't matter what SCM/VCS tool you're using (I've been the architect and admin on the implementation of a number of commercial tools). Keep the old repository live for a while and then archive it.<p>You'll find that while everyone wants to say that they absolutely need the full revision history of every project, you rarely go back very far (aka the last major release or two). And if you do need that history, you can pull it from the archives.
Git was designed for the Linux kernel, and it's simply not big: a couple thousand files, broken up into directories of dozens or hundreds of files.<p><a href="http://www.schoenitzer.de/lks/lks_en.html#new_files" rel="nofollow">http://www.schoenitzer.de/lks/lks_en.html#new_files</a>
This is an interesting social AND a technical problem. The problem for FB is that it is all too easy for them to just fork git, create the necessary interfaces and then hope the git maintainers would accept it (they mightn't) or release it into the wild (and incur bad karma and wrath of OS developers who'd see this has schism or even heresy).<p>They've reached out to the developers on git, and I guess that's a first step.
I'm surprised Facebook and all its peripheral development has that much source. I would expect something like 5-10 million lines of code, not ~100 million lines implied by the example.
I don't think Git was designed to perform well with such a large repo. In this case, the best-practice is probably compartmentalizing the code and using Git submodules. The Git submodule interface is a little un-friendly, but I think it does work well for such large repos. I've been using submodules successfully for our development that tracks source files as well as binary assets.
I think it's a bad practice to keep a giant code base in one repo. Split the code base into purpose-specific modules, just as you would split any project into purpose-specific modules. In fact, those two things might well line up 1:1.<p>If a project depends on other projects, have it reference the other projects. Where appropriate, include exact version numbers and/or commit hashes. Gemfiles are good examples of this good practice at work.<p>Yes, git has submodules for this sort of thing, but after investigating that route, I decided against using git submodules. Use something independent of the VCS instead. Then git won't do weird or unexpected things when you switch branches. Also, you might want to mix in projects that use other version control systems. And really, why unnecessarily couple a project to its version control system?<p>If (when?), even after splitting a megaproject into manageable subprojects, these performance issues creep in, I'd certainly be interested in whatever improvements people are coming up with...
I'm curious what their performance numbers look like if they host the .git repo on tmpfs -- 15GB isn't unreasonable on a beefy (24-32GB of ram) machine.
Hey do you know a lot of sites with such needs? Facebook is the first, and probably all sites with such needs I can count with fingers on my one hand. I don't think it's git issue - everyone use this system and all are happy using it. This is like a new feature, but not issue.
If this was the crazy size of your git repo, why wouldn't you make a tool that took your git repo and versioned it? Keep it in repos that can all be performant, since most of the time you are working with "time local" information?
My first thought, as suggested by some on the list, was modularization. Redstone's response (that the 1.3 million files are essentially all interdependent) terrifies me.
That's projected growth for two of their projects. Sounds like they have something brewing...<p>Still amazed that breaking it up would do more harm than good when the code isn't even written yet...
Multiple people in this conversation section have asserted that code sharing is <i>way</i> easier when all the code is in a single <i>repo</i>, but from my understanding of sub-modules, it would be a fairly simple matter of setting up your pre/post-commit hooks to update submodules to a branch automatically and get useful company wide change atomicity (after all, changes should only propagate between teams/projects once they have some stability).<p>Putting aside the question of whether or not an enormous singular repo can be broken up intelligently into modular projects, is there something about the submodule approach that makes it a uniquely unsuitable way for sharing changes amongst projects?
git is not memory efficient by design, i used to push about 1G commit to the server and it hangs forever, i had to abort it and push in as small chunks instead
Does Facebook really believe that because they have the most users, they should also have the biggest git repo?<p>Amazon and Google have already solved this problem, and the solution is to reorganize things into smaller manageable packages.