Facebook hit git performance issue on large repository

378 pointsby kaesoover 13 years ago

33 comments

bosover 13 years ago

Facebook engineer here, working on this problem with Joshua.What this comes down to is that git uses a lot of essentially O(n) data structures, and when n gets big, that can be painful.A few examples:* There's no secondary index from file or path name to commit hash. This is what slows down operations like "git blame": they have to search every commit to see if it touched a file.* Since git uses lstat to see if files have been changed, the sheer number of system calls on a large filesystem becomes an issue. If the dentry and inode caches aren't warm, you spend a ton of time waiting on disk I/O.An inotify daemon could help, but it's not perfect: it needs a long time to warm up in the case of a reboot or crash. Also, inotify is an incredibly tricky interface to use efficiently and reliably. (I wrote the inotify support in Mercurial, FWIW.)* The index is also a performance problem. On a big repo, it's 100MB+ in size (hence expensive to read), and the whole thing is rewritten from scratch any time it needs to be touched (e.g. a single file's stat entry goes stale).None of these problems is insurmountable, but neither is any of them amenable to an easy solution. (And no, "split up the tree" is not an easy solution.)

评论 #3550022 未加载

评论 #3550626 未加载

评论 #3549860 未加载

评论 #3550778 未加载

评论 #3549761 未加载

评论 #3551208 未加载

评论 #3549806 未加载

评论 #3550746 未加载

lbrandyover 13 years ago

Wow. I was expecting an interesting discussion. I was disappointed. Apparently the consensus on hacker news is that there exists a repository size N above which the benefits of splitting the repo _always_ outweigh the negatives. And, if that wasn't absurd enough, we've decided that git can already handle N and the repository in question is clearly above N. And I guess all along we'll ignore the many massive organizations who cannot and will not use git for precisely the same issue.So instead of (potentially very enlightening conversation) identifying and talking about limitations and possible solutions in git, we've decided that anyone who can't use git because of its perf issues is "doing it wrong".

评论 #3550198 未加载

评论 #3549330 未加载

评论 #3549545 未加载

评论 #3549360 未加载

评论 #3550280 未加载

评论 #3549329 未加载

评论 #3549445 未加载

jrockwayover 13 years ago

Yes, it's well known that big companies with big continuously integrated codebases don't manage the entire codebase with Git. It's slow, and splitting repositories means you can't have company-wide atomic commits. It's convenient to have a bunch of separate projects that share no state or code, but also wasteful.So often, the tool used to manage the central repository, which needs to cleanly handle a large codebase, is different from the tool developers use for day-to-day work, which only needs to handle a small subset. At Google, everything is in Perforce, but since I personally need only four or five projects from Perforce for my work, I mirror that to git and interact with git on a day-to-day basis. This model seems to scale fairly well; Google has a big codebase with a lot of reuse, but all my git operations execute instantaneously.Many projects can "shard" their code across repositories, but this is usually an unhappy compromise.People always use the Linux kernel as an example of a big project, but even as open source projects go, it's pretty tiny. Compare the entire CPAN to Linux, for example. It's nice that I can update CPAN modules one at a time, but it would be nicer if I could fix a bug in my module and all modules that depend on it in one commit. But I can't, because CPAN is sharded across many different developers and repositories. This makes working on one module fast but working on a subset of modules impossible.So really, Facebook is not being ridiculous here. Many companies have the same problem and decide not to handle it at all. Facebook realizes they want great developer tools and continuous integration across all their projects. And Git just doesn't work for that.

评论 #3549528 未加载

评论 #3549855 未加载

评论 #3551251 未加载

评论 #3549793 未加载

ramanujanover 13 years ago

This looks like it could be of assistance:<a href="http://source.android.com/source/version-control.html" rel="nofollow">http://source.android.com/source/version-control.html</a><pre><code> Repo is a repository management tool that we built on top of Git. Repo unifies the many Git repositories when necessary, does the uploads to our revision control system, and automates parts of the Android development workflow. Repo is not meant to replace Git, only to make it easier to work with Git in the context of Android. The repo command is an executable Python script that you can put anywhere in your path. In working with the Android source files, you will use Repo for across-network operations. For example, with a single Repo command you can download files from multiple repositories into your local working directory. </code></pre> <a href="http://google-opensource.blogspot.com/2008/11/gerrit-and-repo-android-source.html" rel="nofollow">http://google-opensource.blogspot.com/2008/11/gerrit-and-rep...</a><pre><code> With approximately 8.5 million lines of code (not including things like the Linux Kernel!), keeping this all in one git tree would've been problematic for a few reasons: * We want to delineate access control based on location in the tree. * We want to be able to make some components replaceable at a later date. * We needed trivial overlays for OEMs and other projects who either aren't ready or aren't able to embrace open source. * We don't want our most technical people to spend their time as patch monkeys. The repo tool uses an XML-based manifest file describing where the upstream repositories are, and how to merge them into a single working checkout. repo will recurse across all the git subtrees and handle uploads, pulls, and other needed items. repo has built-in knowledge of topic branches and makes working with them an essential part of the workflow. </code></pre> Looks like it's worth taking a serious look at this repo script, as it's been used in production for Android. Might allow splitting into multiple git repositories for performance while still retaining some of the benefits of a single repository.

评论 #3550676 未加载

评论 #3550073 未加载

评论 #3549812 未加载

losvedirover 13 years ago

Huh, fascinating. git was initially created for the Linux kernel development, and I haven't heard of any issues there. Offhand I would have said, as a codebase, the Linux kernel would be larger and more complex than facebook, but I don't have a great sense of everything involved in both cases.So what's the story here: kernel developers put up with longer git times, the kernel is better organized, the scope of facebook is more massive even than the linux kernel, or there's some inherent design in git that works better for kernel work than web work?

评论 #3549261 未加载

评论 #3549253 未加载

评论 #3549217 未加载

评论 #3549797 未加载

yuvadamover 13 years ago

While I'd be interested in seeing this issue further unfold, just the prospect of a 1.3M-file repo gives me the creeps.I'm not sure what the exact situation at Facebook is with this repository, but I'm positive that if they had to start with a clean slate, this repo would easily find itself broken up into at least a dozen different repos.Not to mention the fact that if _git_ has issues dealing with 1.3M files, I wonder what other (D)VCS they're thinking of as an alternative that would be more performant.

评论 #3549059 未加载

评论 #3549036 未加载

评论 #3549027 未加载

sekover 13 years ago

<a href="http://thread.gmane.org/gmane.comp.version-control.git/189776" rel="nofollow">http://thread.gmane.org/gmane.comp.version-control.git/18977...</a>They keep every project in a single repo, mystery solved.Edit:> We already have some of the easily separable projects in separate repositories, like HPHP.Yeah, because it makes no sense, it's C++. They probably use for everything PHP i assume then. Is there no good build management tool for it?

评论 #3549126 未加载

评论 #3549278 未加载

julian37over 13 years ago

Somewhat off-topic, could somebody explain why<pre><code> echo 3 | tee /proc/sys/vm/drop_caches </code></pre> rather than just<pre><code> echo 3 > /proc/sys/vm/drop_caches </code></pre> Is it because the output to stdout lets you be extra sure that the right data was sent to the kernel?I'm just wondering if this is an idiom with a deeper meaning that I'm not aware of.EDIT: I'm guessing that when you run it in a script (without set -x), rather than on the command line, you can see in the log what it is you sent?

评论 #3549697 未加载

评论 #3549708 未加载

dblockover 13 years ago

Others have tried and keep throwing more and more smart people at the problem they just shouldn't have.MSFT with Windows codebase that runs out of several labs. Crazy branching and merging infrastructure. They use source-depot, originally a clone of perforce.Google with all their source code in one Perforce repo.Facebook will be on perforce before we know it.The solution is an internal Github, not one giant project.

评论 #3549544 未加载

评论 #3549146 未加载

gokhanover 13 years ago

Large repos bring their own problems, and results in some design decisions accordingly. For example, Visual Studio itself is 5M+ files and this affected some of the the initial design decisions (Server side workspaces, for this example) when developing TFS 2005 (the first version) [1]. That decision suits MS but not the small to medium clients well. So they're now alternating that design with client side workspaces.It's not wise to offer Facebook to split the repository. Looks like it's time to improve the tool.[1] <a href="http://blogs.msdn.com/b/bharry/archive/2011/08/02/version-control-model-enhancements-in-tfs-11.aspx" rel="nofollow">http://blogs.msdn.com/b/bharry/archive/2011/08/02/version-co...</a>

iamleppertover 13 years ago

I can believe this working with a former facebook employee. They do not believe in separating or distilling anything into separate repos. Why the fuck would you want to have a 15GB repo?Ideally they should have many small, manageable repositories that are well tested and owned by a specific group/person/whatever. At least something small enough a single dev or team can get their head around.Sheesh.

评论 #3549715 未加载

dustingetzover 13 years ago

the obvious answer, repeatedly mentioned in comments:> factor into modules, one project per repowhere i work we have a project with clear module boundaries, but all in the same repo. we have an "app" and some dependencies including our platform/web framework. none of these are stable, they're all growing together. Commits on the app require changes in the platform, and in code review it is helpful to see things all together. Porting commits across different branches requires porting both the application change and the dependent platform changes. Often a client-specific branch will require severe one-off changes so the platform may diverge -- it is not practical for us (right now) to continually rebase client branches onto the latest platform.this is just our experience, not facebook's, but lets face it: real life software isn't black and white, and discussion that doesn't acknowledge this isn't particularly helpful.

评论 #3549600 未加载

评论 #3549463 未加载

djtriptychover 13 years ago

I hope these guys do take the route of developing a large-scale performant patch.Git as so many interesting uses at scale as just a tool that navigates and tracks DAGs over time.

courtewingover 13 years ago

This was actually pretty fascinating to me. On one hand, I am astonished at how long it takes to perform seemingly trivial git operations on repositories at this scale. On the other hand, I'm utterly mystified that a company like Facebook has such monolithic repositories. Even back when I was using SVN a lot, I relied on externals and such to break up large projects into their smaller service-level components.I'd be very interested to see some benchmarks on their current VCS solution for repositories of this scale.

评论 #3548962 未加载

评论 #3548986 未加载

评论 #3549037 未加载

评论 #3548999 未加载

评论 #3548973 未加载

评论 #3549104 未加载

jpdoctorover 13 years ago

$100B company, maybe they can afford to put some people onto solving this for the open software community (and put the solution into the open), especially since nobody else in the community seems to have this problem.

评论 #3550048 未加载

评论 #3549859 未加载

redstoneover 13 years ago

This is Joshua (who posted the original email). I'm glad to see so much interest in source control scalability. If there are others who have ever contemplated investing a bit of time to improving git, it'd be great to coordinate and see what makes sense to do - even if it turns out that the right answer is just to make the tools that manage multiple repos so good that it feels as easy as a single repo.

lnguyenover 13 years ago

There's two issues: the width of the repository (number of files) and the depth (the number of commits).Since "status" and "commit" perform fairly well after the OS file cache has been warmed up, that probably can be resolved by having background processes that keep it warm. (Also, how long would it take to just simply stat that number of files? )The issue of "blame" still taking over 10 minutes: We need to know far back in the repository they're searching. What happen if there's one line that hasn't been changed since the initial commit? Are you being forced to go back to through the whole commit history?How old is the repository? Years? Months? I'm probably guessing in the at least years range based on the number of commits (unless the developers are extremely commit-happy).At a certain point, you're going to be better off taking the tip revisions off a branch and starting a fresh repository. It doesn't matter what SCM/VCS tool you're using (I've been the architect and admin on the implementation of a number of commercial tools). Keep the old repository live for a while and then archive it.You'll find that while everyone wants to say that they absolutely need the full revision history of every project, you rarely go back very far (aka the last major release or two). And if you do need that history, you can pull it from the archives.

pwpwpover 13 years ago

Git was designed for the Linux kernel, and it's simply not big: a couple thousand files, broken up into directories of dozens or hundreds of files.<a href="http://www.schoenitzer.de/lks/lks_en.html#new_files" rel="nofollow">http://www.schoenitzer.de/lks/lks_en.html#new_files</a>

teycover 13 years ago

This is an interesting social AND a technical problem. The problem for FB is that it is all too easy for them to just fork git, create the necessary interfaces and then hope the git maintainers would accept it (they mightn't) or release it into the wild (and incur bad karma and wrath of OS developers who'd see this has schism or even heresy).They've reached out to the developers on git, and I guess that's a first step.

dpcxover 13 years ago

I don't want to imagine the actual kind of code that requires 1.3M files to run.

评论 #3549247 未加载

评论 #3549086 未加载

ctzover 13 years ago

I'm surprised Facebook and all its peripheral development has that much source. I would expect something like 5-10 million lines of code, not ~100 million lines implied by the example.

评论 #3549508 未加载

akgover 13 years ago

I don't think Git was designed to perform well with such a large repo. In this case, the best-practice is probably compartmentalizing the code and using Git submodules. The Git submodule interface is a little un-friendly, but I think it does work well for such large repos. I've been using submodules successfully for our development that tracks source files as well as binary assets.

charlieokover 13 years ago

I think it's a bad practice to keep a giant code base in one repo. Split the code base into purpose-specific modules, just as you would split any project into purpose-specific modules. In fact, those two things might well line up 1:1.If a project depends on other projects, have it reference the other projects. Where appropriate, include exact version numbers and/or commit hashes. Gemfiles are good examples of this good practice at work.Yes, git has submodules for this sort of thing, but after investigating that route, I decided against using git submodules. Use something independent of the VCS instead. Then git won't do weird or unexpected things when you switch branches. Also, you might want to mix in projects that use other version control systems. And really, why unnecessarily couple a project to its version control system?If (when?), even after splitting a megaproject into manageable subprojects, these performance issues creep in, I'd certainly be interested in whatever improvements people are coming up with...

loegover 13 years ago

I'm curious what their performance numbers look like if they host the .git repo on tmpfs -- 15GB isn't unreasonable on a beefy (24-32GB of ram) machine.

评论 #3549434 未加载

评论 #3550750 未加载

railsmaxover 13 years ago

Hey do you know a lot of sites with such needs? Facebook is the first, and probably all sites with such needs I can count with fingers on my one hand. I don't think it's git issue - everyone use this system and all are happy using it. This is like a new feature, but not issue.

alok-gover 13 years ago

Is anything known for scalability to such sizes for Subversion, Mercurial, Bazaar, and others?

earinoover 13 years ago

If this was the crazy size of your git repo, why wouldn't you make a tool that took your git repo and versioned it? Keep it in repos that can all be performant, since most of the time you are working with "time local" information?

MikeOnFireover 13 years ago

My first thought, as suggested by some on the list, was modularization. Redstone's response (that the 1.3 million files are essentially all interdependent) terrifies me.

slashcleeover 13 years ago

These times are for spinning-platter hard drives. I wonder what the numbers look like on a modern SSD?

djb_hackernewsover 13 years ago

That's projected growth for two of their projects. Sounds like they have something brewing...Still amazed that breaking it up would do more harm than good when the code isn't even written yet...

评论 #3549026 未加载

DannoHungover 13 years ago

Multiple people in this conversation section have asserted that code sharing is way easier when all the code is in a single repo, but from my understanding of sub-modules, it would be a fairly simple matter of setting up your pre/post-commit hooks to update submodules to a branch automatically and get useful company wide change atomicity (after all, changes should only propagate between teams/projects once they have some stability).Putting aside the question of whether or not an enormous singular repo can be broken up intelligently into modular projects, is there something about the submodule approach that makes it a uniquely unsuitable way for sharing changes amongst projects?

评论 #3550057 未加载

xxiaoover 13 years ago

git is not memory efficient by design, i used to push about 1G commit to the server and it hangs forever, i had to abort it and push in as small chunks instead

r15habhover 13 years ago

Does Facebook really believe that because they have the most users, they should also have the biggest git repo?Amazon and Google have already solved this problem, and the solution is to reorganize things into smaller manageable packages.

评论 #3550901 未加载