Facebook's git repo is 54GB

251 pointsby ShaneCurranabout 11 years ago

28 comments

gavinpcabout 11 years ago

In terms of engineering tradeoffs, this reminds me of a recent talk by Alan Kay where he says that to build the software of the future, you have to pay extra to get the hardware of the future today. [1] Joel Spolsky called it "throwing money at the problem" when, five years ago he got SSD's for everybody at Fog Creek just to deal with a slow build. [2]I don't use Facebook, and I'm not suggesting that they're building the software of the future. But surely someone there is smart enough to know that, for this decision, time is on their side.[1] <a href="https://news.ycombinator.com/item?id=7538063" rel="nofollow">https://news.ycombinator.com/item?id=7538063</a>[2] <a href="http://www.joelonsoftware.com/items/2009/03/27.html" rel="nofollow">http://www.joelonsoftware.com/items/2009/03/27.html</a>

评论 #7648753 未加载

评论 #7648758 未加载

评论 #7649568 未加载

rl3about 11 years ago

Although this is large for a company that deals mostly in web-based projects, it's nothing compared to repository sizes in game development.Usually game assets are in one repository (including compiled binaries) and code in another. The repository containing the game itself can grow to hundreds of gigabytes in size due to tracking revision history on art assets (models, movies, textures, animation data, etc).I wouldn't doubt there's some larger commercial game projects that have repository sizes exceeding 1TB.

评论 #7649194 未加载

评论 #7648902 未加载

评论 #7651192 未加载

评论 #7649371 未加载

评论 #7648883 未加载

评论 #7649249 未加载

antimatterabout 11 years ago

Didn't they switch to Mercurial?<a href="https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-facebook/" rel="nofollow">https://code.facebook.com/posts/218678814984400/scaling-merc...</a>

评论 #7648708 未加载

VikingCoderabout 11 years ago

Pay attention to the footnote:<pre><code> *8 GB plus 46 GB .git directory</code></pre>

评论 #7648573 未加载

评论 #7648455 未加载

general_failureabout 11 years ago

The worrying point here is the checkout of 8GB as opposed to the history size itself (46GB). If git is fast enough with SSD, this is hardly anything to worry about.I actually prefer monolithic repos (I realize that the slide posted might be in jest). I have seen projects struggle with submodules and splitting up modules into separate repos. People change something in their module. They don't test any upstream modules because it's not their problem anymore. Software in fast moving companies doesn't work like that. There are always subtle behavior dependancies (re: one module depends on a bug in another module either by mistake or intentionally). I just prefer having all code and tests of all modules in one place.

评论 #7650624 未加载

评论 #7649514 未加载

评论 #7648861 未加载

评论 #7648923 未加载

评论 #7648804 未加载

alayneabout 11 years ago

FB had previous scaling problems with git which they discussed in 2012 <a href="http://comments.gmane.org/gmane.comp.version-control.git/189776" rel="nofollow">http://comments.gmane.org/gmane.comp.version-control.git/189...</a>It appears they are now using Mercurial and working on scaling that (also noted by several others in this discussion): <a href="https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-facebook/" rel="nofollow">https://code.facebook.com/posts/218678814984400/scaling-merc...</a>

TheCorehabout 11 years ago

I bet most of that size is made up from the various dependencies Facebook probably has, though I'm still surprised it's that large. I expected the background worker things, like the facial recognition system for tagging people, and the video re-encoding libs, to be housed on separate repositories.I also wonder if that size includes a snapshot of a subset of Facebook's Graph, so that each developer has a "mini-facebook" to work on that's large enough to be representative of the actual site (so that feed generation and other functionalities take somewhat the same time to execute.)

评论 #7648688 未加载

评论 #7648528 未加载

评论 #7648617 未加载

评论 #7648578 未加载

lightbladeabout 11 years ago

> @readyState would massively enjoy that first clone @ferossThe first clone does not have to go over the wire. Part of git's distributed nature is that you can copy the .git to any hard drive and pass it on to someone else. Then...> git checkout .

sligabout 11 years ago

Am I missing something or this means a new intern working on a small feature, for instance, would have access to entire codebase?

评论 #7648705 未加载

评论 #7648724 未加载

评论 #7648801 未加载

hk__2about 11 years ago

Is there a reason why they keep everything in the same repo? Can’t you just split the code across multiple smaller repos?

评论 #7648811 未加载

评论 #7648631 未加载

评论 #7649040 未加载

com2kidabout 11 years ago

Meh. I'm working on a comparably small project (~40 developers), and we're over 16GB.Mostly because we want a 100% reproducible build environment, so a complete build environment (compilers + IDE + build system) is all checked into the repro.

评论 #7649242 未加载

dnlserranoabout 11 years ago

Someone recently told me that Facebook had a torrent file that went around the company that people could use to download the entire codebase using a BitTorrent client. Is there any truth in this?I mean, the same guy that told me this, also said that the codebase size was about 50 times less than the one reported in this slide, so it may all be pure speculation.

评论 #7649009 未加载

评论 #7648644 未加载

评论 #7649817 未加载

NAFV_Pabout 11 years ago

<pre><code> NAFV_P@DEC-PDP9000:~$ python Python 2.7.3 (default, Feb 27 2014, 19:58:35) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information >>> t=54*2**30 >>> t 57982058496 # let's assume a char is 2mm wide, 500 chars per meter >>> t/500.0 115964116.992 #meters of code # assume 80 chars per line, a char is 5mm high, 200 lines per meter >>> u=80*200.0 >>> v=t/u >>> v 3623878.656 # height of code in meters # 1000 meters per km >>> v/1000.0 3623.878656 # km of code, it's about 385,000 km from the Earth to the Moon >>> from sys import stdout >>> stdout.write("that's a hella lotta code\n")</code></pre>

ianphughesabout 11 years ago

I wonder what their branching strategy is like and how merges are gated with a single codebase of that size?

评论 #7648534 未加载

jasonnutterabout 11 years ago

Relevant: <a href="https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-facebook/" rel="nofollow">https://code.facebook.com/posts/218678814984400/scaling-merc...</a>

rickrabout 11 years ago

I thought I had read an article about facebook switching to perforce due to their really large git repo. Were they at least thinking about it?A quick google comes up with nothing but I could have SWORN I read that.

评论 #7648675 未加载

评论 #7648521 未加载

pekkabout 11 years ago

Why don't people use multiple git repos for multiple internal projects? It seems totally nonsensical and undesirable.

korzunabout 11 years ago

That's actually not that bad for a engineering shop of their size. I would start archiving metadata at some point.

ausjkeabout 11 years ago

gosh, last time I had trouble with 8GB data checking in, it's very memory hungry when the data set is big and then you need check them in all at once, how much memory on the server side you need when you want to 'git add .' all the repo of 54GB?what about a re-index or something, will that take forever?I worry at such size the speed will suffer, I feel git is comfortable with probably a few GBs only?anyway it's good to know that 54GB still is usable!

negativityabout 11 years ago

...but 8GB for the actual current version.How much of it is static resources, like CSS sprite images?

Dorian-Marieabout 11 years ago

They must storing a lot of images and binary files I guess.

bananasabout 11 years ago

Company I did a contract for last year has 8MB of (Java) source code and a 52MB SVN repo and make £40 million a year out of it...We're doing something wrong.

kevinsf90about 11 years ago

I thought they used Mercurial

pearjuiceabout 11 years ago

Someone must have forgotten a .gitignore or two.

_akabout 11 years ago

well, git gc --aggressive --prune=now, duh.(jk)

SnakeDocabout 11 years ago

I hope everyone realizes this is not 54GB of code, but in fact, is more likely a very public showing of very poor SCM management. They likely have tons of binaries in there, many many full codebase changes (whitespace, tabs, line endings, etc). Also not to mention how much dead code lives in there?

评论 #7648754 未加载

SnakeDocabout 11 years ago

Hey Facebook! You're doing it wrong!!!!!

coherentponyabout 11 years ago

So what? This probably means they're versioning data files they shouldn't be. I feel like this just exists here as a pissing contest.

评论 #7649397 未加载

28 comments

gavinpcabout 11 years ago

评论 #7648753 未加载

评论 #7648758 未加载

评论 #7649568 未加载

rl3about 11 years ago

评论 #7649194 未加载

评论 #7648902 未加载

评论 #7651192 未加载

评论 #7649371 未加载

评论 #7648883 未加载

评论 #7649249 未加载

antimatterabout 11 years ago

评论 #7648708 未加载

VikingCoderabout 11 years ago

Pay attention to the footnote:<pre><code> *8 GB plus 46 GB .git directory</code></pre>

评论 #7648573 未加载

评论 #7648455 未加载

general_failureabout 11 years ago

评论 #7650624 未加载

评论 #7649514 未加载

评论 #7648861 未加载

评论 #7648923 未加载

评论 #7648804 未加载

alayneabout 11 years ago

TheCorehabout 11 years ago

评论 #7648688 未加载

评论 #7648528 未加载

评论 #7648617 未加载

评论 #7648578 未加载

lightbladeabout 11 years ago

sligabout 11 years ago

Am I missing something or this means a new intern working on a small feature, for instance, would have access to entire codebase?

评论 #7648705 未加载

评论 #7648724 未加载

评论 #7648801 未加载

hk__2about 11 years ago

Is there a reason why they keep everything in the same repo? Can’t you just split the code across multiple smaller repos?

评论 #7648811 未加载

评论 #7648631 未加载

评论 #7649040 未加载

com2kidabout 11 years ago

评论 #7649242 未加载

dnlserranoabout 11 years ago

评论 #7649009 未加载

评论 #7648644 未加载

评论 #7649817 未加载

NAFV_Pabout 11 years ago

ianphughesabout 11 years ago

I wonder what their branching strategy is like and how merges are gated with a single codebase of that size?

评论 #7648534 未加载

jasonnutterabout 11 years ago

Relevant: <a href="https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-facebook/" rel="nofollow">https://code.facebook.com/posts/218678814984400/scaling-merc...</a>

rickrabout 11 years ago

评论 #7648675 未加载

评论 #7648521 未加载

pekkabout 11 years ago

Why don't people use multiple git repos for multiple internal projects? It seems totally nonsensical and undesirable.

korzunabout 11 years ago

That's actually not that bad for a engineering shop of their size. I would start archiving metadata at some point.

ausjkeabout 11 years ago

negativityabout 11 years ago

...but 8GB for the actual current version.How much of it is static resources, like CSS sprite images?

Dorian-Marieabout 11 years ago

They must storing a lot of images and binary files I guess.

bananasabout 11 years ago

Company I did a contract for last year has 8MB of (Java) source code and a 52MB SVN repo and make £40 million a year out of it...We're doing something wrong.

kevinsf90about 11 years ago

I thought they used Mercurial

pearjuiceabout 11 years ago

Someone must have forgotten a .gitignore or two.

_akabout 11 years ago

well, git gc --aggressive --prune=now, duh.(jk)

SnakeDocabout 11 years ago

评论 #7648754 未加载

SnakeDocabout 11 years ago

Hey Facebook! You're doing it wrong!!!!!

coherentponyabout 11 years ago

So what? This probably means they're versioning data files they shouldn't be. I feel like this just exists here as a pissing contest.

评论 #7649397 未加载