In terms of engineering tradeoffs, this reminds me of a recent talk by Alan Kay where he says that to build the software of the future, you have to pay extra to get the hardware of the future today. [1] Joel Spolsky called it "throwing money at the problem" when, five years ago he got SSD's for everybody at Fog Creek just to deal with a slow build. [2]<p>I don't use Facebook, and I'm not suggesting that they're building the software of the future. But surely someone there is smart enough to know that, for this decision, time is on their side.<p>[1] <a href="https://news.ycombinator.com/item?id=7538063" rel="nofollow">https://news.ycombinator.com/item?id=7538063</a><p>[2] <a href="http://www.joelonsoftware.com/items/2009/03/27.html" rel="nofollow">http://www.joelonsoftware.com/items/2009/03/27.html</a>
Although this is large for a company that deals mostly in web-based projects, it's nothing compared to repository sizes in game development.<p>Usually game assets are in one repository (including compiled binaries) and code in another. The repository containing the game itself can grow to hundreds of gigabytes in size due to tracking revision history on art assets (models, movies, textures, animation data, etc).<p>I wouldn't doubt there's some larger commercial game projects that have repository sizes exceeding 1TB.
Didn't they switch to Mercurial?<p><a href="https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-facebook/" rel="nofollow">https://code.facebook.com/posts/218678814984400/scaling-merc...</a>
The worrying point here is the checkout of 8GB as opposed to the history size itself (46GB). If git is fast enough with SSD, this is hardly anything to worry about.<p>I actually prefer monolithic repos (I realize that the slide posted might be in jest). I have seen projects struggle with submodules and splitting up modules into separate repos. People change something in their module. They don't test any upstream modules because it's not their problem anymore. Software in fast moving companies doesn't work like that. There are always subtle behavior dependancies (re: one module depends on a bug in another module either by mistake or intentionally). I just prefer having all code and tests of all modules in one place.
FB had previous scaling problems with git which they discussed in 2012 <a href="http://comments.gmane.org/gmane.comp.version-control.git/189776" rel="nofollow">http://comments.gmane.org/gmane.comp.version-control.git/189...</a><p>It appears they are now using Mercurial and working on scaling that (also noted by several others in this discussion): <a href="https://code.facebook.com/posts/218678814984400/scaling-mercurial-at-facebook/" rel="nofollow">https://code.facebook.com/posts/218678814984400/scaling-merc...</a>
I bet most of that size is made up from the various dependencies Facebook probably has, though I'm still surprised it's that large. I expected the background worker things, like the facial recognition system for tagging people, and the video re-encoding libs, to be housed on separate repositories.<p>I also wonder if that size includes a snapshot of a subset of Facebook's Graph, so that each developer has a "mini-facebook" to work on that's large enough to be representative of the actual site (so that feed generation and other functionalities take somewhat the same time to execute.)
> @readyState would massively enjoy that first clone @feross<p>The first clone does not have to go over the wire. Part of git's distributed nature is that you can copy the .git to any hard drive and pass it on to someone else. Then...<p>> git checkout .
Meh. I'm working on a comparably small project (~40 developers), and we're over 16GB.<p>Mostly because we want a 100% reproducible build environment, so a complete build environment (compilers + IDE + build system) is all checked into the repro.
Someone recently told me that Facebook had a torrent file that went around the company that people could use to download the entire codebase using a BitTorrent client. Is there any truth in this?<p>I mean, the same guy that told me this, also said that the codebase size was about 50 times less than the one reported in this slide, so it may all be pure speculation.
<p><pre><code> NAFV_P@DEC-PDP9000:~$ python
Python 2.7.3 (default, Feb 27 2014, 19:58:35)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information
>>> t=54*2**30
>>> t
57982058496
# let's assume a char is 2mm wide, 500 chars per meter
>>> t/500.0
115964116.992 #meters of code
# assume 80 chars per line, a char is 5mm high, 200 lines per meter
>>> u=80*200.0
>>> v=t/u
>>> v
3623878.656 # height of code in meters
# 1000 meters per km
>>> v/1000.0
3623.878656 # km of code, it's about 385,000 km from the Earth to the Moon
>>> from sys import stdout
>>> stdout.write("that's a hella lotta code\n")</code></pre>
I thought I had read an article about facebook switching to perforce due to their really large git repo. Were they at least thinking about it?<p>A quick google comes up with nothing but I could have SWORN I read that.
gosh, last time I had trouble with 8GB data checking in, it's very memory hungry when the data set is big and then you need check them in all at once, how much memory on the server side you need when you want to 'git add .' all the repo of 54GB?<p>what about a re-index or something, will that take forever?<p>I worry at such size the speed will suffer, I feel git is comfortable with probably a few GBs only?<p>anyway it's good to know that 54GB still is usable!
Company I did a contract for last year has 8MB of (Java) source code and a 52MB SVN repo and make £40 million a year out of it...<p>We're doing something wrong.
I hope everyone realizes this is not 54GB of code, but in fact, is more likely a very public showing of very poor SCM management. They likely have tons of binaries in there, many many full codebase changes (whitespace, tabs, line endings, etc). Also not to mention how much dead code lives in there?