I think it takes some real humility to post this. No doubt someone will follow up with an “of course...” or “if you don’t understand the tech you use...” comment.<p>But thank you for this. It takes a bit of courage to point out you’ve been doing something grotesquely inefficient for years and years.
Better title: A one-line change decreased our "git clone" times by 99%.<p>It's a bit misleading to use "build time" to describe this improvement, as it makes people think about build systems, compilers, header files, or cache. On the other hand, the alternative title is descriptive and helpful to all developers, not only just builders - people who simply need to clone a branch from a large repository can benefit from this tip as well.
This reminds me of my first programming job in 2005, working with Macromedia Flash. They had one other Flash programmer who only worked there every once in a while because he was actually studying in college, and he was working on some kind of project from hell that, among other problems, took about two minutes to build to SWF.<p>Eventually they stopped asking him to come because he couldn't get anything done, and so I had a look at it. In the Movie Clip library of the project I found he had an empty text field somewhere that was configured to include a copy of <i>almost the entire Unicode range</i>, including thousands of CJK characters, so each time you built the SWF it would collect and compress numerous different scripts from different fonts as vectors for use by the program. And it wasn't even being used by anything.<p>Once I removed that one empty text field, builds went down to about ~3 seconds.
This is the most I've ever gotten out of pinterest, other than this, it's just the "wrong site that google turns up, that I can't use because it wants me to create an account just to watch the image I searched for"
On my first job, 20 years ago, we used a custom Visual C framework that generated one huge .h file that connected all sorts of stuff together. Amongst other things, that .h file contained a list of 10,000 const uints, which were included in every file, and compiled in every file. Compiling that project took hours. At some point I wrote a script that changed all those const uints to #define, which cut our build time to a much more manageable half hour.<p>Project lead called it the biggest productivity improvement in the project; now we could build over lunch instead of over the weekend.<p>If there's a step in your build pipeline that takes an unreasonable amount of time, it's worth checking why. In my current project, the slowest part of our build pipeline is the Cypress tests. (They're also the most unreliable part.)
I sympathise a lot with this post! Git cloning can be shockingly slow.<p>As a personal anecdote, clones of the Rust repository in CI used to be pretty slow, and on investigating we found out that one key problem was cloning the LLVM submodule (which Rust has a fork of).<p>In the end we put in place a hack to download the tar.gz of our LLVM repo from github and just copy it in place of the submodule, rather than cloning it. [0]<p>Also, as a counterpoint to some other comments in this thread - it's really easy to just shrug off CI getting slower. A few minutes here and there adds up. It was only because our CI would hard-fail after 3 hours that the infra team really started digging in (on this and other things) - had we left it, I suspect we might be at around 5 hours by now! Contributors want to do their work, not investigate "what does a git clone really do".<p>p.s. our first take on this was to have the submodules cloned and stored in the CI cache, then use the rather neat `--reference` flag [1] to grab objects from this local cache when initialising the submodule - incrementally updating the CI cache was way cheaper than recloning each time. Sadly the CI provider wasn't great at handling multi-GB caches, so we went with the approach outlined above.<p>[0] <a href="https://github.com/rust-lang/rust/blob/1.47.0/src/ci/init_repo.sh#L50-L68" rel="nofollow">https://github.com/rust-lang/rust/blob/1.47.0/src/ci/init_re...</a><p>[1] <a href="https://github.com/rust-lang/rust/commit/0347ff58230af512c9521bdda7877b8bef9e9d34#diff-a14d83f2e928fc5906d026a42cb16f021b452709b88bc3fd85c63e741cbd9a42R70" rel="nofollow">https://github.com/rust-lang/rust/commit/0347ff58230af512c95...</a>
> Even though we’re telling Git to do a shallow clone, to not fetch any tags, and to fetch the last 50 commits ...<p>What is the reason for cloning 50 commits? Whenever I clone a repo off GitHub for a quick build and don't care about sending patches back, I always use --depth=1 to avoid any history or stale assets. Is there a reason to get more commits if you don't care about having a local copy of the history? Do automated build pipelines need more info?
I expected this to be some micro-optimization of moving a thing from taking 10 seconds to 100ms.<p>> Cloning our largest repo, Pinboard went from 40 minutes to 30 seconds.<p>This is both very impressive as well as very disheartening. If a process in my CI was taking 40 minutes I would be investigating sooner than a 40-minute delay.<p>I don't mean to throw shade on the pintrest engineering team, but, it speaks to an institutional complacency with things like this.<p>I'm sure everyone was happy when the clone took 1 second.<p>I doubt anyone noticed when the clone took 1 minute.<p>Someone probably started to notice when the clone took 5 minutes but didn't look.<p>Someone probably tried to fix it when the clone was taking 10 minutes and failed.<p>I wonder what 'institutional complacencies' we have. Problems we assume are unsolvable but are actually very trivial to solve.
I’ve found as an industry we’ve moved to more complex tools, but haven’t built the expertise in them to truly engineer solutions using them. I think lots of organizations could find major optimizations, but it requires really learning about the technology you’re utilizing.
When I first joined one of my previous jobs, the build process had a checkout stage where it was blowing away the git folder and checked out from scratch the whole repo every time (!). Since the build machine was reserved for that build job I simply made some changes to do git clean -dfx & git reset --hard & git checkout origin branch. It shaved off like 15 minutes of the build time, which was something like 50% of the total build time.
> In the case of Pinboard, that operation would be fetching more than 2,500 branches.<p>Ok, I'll ask: why does a single repository have over 2,500 branches? Why not delete the ones you no longer use?
One of the (many) things that drives me batty about Jenkins is that there are two different ways to represent everything. These days the "declarative pipelines" style seems to be the first class citizen, but most of the documentation still shows the old way. I can't take the code in this example and compare it trivially to my pipelines because the exact same logic is represented in a completely different format. I wish they would just deprecate one or the other.
I find the self-congratulatory tone in the post kind of off-putting, akin to "I saved 99% on my heating bill when I started closing doors and windows in the middle of winter."<p>If your repos weigh in at 20GB in size, with 350k commits, subject to 60k pulls in a single day, having someone with half a devops clue take a look at what your Jenkinsfile is doing with git is not exactly rocket science or a needle in a haystack. (Here's hoping they discover branch pruning too; how many of those 2500 branches are active?)<p>As a consultant I've seen plenty of apallingly poor workflows and practices, so this isn't all that remarkable... but for me the post seems kind of pointless.
Can someone explain the intended meaning behind calling six different repositories "monorepos"?<p>It sounds to me like you don't have a monorepo at all and instead have six repositories for six project areas.
I'm a git noob, so I'm sorry if this sounds dumb but wouldn't<p>git clone --single-branch<p>achieve the same thing (i.e, check out only the branch you want to build) ?<p>Also, why would you <i>not</i> only check out one branch when doing CI ?
I truly appreciate articles like this — it’s warming to see other companies running into the kinds of issues I’ve ran into or had to deal with, and more so that their culture openly discusses and shares these learnings with the broader community.<p>The most effective organizations I’ve worked at built mechanisms and processes to disseminate these kinds of learnings and have regular brown bags on how a particular problem was solved or how others can apply their lessons.<p>Keep it up Pinterest engineering folks.
He says that "Pinboard has more than 350K commits and is 20GB in size when cloned fully." I'm not clear though, exactly what "cloned fully" means in context of the unoptimized/optimized situation.<p>He says it went from 40 minutes to 30 seconds. Does this mean they found a way to grab the whole 20GB repo in 30 seconds? seems pretty darn fast to grab 20GB, but maybe on fast internal networks?<p>Or maybe they meant that it was 20GB if you grabbed all of the many thousands of garbage branches, when Jenkins really only needed to test "master", and finding a solution that allowed them to only grab what they needed made things faster.<p>I'm also curious about the incremental vs "cloning fully" aspect of it. Does each run of Jenkins clone the repo from scratch or does it incrementally pull into a directory where it has been cloned before? I could see how in a cloning-from-scratch situation the burden of cloning every branch that ever existed would be large, whereas incrementally I would think it wouldn't matter that much.
My similar story goes like this: We had CRM software that let you setup user defined menu options. Someone at our organization decided to make a set of nested menu options where you could configure a product, with every possible combination being assigned a value!<p>So if you had a large, blue second generation widget with a foo accessory and option buzz, you were value 30202, and if was the same one except red, it was 26420...<p>Every time the CRM software started up, it cycled through the options, generated a new XML file with all the results, this took about a minute and created like a 60MB file.<p>The fix was to basically version the XML file and the options definition file. If someone had already generated that file, just load the XML file instead of parsing and looping through the options file. Started up in 5 seconds!<p>What was the excuse that it took so long in the first place? "The CRM software is written in Java, so it's slow."
Seems like there's a lot of hostility towards the title, which might be considered the engineering blog equivalent of clickbait. If the authors are around, the post was quite informative and interesting to read, but I'm sure it would have been much more palatable with a more descriptive title.<p>But back on topic: does anyone have any insight into when git fetches things, and what it chooses to grab? It is just "when we were writing git we chose these things as being useful to have a 'please update things before running this command' implicitly run before them"? For example, git pull seems to run a fetch for you, etc.
Ok, I'll ask the obvious question: why did setting the branches option to master not already do this?<p>EDIT<p><a href="https://www.jenkins.io/doc/pipeline/steps/workflow-scm-step/" rel="nofollow">https://www.jenkins.io/doc/pipeline/steps/workflow-scm-step/</a> makes it sounds like the branches option specifies which branches to monitor for changes, after which all branches are fetched. This still seems like a counter-intuitive design that doesn't fit the most common cases.
This is good info. Need to check my own build pipelines now and see if we are just blindingly cloning everything or not. 40 minutes to do a clone is a pretty long time to wait though.
Parkinson's Law of builds. "work expands so as to fill the time available for its completion", or in this case the available time is the point at which people can't stand the build taking too long. 30-60 minutes is normal because anything > 1 minute required you to context-switch anyway, and > 60 minutes means you are now at risk of taking a day if you have a work queue of a 1-pizza team. So [1..60] range causes a grumble but nothing will be done.
Is there any way to do this for GitLab CI [1]? I'm using GIT_DEPTH=1, but I'm not sure how to set refspecs. It's not too important right now since it only takes about 11 seconds to clone the git repo, but maybe it's a quick win as well.<p>[1] <a href="https://docs.gitlab.com/ee/ci/large_repositories/" rel="nofollow">https://docs.gitlab.com/ee/ci/large_repositories/</a>
> For Pinboard alone, we do more than 60K git pulls on business days.<p>Can anyone explain this? Seems ripe for another 99% improvement even with hundreds of devs.
My CI servers have to build branches as well, though. A fresh clone for every build? No wonder it was slow, but even this solution seems inefficient. My preferred general solution is a persistent repository clone per build host, maintained by incremental fetch, and use <i>git worktree add</i>, not <i>git clone</i>, to checkout each build.
Well, good advice, and good for them, but<p>> Cloning monorepos that have a lot of code and history is time consuming, and we need to do it frequently throughout the day in our continuous integration pipelines.<p><i>No you don't!</i><p>If removing per-build clones was the only way to speed things up, I'm absolutely sure you could figure out how with medium difficulty at most.
Thus just shows how poor visibility into git is, I hope it gets better.<p>Building a product with poor visibility and ridiculing users for not knowing internals is the worst practice in Computer Science.<p>Hadoop did the same, and has set a record of fastest software to become legacy.<p>Super nice to see great comments here and the nice article.
Looks like Pinterest’s team is confused about Git Branches. These are not real full copy versions of the main branch like in SVN or TFS. A branch in Git world is simply a pointer to a specific commit in the code push history.<p>Having said that, happy to be proven wrong, and learn about it.
For CI on large repos, you can do much better than this by using a persistent git cache. It takes a little finessing to destroy it if it's corrupt and avoid concurrent modifications, but it's extremely worth it.
Because of strife with 99% claim. If the pull time took 39.9min (and thus build took 0.1min = 6sec) then a 99% decrease in pull time would result in 99% decrease of total time and you would get 30sec total time in the end. (Rounding to 0 decimal places).<p>Not that any of this is important for the article to be interesting. In a previous job we had to fight long pull times and we quickly created a git repo for CI that would sit on a machine next to the CI server and would periodically pull from GitHub to avoid the CI to do pulls over Internet.
The title is a bit of a misnomer, isn't it?<p>> This simple one line change reduced our clone times by 99% and significantly reduced our build times as a result.<p>Sounds like it didn't reduce build times quite by 99%.
I'm not impressed by the author of the post, since it's also something documented in the plugin, saying that you should not checkout all the branches, if not interested. The default behaviour of course is to get all of them.
So git doesn't scale well with wide, deep source histories? That's a failing of git I think, not the Engineers who may even have written that line when the source base was far less gnarly.
I once reduced the speed of our test suite from 10 mins to < 5 minutes by changing 2 characters in 1 line...<p>Then bcrypt work factor! It was originally 12, reduced it to 1 (don’t worry, production is still 12)
Is it a common practice to clone the repo on every build (especially on web apps)? I just have Jenkins navigate to an app folder, run few git commands (hard reset, pull), and build (webpack).
The article is erroneous in many ways as others have described, but the main error I see is that it says 'git clone' is run before the fetch.<p>It should be 'git init'
It is pinteresting that a webapp for making your image saving obsession easier to satisfy takes hundreds to thousands developer actions per day and repository sizes of tens of gigabytes.
Semi-related for JS developers: if you do `eslint` as part of your build, make sure `node_modules` (and `node_modules` in subolders if you have monorepo-ish solution) is excluded.
“We have six main repositories at Pinterest: Pinboard, Optimus, Cosmos, Magnus, iOS, and Android. Each one is a monorepo and houses a large collection of language-specific services.”<p>What is an “iOS monorepo” supposed to be like?