This is technical recruiting done right, a lucid walkthrough of a hard problem complete with links to the implementation. It must have taken at least a few weeks of engineer-time to write, Github is awesome for making this public.<p>I wish they had talked a little more about the tradeoff they made. They mentioned that splitting packfiles by fork was space-prohibitive, but ended up with a solution which must take more space than originally used. (If the new heuristic refuses to use some objects as delta bases, some options which would have provided the best compression are no longer available to git)<p>The performance win is incredible, how much space did they give up in the process?
Not specific to the post, but:<p>> we're sending few objects, all from the tip of the repository, and these objects will usually be delta'ed against older objects that won't be sent. Therefore, Git tries to find new delta bases for these objects.<p>Why is this the case ? git can send thin packs if the receiver already has the objects, why does it still need to find a full base to diff against ? (Not counting when initial base objects are from another fork -- I don't know if it's often the case)<p>On top of that as far as I understood from the discussion about heuristics (<a href="https://git.kernel.org/cgit/git/git.git/tree/Documentation/technical/pack-heuristics.txt?id=HEAD" rel="nofollow">https://git.kernel.org/cgit/git/git.git/tree/Documentation/t...</a>) it seems like the latest objects are full and the earlier objects are diffed against them (double benefits: you usually want access to the last object which is already full, and earlier objects tend to be only remove stuff, not add because "stuff grows over time). So if objects are still stored as packs, things should already be in a pretty good shape to be sent as-is... or not ?
This is awesome. I absolutely love <company>engineering blogs like this.<p>There's no real reason for companies to educate external devs/hobbyists/students like this, but some do, and it's really awesome.
Perhaps I'm a bit ignorant of git's storage and protocols, but what's the purpose of this initial count? Seems to me it's traversing the tree twice - once to count and once to send the objects across the network. So why not traverse the tree, sending objects that need to be sent and ignoring the ones that don't <i>instead</i> of counting?
The Azure team responsible for the Local Git implementation needs to read this fantastic article.<p>I've been putting up with 10-minute deploys due to precisely this issue of counting objects. It's slow because we don't use Local Git as our source-of-record repository (because commits initiate a deployment step), so every deploy involves a clean fetch into a new tmpdir.<p>At least now I know why our deploys are getting slower and slower.
I wonder why GitHub has a separate domain githubengineering.com for this blog instead of a subdomain like engineering.github.com.<p>I notice that there is an inactive user account called engineering. If at all an User Page is created by that account, it would be available at engineering.github.io.
"When you fork a repository on GitHub, we create a shallow copy of it. This copy has no objects of its own, but it has access to all the objects of an alternate ..."<p>So does this mean one could attach a GitHub repository by having a lot of shill accounts cline it and add random objects (possibly having a performance impact on the original)? I understand the engineering need for the use of alternates, but wonder about the lowered degree of isolation.