Clauset Shalizi Newman 2007 has not-nice things to say about the classic physicist's idiot trick of fitting power law distributions by drawing a straight line on a log-log graph: it's got huge bias. <a href="https://arxiv.org/abs/0706.1062" rel="nofollow">https://arxiv.org/abs/0706.1062</a><p>However, the other difficult thing about power law distributions is that the dataset size requirements for proper determination of the fact that it's a power law distribution are occasionally incredibly difficult. So their critique is very strong, given the comparative lack of data. It is often the case that computer systems, with the overflowing reams of data, are still not enough. Note that the paper I cited up there suggests MLE and then a Kolmogorov-Smirnoff test, so it'll say a lot of things aren't power laws that could well be.<p>Another way to look at it is from a more geometric point of view. The metric entropy of any generic system of variables is defined as the sum of the positive Lyapunov exponents: and as an "entropy" that quantity does have a lot of commonalities with the other entropies. But to have positive Lyapunov exponents is often to have a chaotic dynamics, so it could just be conjectured that the time series of commits and merge octopus sizes in kernel git history is chaotic, so the evolution of the time series will be fractal in nature.<p>But it's also really fucking hard to confirm or deny that one, because there are varied and strange definitions of chaos itself and the methods that have been suggested to measure Lyapunov exponent in real systems are arcane and difficult. You could try some synchronization methods, but they remain arcane and crap. Fractal measurement methods are also shitty and full of dark magic.<p>One neat little trick might be to discretize the series, symbolic dynamics-style (it's already discretized but discretize further, into like percentiles or something) and run it through one of the dynamical machine learning dealies to see if there's patterns. Not too much literature on that but it's a thing that some randoes in like 2004 or something did
There is a mention of the 66 parent merge from Linus himself:<p><a href="http://marc.info/?l=linux-kernel&m=139033182525831" rel="nofollow">http://marc.info/?l=linux-kernel&m=139033182525831</a>
Another interesting piece of trivia: the very first more-than-two-parent merge in the kernel history is a mistake. The second and third parents are <i>the same commit</i>.<p><pre><code> commit 13e652800d1644dfedcd0d59ac95ef0beb7f3165
Merge: 4332bdd 88d7bd8 88d7bd8
Author: David Woodhouse <dwmw2@shinybook.infradead.org>
Date: Sun May 8 13:23:54 2005 +0100
Merge with master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6.git</code></pre>
Some of my favorite commits come from Rusty Russel, who wrote the lguest toy hypervisor documentation as a story:<p><a href="https://github.com/torvalds/linux/commit/f938d2c892db0d80d144253d4a7b7083efdbedeb#diff-847230dec604827964905e0dfec81e42R1" rel="nofollow">https://github.com/torvalds/linux/commit/f938d2c892db0d80d14...</a>
I don't like OP's definition of divergence. I prefer to take the size of the diff along first-parent instead.<p>Here's how I would do it:<p><pre><code> time git log -m --first-parent --shortstat --pretty="%H" --min-parents=2 |
grep -v '^$\|3e1dd193edefd2a806a0ba6cf0879cf1a95217da' |
sed 's/.* file.* changed,//' |
sed 's/insertion.*,/+/' |
sed 's/deletion.*//' |
sed 's/insertion.*//' |
sed 's/^\ \(.*\)\ $/\$\(\(\1\)\)/' |
xargs -d '\n' -L 2 echo echo |
bash |
sort -k 2,2 -g
</code></pre>
Note: I skip 3e1dd193edefd2a806a0ba6cf0879cf1a95217da because that commit has no diff along first-parent, and thus screws up my xargs result (which depends on every 2nd line having the --shortstat output).<p>Of course "--first-parent" doesn't guarantee that we're walking the mainline (see: <a href="https://developer.atlassian.com/blog/2016/04/stop-foxtrots-now/" rel="nofollow">https://developer.atlassian.com/blog/2016/04/stop-foxtrots-n...</a> ), but it <i>usually</i> is.<p>On my laptop it takes 3 mins 30 seconds. Here are the 5 biggest merges by this definition:<p><pre><code> 099bfbfc7fbbe22356c02f0caf709ac32e1126ea 463702
3f17ea6dea8ba5668873afa54628a91aaa3fb1c0 466320
ce519e2327bff01d0eb54071e7044e6291a52aa6 500074
7ea61767e41e2baedd6a968d13f56026522e1207 504965
f063a0c0c995d010960efcc1b2ed14b99674f25c 569691
</code></pre>
And here's "git show" for those 5:<p><pre><code> 099bfbfc7fbb 2015-06-26T13:18:51-07:00 Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux
3f17ea6dea8b 2014-06-08T11:31:16-07:00 Merge branch 'next' (accumulated 3.16 merge window patches) into master
ce519e2327bf 2009-01-06T17:04:29-08:00 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6
7ea61767e41e 2009-09-16T08:11:54-07:00 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6
f063a0c0c995 2010-10-28T12:13:00-07:00 Merge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging-2.6</code></pre>
> <i>"Christ, that's not an octopus, that's a Cthulhu merge"</i><p>Perhaps git should throw a warning when you try to do an octopus merge with more parents than an octopus has legs. If you really want to proceed, add the --cthulhu option. The default behavior would be --no-cthulhu.
It only has one parent, but this would be the commit that I'm least proud of (not in Linux, obviously):<p><a href="https://github.com/cyrusimap/cyrus-imapd/commit/fdc0eb3d09bcc2ce916d2790c98839a61d403937" rel="nofollow">https://github.com/cyrusimap/cyrus-imapd/commit/fdc0eb3d09bc...</a><p>Showing 126 changed files with 14,128 additions and 20,617 deletions.<p>(ok, I'm pretty proud of reducing code size by 6k+ lines while improving lots of stuff, but the commit is a shitshow)
I think Gary's commit counts are off:<p><pre><code> $ git log | wc -l
</code></pre>
This should count the number of lines in the entire git log, including metadata (not just commits). I think he means this:<p><pre><code> $ git log --oneline | wc -l
</code></pre>
The number of commits for Rails should be closer to 61,000.
<i>Octopuses are more common than you might expect</i><p>The etymologically correct plural is <i>octopodes</i>. (Some people accuse "octopodes* of being pedantic, but as I see it "pedantic" is just a euphemism for "correct in a way I don't like".)
Slight article nitpick: a distribution that 'looks like a straight line' in a log-log plot is often <i>not</i> power-law distributed.<p>One could say that the distribution has a fat one-sided tail though.
I used octopus merges once for a deployment system that I built when my team switched from SVN to Git. Since there were a lot of developers working on different parts, it was many times required to test multiple different changes in parallel in the QA system.<p>I built a small web UI where developers could select and unselect development branches, and it would octopus-merge all selected branches into the master branch, and force-push that state onto the QA branch (and deploy it to QA, of course). So QA would always be master + all development branches that were currently being verified. By using a Github webhook, it would update the QA system whenever master or one of the branches being verified was pushed to. I'm not in that team anymore, but I think that deployment tool is still humming along nicely.
That was the worst diagram today.
<1 Commits on the y-axis? Where would be 30 on the x-axis? Can't tell if you only have 3 markers on a log axis.