The argument here I suppose is live on the bleeding edge? Or swap compilers/allocators as necessary? Or use O3? I'm not entirely sure.<p>It wraps up with:<p>>>> COMPILE YOUR SOFTWARES. It's going to help you understand it, make your app faster, more secure (by removing the mail gateway of NGiNX for example), and it will show you which softwares are easily maintanable and reliable.<p>It is PITA is keeping up with compiler changes and library changes of third party software. In this case, throwing in different malloc implementations in there too, not to mention different libc implementations. With all those variables for each component you deploy you're probably less likely to understand what's going in your software.<p>Maybe for a small app with 2-3 extra pieces or something that needs to really be optimized this is good advice, but it sounds like a lot of work for significant footprints.<p>It would be nice if the official docker images had some tags that were better optimized, within reason.
The sad fact is that a lot of software is difficult to compile; isn't documented well, something that is worse for building; won't work well if installed in a non-standard way, whether that is final location, different supporting libs, or different platform; and can take a long time.<p>I'm happy nowadays when I see there's a binary available, no mucking around with gcc/clang/llvm - just trying to work out which one, let alone which version! - no diving down a rabbit hole of compiling dependencies that then need other dependencies compiled… no deciphering Makefiles that were written in a way that only a C guru can grok, with no comments.<p>Whatever the benefits are, I prefer sanity.
Please label your bar graph axis, with units. It’s kind of counterproductive to look at a benchmark graph without knowing whether more or less is better.
Other cool things you can do if you compile yourself is use features like auto parallelization[1].<p>I wouldn't recommend to enable it system wide because it causes issues with programs that fork() due to limitations in gcc's OpenMP library[2], but other than that it works pretty well. For example, I can fully load my 4C/8T CPU using 3 clang processes because compilation is magically spread over multiple threads. I've seen a "single threaded" program (qemu-img) suddenly start using more than a single core to convert disk images into other formats, leading to speedups.<p>Also things like PGO/FDO in combination with workload specific profiling data can easily give you 10% or more if you are CPU bound.<p>[1]: <a href="https://gcc.gnu.org/wiki/AutoParInGCC" rel="nofollow">https://gcc.gnu.org/wiki/AutoParInGCC</a><p>[2]: <a href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42624" rel="nofollow">https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42624</a> (There was a patch to fix this, but it never got merged and doesn't apply to the current version any more, sadly)
Please avoid GIFs and memes in your article. It adds nothing to the actual information in the article, but takes the seriousness away and makes it less readable.
I wonder how x86-64-v3 for Arch/v2 for fedora in the near future will change this calculus. Currently you're basically compiling for a Core 2/Athlon 64 era chip, so there's clear wins to be had, but I wonder how much of the benefit can be had just by using software requiring Haswell/Zen1 at minimum
On my side projects I compile everything myself, but I do not completely agree with this post because Redis is one of easiest/fastest mainstream databases to compile, it can get very time consuming and the returns are not always there.