I realize one needs a catchy title and some storytelling to get people to read a blog article, but for a summary of the main points:<p>* This is not about a build step that makes the app perform better<p>* The app isn't 10x faster (or faster at all; it's the same binary)<p>* The author ran a benchmark two ways, one of which inadvertently included the time taken to generate sample input data, because it was coming from a pipe<p>* Generating the data before starting the program under test fixes the measurement
Back in college, a friend of mine decided to learn how to program. He had never programmed before. He picked up the DEC FORTRAN-10 manual and read it cover to cover.<p>He then wrote a program that generated some large amount of data and wrote it to a file. Being much smarter than I am, his first program worked the first time.<p>But it ran terribly slow. Baffled, he showed it to his friend, who exclaimed why are you, in a loop, opening the file, appending one character, and closing the file? That's going to run incredibly slowly. Instead, open the file, write all the data, then close it!<p>The reply was "the manual didn't say anything about that or how to do I/O efficiently."
I don't want to belittle the author, but I am surprised, that people using a low-level language on Linux wouldn't know how Unix pipelines work or that reading one byte per syscall is quite inefficient. I understand that the author is still learning (aren't we all?), but I just felt it is a pretty fundamental knowledge. At the same time author managed to have better performance that the official thing had. I guess many things feel fundamental in the retrospect.
There is general wisdom about bash pipelines here that I think most people will miss simply because of the title. Interesting though, my mental model of bash piping was wrong too.
I was so confused about why this mattered/made <i>such</i> a difference - until I went back and re-read from the top: OP does the benchmark timing in `main`, in the Zig app under test.<p>If you don't do that, if you use the `time` CLI for example, this wouldn't have been a problem in the first place. Though sure you couldn't have compared to compiling fresh & running anyway, and at least on small inputs would've wanted to do the input prep first anyway.<p>But I think if you put the benchmark code inside the DUT you're setting yourself up for all kinds of gotchas like this.
There seems to be a small misunderstanding on the behavior of pipes here. All the commands in a bash pipeline do start at the same time, but output goes into the pipeline buffer whenever the writing process writes it. There is no specific point where the "output from jobA is ready".<p>The author's example code, "<i>jobA starts, sleeps for three seconds, prints to stdout, sleeps for two more seconds, then exits</i>" and "<i>jobB starts, waits for input on stdin, then prints everything it can read from stdin until stdin closes</i>" is measuring 5 seconds not because the input to jobB is not ready until jobA terminates but because jobB is waiting for the pipe to close which doesn't happen until jobA ends. That explains the timing of the output:<p><pre><code> $ ./jobA | ./jobB
09:11:53.326 jobA is starting
09:11:53.326 jobB is starting
09:11:53.328 jobB is waiting on input
09:11:56.330 jobB read 'result of jobA is...' from input
09:11:58.331 jobA is terminating
09:11:58.331 jobB read '42' from input
09:11:58.333 jobB is done reading input
09:11:58.335 jobB is terminating
</code></pre>
The bottom line is that it's important to actually measure what you want to measure.
This post is another example of why I like zig so much. It seems to get people talking about performance in a way which helps them learn how things work below today’s heavily abstracted veneer
If you want create something like the pipe behaviour the author expected (buffer all output before sending to the next command), the sponge command from moreutils can help.
My first guess involved caching but I was thinking about whether the binary itself had to be read from disk or was already cached in RAM. Great linux-fu post.
If I were trying to optimize my code, I would start with loading the entire benchmark bytecode to memory, then start the counter. Otherwise I can't be sure how much time is spent reading from a pipe/file to memory, and how much time is spent in my code.<p>Then I would try to benchmark what happens if it all fits in L1 cache, L2, L3, and main memory.<p>Of course, if the common use case is reading from a file, network, or pipe, maybe you can optimize that, but I would take it step by step.
> By adding a benchmarking script to my continuous integration and archiving the results, it was easy for me to identify when my measurements changed.<p>This assumes CI runs on the same machine with same hardware every time, but most CI doesn’t do that.
The TL:DR; is that the build step masks the wait for input from a shell pipe. With a side dish of "do buffered input" and then a small "avoid memory allocation for fun."
This is an excellent writeup, with interesting ideas and clear description of actions taken. My idea of pipelines, also, was flawed. Well done!<p>Nothing to do with Zig. Just a nice debugging story.
You can easily hit a similar problem in other languages too. For example, in Rust, std::fs::File isn't buffered, so reading single bytes from it will also be rather slow.