Why does an extraneous build step make my Zig app 10x faster?

231 pointsby ojosilvaabout 1 year ago

16 comments

mastoabout 1 year ago

I realize one needs a catchy title and some storytelling to get people to read a blog article, but for a summary of the main points:* This is not about a build step that makes the app perform better* The app isn't 10x faster (or faster at all; it's the same binary)* The author ran a benchmark two ways, one of which inadvertently included the time taken to generate sample input data, because it was coming from a pipe* Generating the data before starting the program under test fixes the measurement

评论 #39766547 未加载

评论 #39769875 未加载

WalterBrightabout 1 year ago

Back in college, a friend of mine decided to learn how to program. He had never programmed before. He picked up the DEC FORTRAN-10 manual and read it cover to cover.He then wrote a program that generated some large amount of data and wrote it to a file. Being much smarter than I am, his first program worked the first time.But it ran terribly slow. Baffled, he showed it to his friend, who exclaimed why are you, in a loop, opening the file, appending one character, and closing the file? That's going to run incredibly slowly. Instead, open the file, write all the data, then close it!The reply was "the manual didn't say anything about that or how to do I/O efficiently."

评论 #39767943 未加载

hawskiabout 1 year ago

I don't want to belittle the author, but I am surprised, that people using a low-level language on Linux wouldn't know how Unix pipelines work or that reading one byte per syscall is quite inefficient. I understand that the author is still learning (aren't we all?), but I just felt it is a pretty fundamental knowledge. At the same time author managed to have better performance that the official thing had. I guess many things feel fundamental in the retrospect.

评论 #39766130 未加载

评论 #39766431 未加载

评论 #39766163 未加载

评论 #39766154 未加载

评论 #39766677 未加载

评论 #39771082 未加载

评论 #39771332 未加载

afortyabout 1 year ago

There is general wisdom about bash pipelines here that I think most people will miss simply because of the title. Interesting though, my mental model of bash piping was wrong too.

评论 #39765466 未加载

评论 #39765442 未加载

评论 #39765573 未加载

OJFordabout 1 year ago

I was so confused about why this mattered/made such a difference - until I went back and re-read from the top: OP does the benchmark timing in `main`, in the Zig app under test.If you don't do that, if you use the `time` CLI for example, this wouldn't have been a problem in the first place. Though sure you couldn't have compared to compiling fresh & running anyway, and at least on small inputs would've wanted to do the input prep first anyway.But I think if you put the benchmark code inside the DUT you're setting yourself up for all kinds of gotchas like this.

mcguireabout 1 year ago

There seems to be a small misunderstanding on the behavior of pipes here. All the commands in a bash pipeline do start at the same time, but output goes into the pipeline buffer whenever the writing process writes it. There is no specific point where the "output from jobA is ready".The author's example code, "jobA starts, sleeps for three seconds, prints to stdout, sleeps for two more seconds, then exits" and "jobB starts, waits for input on stdin, then prints everything it can read from stdin until stdin closes" is measuring 5 seconds not because the input to jobB is not ready until jobA terminates but because jobB is waiting for the pipe to close which doesn't happen until jobA ends. That explains the timing of the output:<pre><code> $ ./jobA | ./jobB 09:11:53.326 jobA is starting 09:11:53.326 jobB is starting 09:11:53.328 jobB is waiting on input 09:11:56.330 jobB read 'result of jobA is...' from input 09:11:58.331 jobA is terminating 09:11:58.331 jobB read '42' from input 09:11:58.333 jobB is done reading input 09:11:58.335 jobB is terminating </code></pre> The bottom line is that it's important to actually measure what you want to measure.

评论 #39770255 未加载

dsm9000about 1 year ago

This post is another example of why I like zig so much. It seems to get people talking about performance in a way which helps them learn how things work below today’s heavily abstracted veneer

michael1999about 1 year ago

If you want create something like the pipe behaviour the author expected (buffer all output before sending to the next command), the sponge command from moreutils can help.

Symmetryabout 1 year ago

My first guess involved caching but I was thinking about whether the binary itself had to be read from disk or was already cached in RAM. Great linux-fu post.

jedisct1about 1 year ago

This is exactly the thing that feels obvious once you realize it, but that can be puzzling until you don't.

0x000xca0xfeabout 1 year ago

Also a great reminder to always benchmark with different data set sizes.

underdeserverabout 1 year ago

If I were trying to optimize my code, I would start with loading the entire benchmark bytecode to memory, then start the counter. Otherwise I can't be sure how much time is spent reading from a pipe/file to memory, and how much time is spent in my code.Then I would try to benchmark what happens if it all fits in L1 cache, L2, L3, and main memory.Of course, if the common use case is reading from a file, network, or pipe, maybe you can optimize that, but I would take it step by step.

styfleabout 1 year ago

> By adding a benchmarking script to my continuous integration and archiving the results, it was easy for me to identify when my measurements changed.This assumes CI runs on the same machine with same hardware every time, but most CI doesn’t do that.

评论 #39769666 未加载

Rygianabout 1 year ago

The TL:DR; is that the build step masks the wait for input from a shell pipe. With a side dish of "do buffered input" and then a small "avoid memory allocation for fun."

john-tells-allabout 1 year ago

This is an excellent writeup, with interesting ideas and clear description of actions taken. My idea of pipelines, also, was flawed. Well done!Nothing to do with Zig. Just a nice debugging story.

FreeFullabout 1 year ago

You can easily hit a similar problem in other languages too. For example, in Rust, std::fs::File isn't buffered, so reading single bytes from it will also be rather slow.