{n} times faster than C

447 pointsby 414owenalmost 2 years ago

31 comments

torstenvlalmost 2 years ago

I'm not so sure that the right take-away is "hand-written assembler is 6x faster than C." It's more like "jumps are a lot slower than conditional arithmetic." And that can [edit:often] be achieved easily in C by simply not using switch statements when an if statement or two will work fine.Rewriting the C function as follows got a 5.5x speedup:<pre><code> int run_switches(char *input) { int r = 0; char c; while (1) { c = *input++; if (c == 's') r++; if (c == 'p') r--; if (c == '\0') break; } return r; } </code></pre> Results:<pre><code> [16:50:14 user@boxer ~/looptest] $ gcc -O3 bench.c loop1.c -o lone [16:50:37 user@boxer ~/looptest] $ gcc -O3 bench.c loop2.c -o ltwo [16:50:47 user@boxer ~/looptest] $ time ./lone 1000 1 449000 ./lone 1000 1 3.58s user 0.00s system 99% cpu 3.589 total [16:50:57 user@boxer ~/looptest] $ time ./ltwo 1000 1 449000 ./ltwo 1000 1 0.65s user 0.00s system 99% cpu 0.658 total</code></pre>

评论 #36623958 未加载

评论 #36623708 未加载

评论 #36623180 未加载

评论 #36624329 未加载

评论 #36623284 未加载

评论 #36628710 未加载

评论 #36625110 未加载

评论 #36631636 未加载

评论 #36627638 未加载

nwallinalmost 2 years ago

IMHO the original code wasn't written in a way that's particularly friendly to compilers. If you write it like this:<pre><code> int run_switches_branchless(const char* s) { int result = 0; for (; *s; ++s) { result += *s == 's'; result -= *s == 'p'; } return result; } </code></pre> ...the compiler will do all the branchless sete/cmov stuff as it sees fit. It will be the same speed as the optimized assembly in the post, +/- something insignificant. However it won't unroll and vectorize the loop. If you write it like this:<pre><code> int run_switches_vectorized(const char* s, size_t size) { int result = 0; for (; size--; ++s) { result += *s == 's'; result -= *s == 'p'; } return result; } </code></pre> It will know the size of the loop, and will unroll it and use AVX-512 instructions if they're available. This will be substantially faster than the first loop for large inputs, although I'm too lazy to benchmark just how much faster it is.Now, this requires knowing the size of your string in advance, and maybe you're the sort of C programmer who doesn't keep track of how big your strings are. I'm not your coworker, I don't review your code. Do what you want. But you really really probably shouldn't.<a href="https://godbolt.org/z/rde51zMd8" rel="nofollow noreferrer">https://godbolt.org/z/rde51zMd8</a>

评论 #36623823 未加载

评论 #36627023 未加载

评论 #36625584 未加载

评论 #36623408 未加载

Const-mealmost 2 years ago

I’m probably an optimization expert, and I would solve that problem completely differently.On my computer, the initial C version runs at 389 MB / second. I haven’t tested the assembly versions, but if they deliver the same 6.2x speedup, would result in 2.4 GB/second here.Here’s C++ version which for long buffers exceeds 24 GB/second on my computer: <a href="https://gist.github.com/Const-me/3ade77faad47f0fbb0538965ae7f8e04" rel="nofollow noreferrer">https://gist.github.com/Const-me/3ade77faad47f0fbb0538965ae7...</a> That’s 61x speedup compared to the original version, without any assembly, based on AVX2 intrinsics.

评论 #36624746 未加载

评论 #36624301 未加载

评论 #36626491 未加载

评论 #36626232 未加载

Sesse__almost 2 years ago

This code screams for SIMD! If you can change the prototype to take an explicit length, you could easily read and process 16 bytes at a time (the compares will give you values you can just add and subtract directly). Heck, even calling strlen() at the function's start to get the explicit length would probably be worth it.

camel-cdralmost 2 years ago

I threw together a quick risc-v vectorized implementation:<pre><code> size_t run(char *str) { uint8_t *p = (uint8_t*)str; long end = 0; size_t res = 0, vl; while (1) { vl = __riscv_vsetvlmax_e8m8(); vuint8m8_t v = __riscv_vle8ff_v_u8m8(p, &vl, vl); end = __riscv_vfirst_m_b1(__riscv_vmseq_vx_u8m8_b1(v, '\0', vl), vl); if (end >= 0) break; res += __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 's', vl), vl); res -= __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 'p', vl), vl); p += vl; } vl = __riscv_vsetvl_e8m8(end); vuint8m8_t v = __riscv_vle8_v_u8m8(p, vl); res += __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 's', vl), vl); res -= __riscv_vcpop_m_b1(__riscv_vmseq_vx_u8m8_b1(v, 'p', vl), vl); return res; } </code></pre> Here are the results from the above, the switch and the table c implementation, ran on my mangopi mq pro (C906, in order rv64gc with rvv 0.7.1, and a 128 bit vector length):<pre><code> switch: 0.19 Bytes/Cycle tbl: 0.17 Bytes/Cycle rvv: 1.57 Bytes/Cycle (dips down to 1.35 after ~30 KiB) </code></pre> Edit: you can go up to 2/1.7 Bytes/Cycle, if you make sure the pointer is page aligned (and vl isn't larger than the page size), see comments

评论 #36623156 未加载

okaleniukalmost 2 years ago

I think, it's a particular quirk of x86 architecture. Branching is expensive in comparison because not doing branching is super cheap. <a href="https://wordsandbuttons.online/challenge_your_performance_intuition_with_cpp_operators.html" rel="nofollow noreferrer">https://wordsandbuttons.online/challenge_your_performance_in...</a>However, on other processors, this might not be the case. <a href="https://wordsandbuttons.online/using_logical_operators_for_logical_operations_is_good.html" rel="nofollow noreferrer">https://wordsandbuttons.online/using_logical_operators_for_l...</a>The good question is what do we need C for in general? Of course, we can hand-tailor our code to run best on one particular piece of hardware. And we don't need C for that, it would be the wrong tool. We need assembly (and a decent macro system for some sugar)But the original goal of C was to make translating system-level code from one platform to another easier. And we're expected to lose efficiency on this operation. It's like instead of writing a poem in Hindi and translating it in Urdu, we write one in Esperanto and then translate to whatever language we want automatically. You don't get two brilliant poems, you only get two poor translations, but you get them fast. That's what C is for.

eklitzkealmost 2 years ago

Rearranging branches (and perhaps blocks too?) will definitely be done if you are building using FDO, because without FDO (or PGO) the compiler has no idea how likely each branch is to be taken. Cmov can also be enabled by FDO in some cases.However, whether or not using cmov is effective compared to a regular test/jump is highly dependent on how predictable the branch is, with cmov typically performing better when the branch is very unpredictable. Since they got a 6x speedup with cmov, I assume that their test input (which isn't described in the post, and is also not in their GitHub repo) consists of random strings consisting almost entirely of s and p characters. There's nothing wrong with this, but it does make the post seem a little misleading to me, as their clever speedup is mostly about exploiting an unmentioned property of the data that is highly specific to their benchmark.

评论 #36624125 未加载

评论 #36623739 未加载

评论 #36624658 未加载

xoranthalmost 2 years ago

I think I managed to improve on both this post, and its sequel, at the cost of specializing the function for the case of a string made only of 's' and 'p'.The benchmark only tests strings made of 's' and 'p', so I think it is fair.The idea is as follow. We want to increase `res` by one when the next character is `s`. Naively, we might try something like this:<pre><code> res += (c - 'r'); // is `res += 1` when c == 's' </code></pre> This doesn't work, as `'p' - 'r' == -2`, and we'd need it to be -1.But `'p' - 'r'`, when viewer as an unsigned integer, underflows, setting the carry flag. Turns out x64 has an instruction (adc) that adds two registers _plus_ the carry flag.Therefore we can replace two `cmp, cmov` with one `sub, adc`:<pre><code> run_switches: xor eax, eax # res = 0 loop: movsx ecx, byte ptr [rdi] test ecx, ecx je ret inc rdi sub ecx, 'r' adc eax, ecx # Magic happens here jmp loop ret: ret </code></pre> Benchmarks are as follows (`bench-x64-8` is the asm above):<pre><code> Summary '01-six-times-faster-than-c/bench-x64-8 1000 1' ran 1.08 ± 0.00 times faster than '02-the-same-speed-as-c/bench-c-4-clang 1000 1' 1.66 ± 0.00 times faster than '01-six-times-faster-than-c/bench-x64-7 1000 1' </code></pre> Of course, one could improve things further using SWAR/SIMD...

评论 #36624157 未加载

评论 #36628447 未加载

sltkralmost 2 years ago

How much faster is this:<pre><code> int run_switches(const char *buf) { size_t len = strlen(buf); int res = 0; for (size_t i = 0; i < len; ++i) { res += (buf[i] == 's') - (buf[i] == 'p'); } return res; } </code></pre> strlen() should be implemented in a pretty fast way, and after the buffer size is known, the compiler can autovectorize the inner loop, which does happen in practice: <a href="https://gcc.godbolt.org/z/qYfadPYoq" rel="nofollow noreferrer">https://gcc.godbolt.org/z/qYfadPYoq</a>

aidenn0almost 2 years ago

A while back, I wrote a UTF-8 decoder in Common Lisp, targeting SBCL (it already has one built in, this was an exercise). Pretty much all of the optimization win (after the obvious low-hanging fruit) was structuring the code so that the compiler would generate cmov* instructions rather than branches.

评论 #36622681 未加载

评论 #36623188 未加载

fefe23almost 2 years ago

First, before optimizing you should consider correctness and security. input should be const and the return value should be ssize_t (so you don't have numeric overflow on 64-bit).Second, consider this replacement function:<pre><code> ssize_t test(const char \*input) { ssize_t res = 0; size_t l = strlen(input); size_t i; for (i=0; i<l; ++i) { res += (input[i] == 's') - (input[i] == 'p'); } return res; } </code></pre> The timings are (using gcc -O3 -march=native): your function 640 cycles, mine 128 cycles. How can that be? I'm reading the memory twice! I have one call to strlen in there, and memory is slow. Shouldn't this be much slower?No. strlen is a hack that uses vector instructions even though it may technically read beyond the string length. It makes sure not to cross page boundaries so it will not cause adverse reactions, but valgrind needs a suppression exception to not complain about it.If you know the length beforehand, the compiler can vectorize and unroll the loop, which it happens to do here. To great effect, if I may say so.The art of writing fast code is usually to get out of the way of the compiler, which will do a perfectly fine job if you let it.If you really wanted to, you could get rid of the strlen by hacking your logic into what strlen does. That would make the C code much less readable and not actually help that much. My test string is "abcdefghijklmnopqrstuvxyz", so it's all in the l1 cache.

torstenvlalmost 2 years ago

There's an error in the pseudocode.<pre><code> cmp ecx, 's' # if (c == 's') jne loop # continue add eax, 1 # res++ jmp loop # continue </code></pre> should be<pre><code> cmp ecx, 's' # if (c != 's') jne loop # continue add eax, 1 # res++ jmp loop # continue</code></pre>

评论 #36622987 未加载

评论 #36623288 未加载

414owenalmost 2 years ago

A clickbait title for an in-depth look at hand-optimizing a very simple loop.

评论 #36623217 未加载

评论 #36634125 未加载

jtrianglealmost 2 years ago

It's a cardinal rule that any time someone utters "XYZ is n faster than C" someone comes along and shows C is actually 2x faster than XYZ.

评论 #36625176 未加载

BoppreHalmost 2 years ago

You can also use math to avoid most of the jumps:<pre><code> int run_switches(char *input) { int res = 0; while (true) { char c = *input++; if (c == '\0') return res; // Here's the trick: res += (c == 's') - (c == 'p'); } } </code></pre> This gives a 3.7x speed compared to loop-1.c. The lower line count is also nice.

评论 #36623319 未加载

erualmost 2 years ago

Compare also <a href="https://codegolf.stackexchange.com/a/236630/32575" rel="nofollow noreferrer">https://codegolf.stackexchange.com/a/236630/32575</a> "High throughput Fizz Buzz" where someone uses assembly to generate Fizz Buzz at around 54-56GiB/s.

gavinrayalmost 2 years ago

Fantastic post, I appreciated that the ASM was displayed in tabs as both "standard" and "visual-arrows"-annotated.Kept me reading into the follow-up article.Also, I love the UI of this blog.

评论 #36628757 未加载

arun-mani-jalmost 2 years ago

Any guide on how a person who uses Python or JavaScript can learn such things? I mean knowing which assembly code would be better, which algorithm makes better usage of processor etc.? :)Also, how is such optimization carried out in a large scale software? Like, do you tweak the generated assembly code manually? (Sorry I'm a very very very beginner to low-level code)

评论 #36628470 未加载

评论 #36628441 未加载

评论 #36657790 未加载

评论 #36629830 未加载

vardumpalmost 2 years ago

I think it's straightforward to optimize to a point it's maybe about 10x faster than the "optimized" version. The answer is of course SIMD vectorization.

red2awnalmost 2 years ago

I experimented with different optimizations and ended with 128x speedup. The improvement mainly comes from manual SIMD intrinsics, but you can go a long way just by making the code more auto-vectorization friendly as some other comments have mentioned. See:<a href="https://ipthomas.com/blog/2023/07/n-times-faster-than-c-where-n-128/" rel="nofollow noreferrer">https://ipthomas.com/blog/2023/07/n-times-faster-than-c-wher...</a>

ammalmost 2 years ago

Back-of-the-envelope approach that should eliminate most branching:<pre><code> int table[256] = {0}; void init() { table['s'] = 1; table['p'] = -1; } int run_switches(char *input, int size) { int res = 0; while (size-- >= 0) res += table[input[size]]; return res; }</code></pre>

评论 #36657666 未加载

lukas099almost 2 years ago

Would it be possible to write a code profiler and compiler that work together to optimize code based on real-world data? The profiler would output data that would feed back into the compiler, telling it which branches were selected most often, which would recompile optimizing for the profile. Would this even work? Has it already been done?

评论 #36624576 未加载

olliejalmost 2 years ago

I see other people have done minor rewrites, but the post does mention reordering branches, so the obvious question is whether there was any attempt to use PGO, which is an obvious first step in optimization.

einpoklumalmost 2 years ago

A very instructional post. I wish more people had such a level of mastery of GPU assembly and its effects, and would post such treatments on outsmarting NVIDIA's (or AMD's) optimizers.

failuseralmost 2 years ago

Having a full-blown predicate support is so nice to have, but it interferes with compact instruction encoding.Such bloated ISA like x86 might actually handle predicate support, but who will try such a radical change?

评论 #36630396 未加载

sitkackalmost 2 years ago

This is such a wonderful post! Heavenly.

rajnathanialmost 2 years ago

Really interesting. The recent HN article on branchless binary search also covered cmov: <a href="https://news.ycombinator.com/item?id=35737862">https://news.ycombinator.com/item?id=35737862</a>

RobotToasteralmost 2 years ago

Was the C compiled with optimisation enabled?

评论 #36623500 未加载

kristianpaulalmost 2 years ago

How fast is forth compared to C these days?

评论 #36632217 未加载

throwaway14356almost 2 years ago

naive q: could one just count one of the letters and subtract it from the total number of letters?

评论 #36627211 未加载

orlpalmost 2 years ago

I made a variant that is (on my Apple m1 machine) 20x faster than the naive C version in the blog by branchlessly processing the string word-by-word:<pre><code> int run_switches(const char* input) { int res = 0; // Align to word boundary. while ((uintptr_t) input % sizeof(size_t)) { char c = *input++; res += c == 's'; res -= c == 'p'; if (c == 0) return res; } // Process word-by-word. const size_t ONES = ((size_t) -1) / 255; // 0x...01010101 const size_t HIGH_BITS = ONES << 7; // 0x...80808080 const size_t SMASK = ONES * (size_t) 's'; // 0x...73737373 const size_t PMASK = ONES * (size_t) 'p'; // 0x...70707070 size_t s_accum = 0; size_t p_accum = 0; int iters = 0; while (1) { // Load word and check for zero byte. // (w - ONES) & ~w has the top bit set in each byte where that byte is zero. size_t w; memcpy(&w, input, sizeof(size_t)); if ((w - ONES) & ~w & HIGH_BITS) break; input += sizeof(size_t); // We reuse the same trick as before, but XORing with SMASK/PMASK first to get // exactly the high bits set where a byte is 's' or 'p'. size_t s_high_bits = ((w ^ SMASK) - ONES) & ~(w ^ SMASK) & HIGH_BITS; size_t p_high_bits = ((w ^ PMASK) - ONES) & ~(w ^ PMASK) & HIGH_BITS; // Shift down and accumulate. s_accum += s_high_bits >> 7; p_accum += p_high_bits >> 7; if (++iters >= 255 / sizeof(size_t)) { // To prevent overflow in our byte-wise accumulators we must flush // them every so often. We use a trick by noting that 2^8 = 1 (mod 255) // and thus a + 2^8 b + 2^16 c + ... = a + b + c (mod 255). res += s_accum % 255; res -= p_accum % 255; iters = s_accum = p_accum = 0; } } res += s_accum % 255; res -= p_accum % 255; // Process tail. while (1) { char c = *input++; res += c == 's'; res -= c == 'p'; if (c == 0) break; } return res; } </code></pre> Fun fact: the above is still 1.6x slower (on my machine) than the naive two-pass algorithm that gets autovectorized by clang:<pre><code> int run_switches(const char* input) { size_t len = strlen(input); int res = 0; for (size_t i = 0; i < len; ++i) { char c = input[i]; res += c == 's'; res -= c == 'p'; } return res; }</code></pre>

评论 #36636128 未加载

评论 #36630130 未加载

评论 #36630390 未加载

31 comments

torstenvlalmost 2 years ago

评论 #36623958 未加载

评论 #36623708 未加载

评论 #36623180 未加载

评论 #36624329 未加载

评论 #36623284 未加载

评论 #36628710 未加载

评论 #36625110 未加载

评论 #36631636 未加载

评论 #36627638 未加载

nwallinalmost 2 years ago

评论 #36623823 未加载

评论 #36627023 未加载

评论 #36625584 未加载

评论 #36623408 未加载

Const-mealmost 2 years ago

评论 #36624746 未加载

评论 #36624301 未加载

评论 #36626491 未加载

评论 #36626232 未加载

Sesse__almost 2 years ago

camel-cdralmost 2 years ago

评论 #36623156 未加载

okaleniukalmost 2 years ago

eklitzkealmost 2 years ago

评论 #36624125 未加载

评论 #36623739 未加载

评论 #36624658 未加载

xoranthalmost 2 years ago

评论 #36624157 未加载

评论 #36628447 未加载

sltkralmost 2 years ago

aidenn0almost 2 years ago

评论 #36622681 未加载

评论 #36623188 未加载

fefe23almost 2 years ago

torstenvlalmost 2 years ago

评论 #36622987 未加载

评论 #36623288 未加载

414owenalmost 2 years ago

A clickbait title for an in-depth look at hand-optimizing a very simple loop.

评论 #36623217 未加载

评论 #36634125 未加载

jtrianglealmost 2 years ago

It's a cardinal rule that any time someone utters "XYZ is n faster than C" someone comes along and shows C is actually 2x faster than XYZ.

评论 #36625176 未加载

BoppreHalmost 2 years ago

评论 #36623319 未加载

erualmost 2 years ago

gavinrayalmost 2 years ago

Fantastic post, I appreciated that the ASM was displayed in tabs as both "standard" and "visual-arrows"-annotated.Kept me reading into the follow-up article.Also, I love the UI of this blog.

评论 #36628757 未加载

arun-mani-jalmost 2 years ago

评论 #36628470 未加载

评论 #36628441 未加载

评论 #36657790 未加载

评论 #36629830 未加载

vardumpalmost 2 years ago

I think it's straightforward to optimize to a point it's maybe about 10x faster than the "optimized" version. The answer is of course SIMD vectorization.

red2awnalmost 2 years ago

ammalmost 2 years ago

评论 #36657666 未加载

lukas099almost 2 years ago

评论 #36624576 未加载

olliejalmost 2 years ago

einpoklumalmost 2 years ago

A very instructional post. I wish more people had such a level of mastery of GPU assembly and its effects, and would post such treatments on outsmarting NVIDIA's (or AMD's) optimizers.

failuseralmost 2 years ago

评论 #36630396 未加载

sitkackalmost 2 years ago

This is such a wonderful post! Heavenly.

rajnathanialmost 2 years ago

Really interesting. The recent HN article on branchless binary search also covered cmov: <a href="https://news.ycombinator.com/item?id=35737862">https://news.ycombinator.com/item?id=35737862</a>

RobotToasteralmost 2 years ago

Was the C compiled with optimisation enabled?

评论 #36623500 未加载

kristianpaulalmost 2 years ago

How fast is forth compared to C these days?

评论 #36632217 未加载

throwaway14356almost 2 years ago

naive q: could one just count one of the letters and subtract it from the total number of letters?