Do not taunt happy fun branch predictor

314 点作者 mkeeter超过 2 年前

22 条评论

jerf超过 2 年前

This is another good example of how our CPUs are in many ways specialized C processors. C is a structured programming language that uses functions, so our processors like functions. If you jump out of that paradigm, even if the assembly instructions nominally seem to allow it, you'll run more slowly. Even when it seems like what you're offering is a shortcut to the CPU.This is neither praise nor criticism of the current CPU paradigm; it's just something you need to understand if you want the best performance out of our machines.A different paradigm, like a concatenative-paradigm-based program, might naively be more inclined to compile into code that looks more like what the author tried, jumping between implementations of the stack operators without it actually being "functions". One can imagine processors that would be "happier" with that, and would be bothered by things that look like function returns more. But that's not the CPUs we have.

评论 #34522834 未加载

评论 #34524657 未加载

评论 #34522770 未加载

评论 #34526520 未加载

评论 #34522784 未加载

评论 #34528818 未加载

评论 #34529242 未加载

评论 #34524377 未加载

评论 #34554851 未加载

评论 #34527478 未加载

评论 #34527423 未加载

评论 #34525381 未加载

titzer超过 2 年前

> More specifically, the branch predictor probably keeps an internal stack of function return addresses, which is pushed to whenever a bl is executed. When the branch predictor sees a ret coming down the pipeline, it assumes that you're returning to the address associated with the most recent bl (and begins prefetching / speculative execution / whatever), then pops that top address from its internal stack.There's no need for "probably" here. The micro-architectural mechanism is known as a return stack buffer[1] and is generally separate from the branch predictor unit, though the processor may make use of indirect branch prediction entries for returns as well.[1] It is, indeed, a tiny little stack of return addresses and indeed, the article hit performance issues by misaligning it. The (Intel chips') RSB is behind the Retbleed vulnerabilities.

FullyFunctional超过 2 年前

Well, of course, the Return Address Stack (RAS) predictor maintains its own call stack and you need to understand how it works. However, there's a subtler way to break it: recurse too deeply. The RAS only has a fixed, small, and implementation dependent length. If you use deep recursion with non-trivial control flow (in particular multiple call sites), then the RAS will starting missing once you return from beyond that limit.Another consequence of the RAS is that co-routines switching is more expensive than they might appear at first. RISC-V has encoding hints to mark call(jal)/returns that are actually co-routine switching but the full cost can't be avoided.

评论 #34524862 未加载

ahh超过 2 年前

Interestingly, Matt has invented a variant on the retpoline [1] which _intentionally_ missteers the branch predictor to prevent various speculative attacks. (Invented by my former Google manager.). It's pretty cool how much simpler a retpoline would be in aarch64, since we have explicit control over the link register rather than having to play stupid games with stacks.(Real retpolines have a little more magic, naturally.)[1]<a href="https://stackoverflow.com/questions/48089426/what-is-a-retpoline-and-how-does-it-work" rel="nofollow">https://stackoverflow.com/questions/48089426/what-is-a-retpo...</a>

jefftk超过 2 年前

> The SIMD code does come with one asterisk, though: because floating-point addition is not associative, and it performs the summation in a different order, it may not get the same result as straight-line code. In retrospect, this is likely why the compiler doesn't generate SIMD instructions to compute the sum!What if you set -funsafe-math-optimizations, which allows "optimizations that allow arbitrary reassociations and transformations with no accuracy guarantees"?

评论 #34526322 未加载

评论 #34526235 未加载

评论 #34523229 未加载

评论 #34524086 未加载

ogogmad超过 2 年前

Minor erratum: Floating point addition actually is commutative; it's in fact non-associative.

评论 #34521818 未加载

评论 #34521679 未加载

评论 #34521837 未加载

评论 #34525006 未加载

评论 #34523883 未加载

snerbles超过 2 年前

While I was in undergrad I toyed around with abusing the branch predictor on a few different machines, compiling something like the following with optimizations off - it performs an identical computation regardless of branch outcome:<pre><code> void branchLoop(unsigned int condition, unsigned int &sum) { // put something suitably large here unsigned int loopCount = 0x0fffffff; unsigned int i; // compile with -O0 or this gets optimized away for (i = 0; i < loopCount; i++) if ((i & condition) == 0) sum++; else sum++; } </code></pre> The Core Duo on my Thinkpad T60 had some very distinct slowdowns on certain bit patterns, which were not repeatable on the handful of other CPUs I had access to at the time. I haven't tried this with more modern CPUs, however.

评论 #34524883 未加载

iamsmooney超过 2 年前

Title is a reference to this old SNL skit: <a href="https://www.youtube.com/watch?v=GmqeZl8OI2M">https://www.youtube.com/watch?v=GmqeZl8OI2M</a>

评论 #34523426 未加载

评论 #34529096 未加载

评论 #34522718 未加载

londons_explore超过 2 年前

Observation: Almost any code, when micro-optimized, can gain about 10x performance.So, if we had the time and energy, we could probably make all of computing at least 10x faster.But we don't have the time or energy to dedicate that much effort to every line of code... But perhaps AI does?

评论 #34522355 未加载

评论 #34522787 未加载

评论 #34522584 未加载

评论 #34522774 未加载

评论 #34522518 未加载

error503超过 2 年前

Interesting.I thought it would be interesting to compare the behaviour of (very) different AArch64 processors on this code.I ran your code on an Oracle Cloud Ampere Altra A1:<pre><code> sum_slice time: [677.45 ns 684.25 ns 695.67 ns] sum_ptr time: [689.11 ns 689.42 ns 689.81 ns] sum_ptr_asm_matched time: [1.3773 µs 1.3787 µs 1.3806 µs] sum_ptr_asm_mismatched time: [1.0405 µs 1.0421 µs 1.0441 µs] sum_ptr_asm_mismatched_br time: [699.79 ns 700.38 ns 701.02 ns] sum_ptr_asm_branch time: [695.80 ns 696.61 ns 697.56 ns] sum_ptr_asm_simd time: [131.28 ns 131.42 ns 131.59 ns] </code></pre> It looks like there's no penalty on this processor, though I would be surprised if it does not have a branch predictor / return stack tracking at all. In general there's less variance here than the M1. The SIMD version is indeed much faster, but by a smaller factor.And on the relatively (very) slow Rockchip RK3399 on OrangePi 4 LTS (1.8GHz Cortex-A72):<pre><code> sum_slice time: [1.7149 µs 1.7149 µs 1.7149 µs] sum_ptr time: [1.7165 µs 1.7165 µs 1.7166 µs] sum_ptr_asm_matched time: [3.4290 µs 3.4291 µs 3.4292 µs] sum_ptr_asm_mismatched time: [1.7284 µs 1.7294 µs 1.7304 µs] sum_ptr_asm_mismatched_br time: [1.7384 µs 1.7441 µs 1.7519 µs] sum_ptr_asm_branch time: [1.7777 µs 1.7980 µs 1.8202 µs] sum_ptr_asm_simd time: [421.93 ns 422.63 ns 423.30 ns] </code></pre> Similar to the Ampere processor, but here we pay much more for the extra instructions to create matching pairs. Interesting here that the mismatched branching is faster than the single branch.I guess absolute numbers are not too meaningful here, but a bit interesting that Ampere Altra is also the fastest of the 3 except in SIMD where M1 wins. I would have expected that with 80 of these cores on die they'd be more power constrained than M1, but I guess not.Edit: I took the liberty of allowing LLVM to do the SIMD vectorization rather than OP's hand-built code (using the fadd_fast intrinsic and fold() instead of sum()). It is considerably faster still:Ampere Altra:<pre><code> sum_slice time: [86.382 ns 86.515 ns 86.715 ns] </code></pre> RK3399:<pre><code> sum_slice time: [306.94 ns 306.94 ns 306.95 ns]</code></pre>

评论 #34527413 未加载

评论 #34526912 未加载

efitz超过 2 年前

I think that the days where hand tuned assembly language outperform compiler generated code are largely behind us (let loose the contrary anecdotes).Compilers and microprocessors are way more complex than they were back in the 80s or 90s, and compiler engineers know way more about how instructions are actually executed than the vast majority of programmers.

评论 #34523400 未加载

评论 #34524251 未加载

评论 #34523195 未加载

评论 #34523438 未加载

评论 #34523313 未加载

评论 #34553312 未加载

评论 #34526701 未加载

water-your-self超过 2 年前

If an HN reader wanted to play around with similar digging, what would be the essential tools to be aware of and where best could he start?Assuming prior knowledge of assembly/C but without much experience decompiling or testing speed.

评论 #34522923 未加载

评论 #34522741 未加载

gpderetta超过 2 年前

Yes, never push and ret. Here is something I wrote (/me checks calendar) more than 15 years ago about optimizing coroutine control flow: <a href="https://www.crystalclearsoftware.com/soc/coroutine/coroutine/linuxasm.html" rel="nofollow">https://www.crystalclearsoftware.com/soc/coroutine/coroutine...</a>

nobody9999超过 2 年前

Thank you for posting this.Somewhat OT, but I miss Phil Hartman[0][1][2]. You may remember him[3] from shows like Saturday Night Live, The Simpsons and "Planet of the Apes."[4][0] <a href="https://en.wikipedia.org/wiki/Phil_Hartman" rel="nofollow">https://en.wikipedia.org/wiki/Phil_Hartman</a>[1] <a href="https://en.wikipedia.org/wiki/Happy_Fun_Ball" rel="nofollow">https://en.wikipedia.org/wiki/Happy_Fun_Ball</a>[2] <a href="https://www.youtube.com/watch?v=GmqeZl8OI2M">https://www.youtube.com/watch?v=GmqeZl8OI2M</a>[3] <a href="https://screenrant.com/the-simpsons-funniest-troy-mcclure-quotes/" rel="nofollow">https://screenrant.com/the-simpsons-funniest-troy-mcclure-qu...</a>[4] <a href="https://www.youtube.com/watch?v=yOeUXEpxzcc">https://www.youtube.com/watch?v=yOeUXEpxzcc</a>

pranith超过 2 年前

Great investigative work! The stack structure you refer to here is called the Return Address Stack (RAS).

xKingfisher超过 2 年前

An interesting application of this is massaging the branch predictor using tail calls to speed up a parser/interpreter:<a href="https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html" rel="nofollow">https://blog.reverberate.org/2021/04/21/musttail-efficient-i...</a>

评论 #34527079 未加载

sacnoradhq超过 2 年前

After about the Pentium-/Pentium Pro-era, hand-coded assembly generally is premature optimization (and wasted effort).Once upon a time(tm), you could write self-modifying code or guess at keeping pipelines occupied by manual instruction reordering, but cache line invalidation and OOOE make these moot.The problem is that with a pipeline stall (wrong branch predicted) in hyper-deep pipelines, the penalty is enormous: waiting for the other condition calculation to percolate through the pipeline or independent stages.Processors are optimized for the mainstream, usually the current or last generation of compilers when the processors were designed.To generate the generally fastest bitcode, it would require an incremental JIT with history that can permute and mutate bitcode from runtime metrics. That's beyond HotSpot(tm), LLVM, or anything of the sort.

Malic超过 2 年前

Ow. My head hurts.And this is why optimizing compilers are some of the most complex programs there are. (or so I have been taught)

评论 #34521666 未加载

评论 #34523752 未加载

评论 #34524517 未加载

评论 #34526877 未加载

CalChris超过 2 年前

It took me a while to understand the mismatched bl/ret pairs because I'm used to reading matched bl/ret pairs. This confusion has to be similar to what the silicon is 'thinking': Human, I see matched bl/ret pairs all the time and I'm good at them. Why are you giving me mismatched pairs? I'll do the right thing but I'm not so good at them.Still, this seems like function inlining. But why not just inline and use a regular branch loop? Is foo() also being called from elsewhere? Is space at a premium?

评论 #34524336 未加载

LoganDark超过 2 年前

I love how "rewrite it in Rust" is an actual thing they tried, and it actually performed pretty well given the circumstances.

sylware超过 2 年前

I write assembly mainly _not_ because it is faster, but because I don't depend on an absurdely complex and massive compiler.

评论 #34523898 未加载

ShroudedNight超过 2 年前

I feel like this treatment is incomplete without having tested the scenario where the unmatched ret is replaced with a br lr.EDIT: Reading the documentation after the fact, it appears that that was what br x30 was - naively I had interpreted the hex as a fixed offset to a label.