I have an 'art/game' project in Rust that uses vectorized expression evaluation to draw random images. It turns the expression tree into a stack machine then evaluates the stack machine with SIMD instructions:
<a href="https://github.com/jackmott/evolution/blob/master/src/stack_machine.rs" rel="nofollow">https://github.com/jackmott/evolution/blob/master/src/stack_...</a><p>The speedup is really big for this case sometimes because not only do you do math 4x/8x/16x faster (depending on instruction set) but you also traverse the stack machine (or tree if you are pure interpreting) 4x/8x/16x less often. The improvement when traversing a tree is extra extra big because of reduced memory hops.<p>I used a SIMD library I made in Rust, which lets me write the stack machine once, and then run it in SSE2/SSE41 or AVX2 mode. You can select either at runtime or compile time:<p><a href="https://github.com/jackmott/simdeez" rel="nofollow">https://github.com/jackmott/simdeez</a>
I always wondered why people assume that the more high level abstractions or high level programming languages you use the less performance you should expect. C is faster than Java, Java will be faster than Python etc. The 'interpretation overhead' is supposed to kill your performance.<p>But there is one weird exception. You have the APL family, which are a very high at level of abstraction but perform even faster than C. Especially because od using vectorized processing. When you work on vectors of 1000s items in one instruction, you amortize language interpretation cost away and since working with vectors is actually the only natural way to work with computers you get massive performance from those instructions using vectorized instructions or even running on gpu. (All memory access is naturally linear in computing. Random access memory is an unnatural computing myth which comes at enormous cost and has to be hardware accelerated to be even usable).<p>Similar can be said about databases and SQL. Especially in OLAP processing, where you can linearize your data tables and columns and vectorize your processing. Because it is near impossible to overcome von Neumann bottleneck in traditional single computer languages like C or Java, any SQL or APL will beat the crap out of them if you span the processing over multiple cores and machines.<p>Days of single machine processing are over and clusters of computers are the future. AWS (and potentially other clouds) are essentially Operating Systems for sich environent. It'd be nice for open source to catch up though.
> As the table reveals, every time this function performs a multiplication, only 8 out of 82 (9+30+28+8+7=82) instructions are doing the “real” multiplication. That's only about 10% of the total instructions. The other 90% are considered interpretation overhead. Once we vectorized this function, its performance was improved by nearly nine times. See PR #12543.<p>This is a misleading way to present this data. If I understand this correctly, most of the 90% "interpretation overhead" are time spent evaluating the operands to the multiplication, and this is <i>also</i> vectorized. So it's not just that vectorizing the 10% can give you a 9x speedup overall, although in my opinion the text tries to suggest this.<p>In any case, there must be even more going on here. The data being processed here seem to be Float64. On an AVX-2 processor like most of us have, you can only fit up to 4 64-bit floats into a vector register. This means that, even if your entire computation vectorizes very very nicely, you should only expect a 4x maximum speedup. Even if they have an AVX-512 server (they don't say) with twice the vector width, 8x would be the expected limit. In practice it would be considerably less because the processor reduces its frequency to avoid overheating on AVX-512-heavy computations. I'm not aware of hardware that uses even wider vectors.<p>So an end-to-end 9x improvement for the entire function here seems impossible to achieve using vectorization alone. I question both the measurement and the suggestion that vectorization is the <i>only</i> thing that changes here. Maybe they accidentally (? they don't seem to understand in detail what's going on) stumbled upon a much more cache friendly version of the computation they were trying to do, or maybe previously they caused the GC to interfere, or... something. But 9x due to vectorization of a Float64 computation? I'm not buying it.
I’ve seen a fair bit about TiDB but not much from actual users. Can someone who uses this in production explain why and what alternatives they evaluated (we use Citus so would be curious to hear).
Well what to say. You are just throwing potential performance away if you don't use SSE when you can.<p>There is an idea that "renaming and reordering engine can make non-SSE code as fast as it without extra hassle." At least of X86, that can't be true as you physically can't access all execution ports with non-vector instructions.
i'm quite surprise how many of HN comments are focused on nitpicking frontend implementations rather than the content itself.<p>it's not very likely the author who works on vectorized execution also implemented the blog system.
You can also get a 10x accessibility increase for viewing your site, by not doing shit like this:<p><pre><code> <div class="center-element" id="page-loader">
<svg id="hexagon" viewbox="0 0 129.78 150.37"
...
</div>
<div id="page-content" style="display:none">
(actual page content here)
</code></pre>
This is another one of these pages that shouldn't require JS, but <i>deliberately hides the content</i> and then uses JS to un-hide it. <i>WTF!?</i> I know this is a little off-topic but I found it ironic that a post about optimising performance would be presented so outrageously inefficiently and inaccessibly. (I just turned off the CSS to read it.)