It's interesting that SIMD came to the mainstream 25 years ago and compilers, our apps and PL tech are still quite far from effectively utilizing it outside nonportable manually coded SIMD aware compute kernels in glorified assembler. There are some exceptions like ispc and GPU languages like OpenCL and Futhark (GPU people say "cores" when they mean SIMD lanes!)...
The whole book looks very full of interesting low level techniques. Do high level JIT compilers like the V8 or JVM apply vectorisation or any of the other optimisations mentioned there, or is that level of fine-tuned performance only possible if you code them manually in C++?
This book is fantastic - it comes at a perfect time too, where I’m getting more work related projects about ‘make slow code fast’.<p>Going to be reading this - and looking forward to any future parts!
Since the performance for array sizes <L1-size and <L2-size is similar , I would like to see an attempt to improve B.
B = L2-size / 2 / sizeof(int) - 16 might produce better results.<p>Note also that _mm_broadcast_ss() is faster on newer processors.
I've seen the author advertise his book on codeforces.com blogs before if you want somewhere to reach him: <a href="https://codeforces.com/blog/entry/99790" rel="nofollow">https://codeforces.com/blog/entry/99790</a><p>That might be a better intro than a random chapter of the book and contextualizes why you might want to learn SIMD programming (i.e., up to an order of magnitude speed-up vs STL implementations).