> if you want to target multiple ISAs, you need to write multiple algorithms<p>In my experience, these algorithms are similar to each other.
More often than not don't require too much extra time: a few macros here and there, a few templates, couple version of a small low-level function, etc.<p>> _mm256_hadd_epi32<p>That instruction is slow, e.g. on Ryzen it has latency 7. _mm256_slli_si256 and bitwise ops have latency 1, often can do same faster.<p>> readability is reduced when compared to the original scalar implementation<p>Solvable with a library, example: <a href="https://github.com/Const-me/IntelIntrinsics" rel="nofollow">https://github.com/Const-me/IntelIntrinsics</a>