numpy, for comparison:<p><pre><code> In [8]: vec = np.random.randint(-200, 200, (100_000_000,))
In [9]: %timeit vec.sum()
63 ms ± 4.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
</code></pre>
The branchless C++ version took 125ms, and the AVX 512 version took ~9ms.
After the updates to the article, the takeaway seems to be "you can use AVX-512 dot product instructions to sum an array of bytes to int and get a 15% speedup over more straightforward vector code". That's an interesting point, but it's now well-hidden among irrelevant things like the compressed representation that was only relevant to the article's original point.<p>It might make sense to resubmit a completely rewritten and pared-down version of the article. The dot product trick is neat.
I don’t believe in the bright future of AVX512 tech, and I don’t have hardware either, my desktop PC has AMD Zen2 CPU.<p>Here’s how I would do that in AVX2: <a href="https://gist.github.com/Const-me/eed10bfe690b5804d2fc8266e0218981#file-simintegersavx2-cpp-L35-L105" rel="nofollow">https://gist.github.com/Const-me/eed10bfe690b5804d2fc8266e02...</a><p>I wonder how does the performance compare to your version.
Ehh. I like playing with vectors and have a weird coding style.<p>My AVX version.<p><a href="https://github.com/schmide/sumint/blob/main/sumint.cpp" rel="nofollow">https://github.com/schmide/sumint/blob/main/sumint.cpp</a>