Further optimization potential: the four lines<p><pre><code> sum = Avx2.Add(block0, sum);
sum = Avx2.Add(block1, sum);
sum = Avx2.Add(block2, sum);
sum = Avx2.Add(block3, sum);
</code></pre>
have all a serializing dependency on sum variable. But (integer) addition is associative and commutative, so you could sum it in a tree-like manner, ending up only with a a single serializing dependency:<p><pre><code> sum01 = Avx.Add(block0, block1);
sum23 = Avx.Add(block2, block3); // These two run in parallel
sum = Avx.Add(sum, sum01); // sum01 hopefully ready; parallel with sum23
sum = Avx.Add(sum, sum23); // sum23 hopefully ready
</code></pre>
Where only the last line serializes with the previous one. Maybe the HW is smart enough to rename the registers and do the same thing internally, but it'd be interesting to benchmark it.