Nicely done.<p>> There is still a lot of room to micro-optimize both the avx and avx64 implementation<p>I personally couldn't see much - perhaps aligning loads and defering `_mm256_madd_epi16` are the only ideas that come to mind.
What did you have in mind?