The megahertz-scaling "Free Lunch" was declared dead 15 years ago [<a href="http://www.gotw.ca/publications/concurrency-ddj.htm" rel="nofollow">http://www.gotw.ca/publications/concurrency-ddj.htm</a>] and it's been only getting deader. People are finally, grudgingly accepting that they must go parallel unless we want to see software performance stagnate permanently. For most people here, the issue has been obvious since before they learned to program. But, still they are putting off learning how to deal with it. The first, obvious answer to that is threading. But, in my experience, SIMD is a bigger bang for the buck for two reasons: 1) No synchronization problems. 2) Better cache utilization. It's not just that SIMD forces you to work in large, contiguous blocks. Fun fact: When you aren't using SIMD you are only using a fraction of your L1 cache bandwidth!<p>A big challenge is that SIMD intrinsic-function APIs are weird. They have inscrutable function names and sometimes difficult semantics. What helped me greatly was going through the effort of writing #define wrappers for myself that just gave each function in SSE1-3 names that made sense to me. I don't expect many people to put in that effort. And, unfortunately, I don't have go-to recommendations for pre-existing libraries. Best I can do is:<p><a href="https://github.com/VcDevel/Vc" rel="nofollow">https://github.com/VcDevel/Vc</a> is working on being standardized into C++. It's great for processing medium-to-large arrays.<p><a href="https://ispc.github.io/" rel="nofollow">https://ispc.github.io/</a> is great for writing large, complicated SIMD features.<p><a href="https://github.com/microsoft/DirectXMath" rel="nofollow">https://github.com/microsoft/DirectXMath</a> is not actually tied to DirectX. It's has a huge library of small-vector linear algebra (3D graphics math) function. It used to be pretty tied to MS's compiler. But, I believe they've been cleaning it up to be cross compiler lately.