Great post. Vectorization is one of the easiest ways to increase per-thread performance.<p>This gives me an excuse to post one of my favorite articles at Intel. It shows the best performance increases for this person's problem come from optimizing memory accesses.<p><a href="http://software.intel.com/en-us/articles/superscalar-programming-101-matrix-multiply-part-1/" rel="nofollow">http://software.intel.com/en-us/articles/superscalar-program...</a><p>Seeing as memory read/write instructions are about 40-50% of the x86 code out there (from what I've heard) tweaking memory accesses seems to be a great way to get great performance.