The first 2 examples are not dividing the work between the threads, but having each of the threads repeat the full work, which is not poor OpenMP use, but wrong use. I would have also used the collapse directive and played a little bit with the schedule. Finally, looping in the inner loop through the first index is not a good idea not only when working with OpenMP.
Comment on the blog that got deleted:<p>I did a similar test in C and have gotten very similar results. When N is around 4000 the trashing version starts to differ substantially. A 3x difference can already be seen when N is 1000.<p><i>This means if your program is running on two threads over different parts of the matrix, every single iteration requires a request to RAM.</i><p>I'm skeptical over this part, I have tried to replicate this behavior but was unsuccessful. Even though cores are sharing L3, I doubt that a thread will overwrite the entire cache on every iteration.
For all those interested in this subject, I'd like to recommend Nitsan Wakart's blog, <a href="http://psy-lob-saw.blogspot.com/" rel="nofollow">http://psy-lob-saw.blogspot.com/</a>, which is dedicated to mechanical sympathy relating to concurrency and the memory system.