This article is old: March 9, 2009 1:00 AM PDT<p>Nowadays glibc has modern SSE code and the kernel uses "rep movsb". The kernel can store and restore FPU state if the copy is long and doing SSE/AVX is worth it. Someone on the Linux kernel mailing list measured that performance depends on src and dest being 64-byte aligned compared to each other: if they are aligned, "rep movsb" is faster than SSE.<p>The thread: <a href="https://lkml.org/lkml/2011/9/1/229" rel="nofollow">https://lkml.org/lkml/2011/9/1/229</a><p><a href="http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=arch/x86/lib/memcpy_64.S;hb=HEAD" rel="nofollow">http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git...</a><p><a href="http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/memcpy-ssse3.S;hb=HEAD" rel="nofollow">http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_...</a>
> the developer communications don't appear on a public list. There is no visible public help forum or mail list<p><a href="http://dir.gmane.org/index.php?prefix=gmane.comp.lib.glibc" rel="nofollow">http://dir.gmane.org/index.php?prefix=gmane.comp.lib.glibc</a><p>Seems public to me.
A couple of years ago, before SSE existed, I wrote a highly optimized memory copy routine. It was more than just using movntq (non temporal is important to avoid cache pollution) and the like, for large data I copied the chunks in a local buffer less than one page size and copied it to the destination. Sounds crazy? It actually was much faster because of page locality.<p>For small chunks however, nothing was faster than rep movsb which moves one byte at the time.
Someone tell me if I am mistaken - but it looks like the main difference between GCC's and Intel's memcpy() boils down to gcc using `rep movsl` and icc using `movdqa`, the latter having a shorter decode time and possibly shorter execution time?
I'm sad that computers in this modern age still require me to be in their business. Doesn't it seem like the cpu's own business to move bytes efficiently? Why is the compiler, much less the programmer, involved? The tests being made in the compiler/lib are of factors better-known at runtime (overlap, size, alignment) and better handled by microcode.
Anger Fog found this issue one year earlier, 2008:<p><a href="http://www.cygwin.com/ml/libc-help/2008-08/msg00007.html" rel="nofollow">http://www.cygwin.com/ml/libc-help/2008-08/msg00007.html</a>