TechEcho

6 comments

wolf550eover 13 years ago

This article is old: March 9, 2009 1:00 AM PDTNowadays glibc has modern SSE code and the kernel uses "rep movsb". The kernel can store and restore FPU state if the copy is long and doing SSE/AVX is worth it. Someone on the Linux kernel mailing list measured that performance depends on src and dest being 64-byte aligned compared to each other: if they are aligned, "rep movsb" is faster than SSE.The thread: <a href="https://lkml.org/lkml/2011/9/1/229" rel="nofollow">https://lkml.org/lkml/2011/9/1/229</a><a href="http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=arch/x86/lib/memcpy_64.S;hb=HEAD" rel="nofollow">http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git...</a><a href="http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/memcpy-ssse3.S;hb=HEAD" rel="nofollow">http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_...</a>

abrahamsenover 13 years ago

> the developer communications don't appear on a public list. There is no visible public help forum or mail list<a href="http://dir.gmane.org/index.php?prefix=gmane.comp.lib.glibc" rel="nofollow">http://dir.gmane.org/index.php?prefix=gmane.comp.lib.glibc</a>Seems public to me.

评论 #3191995 未加载

shin_laoover 13 years ago

A couple of years ago, before SSE existed, I wrote a highly optimized memory copy routine. It was more than just using movntq (non temporal is important to avoid cache pollution) and the like, for large data I copied the chunks in a local buffer less than one page size and copied it to the destination. Sounds crazy? It actually was much faster because of page locality.For small chunks however, nothing was faster than rep movsb which moves one byte at the time.

memsetover 13 years ago

Someone tell me if I am mistaken - but it looks like the main difference between GCC's and Intel's memcpy() boils down to gcc using `rep movsl` and icc using `movdqa`, the latter having a shorter decode time and possibly shorter execution time?

评论 #3190211 未加载

评论 #3190528 未加载

JoeAltmaierover 13 years ago

I'm sad that computers in this modern age still require me to be in their business. Doesn't it seem like the cpu's own business to move bytes efficiently? Why is the compiler, much less the programmer, involved? The tests being made in the compiler/lib are of factors better-known at runtime (overlap, size, alignment) and better handled by microcode.

评论 #3194438 未加载

vz0over 13 years ago

Anger Fog found this issue one year earlier, 2008:<a href="http://www.cygwin.com/ml/libc-help/2008-08/msg00007.html" rel="nofollow">http://www.cygwin.com/ml/libc-help/2008-08/msg00007.html</a>

6 comments

wolf550eover 13 years ago

abrahamsenover 13 years ago

评论 #3191995 未加载

shin_laoover 13 years ago

memsetover 13 years ago

评论 #3190211 未加载

评论 #3190528 未加载

JoeAltmaierover 13 years ago

评论 #3194438 未加载

vz0over 13 years ago

Anger Fog found this issue one year earlier, 2008:<a href="http://www.cygwin.com/ml/libc-help/2008-08/msg00007.html" rel="nofollow">http://www.cygwin.com/ml/libc-help/2008-08/msg00007.html</a>

Intel's take on GCC's memcpy implementation

6 comments

Intel's take on GCC's memcpy implementation

6 comments