科技回声

6 条评论

wolf550e超过 13 年前

This article is old: March 9, 2009 1:00 AM PDTNowadays glibc has modern SSE code and the kernel uses "rep movsb". The kernel can store and restore FPU state if the copy is long and doing SSE/AVX is worth it. Someone on the Linux kernel mailing list measured that performance depends on src and dest being 64-byte aligned compared to each other: if they are aligned, "rep movsb" is faster than SSE.The thread: <a href="https://lkml.org/lkml/2011/9/1/229" rel="nofollow">https://lkml.org/lkml/2011/9/1/229</a><a href="http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob;f=arch/x86/lib/memcpy_64.S;hb=HEAD" rel="nofollow">http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git...</a><a href="http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/memcpy-ssse3.S;hb=HEAD" rel="nofollow">http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_...</a>

abrahamsen超过 13 年前

> the developer communications don't appear on a public list. There is no visible public help forum or mail list<a href="http://dir.gmane.org/index.php?prefix=gmane.comp.lib.glibc" rel="nofollow">http://dir.gmane.org/index.php?prefix=gmane.comp.lib.glibc</a>Seems public to me.

评论 #3191995 未加载

shin_lao超过 13 年前

A couple of years ago, before SSE existed, I wrote a highly optimized memory copy routine. It was more than just using movntq (non temporal is important to avoid cache pollution) and the like, for large data I copied the chunks in a local buffer less than one page size and copied it to the destination. Sounds crazy? It actually was much faster because of page locality.For small chunks however, nothing was faster than rep movsb which moves one byte at the time.

memset超过 13 年前

Someone tell me if I am mistaken - but it looks like the main difference between GCC's and Intel's memcpy() boils down to gcc using `rep movsl` and icc using `movdqa`, the latter having a shorter decode time and possibly shorter execution time?

评论 #3190211 未加载

评论 #3190528 未加载

JoeAltmaier超过 13 年前

I'm sad that computers in this modern age still require me to be in their business. Doesn't it seem like the cpu's own business to move bytes efficiently? Why is the compiler, much less the programmer, involved? The tests being made in the compiler/lib are of factors better-known at runtime (overlap, size, alignment) and better handled by microcode.

评论 #3194438 未加载

vz0超过 13 年前

Anger Fog found this issue one year earlier, 2008:<a href="http://www.cygwin.com/ml/libc-help/2008-08/msg00007.html" rel="nofollow">http://www.cygwin.com/ml/libc-help/2008-08/msg00007.html</a>

6 条评论

wolf550e超过 13 年前

abrahamsen超过 13 年前

评论 #3191995 未加载

shin_lao超过 13 年前

memset超过 13 年前

评论 #3190211 未加载

评论 #3190528 未加载

JoeAltmaier超过 13 年前

评论 #3194438 未加载

vz0超过 13 年前

Anger Fog found this issue one year earlier, 2008:<a href="http://www.cygwin.com/ml/libc-help/2008-08/msg00007.html" rel="nofollow">http://www.cygwin.com/ml/libc-help/2008-08/msg00007.html</a>

Intel's take on GCC's memcpy implementation

6 条评论

Intel's take on GCC's memcpy implementation

6 条评论