This is one of those questions where you really, honestly, do need to look at a very low level.<p>Back in the ancient days, I worked at IBM doing benchmarking for an OS project that was never released. We were using PPC601 Sandalfoots (Sandalfeet?) as dev machines. A perennial fight was devs writing their own memcpy using <i>dst++ = </i>src++ loops rather than the one in the library, which was written by one of my coworkers and consisted of 3 pages of assembly that used at least 18 registers.<p>The simple loop was something like X cycles/byte, while the library version was P + (Q cycles/byte) but the difference was such that the crossover point was about 8 bytes. So, scraping out the simple memcpy implementations from the code was about a weekly thing for me.<p>At this point, we discovered that our C compiler would pass structs by value (This was the early-ish days of ANSI C and was a surprise to some of my older coworkers.) and benchmarked <i>that</i>.<p>And discovered that its copy code was <i>worse</i> than the simple <i>dst++ = </i>src++ loops. By about a factor of 4. (The simple loop would be optimized to work with word-sized ints, while the compiler was generating code that copied each byte individually.)<p>If you are doing something where this matters, something like VTune is very important. So is the ability to convince people who do stupid things to stop doing the stupid things.