I am surprised to see any speedup in this! I'd expect something trivial like this to be completely memory bound with the CPU sitting almost idle waiting for bytes coming in from memory.<p>Looking at the benchmark code, this is using rdtsc to read the CPU time stamp counter. That does not take waiting for memory into account, does it?<p>I wonder if there's a difference when measured in wall clock time. It's still somewhat beneficial to have the CPU work efficiently to give an opportunity for hyperthreading to take place when waiting for memory.<p>If you really wanted to make something like this faster, you should focus on cache utilization and make use of prefetching instructions. x86 has pretty bad prefetching instructions and pretty good speculative fetching, so don't expect massive speedups but on ARM or Aarch64, you have a finer grained control over cache prefetching (L1 and L2 separately) and you could see <i>much</i> bigger differences.<p>As for benchmarking this kind of problems: you obviously want to measure real world performance, so you need wall clock time as well as time stamp counter, but I'd look for optimization clues in "perf stat" and other CPU perf counters, with an emphasis on cache misses and branch mispredictions.<p>The figure you should be staring at is the total <i>throughput</i> of the algorithm, measured in gigabytes per second. You should be getting figures close to the memory bandwidth available (25-50 GB/s depending on CPU and memory).<p>edit: I measured the wall clock time with clock_gettime before/after all the repeats (using a megabyte sized buffer) and there is indeed no significant difference, here's my results:<p><pre><code> memcpy(tmpbuffer,buffer,N): 0.122945 cycles / ops 1495907352 nsec (1.495907 sec)
countspaces(buffer, N): 3.657322 cycles / ops 1544915395 nsec (1.544915 sec)
despace(buffer, N): 6.521193 cycles / ops 1621204460 nsec (1.621204 sec)
faster_despace(buffer, N): 1.721657 cycles / ops 1500507217 nsec (1.500507 sec)
despace64(buffer, N): 3.595031 cycles / ops 1544993649 nsec (1.544994 sec)
despace_to(buffer, N, tmpbuffer): 6.307885 cycles / ops 1615101563 nsec (1.615102 sec)
avx2_countspaces(buffer, N): 0.190992 cycles / ops 1460961459 nsec (1.460961 sec)
avx2_despace(buffer, N): 5.750583 cycles / ops 1615971010 nsec (1.615971 sec)
sse4_despace(buffer, N): 0.985002 cycles / ops 1482901389 nsec (1.482901 sec)
sse4_despace_branchless(buffer, N): 0.338737 cycles / ops 1460874704 nsec (1.460875 sec)
sse4_despace_trail(buffer, N): 1.950657 cycles / ops 1502268447 nsec (1.502268 sec)
sse42_despace_branchless(buffer, N): 0.562246 cycles / ops 1468638389 nsec (1.468638 sec)
sse42_despace_branchless_lookup(buffer, N): 0.624913 cycles / ops 1472445127 nsec (1.472445 sec)
sse42_despace_to(buffer, N,tmpbuffer): 1.747046 cycles / ops 1507705780 nsec (1.507706 sec)
</code></pre>
Here's the diff to the original: <a href="http://pasteall.org/208511" rel="nofollow">http://pasteall.org/208511</a><p>edit2: surprisingly, Clang is about 10% slower than GCC in my experiments.