When looking at code performance, it's important to remember that conditional branches are almost always cheap <i>inside microbenchmarks</i>, because the CPU can figure out when the same branches get taken on every loop... but far more expensive in the real world. A similar issue applies to cache: Your code might fit inside the L1 cache in your benchmarks, but when it's used in the real world you get cache misses since the rest of the program accesses data too.