You can use the "fancy math" version on all 8 bytes in the string simultaneously:<p><pre><code> #include <cstdint>
uint32_t convert_hex(const char* s)
{
uint64_t a = *reinterpret_cast<const uint64_t*>(s);
a = (a & 0x0F0F0F0F0F0F0F0Fu) + 9 * ((a & 0xc0c0c0c0c0c0c0c0) >> 6);
a = (a & 0x000F000F000F000Fu) | ((a & 0x0F000F000F000F00) >> 4);
a = (a & 0x000000FF000000FFu) | ((a & 0x00FF000000FF0000) >> 8);
uint32_t b = (a & 0x000000000000FFFFu) | ((a & 0x0000FFFF00000000) >> 16);
b = __builtin_bswap32(b);
b = (b & 0x0f0f0f0f) << 4 | (b & 0xf0f0f0f0) >> 4;
return b;
}
</code></pre>
Compiles to 30 instructions with clang, so 3.75 instructions per byte. (clang is one instruction cleverer than gcc) No branching, the only "complicated" instructions are bswap (__builtin_bswap32) lea, (the "multiplication" by nine) and one addition. Other than that it's all bit manipulations. (moves, shifts, ands, ors) However, I doubt it will pipeline well, the data dependencies are quite linear.<p>It has <i>very</i> little tolerance for inputs that are anything other than 8 byte hex strings. Do error checking elsewhere.<p>I doubt this would be faster in any meaningful sense in a real world use case.