You can use the "fancy math" version on all 8 bytes in the string simultaneously:<p><pre><code> #include <cstdint>
uint32_t convert_hex(const char* s)
{
uint64_t a = *reinterpret_cast<const uint64_t*>(s);
a = (a & 0x0F0F0F0F0F0F0F0Fu) + 9 * ((a & 0xc0c0c0c0c0c0c0c0) >> 6);
a = (a & 0x000F000F000F000Fu) | ((a & 0x0F000F000F000F00) >> 4);
a = (a & 0x000000FF000000FFu) | ((a & 0x00FF000000FF0000) >> 8);
uint32_t b = (a & 0x000000000000FFFFu) | ((a & 0x0000FFFF00000000) >> 16);
b = __builtin_bswap32(b);
b = (b & 0x0f0f0f0f) << 4 | (b & 0xf0f0f0f0) >> 4;
return b;
}
</code></pre>
Compiles to 30 instructions with clang, so 3.75 instructions per byte. (clang is one instruction cleverer than gcc) No branching, the only "complicated" instructions are bswap (__builtin_bswap32) lea, (the "multiplication" by nine) and one addition. Other than that it's all bit manipulations. (moves, shifts, ands, ors) However, I doubt it will pipeline well, the data dependencies are quite linear.<p>It has <i>very</i> little tolerance for inputs that are anything other than 8 byte hex strings. Do error checking elsewhere.<p>I doubt this would be faster in any meaningful sense in a real world use case.
Using of array lookup is a bad wayof optimization:
digittoval[src[N]] can take up to 200 cycles if not in cache<p>Only synthetic tests that small enough (most of them do nothing than tested code) show good results.
The current winner is a SIMD-ish solution using 32 bit math as 4 8-bit lanes. This has been added to the repo, and is discussed in the comments.<p><a href="https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2019/04/17" rel="nofollow">https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...</a>
I don't have the spare cycles to test this out for myself right now, but I'm curious if a 64kb table would be substantially faster. You'd cut it down from 8 indirections, 3 shifts, and 3 ORs to 4 indirections, 1 shift, and 1 OR.<p><pre><code> uint32_t hex_to_u32_lookup(const uint8_t *src) {
uint16_t *wsrc = (uint16_t *) src;
uint32_t v1 = digittoval[wsrc[0]];
uint32_t v2 = digittoval[wsrc[1]];
return v2 << 8 | v1;
}
</code></pre>
(Also, small note: The function is named `hex_to_u32_lookup` but it only gives a 16-bit result. Might want to clarify in the post.)
With modern hardware, for this to be a meaningful percentage of your workload, you'd need to be working with super high data volumes (100 megabytes per second at the very least, maybe an order of magnitude more depending on the exact hardware).<p>A reasonable system designer would never create an interface that accepts only hexadecimal input for bulk data transfers at those rates. That would waste half the I/O, which makes no sense!<p>In this case, the problem with your system isn't the cycle count of your hex conversion function. It's a poorly designed interface that's forced to use hex for high data volumes. You get maybe a fraction of a percent optimizing the hex conversion function, but there's a low-hanging 100% improvement if you just use all 8 bits of your bytes.<p>Or in Donald Knuth's words, "Premature optimization is the root of all evil."