TechEcho

5 comments

nwallinabout 6 years ago

You can use the "fancy math" version on all 8 bytes in the string simultaneously:<pre><code> #include <cstdint> uint32_t convert_hex(const char* s) { uint64_t a = *reinterpret_cast<const uint64_t*>(s); a = (a & 0x0F0F0F0F0F0F0F0Fu) + 9 * ((a & 0xc0c0c0c0c0c0c0c0) >> 6); a = (a & 0x000F000F000F000Fu) | ((a & 0x0F000F000F000F00) >> 4); a = (a & 0x000000FF000000FFu) | ((a & 0x00FF000000FF0000) >> 8); uint32_t b = (a & 0x000000000000FFFFu) | ((a & 0x0000FFFF00000000) >> 16); b = __builtin_bswap32(b); b = (b & 0x0f0f0f0f) << 4 | (b & 0xf0f0f0f0) >> 4; return b; } </code></pre> Compiles to 30 instructions with clang, so 3.75 instructions per byte. (clang is one instruction cleverer than gcc) No branching, the only "complicated" instructions are bswap (__builtin_bswap32) lea, (the "multiplication" by nine) and one addition. Other than that it's all bit manipulations. (moves, shifts, ands, ors) However, I doubt it will pipeline well, the data dependencies are quite linear.It has very little tolerance for inputs that are anything other than 8 byte hex strings. Do error checking elsewhere.I doubt this would be faster in any meaningful sense in a real world use case.

dmoldavanovabout 6 years ago

Using of array lookup is a bad wayof optimization: digittoval[src[N]] can take up to 200 cycles if not in cacheOnly synthetic tests that small enough (most of them do nothing than tested code) show good results.

评论 #19686254 未加载

评论 #19686175 未加载

timerolabout 6 years ago

The current winner is a SIMD-ish solution using 32 bit math as 4 8-bit lanes. This has been added to the repo, and is discussed in the comments.<a href="https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/tree/master/2019/04/17" rel="nofollow">https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...</a>

daekenabout 6 years ago

I don't have the spare cycles to test this out for myself right now, but I'm curious if a 64kb table would be substantially faster. You'd cut it down from 8 indirections, 3 shifts, and 3 ORs to 4 indirections, 1 shift, and 1 OR.<pre><code> uint32_t hex_to_u32_lookup(const uint8_t *src) { uint16_t *wsrc = (uint16_t *) src; uint32_t v1 = digittoval[wsrc[0]]; uint32_t v2 = digittoval[wsrc[1]]; return v2 << 8 | v1; } </code></pre> (Also, small note: The function is named `hex_to_u32_lookup` but it only gives a 16-bit result. Might want to clarify in the post.)

评论 #19686076 未加载

csenseabout 6 years ago

With modern hardware, for this to be a meaningful percentage of your workload, you'd need to be working with super high data volumes (100 megabytes per second at the very least, maybe an order of magnitude more depending on the exact hardware).A reasonable system designer would never create an interface that accepts only hexadecimal input for bulk data transfers at those rates. That would waste half the I/O, which makes no sense!In this case, the problem with your system isn't the cycle count of your hex conversion function. It's a poorly designed interface that's forced to use hex for high data volumes. You get maybe a fraction of a percent optimizing the hex conversion function, but there's a low-hanging 100% improvement if you just use all 8 bits of your bytes.Or in Donald Knuth's words, "Premature optimization is the root of all evil."

Parsing short hexadecimal strings efficiently

5 comments

Parsing short hexadecimal strings efficiently

5 comments