TechEcho

nwellnhofabout 1 year ago

In my experiments, anything using lookup tables was slower than a naive, branching decoder on real-world data. Reading from a lookup table in L1 cache has ~4 cycles latency which is prohibitive for the simple case of mostly ASCII bytes. You can easily achieve more than 1.5 GB/s with a naive decoder while all the "smarter" approaches are capped to ~800 MB/s.

评论 #40268079 未加载

评论 #40268382 未加载

评论 #40266601 未加载

pbsdabout 1 year ago

The overlong lookup can also be written without a memory lookup as<p><pre><code> 0x10000U >> ((0x1531U >> (i*5)) & 31); </code></pre> On most current x86 chips this has a latency of 3 cycles -- LEA+SHR+SHR -- which is better than an L1 cache hit almost everywhere.

评论 #40274924 未加载

评论 #40277760 未加载

评论 #40268440 未加载

clauseckerabout 1 year ago

This looks very similar to the approach we recently used to transcode UTF-8 into UTF-16 using AVX-512: <a href="https://arxiv.org/pdf/2212.05098" rel="nofollow">https://arxiv.org/pdf/2212.05098</a><p>It's part of simdutf.

masfuerteabout 1 year ago

The code is careful not to read past the end of the buffer, but it doesn't explicitly check that there are enough bytes available for the current multibyte character. However, this "end of buffer in middle of character" error is caught later by the check for valid continuation bytes. I thought that was quite neat.

Decoding UTF8 with parallel extract

4 comments

Decoding UTF8 with parallel extract

4 comments