In my experiments, anything using lookup tables was slower than a naive, branching decoder on real-world data. Reading from a lookup table in L1 cache has ~4 cycles latency which is prohibitive for the simple case of mostly ASCII bytes. You can easily achieve more than 1.5 GB/s with a naive decoder while all the "smarter" approaches are capped to ~800 MB/s.
The overlong lookup can also be written without a memory lookup as<p><pre><code> 0x10000U >> ((0x1531U >> (i*5)) & 31);
</code></pre>
On most current x86 chips this has a latency of 3 cycles -- LEA+SHR+SHR -- which is better than an L1 cache hit almost everywhere.
This looks very similar to the approach we recently used to transcode UTF-8 into UTF-16 using AVX-512: <a href="https://arxiv.org/pdf/2212.05098" rel="nofollow">https://arxiv.org/pdf/2212.05098</a><p>It's part of simdutf.
The code is careful not to read past the end of the buffer, but it doesn't explicitly check that there are enough bytes available for the current multibyte character. However, this "end of buffer in middle of character" error is caught later by the check for valid continuation bytes. I thought that was quite neat.