TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

vu128: Efficient variable-length integers

93 pointsby jmillikin12 months ago

9 comments

lifthrasiir12 months ago
As with many other variable-length integer encodings, its efficiency really depends on the expected input distribution. One glaring issue in this regard is that its 2-byte encoding is rather wasted: only 6.25% out of all possible encodings are useful. This can be possibly fixed by making the 2-byte encoding represent 240..495 instead of 0..255. You can even make use of some bits from the first byte, so that for example the sequence goes like: 0=00 1=01 2=02 .. BF=BF C0=C0C0 C1=C0C1 .. 30FF=F0FF 3100=F13100 .. (all in hex).<p>A radically different design, used by for example 7z, is also possible if you put all continuation bits into the first byte. For example, instead of having `1xxxxxxx 1yyyyyyy 0zzzzzzz` for 3-byte encoding, use `110xxxxx yyyyyyyy zzzzzzzz`. This is essentially as compact as more common VLQ and LEB128 encodings, but you can determine the length of the whole sequence from the first byte alone! Unfortunately this is less suitable for integers larger than 64 bits, because then you run out of bits in the first byte. But it ends up with a rather elegant encoding for 64-bit integers, where `11111110` indicates 7 bytes and `11111111` indicates 8 bytes with no additional 0 bit needed.<p>&gt; An afternoon of web searching has failed to uncover a prior name for this particular encoding, so I&#x27;m going to name it vu128 and describe it below. If any reader knows of prior art, please send it my way.<p>I don&#x27;t know if it has other names, but a similar encoding can be seen from X.690 [1] and the &quot;additional information&quot; bit field of CBOR [2].<p>[1] <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;X.690#Length_octets" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;X.690#Length_octets</a><p>[2] <a href="https:&#x2F;&#x2F;www.rfc-editor.org&#x2F;rfc&#x2F;rfc8949.html#section-3" rel="nofollow">https:&#x2F;&#x2F;www.rfc-editor.org&#x2F;rfc&#x2F;rfc8949.html#section-3</a>
评论 #40457448 未加载
评论 #40457530 未加载
评论 #40457694 未加载
评论 #40459217 未加载
评论 #40463183 未加载
评论 #40458623 未加载
terrelln12 months ago
Variable-length integer schemes are great when you are encoding one integer. But when you are encoding a list of integers, you should really consider a different scheme.<p>Variable-length integer schemes generally interleave the control bits &amp; data bits. This means you don&#x27;t know where the next integer starts until you at least partially decode the current integer.<p>When encoding a list of integers, you would rather put all the control information together, and all the data together. This generally allows for significantly more efficient decoding using SIMD.
评论 #40460475 未加载
评论 #40461371 未加载
Const-me12 months ago
When I need them I usually implement the codec used in MKV format: <a href="https:&#x2F;&#x2F;www.rfc-editor.org&#x2F;rfc&#x2F;rfc8794.html" rel="nofollow">https:&#x2F;&#x2F;www.rfc-editor.org&#x2F;rfc&#x2F;rfc8794.html</a><p>Similar performance characteristics, i.e. just the first byte tells the encoded length, no need to test every byte. Luckily, most programming languages have intrinsic functions for BSWAP instruction.
评论 #40457961 未加载
kunley12 months ago
There is also SQLite varint, <a href="https:&#x2F;&#x2F;sqlite.org&#x2F;src4&#x2F;doc&#x2F;trunk&#x2F;www&#x2F;varint.wiki" rel="nofollow">https:&#x2F;&#x2F;sqlite.org&#x2F;src4&#x2F;doc&#x2F;trunk&#x2F;www&#x2F;varint.wiki</a><p>IIRC different from both Leb128 and VLQ, and quite neat. It&#x27;s for 64bit (u)ints, though
danking0012 months ago
If I’m following, this presents an alternative format and implementation to LEB128 which encodes and decodes substantially faster. Notably, the implementation is quite simple. Cool! And agreed that modern CPUs really suffer from branches.<p>Should I interpret the plot to mean the average elapsed wall clock time per integer decoded&#x2F;encoded? And can I conclude the throughput is the reciprocal? So about 100,000 integers per second or around a 1 MB&#x2F;s of decompressed data.<p>I know this is a bit unfair because the implementation is much more complex, but my first thought is why I would use vu128 instead of Lemire’s Stream VByte: <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1709.08990" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;1709.08990</a><p>A slight tangent but I stumbled on this library which stores floats XOR’ed with the previous float in the stream: <a href="https:&#x2F;&#x2F;github.com&#x2F;velvia&#x2F;compressed-vec">https:&#x2F;&#x2F;github.com&#x2F;velvia&#x2F;compressed-vec</a> it seems really clever to me! They reference “Gorilla: A Fast, Scalable, In-Memory Time Series Database” which in turn references two 2006 papers: “Fast Lossless Compression of Scientific Floating-Point Data” and “Fast and Efficient Compression of Floating-Point Data”. Frustratingly, the FB paper doesn’t benchmark their XOR-based floating point encoding but the earlier two papers do.
评论 #40460158 未加载
chubot12 months ago
FWIW the lobste.rs thread surfaced this 2016 comment thread about PrefixVarint vs. LEB128 varint in protobufs, which has lots of good info, and makes the UTF-8 analogy:<p><a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=11263378">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=11263378</a><p>Also sqlite varint:<p><a href="https:&#x2F;&#x2F;sqlite.org&#x2F;src4&#x2F;doc&#x2F;trunk&#x2F;www&#x2F;varint.wiki" rel="nofollow">https:&#x2F;&#x2F;sqlite.org&#x2F;src4&#x2F;doc&#x2F;trunk&#x2F;www&#x2F;varint.wiki</a>
评论 #40457259 未加载
ixau12 months ago
In terms of serializing JSON to a binary representation, which includes variable length integers, check CBOR<p><a href="https:&#x2F;&#x2F;datatracker.ietf.org&#x2F;doc&#x2F;html&#x2F;rfc8949" rel="nofollow">https:&#x2F;&#x2F;datatracker.ietf.org&#x2F;doc&#x2F;html&#x2F;rfc8949</a>
zigzag31212 months ago
How does it compare to vint64[0] performance-wise? For values that fit the 64-bit space.<p>[0] <a href="https:&#x2F;&#x2F;crates.io&#x2F;crates&#x2F;vint64" rel="nofollow">https:&#x2F;&#x2F;crates.io&#x2F;crates&#x2F;vint64</a>
评论 #40461400 未加载
vitiral12 months ago
zig-zag encoding is brilliant. Basically you can determine positive or negative based on odd&#x2F;even and divide by 2 to get the value. I tried to come up with a similar encoding and never thought of this
评论 #40458694 未加载
评论 #40458034 未加载