TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Let's Make a Varint

26 pointsby saturncoleusabout 9 years ago

5 comments

Strilancabout 9 years ago
This post starts off good, but...<p>- The motivation behind this post is storing generated IDs for a site the author is working on. Generated IDs don&#x27;t have the &quot;smaller values are exponentially more likely&quot; property that make varints useful in the first place.<p>- The post rejects UTF-8&#x27;s variable-length encoding because &quot;it can only store numbers up to 1,112,064&quot;. That&#x27;s the maximum number of unicode code points (...ish). The UTF-8 encoding&#x27;s actual limit is &gt;10^16. A much better reason to not use UTF-8&#x27;s encoding is that it was created under pretty heavy backwards-compatibility restrictions that the author doesn&#x27;t have. For example, there&#x27;s no need for being self-synchronizing or to be compatible with Boyer-Moore substring searches.<p>- The post goes with a length-prefixed encoding, but then uses that length prefix as part of the folder structure (it&#x27;s using the generated IDs as filepaths). Which is a great way to create an exponential distribution in the sizes of your directories, instead of the uniform one that was the goal.<p>It&#x27;s not a <i>bad</i> post, there&#x27;s facts in there, but I wouldn&#x27;t recommend bothering with it.
评论 #11656987 未加载
JonathonWabout 9 years ago
Something probably not particularly important to the author (given that he&#x27;s just using these essentially as a serial number), but worth mentioning for general use:<p>&gt; Varints can represent negative numbers too, by casting them to unsigned numbers first.<p>But, assuming you&#x27;re working with twos-complement numbers, it&#x27;s going to be really inefficient, because small negative numbers are going to end up represented as extremely large positive numbers (e.g. -1 will map to MAX_INT - 1, where MAX_INT is the largest unsigned value representable in your underlying integer type).<p>You also lose the unlimited length property if you need to handle twos-complement integers, and your encoding is dependent on the underlying integer type you&#x27;re serializing from&#x2F;deserializing to (especially problematic in languages where types are architecture-dependent, like C&#x2F;C++).<p>Protobuf handles this by doing what they call &quot;ZigZag encoding&quot; if you declare a value as a signed integer type. This maps small negative numbers to small positive numbers, alternating back and forth like so:<p><pre><code> Original Encoded 0 0 -1 1 1 2 -2 3 2 4 </code></pre> and so on.<p>The catch is that you lose sortability if you do this... if you&#x27;re going to have a bunch of negative numbers and need their encoded versions to sort correctly, you&#x27;ll need to do something else. (Of course, you&#x27;ll need to with twos-complement numbers, too, given that -1 will appear to be greater than 1).<p>Protobuf documentation on ZigZag encoding: <a href="https:&#x2F;&#x2F;developers.google.com&#x2F;protocol-buffers&#x2F;docs&#x2F;encoding#signed-integers" rel="nofollow">https:&#x2F;&#x2F;developers.google.com&#x2F;protocol-buffers&#x2F;docs&#x2F;encoding...</a>
评论 #11656973 未加载
评论 #11657637 未加载
ccleveabout 9 years ago
The references in the article don&#x27;t reflect the latest research. So far as I know, this is close to state of the art:<p><a href="https:&#x2F;&#x2F;github.com&#x2F;lemire&#x2F;JavaFastPFOR" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;lemire&#x2F;JavaFastPFOR</a><p>The links to the papers in the readme are useful if you want to learn more.
hyperpapeabout 9 years ago
&quot;No “u”, (and no “o” or “i”) so accidental profanity cannot happen.&quot;<p>h0ly fvcking sh1t!<p>It&#x27;s a good try, and making all the letters lowercase makes the numbers as vowels less legible. But if you&#x27;re creating random strings you will create profanity and slurs.
评论 #11657248 未加载
评论 #11657493 未加载
btraskabout 9 years ago
Varints are widely used in databases. <a href="https:&#x2F;&#x2F;www.sqlite.org&#x2F;src4&#x2F;doc&#x2F;trunk&#x2F;www&#x2F;varint.wiki" rel="nofollow">https:&#x2F;&#x2F;www.sqlite.org&#x2F;src4&#x2F;doc&#x2F;trunk&#x2F;www&#x2F;varint.wiki</a>