This post starts off good, but...<p>- The motivation behind this post is storing generated IDs for a site the author is working on. Generated IDs don't have the "smaller values are exponentially more likely" property that make varints useful in the first place.<p>- The post rejects UTF-8's variable-length encoding because "it can only store numbers up to 1,112,064". That's the maximum number of unicode code points (...ish). The UTF-8 encoding's actual limit is >10^16. A much better reason to not use UTF-8's encoding is that it was created under pretty heavy backwards-compatibility restrictions that the author doesn't have. For example, there's no need for being self-synchronizing or to be compatible with Boyer-Moore substring searches.<p>- The post goes with a length-prefixed encoding, but then uses that length prefix as part of the folder structure (it's using the generated IDs as filepaths). Which is a great way to create an exponential distribution in the sizes of your directories, instead of the uniform one that was the goal.<p>It's not a <i>bad</i> post, there's facts in there, but I wouldn't recommend bothering with it.
Something probably not particularly important to the author (given that he's just using these essentially as a serial number), but worth mentioning for general use:<p>> Varints can represent negative numbers too, by casting them to unsigned numbers first.<p>But, assuming you're working with twos-complement numbers, it's going to be really inefficient, because small negative numbers are going to end up represented as extremely large positive numbers (e.g. -1 will map to MAX_INT - 1, where MAX_INT is the largest unsigned value representable in your underlying integer type).<p>You also lose the unlimited length property if you need to handle twos-complement integers, and your encoding is dependent on the underlying integer type you're serializing from/deserializing to (especially problematic in languages where types are architecture-dependent, like C/C++).<p>Protobuf handles this by doing what they call "ZigZag encoding" if you declare a value as a signed integer type. This maps small negative numbers to small positive numbers, alternating back and forth like so:<p><pre><code> Original Encoded
0 0
-1 1
1 2
-2 3
2 4
</code></pre>
and so on.<p>The catch is that you lose sortability if you do this... if you're going to have a bunch of negative numbers and need their encoded versions to sort correctly, you'll need to do something else. (Of course, you'll need to with twos-complement numbers, too, given that -1 will appear to be greater than 1).<p>Protobuf documentation on ZigZag encoding: <a href="https://developers.google.com/protocol-buffers/docs/encoding#signed-integers" rel="nofollow">https://developers.google.com/protocol-buffers/docs/encoding...</a>
The references in the article don't reflect the latest research. So far as I know, this is close to state of the art:<p><a href="https://github.com/lemire/JavaFastPFOR" rel="nofollow">https://github.com/lemire/JavaFastPFOR</a><p>The links to the papers in the readme are useful if you want to learn more.
"No “u”, (and no “o” or “i”) so accidental profanity cannot happen."<p>h0ly fvcking sh1t!<p>It's a good try, and making all the letters lowercase makes the numbers as vowels less legible. But if you're creating random strings you will create profanity and slurs.
Varints are widely used in databases. <a href="https://www.sqlite.org/src4/doc/trunk/www/varint.wiki" rel="nofollow">https://www.sqlite.org/src4/doc/trunk/www/varint.wiki</a>