See also Firebase push IDs:<p><a href="https://firebase.googleblog.com/2015/02/the-2120-ways-to-ensure-unique_68.html" rel="nofollow">https://firebase.googleblog.com/2015/02/the-2120-ways-to-ens...</a><p>Lexicographically sortable identifiers are critical for any distributed data store if you want anything close to consistency. I've run into the issue of not having them and having to settle for some kind of <autoincrement_id><UUID> key and it's a huge PITA. How this wasn't considered database 101 decades ago just blows my mind.<p>I'd like to see a spec included in this for synchronizing clocks or using RAFT/Paxos for generating ULIDs with strong guarantees on sort order.<p>Also a minor gripe - I wish that the ULID spec checked for microsecond collisions instead of millisecond, because that would be more useful for realtime networked gaming and simulations.
We have used ULIDs in production for over a year now-- and have generated millions of these.<p>First, the main benefit of ULID is that you can generate the IDs within your own software rather than rely on the database. We can queue them or even reference them before they land in the database. The traditional roundtrip has been eliminated.<p>Secondly, being able to sort ULIDs is a nice plus, although not that big of a deal. It makes it relatively easy to shard or partition databases, and it provides a convenient sort if you're not looking for extreme accuracy.<p>ULIDs are also shorter and slightly more user friendly than UUIDs.<p>In some circumstances we found the actual implementations to be slightly lacking. For example, the JS library for ULID once returned a 25 character string rather than the standard 26 characters, causing a big ruckus that we had to manually resolve.
There are definitely systems out there whose clocks are not accurate to the millisecond. It's not healthy for systems to encourage false assumptions (e.g., that ids monotonically increase).
The failure on overflow is weird.<p>Since it starts at a random point and can't overflow, even if you generate a small number of IDs per millisecond you have a constant 1/2^79 chance of failing. The chance is small but reachable for a large network. (bitcoin does 2^88 hashes a day)<p>It could have just wrapped with no problem, because there's no possible node that could generate 2^80 IDs by itself. And if you have multiple nodes it doesn't help there either.
It’s worth noting that, since the ID in the same millisecond is incremented, it may suffer with enumerate attack. So it should not be used to generate one time token or object ID used in url address.
> <i>Uses Crockford's base32 for better efficiency and readability</i><p>Ugh. Crockford's base32 character set doesn't actually solve any of the problems it sets out to solve. Using it suggests to me some uncritical thinking.<p>It[0] says things like L is excluded, because "[uppercase] L Can be confused with 1". Ignoring the part where that is wildly inaccurate for any font that I've <i>ever</i> seen, why not then also remove G, 6, B, 8, Z, 2, S, or 5?<p>Reducing 1/I/i/L/l to just 1 does little to resolve visual ambiguity for users, because a user could just as easily read l or I instead of 1 or O instead of 0, because users don't know your made-up rules, which causes real problems because you often don't control both sides of the channel.<p>[0] - <a href="https://www.crockford.com/wrmg/base32.html" rel="nofollow">https://www.crockford.com/wrmg/base32.html</a>
cuid removes more of the randomness adds a counter and a fingerprint:
<a href="https://github.com/ericelliott/cuid" rel="nofollow">https://github.com/ericelliott/cuid</a><p>The default id in MongoDB does about the same. I always thought the MongoDB identifiers worked well for a lot of use cases.<p>Its also worth mentioning that integer incrementing ids can scale just fine if you reserve them in large blocks and they are no longer guaranteed to match insertion order, e.g: <a href="https://github.com/pingcap/docs/blob/master/sql/mysql-compatibility.md#auto-increment-id" rel="nofollow">https://github.com/pingcap/docs/blob/master/sql/mysql-compat...</a>
I like the idea, just also feel 1.21e+24 unique ULIDs per millisecond seems kind of defeated by the millisecond accuracy. This means there are effectively two tolerance values for time at play in the design of this spec that conflict with each other. If we want users to be able to generate ULIDs on such a short timescale (implying it's a realistic use case), then it would seem they should also be able to get comparable accuracy on the timestamp itself.
It would sure be nice if git commit IDs could use something like this. It would be really convenient if you could look at two commit IDs and know which one is older.
My experience is that you don't want human readable and user visible to be sortable, but to have as unique prefix (and when you have experienced workers also postfix) as possible. So this is certainly useful, but specifying human readable representation is somewhat redundant.<p>Another issue is that there are cases when you want to represent the ID as barcode of reasonable size and readability, which invariably leads to decimal-only Code128 with at most ~30 digits.
Forgive my ignorance. I'm more of a social scientist than a programmer. Questions:<p>1. Why not go for 16-character strings (instead of 26 or 36), with each character representing 8 bits?<p>Sure, you'd need 256 possible characters, but it's almost 2019 and Unicode has been with us for decades now. Surely we could be more cosmopolitan than Americentric ASCII and curate 256 characters for an 8-bit encoding?<p>With a 16-byte string, we could compare and process strings much faster, particularly with SIMD instructions like Intel/AMD's SSE 4.2 string comparison instructions. They're optimized for 16-byte strings and were introduced many years ago in the Nehalem architecture. That's a couple of generations before Sandy Bridge, so any server today is going to support it.<p>2. What does it mean to be "user-friendly" when it comes to these sorts of IDs? What are some scenarios where users interact with them or communicate or share them with someone or some authority? Crockford wanted his 32 character set to be easy to convey on a telephone, which seems like an expiring use case today. It seems like we should be able to use all sorts of non-ASCII characters now, without resorting to the Unicode Klingon or Tengwar blocks. Do we really need to be able to pronounce them all like Crockford anticipated?<p>NOTE: Unicode characters beyond the Basic Latin block take two or more bytes each, so we wouldn't be able to use them encoded as Unicode. What I'm advocating is a 256 character set with each character encoded in one byte, strictly for the purposes of generating these sorts of unique IDs represented by compact 16-character strings. Call it Duarte's Base256. All these other BaseN systems seem orthogonal to character encodings, or they just assume ASCII. I guess my idea would require both a character set and an encoding scheme. The latter would be similar to ISO/IEC 8859-15 and Windows 1252, but more complete with 256 printable characters. A lot of them could probably be emoji.<p>How good or terrible is this idea?
How does this compare to Twitter's Snowflake? <a href="https://blog.twitter.com/engineering/en_us/a/2010/announcing-snowflake.html" rel="nofollow">https://blog.twitter.com/engineering/en_us/a/2010/announcing...</a>
What does 128-bit compatibility with UUID mean? I would expect that it will be interoperability on binary level (e.g. allowing to store ULID in database columns of UUID type), but UUID has type information encoded in it - how can ULID address this requirement?
So you need good clocks (and timesync) and good entropy. Especially for the lexicographic sorting part, you'll need really good clocks. That's fine, if you can get them. But it's not enough, since you get no origin ID, 1ms is a very long time, and you can't sort events occurring in the same ms.
It's not always a good idea to expose timestamps to third parties. You would then need to obfuscate the url with an API-id, and in that case all the properties of readability and url-compatibility are mostly redundant as the ULID only circulates internally in your cluster.
This seems like it could be useful, but<p>> Cryptographically secure source of randomness, if possible<p>I don't think this should be a goal. If this is for IDs, you usually want to optimize for speed of generating IDs and evenness of distribution (aside from merely reducing collisions.) The top answer at below link has a good top list of hash algorithms. None of them are cryptographic.<p><a href="https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed" rel="nofollow">https://softwareengineering.stackexchange.com/questions/4955...</a>
We have been doing something similar for a long time, works out great. Glad to see more industry adoption around this!<p>We also wrote a decentralized clock sync algorithm that can be used where NTP fails, check out <a href="https://github.com/amark/gun/blob/master/nts.js" rel="nofollow">https://github.com/amark/gun/blob/master/nts.js</a> !<p>I find it a little odd they didn't use a separator symbol so that way it doesn't have to overflow after a certain year. Also, then you could have microseconds precision or beyond where it is supported.<p>Overall good progress getting people onboard with this! Solves a lot of problems before they even start.
> Each component is encoded with the Most Significant Byte first (network byte order).<p>This seems a surprising choice. Even the PowerPC now supports little endian. I would guess that 95%+ of all software is running in a little endian system and that any software that would use ULID is going to run in a little endian system. Other than for historical compatibility, I don't think there is any reason to use big endian today, and definitely not for greenfield protocols.