ULID: Universally Unique Lexicographically Sortable Identifier

238 pointsby brunoluizover 6 years ago

22 comments

zackmorrisover 6 years ago

See also Firebase push IDs:<a href="https://firebase.googleblog.com/2015/02/the-2120-ways-to-ensure-unique_68.html" rel="nofollow">https://firebase.googleblog.com/2015/02/the-2120-ways-to-ens...</a>Lexicographically sortable identifiers are critical for any distributed data store if you want anything close to consistency. I've run into the issue of not having them and having to settle for some kind of <autoincrement_id><UUID> key and it's a huge PITA. How this wasn't considered database 101 decades ago just blows my mind.I'd like to see a spec included in this for synchronizing clocks or using RAFT/Paxos for generating ULIDs with strong guarantees on sort order.Also a minor gripe - I wish that the ULID spec checked for microsecond collisions instead of millisecond, because that would be more useful for realtime networked gaming and simulations.

评论 #18772861 未加载

评论 #18772008 未加载

评论 #18775228 未加载

评论 #18772325 未加载

评论 #18771937 未加载

评论 #18774591 未加载

评论 #18771842 未加载

评论 #18775581 未加载

评论 #18776082 未加载

brddover 6 years ago

We have used ULIDs in production for over a year now-- and have generated millions of these.First, the main benefit of ULID is that you can generate the IDs within your own software rather than rely on the database. We can queue them or even reference them before they land in the database. The traditional roundtrip has been eliminated.Secondly, being able to sort ULIDs is a nice plus, although not that big of a deal. It makes it relatively easy to shard or partition databases, and it provides a convenient sort if you're not looking for extreme accuracy.ULIDs are also shorter and slightly more user friendly than UUIDs.In some circumstances we found the actual implementations to be slightly lacking. For example, the JS library for ULID once returned a 25 character string rather than the standard 26 characters, causing a big ruckus that we had to manually resolve.

评论 #18773836 未加载

erik_seabergover 6 years ago

There are definitely systems out there whose clocks are not accurate to the millisecond. It's not healthy for systems to encourage false assumptions (e.g., that ids monotonically increase).

评论 #18772469 未加载

评论 #18779237 未加载

Dylan16807over 6 years ago

The failure on overflow is weird.Since it starts at a random point and can't overflow, even if you generate a small number of IDs per millisecond you have a constant 1/2^79 chance of failing. The chance is small but reachable for a large network. (bitcoin does 2^88 hashes a day)It could have just wrapped with no problem, because there's no possible node that could generate 2^80 IDs by itself. And if you have multiple nodes it doesn't help there either.

评论 #18771831 未加载

评论 #18771908 未加载

评论 #18775642 未加载

评论 #18771722 未加载

评论 #18771726 未加载

评论 #18772363 未加载

xuchengover 6 years ago

It’s worth noting that, since the ID in the same millisecond is incremented, it may suffer with enumerate attack. So it should not be used to generate one time token or object ID used in url address.

BugsJustFindMeover 6 years ago

> Uses Crockford's base32 for better efficiency and readabilityUgh. Crockford's base32 character set doesn't actually solve any of the problems it sets out to solve. Using it suggests to me some uncritical thinking.It[0] says things like L is excluded, because "[uppercase] L Can be confused with 1". Ignoring the part where that is wildly inaccurate for any font that I've ever seen, why not then also remove G, 6, B, 8, Z, 2, S, or 5?Reducing 1/I/i/L/l to just 1 does little to resolve visual ambiguity for users, because a user could just as easily read l or I instead of 1 or O instead of 0, because users don't know your made-up rules, which causes real problems because you often don't control both sides of the channel.[0] - <a href="https://www.crockford.com/wrmg/base32.html" rel="nofollow">https://www.crockford.com/wrmg/base32.html</a>

评论 #18771746 未加载

评论 #18771795 未加载

评论 #18777339 未加载

评论 #18771751 未加载

gregwebsover 6 years ago

cuid removes more of the randomness adds a counter and a fingerprint: <a href="https://github.com/ericelliott/cuid" rel="nofollow">https://github.com/ericelliott/cuid</a>The default id in MongoDB does about the same. I always thought the MongoDB identifiers worked well for a lot of use cases.Its also worth mentioning that integer incrementing ids can scale just fine if you reserve them in large blocks and they are no longer guaranteed to match insertion order, e.g: <a href="https://github.com/pingcap/docs/blob/master/sql/mysql-compatibility.md#auto-increment-id" rel="nofollow">https://github.com/pingcap/docs/blob/master/sql/mysql-compat...</a>

mehrdadnover 6 years ago

I like the idea, just also feel 1.21e+24 unique ULIDs per millisecond seems kind of defeated by the millisecond accuracy. This means there are effectively two tolerance values for time at play in the design of this spec that conflict with each other. If we want users to be able to generate ULIDs on such a short timescale (implying it's a realistic use case), then it would seem they should also be able to get comparable accuracy on the timestamp itself.

krupanover 6 years ago

It would sure be nice if git commit IDs could use something like this. It would be really convenient if you could look at two commit IDs and know which one is older.

Kip9000over 6 years ago

What problem does this solve? Why is it necessary to sort a unique id?

评论 #18771782 未加载

评论 #18771797 未加载

评论 #18771910 未加载

评论 #18772206 未加载

dfoxover 6 years ago

My experience is that you don't want human readable and user visible to be sortable, but to have as unique prefix (and when you have experienced workers also postfix) as possible. So this is certainly useful, but specifying human readable representation is somewhat redundant.Another issue is that there are cases when you want to represent the ID as barcode of reasonable size and readability, which invariably leads to decimal-only Code128 with at most ~30 digits.

otterleyover 6 years ago

See also <a href="https://github.com/segmentio/ksuid" rel="nofollow">https://github.com/segmentio/ksuid</a>

评论 #18772546 未加载

评论 #18775789 未加载

Solar19over 6 years ago

Forgive my ignorance. I'm more of a social scientist than a programmer. Questions:1. Why not go for 16-character strings (instead of 26 or 36), with each character representing 8 bits?Sure, you'd need 256 possible characters, but it's almost 2019 and Unicode has been with us for decades now. Surely we could be more cosmopolitan than Americentric ASCII and curate 256 characters for an 8-bit encoding?With a 16-byte string, we could compare and process strings much faster, particularly with SIMD instructions like Intel/AMD's SSE 4.2 string comparison instructions. They're optimized for 16-byte strings and were introduced many years ago in the Nehalem architecture. That's a couple of generations before Sandy Bridge, so any server today is going to support it.2. What does it mean to be "user-friendly" when it comes to these sorts of IDs? What are some scenarios where users interact with them or communicate or share them with someone or some authority? Crockford wanted his 32 character set to be easy to convey on a telephone, which seems like an expiring use case today. It seems like we should be able to use all sorts of non-ASCII characters now, without resorting to the Unicode Klingon or Tengwar blocks. Do we really need to be able to pronounce them all like Crockford anticipated?NOTE: Unicode characters beyond the Basic Latin block take two or more bytes each, so we wouldn't be able to use them encoded as Unicode. What I'm advocating is a 256 character set with each character encoded in one byte, strictly for the purposes of generating these sorts of unique IDs represented by compact 16-character strings. Call it Duarte's Base256. All these other BaseN systems seem orthogonal to character encodings, or they just assume ASCII. I guess my idea would require both a character set and an encoding scheme. The latter would be similar to ISO/IEC 8859-15 and Windows 1252, but more complete with 256 printable characters. A lot of them could probably be emoji.How good or terrible is this idea?

评论 #18778263 未加载

pspeter3over 6 years ago

How does this compare to Twitter's Snowflake? <a href="https://blog.twitter.com/engineering/en_us/a/2010/announcing-snowflake.html" rel="nofollow">https://blog.twitter.com/engineering/en_us/a/2010/announcing...</a>

评论 #18771892 未加载

评论 #18771952 未加载

评论 #18773461 未加载

ivan_gammelover 6 years ago

What does 128-bit compatibility with UUID mean? I would expect that it will be interoperability on binary level (e.g. allowing to store ULID in database columns of UUID type), but UUID has type information encoded in it - how can ULID address this requirement?

评论 #18772852 未加载

评论 #18771764 未加载

cryptonectorover 6 years ago

So you need good clocks (and timesync) and good entropy. Especially for the lexicographic sorting part, you'll need really good clocks. That's fine, if you can get them. But it's not enough, since you get no origin ID, 1ms is a very long time, and you can't sort events occurring in the same ms.

Tooover 6 years ago

It's not always a good idea to expose timestamps to third parties. You would then need to obfuscate the url with an API-id, and in that case all the properties of readability and url-compatibility are mostly redundant as the ULID only circulates internally in your cluster.

wgjover 6 years ago

This seems like it could be useful, but> Cryptographically secure source of randomness, if possibleI don't think this should be a goal. If this is for IDs, you usually want to optimize for speed of generating IDs and evenness of distribution (aside from merely reducing collisions.) The top answer at below link has a good top list of hash algorithms. None of them are cryptographic.<a href="https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed" rel="nofollow">https://softwareengineering.stackexchange.com/questions/4955...</a>

评论 #18772007 未加载

评论 #18771819 未加载

评论 #18771994 未加载

Aeolunover 6 years ago

We use a similar approach at my company, with a few extra bits reserved for a machine identifier.

sharpercoderover 6 years ago

If you need lexicographically sortable uuids, you have a very different problem.

评论 #18771791 未加载

评论 #18771748 未加载

评论 #18772072 未加载

marknadalover 6 years ago

We have been doing something similar for a long time, works out great. Glad to see more industry adoption around this!We also wrote a decentralized clock sync algorithm that can be used where NTP fails, check out <a href="https://github.com/amark/gun/blob/master/nts.js" rel="nofollow">https://github.com/amark/gun/blob/master/nts.js</a> !I find it a little odd they didn't use a separator symbol so that way it doesn't have to overflow after a certain year. Also, then you could have microseconds precision or beyond where it is supported.Overall good progress getting people onboard with this! Solves a lot of problems before they even start.

RcouF1uZ4gsCover 6 years ago

> Each component is encoded with the Most Significant Byte first (network byte order).This seems a surprising choice. Even the PowerPC now supports little endian. I would guess that 95%+ of all software is running in a little endian system and that any software that would use ULID is going to run in a little endian system. Other than for historical compatibility, I don't think there is any reason to use big endian today, and definitely not for greenfield protocols.

评论 #18772460 未加载