This seems like useless optimization that has a high probability of biting you in the ass later on.<p>Using the standard UUID generation facilities in your OS of choice there's zero chance you get something wrong and screw yourself.<p>UUIDs are great because we can pretty much guarantee global uniqueness. Acquire a company, decide to integrate with someone, need to merge a database, etc? No problem, zero chance of record collisions no matter <i>what</i> happens in the future. (It also means zero chance of accidentally interpreting record #58274 as type A when you meant type C).<p>Furthermore, a 1 in 1 million chance of collision is far too frequent for my liking, but even if it were acceptable what happens when your service/product becomes far more popular than you imagined and you blow through your initial estimates?
If you want sequential IDs with no chance of collisions, read up on "flake" IDs<p>* <a href="https://blog.twitter.com/2010/announcing-snowflake" rel="nofollow">https://blog.twitter.com/2010/announcing-snowflake</a><p>* <a href="http://engineering.custommade.com/simpleflake-distributed-id-generation-for-the-lazy/" rel="nofollow">http://engineering.custommade.com/simpleflake-distributed-id...</a><p>* <a href="http://boundary.com/blog/2012/01/12/flake-a-decentralized-k-ordered-unique-id-generator-in-erlang/" rel="nofollow">http://boundary.com/blog/2012/01/12/flake-a-decentralized-k-...</a><p>* <a href="https://github.com/mumrah/flake-java" rel="nofollow">https://github.com/mumrah/flake-java</a>
This is interesting, but it plucks from nowhere the equation for the chances of collision. Here's my write-up of where that comes from:<p><a href="http://www.solipsys.co.uk/new/TheBirthdayParadox.html?HN_20141031" rel="nofollow">http://www.solipsys.co.uk/new/TheBirthdayParadox.html?HN_201...</a><p>It's intended to be gentle, but a few people have said it's a bit quick in places. I'd appreciate any feedback.<p><i>Added in edit: I've submitted it as a separate item - it's been a few months since it was discussed here.</i><p><a href="https://news.ycombinator.com/item?id=8540220" rel="nofollow">https://news.ycombinator.com/item?id=8540220</a>
You can generalize this idea to: if you are willing to exercise control over some parameters that go into your UID, you need fewer random bits.<p>For example you could encode a number that identifies the host (like, the last byte or last two bytes of the public IP address) and the process id of the process generating the ID, and as a result you need less entropy for avoiding collisions.<p>But you risk that somebody who doesn't know UID algorithm screws things up. For example if you use the last byte of the IP address, and some network administrator decides to give each host an IPv6 net, the last byte of the IP might very well be one for each host. (OK, that's a bit of a contrived example; maybe PID namespaces are a better one?).<p>Or things outside of your control. Your company gets acquired by a much bigger one, and for some reason they decide to use your system for the whole company. Or for a huge customer. And now you're facing a factor 1000 more records than you ever thought possible. Or a factor 10000. History is full of software systems that have been used way beyond what they were planned for originally, and of course nobody revisited all relevant design decisions.<p>Second point to consider: by making parts of your UIDs deterministic, you also leak information. Like when a dataset was created, and on what host. Which might be relevant for timing attacks, or other kinds of security nastiness that you don't even think about right now.
UUIDs have the advantage that they are well-understood and widely supported. If you really need to shave a few bytes here and there, developing your own coding scheme is useful. But for the most part, I don't see the win.<p>Locality is definitely important, but I must be missing something -- if lookup by date, machine ID etc is required, why not create indices on those fields? Why rely on coincidental locality?
I had a similar issue with an app I'm writing now. I wanted short IDs so my URLs wouldn't be fugly, but with a low chance of collisions. The solution I went with (in javascript) is:<p><pre><code> // Make a "pretty unique" ID for this session.
// Since RethinkDB doesn't have a way for us to guarantee a _short_
// random unique value (short of trying the insert and regenerating if it
// doesn't save), we'll just have to rely on the unlikeliness of a collision
// with both this time-based ID and the title-based slug.
// I'm sure this will never ever cause any problems ʘ‿ʘ
var alphabet = "0123456789abcdefghijklmnopqrstuvwxyz";
var id = new Date().getTime().toString().match(/.{1,2}/g).map(function(val){return alphabet[val % alphabet.length];}).join('');
var slugPart = slug((this.title || "").substring(0,60).toLowerCase());
this.url_slug = id + "/" + slugPart;
</code></pre>
That is, get a current timestamp (in milliseconds), and use every group of 2 digits to pull a letter out of an alphabet string. Then append "/title-of-the-thing-made-url-safe". This results in strings that look like "ee7zrm9/something-goes-here", which is then used as the primary key for the document. It's not perfect by any means, but it gets the job done, and I thing appending the title makes collisions extremely rare.
As a warning, don't try to be too clever with your ID system. You can get collision bugs that aren't usually visible in testing.<p>I had a catastrophic bug (ala private data going to the wrong person) from 96-bit (32 bit segment number, 64 bit random local docid) ID collisions when the caching code decided it was going to use docid as the cache key without realizing it was missing a bunch of bits.
The article mentions "friendly" URLs as being a driving factor. That makes it a presentation issue; ie., it's part of the content, and it is wise to consider if you can derive it from the content.<p>For a blog post, for example, there is a title. The classic way of adding a readable date to the URL is useful, if you're reading the URL in the first place. This particular blog post uses that approach: <a href="https://eager.io/blog/how-long-does-an-id-need-to-be/" rel="nofollow">https://eager.io/blog/how-long-does-an-id-need-to-be/</a>.<p>For other objects there might still be useful data. Instead of /invitation/3jdix8jAJm you might have /invitation/myblog/bob@example.com/u7pW, the last part being an auto-generated random component. The benefit is that the ID becomes self-explanatory (self-describing) and very nice for tracing through logs and the like. Of course, one has to be careful about not exposing anything exploitable.
There is a bit of possible misunderstanding/misinformation here: there is a difference between a primary key and a rowid. The reason that I point out this distinction is that rows are stored on disk by rowid, meaning that an insert will still usually insert to the end. On the flip side, yes, the index will have this problem, but the index shouldnt be very large relative to the table meaning that it shouldnt be as expensive as the OP is thinking. Note: often the database will optimize and use the auto increment primary key as the rowid, but it wont for a uuid primary key.
I didn't like long UUIDs either, so I wrote this small Python library to re-encode them using a more varied character set:<p><a href="https://github.com/stochastic-technologies/shortuuid" rel="nofollow">https://github.com/stochastic-technologies/shortuuid</a>
><i>When your IDs are random however, each new record has to be placed at a random position. This means a lot more thrashing of data to and from disk.</i> //<p>Just use the GUID externally and use have a sequential primary key as the table index?
This may be a bit naive but are UUIDV4 completely random from 'head-to-tail'? I mean, given the birthday paradox calculation, couldn't you just take head of the uuidv4 (i.e.: the first x characters) to arrive at the collision/space-consumption tradeoff you want?
Twitter came up with a different scheme called Snowflake (<a href="https://github.com/twitter/snowflake" rel="nofollow">https://github.com/twitter/snowflake</a>).
I appreciate this article. The mention of the Birthday problem made the calculation reasonable, and the trick to ensure time-locality as a fixed bit pattern was enlightening.
I do something very similar: 64-bit IDs start with a timestamp with a custom epoch, and fill in the rest with random data. I store these in bigints in Postgres.
Couldn't you generate a UUID and then check the database that it is unique before using it? You could then repeat until you got a unique identifier.
This seems overly complicated. Why not generate an ID out of two numbers, one a server or thread ID and another that auto-increments? Assigning a unique number to each entity that can generate IDs seems like a tractable problem, and the odds of generating a collision can be reduced to zero if you negotiate a new number when the counter wraps around.
He assumes that the UUIDs are independent (selected independently uniformly at random). They are not. Trust me, it's very, very hard to get this kind of 'pure' randomness on computers. No matter what randomness you use, you will have some correlation and your odds to get different UUIDs drop rapidly.