Today I was considering how organizations such as bitly, imgur, reddit, and even github handle hash collision prevention.<p>While for GH it (seems) easier as commit hashes are sufficiently long/unique, for others with character hashes that only span ~6 chars or so, there's bound to be instances where hashes conflict from a statistical standpoint. (iirc reading an article on a hash collision in GH a few years ago here on ycomb)<p>To my mind these orgs have to have a suite tools/algos requesting information from multiple services, checking whether or not a hash has been taken – and those processes have to optimize for time. (e.g. when a user makes a post, what's a reasonable time to do a lookup?)<p>So, what are the considerations which need to be made algorithmically to check such collisions while keeping runtime to an acceptable minimum?
Database indexes most likely. Consider GitHub...<p>1. Hashes are determined outside of their context, by git.<p>2. They store this is a database<p>3. Conflicts are so rare that they can be ignored. They must happen within the same repository. If they are in different repos, then it is not a conflict.
Not sure how these orgs do it but the system design book by Alex Xu uses timestamp + dataCenterID + machineId + sequence number that is reset every millisecond. This integer is then converted to base62.
If you have different workers to generate ids, give all of them their own ranges in advance. Once a worker used up their range, they can request a new range for their next N (e.g. 100k) ids.
Do you mean "how do they do unique identifier generation"?<p>Most use a combination of precoordination and a database that when an insert fails they just generate new and try again.