Cool technology, good explanation. Legitimate questions below.<p>What you're describing (uploading photos + storing metadata) sounds like something which Facebook has tech talked at length about at multiple venues. Their solution was to use distributed FS for images (such as HDFS, though FB uses their internal "Haystack") and then use HBase for the metadata. To be honest, your solution while it works now, looks like a weak home-grown HBase, but leveraging PL/PGSQL for unique IDs. Why not go the snowflake+hbase route? While it may add Ops complexity, it is a fairly battle-proven stack, and JVM ops is pretty well documented.<p>Or, if you insist on using an RDBMS for metadata, why not just throw money (and not that much) at the problem and buy an SSD for your DB? Increase your iops from 100 or 200 up to 30,000 or 40,000 with a cheap drive, and call it a day. Surely this would be less expensive than the engineering effort that went into (and will continue to go into) this project. This has the added benefit of having no impact on Ops complexity and should scale to quite a staggering number of QPS.<p>Thanks!
Back in 2003 I was a developer on a team at Microsoft that was responsible for building a new storage backend for all the MSN Messenger and Hotmail contact lists. This store had to hold data for close to 300 million user accounts, which we sharded out to a few hundred SQL server databases. Our sharding system consisted of a database with a single table with a row for each user that mapped their 128 bit guid user ID to the ID of their assigned shard. New user creation involved generating a new guid and inserting into this table. Each read operation involved a select from this table.<p>The database ran on a machine with enough RAM to let SQL Server do its thing and cache almost the whole table in memory. At the time I left the team in 2005, it was executing over 25,000 requests per second, with an average latency of under 3ms. Pretty sure that on modern hardware it would handle much, much more.
I work at Flickr and I see they mentioned Flickr's ticket server idea, (ab)using MySQL's autoincrement and "REPLACE INTO" trick and mentioned that a con was the write bottleneck.<p>We're generating more GUIDs than ever with this system and those boxes are more or less idle on every metric. They're right in that we don't meet their time-ordered requirement, but I just wanted to say that writing (or reading) is not a bottleneck.
I think it's a mistake to tie the shard id into the object id. Shard id should be derived from the object id dynamically based on the placement of the object among the shards. If the shards grow or shrink or object migrated, a different shard id is generated, but the object id doesn't have to change.<p>Edit: I like how constructive criticism got downvoted. Thanks for discouraging technical discussion.
Interesting article. It seems like a pretty good solution the problem, especially when judged on the ease of maintenance.<p>One thing that bothered me was, "Let’s walk through an example: let’s say it’s September 9th, 2011, at 5:00pm and our ‘epoch’ begins on January 1st, 2011. There have been 1387263000 milliseconds since the beginning of our epoch..."<p>The number of seconds between 5pm one day and midnight another obviously doesn't end in 3, so I looked into it. It looks like that number is taken from the epoch that is used later in the SQL which starts on Aug 24th.
I've used the "snowflake" like approach in the past with great success. It's really not all that complicated. The reliance on time in the Instagram approach is a bit scary. A few ms off here and there could really hurt this scheme. How do you handle seamlessly transitioning these across machines when your shards move?
I'm sure I missed something, but this doesn't guarantee unique keys across the database does it?<p>What if you have two tables with the same autoincrement value being updated at the same millisecond by two users with the same UserID%NumShards?<p>Or is there some relationship between physical and logical shards that makes this impossible?
Did you evaluate using something like Redis for this? It's got an atomic increment command that guarantee's unique ID's and the performance is stupid fast.
I like it. Question, did you consider using composite keys of shard_id and id to make up a single primary key? If so what were the pros/cons you found with that approach?<p>btw, tambem sou brasileiro vivendo em san francisco (<a href="https://twitter.com/#!/artilheiro" rel="nofollow">https://twitter.com/#!/artilheiro</a>)
Hi, very helpful post. Thank you.<p>Something still puzzles me though. You wrote a sequence is unique to each table in each schema. However, your example seems to use the same sequence name for all tables within a schema.<p>Also, shouldn't your mod operation have only one %?<p>Thanks again!
One nit about snowflake in your article (i'm the author of snowflake)– the zookeeper integration is optional and only used for sanity checking the configuration (that the worker ids are distinct, etc).
We use a solution similar to snowflake at Formspring:<p><a href="https://github.com/formspring/flake" rel="nofollow">https://github.com/formspring/flake</a><p>In the 18 months this system has been running, we never experienced any issue.
Neat.<p>You could avoid having to create a per-schema id function by passing the shard id and the sequence oid as params to the function, so you'd have
default public.next_id(5,'insta5.table_id_seq'::regclass)
Hi Mike, excellent article! One question about sharding, do you find that it reduces High Availability overall, as your uptime depends on the vagaries of additional database servers??
What I'd be more curious to hear about, is how they deal with super nodes and if they're storing a map (or any kind of routing table) or simply using generic modulo hashing on a key.
Holy god. The post mentions "We evaluated a few different NoSQL solutions, but (...)".<p>I couldn't even imagine having to _think_ about this sort of thing; CouchDB makes this an absolute no-brainer. I mean, the act of creating a document assigns it a UUID in the database _by_ _default_.<p>Or, do you want to fetch a UUID before assigning it to any data? localhost:5984/_uuids
Want to fetch 10? localhost:5984/_uuids?count=10
Want to fetch a million? localhost:5984/_uuids?count=1000000<p>Instagram seems like the absolutely perfect candidate for CouchDB -- unstructured data, speed, HTTP querying, big data, attachments...
I just love how startups keep reinventing the wheel
to solve problems that haven't been problems for decades.<p>Come on, 25 pics + 90 likes per second... that almost like... [wait for it]... nothing.<p>I'm pretty sure you got your number wrong.
How to do sharding:<p>1. Define your keyspace in bits.<p>2. Shard your keyspace in CIDR notation<p>3. Resharding always splits a keyspace in half<p><pre><code> eg. 0.0.0.0/0 becomes 0.0.0.0/1 and 128.0.0.0/1
</code></pre>
4. Store your shard keyspace in DNS with SRV/TXT records<p>5. Assign IDs randomly<p>This gives a couple interesting properties, for replication odds are very high that your corresponding server has a shard of similar size and if not its generally only split over two servers. Servers also only need to speak directly to their one or two replicas in the other data center. The other really nice property if you're using spindle disks is that you can copy the entire HD and just delete the half of the keyspace not used after the split.<p>It's all written up here if you want to dig into with a full description of all the pieces:
<a href="http://bit.ly/FredPatent" rel="nofollow">http://bit.ly/FredPatent</a>