Sharding and IDs at Instagram

183 pointsby singhshashiover 13 years ago

19 comments

mitchellhover 13 years ago

Cool technology, good explanation. Legitimate questions below.What you're describing (uploading photos + storing metadata) sounds like something which Facebook has tech talked at length about at multiple venues. Their solution was to use distributed FS for images (such as HDFS, though FB uses their internal "Haystack") and then use HBase for the metadata. To be honest, your solution while it works now, looks like a weak home-grown HBase, but leveraging PL/PGSQL for unique IDs. Why not go the snowflake+hbase route? While it may add Ops complexity, it is a fairly battle-proven stack, and JVM ops is pretty well documented.Or, if you insist on using an RDBMS for metadata, why not just throw money (and not that much) at the problem and buy an SSD for your DB? Increase your iops from 100 or 200 up to 30,000 or 40,000 with a cheap drive, and call it a day. Surely this would be less expensive than the engineering effort that went into (and will continue to go into) this project. This has the added benefit of having no impact on Ops complexity and should scale to quite a staggering number of QPS.Thanks!

评论 #3058966 未加载

seattlemattover 13 years ago

Back in 2003 I was a developer on a team at Microsoft that was responsible for building a new storage backend for all the MSN Messenger and Hotmail contact lists. This store had to hold data for close to 300 million user accounts, which we sharded out to a few hundred SQL server databases. Our sharding system consisted of a database with a single table with a row for each user that mapped their 128 bit guid user ID to the ID of their assigned shard. New user creation involved generating a new guid and inserting into this table. Each read operation involved a select from this table.The database ran on a machine with enough RAM to let SQL Server do its thing and cache almost the whole table in memory. At the time I left the team in 2005, it was executing over 25,000 requests per second, with an average latency of under 3ms. Pretty sure that on modern hardware it would handle much, much more.

评论 #3079841 未加载

mncaudillover 13 years ago

I work at Flickr and I see they mentioned Flickr's ticket server idea, (ab)using MySQL's autoincrement and "REPLACE INTO" trick and mentioned that a con was the write bottleneck.We're generating more GUIDs than ever with this system and those boxes are more or less idle on every metric. They're right in that we don't meet their time-ordered requirement, but I just wanted to say that writing (or reading) is not a bottleneck.

评论 #3058978 未加载

ww520over 13 years ago

I think it's a mistake to tie the shard id into the object id. Shard id should be derived from the object id dynamically based on the placement of the object among the shards. If the shards grow or shrink or object migrated, a different shard id is generated, but the object id doesn't have to change.Edit: I like how constructive criticism got downvoted. Thanks for discouraging technical discussion.

评论 #3059342 未加载

评论 #3060084 未加载

fratrikover 13 years ago

Interesting article. It seems like a pretty good solution the problem, especially when judged on the ease of maintenance.One thing that bothered me was, "Let’s walk through an example: let’s say it’s September 9th, 2011, at 5:00pm and our ‘epoch’ begins on January 1st, 2011. There have been 1387263000 milliseconds since the beginning of our epoch..."The number of seconds between 5pm one day and midnight another obviously doesn't end in 3, so I looked into it. It looks like that number is taken from the epoch that is used later in the SQL which starts on Aug 24th.

jconleyover 13 years ago

I've used the "snowflake" like approach in the past with great success. It's really not all that complicated. The reliance on time in the Instagram approach is a bit scary. A few ms off here and there could really hurt this scheme. How do you handle seamlessly transitioning these across machines when your shards move?

评论 #3058777 未加载

评论 #3058773 未加载

kenjacksonover 13 years ago

I'm sure I missed something, but this doesn't guarantee unique keys across the database does it?What if you have two tables with the same autoincrement value being updated at the same millisecond by two users with the same UserID%NumShards?Or is there some relationship between physical and logical shards that makes this impossible?

评论 #3058997 未加载

santosh79over 13 years ago

Did you evaluate using something like Redis for this? It's got an atomic increment command that guarantee's unique ID's and the performance is stupid fast.

评论 #3059379 未加载

artilheiroover 13 years ago

I like it. Question, did you consider using composite keys of shard_id and id to make up a single primary key? If so what were the pros/cons you found with that approach?btw, tambem sou brasileiro vivendo em san francisco (<a href="https://twitter.com/#!/artilheiro" rel="nofollow">https://twitter.com/#!/artilheiro</a>)

评论 #3058792 未加载

pabitasover 13 years ago

Hi, very helpful post. Thank you.Something still puzzles me though. You wrote a sequence is unique to each table in each schema. However, your example seems to use the same sequence name for all tables within a schema.Also, shouldn't your mod operation have only one %?Thanks again!

ryanskingover 13 years ago

One nit about snowflake in your article (i'm the author of snowflake)– the zookeeper integration is optional and only used for sanity checking the configuration (that the worker ids are distinct, etc).

iampimsover 13 years ago

We use a solution similar to snowflake at Formspring:<a href="https://github.com/formspring/flake" rel="nofollow">https://github.com/formspring/flake</a>In the 18 months this system has been running, we never experienced any issue.

andrewdunstanover 13 years ago

Neat.You could avoid having to create a per-schema id function by passing the shard id and the sequence oid as params to the function, so you'd have default public.next_id(5,'insta5.table_id_seq'::regclass)

sean9over 13 years ago

Hi Mike, excellent article! One question about sharding, do you find that it reduces High Availability overall, as your uptime depends on the vagaries of additional database servers??

zeegover 13 years ago

What I'd be more curious to hear about, is how they deal with super nodes and if they're storing a map (or any kind of routing table) or simply using generic modulo hashing on a key.

评论 #3058767 未加载

mark242over 13 years ago

Holy god. The post mentions "We evaluated a few different NoSQL solutions, but (...)".I couldn't even imagine having to _think_ about this sort of thing; CouchDB makes this an absolute no-brainer. I mean, the act of creating a document assigns it a UUID in the database _by_ _default_.Or, do you want to fetch a UUID before assigning it to any data? localhost:5984/_uuids Want to fetch 10? localhost:5984/_uuids?count=10 Want to fetch a million? localhost:5984/_uuids?count=1000000Instagram seems like the absolutely perfect candidate for CouchDB -- unstructured data, speed, HTTP querying, big data, attachments...

评论 #3058986 未加载

评论 #3058898 未加载

评论 #3058902 未加载

trimover 13 years ago

This seems fairly neat. Is it possible to do something like this with MySQL?

评论 #3058707 未加载

sirmillerover 13 years ago

I just love how startups keep reinventing the wheel to solve problems that haven't been problems for decades.Come on, 25 pics + 90 likes per second... that almost like... [wait for it]... nothing.I'm pretty sure you got your number wrong.

评论 #3068434 未加载

fleitzover 13 years ago

How to do sharding:1. Define your keyspace in bits.2. Shard your keyspace in CIDR notation3. Resharding always splits a keyspace in half<pre><code> eg. 0.0.0.0/0 becomes 0.0.0.0/1 and 128.0.0.0/1 </code></pre> 4. Store your shard keyspace in DNS with SRV/TXT records5. Assign IDs randomlyThis gives a couple interesting properties, for replication odds are very high that your corresponding server has a shard of similar size and if not its generally only split over two servers. Servers also only need to speak directly to their one or two replicas in the other data center. The other really nice property if you're using spindle disks is that you can copy the entire HD and just delete the half of the keyspace not used after the split.It's all written up here if you want to dig into with a full description of all the pieces: <a href="http://bit.ly/FredPatent" rel="nofollow">http://bit.ly/FredPatent</a>

评论 #3058923 未加载