Keeping Instagram up with over a million new users in twelve hours

279 pointsby mikeykabout 13 years ago

21 comments

lenn0xabout 13 years ago

What kind of instances are you guys running for Redis/memcached? I am a bit surprised on the numbers here, but to be fair I don't do much in the virtualization world. With low cpu overhead, it sounds like you might be saturating the number of interrupts on the network card if it's not a bandwidth issue. Memcache can usually push 100-300k/s on an 8-core Westmere (could go higher if you removed the big lock). Redis on the other hand with pinned processes to each physical core can do about 500,000/s. We (Twitter) saw saturation around 100,000~ on CPU0, what tipped us off was ksoftirq spinning at 100%. If you have a modern server and network card, just pin each IRQ for every TX/RX queue to an individual physical core.

评论 #3805824 未加载

sciurusabout 13 years ago

A slight tangent, since I saw that instagram are using both Graphite and Munin- Collectd just added a plugin to send metrics to Graphite. You might want to try it for tracking your machine stats over time.<a href="http://collectd.org/wiki/index.php/Plugin:Write_Graphite" rel="nofollow">http://collectd.org/wiki/index.php/Plugin:Write_Graphite</a> <a href="http://collectd.org/" rel="nofollow">http://collectd.org/</a>

评论 #3806060 未加载

statictypeabout 13 years ago

Isn't there a risk with EBS snapshots that the snapshot of a live instance could have been taken while your db engine was in the middle of a transaction and leave the data in the newly spun instance in an inconsistent state?Is it that EBS snapshots are engineered to prevent this? Or just that it's not likely to happen in practice?

评论 #3805748 未加载

0xbadcafebeeabout 13 years ago

Why use Graphite instead of Ganglia? Ganglia uses RRDs. It's been around forever, it's fairly low on resource use, it's fast, and you can generate custom graphs like with Graphite. I actually ended up doing some graphs with google charts and ganglia last time I messed with it. (Also, nobody has really simple tools to tell you which of your 3,000 cluster nodes has red flags in real time and spit them into a fire-fighting irc channel so we had to write those ourselves in python)"Takeaway: if read capacity is likely to be a concern, bringing up read-slaves ahead of time and getting them in rotation is ideal"Sorry but this is not 'ideal', this is Capacity Planning 101. If you're launching a new product which you expect to be very popular, take your peak traffic and double or quadruple it and build out infrastructure to handle it ahead of time. I thought this was the whole point of the "cloud"? Add a metric shit-ton of resources for a planned peak and dial it down after.

评论 #3806313 未加载

评论 #3805645 未加载

terhechteabout 13 years ago

Congratulations. Really impressive how solid you handled the Android onslaught.

gflarityabout 13 years ago

We use statsd, graphite, redis and node as well. You might be interested some of my projects relating to these:<a href="https://github.com/gflarity/nervous" rel="nofollow">https://github.com/gflarity/nervous</a> <a href="https://github.com/gflarity/response" rel="nofollow">https://github.com/gflarity/response</a> <a href="https://github.com/gflarity/qdis" rel="nofollow">https://github.com/gflarity/qdis</a>

olegiabout 13 years ago

Hello!Question about quality insta-photos on Android.I have JPG from SGS2 - <a href="http://kia4sale.narod.ru/insta/01.jpg" rel="nofollow">http://kia4sale.narod.ru/insta/01.jpg</a>This is <a href="http://kia4sale.narod.ru/insta/02.jpg" rel="nofollow">http://kia4sale.narod.ru/insta/02.jpg</a> instaphoto (Earlybird) from Android versionThis is <a href="http://distilleryimage9.instagram.com/662ade7483ce11e19e4a12313813ffc0_7.jpg" rel="nofollow">http://distilleryimage9.instagram.com/662ade7483ce11e19e4a12...</a> - instaphoto from SGS2 JPG but on iPhone 4.Question: why instaphoto on Android version in blurry?Thanks.

jcastroabout 13 years ago

What OS are you deploying on EC2?

评论 #3804747 未加载

zupremeabout 13 years ago

Thanks for OpenSourcing Node2dm. I think I'll take that for a spin this weekend.

nboutelierabout 13 years ago

Im curious to know what kind of EC2 instance they are running the master Postgresql on and if they've had any write bottle necks. Im using Postgres for an app, and am worried about running into write issues.

EAMillerabout 13 years ago

What sort of hosting do you use for your main Pg (and Redis) instances?

评论 #3805242 未加载

评论 #3804890 未加载

andrewdunstanabout 13 years ago

PGFouine is nice, but it needs a major do-over. It would be good written with a plpgsql backend running against database loaded csv log files, so that it could handle huge logs, unlike now.

ganilbabout 13 years ago

I am curious to find out why there was a need to develop your own C2DM server - what was lacking in Google's C2DM server? I am a C2DM newbie so pardon my ignorance.

rkurianabout 13 years ago

It looks like you guys use Redis for a lot of different functionality. It would be great to see an article on how you guys use Redis.

8ig8about 13 years ago

> We use the counters to track everything from number of signups per second.Per second... It must be quite a moment when you reach this point.

评论 #3806896 未加载

jurreabout 13 years ago

Very interesting read, but doesn't New Relic do all these things for you? Maybe it's not possible to use with their setup?

评论 #3807646 未加载

bondabout 13 years ago

Does anyone has some info on the architecture required to maintain a service like this? Servers, db, etc?

评论 #3805188 未加载

kunalmodiabout 13 years ago

are you guys sharding redis? or does it all fit in a single machine?

评论 #3804956 未加载

nodesocketabout 13 years ago

Great stuff, love node2dm, and didn't know about statsd + graphite.

评论 #3804620 未加载

Sujanabout 13 years ago

Thought about adding a tool like newrelic.com to your toolset?

评论 #3804694 未加载

drivebyacct2about 13 years ago

What percentage of processing power is spent on making me look like a hipster?

评论 #3806451 未加载