Post mortem of a failed HackerNews launch

150 pointsby gingerjoosover 12 years ago

26 comments

zorlemover 12 years ago

A few things have caught my attention in your post.Your biggest problem was that the configuration of your services was not sized/tuned properly for the hardware resources you've got. As a result of this your servers have become unresponsive and instead of fixing the problem, you've had to wait 30+ minutes until the servers recovered.In your case you should have limited Solr's JVM memory size to the amount of RAM that your server can actually allocate to it (check your heap settings and possibly the PermGen space allocation).If all services are sized properly, under no circumstance should your server become completely unresponsive, only the overloaded services would be affected. This would allow you or your System Administrator to login and fix the root-cause, instead of having to wait 30+ minutes for the server to recover or be rebooted. In the end it will allow you to react and interact with the systems.The basic principle is that your production servers should never swap (that's why setting vm.swappines=0 sysctl is very important). The moment your services start swapping your performance will suffer so much that your server will not be able to handle any of the requests and they will keep piling up until a total meltdown.In your case OOM killing the java process actually saved you by allowing you to login to the server. I wouldn't consider setting the OOM reaction to "panic" a good approach - if there is a similar problem and you reboot the server, you will have no idea what caused the memory usage to grow in the first place.

评论 #4851225 未加载

评论 #4850114 未加载

ck2over 12 years ago

1. Reduce keepalive, even with nginx 60 is too much (unless it's an "expensive" ssl connection).2. set vm.swappiness = 0 to make sure crippling hard drive swap doesn't start until it absolutely has to3. Use IPTABLES xt_connlimit to make sure people aren't abusing connections, even by accident - no client should have more than 20 connections to port 80, maybe even as low as 5 if your server is under a "friendly" ddos. If you are reverse proxying to apache, connlimit is a MUST.

评论 #4848656 未加载

评论 #4848615 未加载

评论 #4848521 未加载

buro9over 12 years ago

If anyone owns a blog or site that they suspect may appear on HackerNews (especially if you're posting it), then please take the small amount of time to put an instance of Varnish in front of the site.Then, ensure that Varnish is actually caching every element of the page, and that you are seeing the cache being hit consistently.You should expect over 10,000 unique visitors within 24 hours, with most coming in the 30 minutes to 2 hours after you've hit the front page on HN.You need not do your whole site... but definitely ensure that the key landing page can take the strain.Unless you've put something like Varnish in front of your web servers, there's a good chance your web server is going down, especially if your pages are quite dynamic and require any processing.

评论 #4847937 未加载

评论 #4848609 未加载

评论 #4847781 未加载

评论 #4847783 未加载

评论 #4847780 未加载

driverdanover 12 years ago

I'd argue the opposite of your headline, that this was a very successful launch. Since HN isn't your target audience having your site fail from the traffic was far better than having it fail from a launch in your market. You shook out some important bugs before you lost real users. Plus you got to do this followup which will bring even more traffic.

bdcravensover 12 years ago

First off, best of luck with your project. Secondly, kudos on writing the post-mortem, as I know it takes some guts to own a "failure".I think, however, the need to write something like this speaks to an incorrection assumption: you need a "launch". Of course, TC and HN can give you a nice bump in traffic and even signups. However, in the long run, this really doesn't accomplish much for you. It gives you the kind of traffic that will likely leave and move on to the next article, skewing your metrics. There's certainly qualified prospects in there, but it's hard to decipher with all the noise.Again, the concept of a "launch" speaks to poor business models. It really benefits businesses where the word "traction" is more important than "revenue". Build a business that provides a service that others will pay for and grow as fast as the business can bear, bringing in those visitors that are truly valuable to you.

评论 #4847834 未加载

评论 #4849593 未加载

评论 #4852660 未加载

评论 #4847807 未加载

评论 #4848819 未加载

mekokaover 12 years ago

Thanks for this post, there were some nice tips in there. Although, I do have some nitpicking about your writing style. Maybe it's just me, but I found that your use of "+ve" instead of just saying "positive" and of "&" instead of "and" did not have the intended effect of speeding up reading, quite the reverse actually.

评论 #4848299 未加载

评论 #4848354 未加载

评论 #4849353 未加载

lmmover 12 years ago

Running with swap enabled is a terrible idea. The authors mention how it was only once solr crashed that they were able to actually log in and start fixing problems; having swap means that rather than the OOM killer terminating processes, instead your whole system just grinds to a halt.(it's strange that they recommend enabling swap when they also recommend enabling reboot-on-oom, which is pretty much the complete opposite philosophy)

评论 #4848590 未加载

评论 #4848234 未加载

boundlessdreamzover 12 years ago

1. So nginx didn't cache because of cookie?2. Isn't swapping bad? I don't think I've ever had a situation in which swap more than say 100MB was helpful. Once the machine starts swapping, a bigger swap just prolongs the agony.3. If you couldn't ssh, why didn't you just reboot the machine?Edit:1. What did you use for the graphs?2. What is the stack?

评论 #4847806 未加载

TeeWEEover 12 years ago

Stress test, load test before launch!It doesn take more then an hour, and you quickly know what your upper limits are, and where the bottlenecks are.I use gatling in favor of JMeter: <a href="https://github.com/excilys/gatling" rel="nofollow">https://github.com/excilys/gatling</a>

nasalgoatover 12 years ago

I find it very difficult to believe that this person worked on any sort of performance team, given that what they discovered is pretty much "Handling Load 101".Running everything on one box? Using swap? No caching? It's like a laundry list of junior admin mistakes.

评论 #4852182 未加载

jsaxton86over 12 years ago

This post mortem has me thinking about the best way to handle the situation in which you can't SSH into your server. The OP decided to trigger a kernel panic/restart on OOM errors, but I have a couple of concerns about this approach:* If memory serves correctly, if your system runs out of memory, shouldn't the scheduler kill processes that are using too much memory? If this is the case, the system should recover from the OOM error and no restart should be needed.* OOM errors aren't the only way to get a system into a state where you cannot SSH into a system. It would be great to have a more general solution.* Even if you do restart, unless you had some kind of performance monitoring enabled, the system is no longer in the high-memory state so it will take a bit of digging to determine the root cause. If OOM errors are logged to syslog or something, I guess this isn't a big deal.I suppose the best fail-safe solution is to ensure you always have one of the following:* physical access to the system* a way to access the console indirectly (something like VSphere comes to mind)* Services like linode allow you to restart your system remotely, which would have been useful in this scenario

评论 #4849877 未加载

debacleover 12 years ago

I clicked on the link to Cucumbertown and was immediately greeted with a picture of Italian seasoned chicken thighs.I think I really like your website. I really like the simplicity of the presentation to the user.

nemesisjover 12 years ago

This is a great way to make lemonade out of the lemon of getting hosed by a lot of traffic. Write an informative post-mortem and resubmit! I know I missed the original submission and clicked through to the site, and there you have it. I'd say being humble and trying again is never a bad idea.

antirezover 12 years ago

Whatever was the real cause for your issues, Linode's default small swap space is a plague. A system starts to misbehave much gently if there is enough swap.

评论 #4849662 未加载

评论 #4849678 未加载

alexbrand09over 12 years ago

I am currently building a site and this is definitely an experience that I can learn from. I am wondering, why was the homepage not being cached?

评论 #4848986 未加载

saltcodover 12 years ago

[ not that you asked for it here, but I've got some frontpage UI feedback: ]I think you should put a description up front to describe what Cucumber town is. I think that main image should be a slider with multiple feature images, and I think the Latest Recipes should be the first section after this. Just my 2c!Screen: <a href="http://cl.ly/image/3R2Y131Z433L" rel="nofollow">http://cl.ly/image/3R2Y131Z433L</a>

评论 #4848135 未加载

runarbover 12 years ago

I have been having some of the same issues on a site I run ( <a href="http://www.opentestsearch.com/" rel="nofollow">http://www.opentestsearch.com/</a> ). Under heavy load solr will grind to a halt if you don't have enough ram available.Putting a dedicated Varnish server in front of the search servers helped a lot. Using a cdn may also be a viable option, but haven't tried it myself.

评论 #4853618 未加载

pothiboover 12 years ago

That's why I like to use Heroku/EC-2 for launching new webservice. If shits hit the fan, you can jack up the processing power/database/RAM/whatever to scale to your demand. Once you have a good idea of the traffic it generates, you can then move it to a cheaper service.Obviously, it's easy to say that when you're on the bench. Congratulations on the launch by the way.

评论 #4848118 未加载

评论 #4848681 未加载

lnanek2over 12 years ago

Wow, nice post with real data, graphs, and helpful tips. As the Germans would say, I have nothing to complain about.

mp3tricordover 12 years ago

Once memory goes to swap you already lost. Personally I rarely configure swap on servers, save the DB. I would reconfigure your services to not grow past physical free memory. After that you are going to have to scale servers horizontally.

elchiefover 12 years ago

lamesauce.1. HN should let you pay them $10 and let them hammer your server(s) before your story goes live. good for you. good for them.2. there's a deal at lowendbox right now for a 2GB VPS for $30 a YEAR. you could have a healthy server farm for pretty cheap.

James_Henry2over 12 years ago

This is really interesting, thanks for sharing. I love this kind of transparency.

chacham15over 12 years ago

Is there a way to run simulated traffic to determine how your server will react based upon heavier load to try and determine how many people it can serve?

评论 #4848571 未加载

评论 #4848612 未加载

fox91over 12 years ago

Please, fix your CSS for mobile use. It's impossible to read because, if I zoom, the sidebar gets bigger too

srameshcover 12 years ago

well, you got it right this time :)

maxentover 12 years ago

Your project, Cucumbertown, is a cooking/recipe site/platform/network. Hacker News is not your audience/customer. Any "launch" on Hacker News is a fail, regardless of downtime.

评论 #4847789 未加载

评论 #4847903 未加载

评论 #4847925 未加载

评论 #4848294 未加载