TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Why Reddit was down for 6 hours

234 点作者 meghan大约 14 年前

12 条评论

jamwt大约 14 年前
I know it's not exactly in vogue these days to tout the merits of bare hardware, but.. after all the VPS hubbub over the last couple of years, the best progression for your website still seems to be:<p>1. No traction? Just put it anywhere, 'cause frankly, it doesn't matter. Cheapest reputable VPS possible. Let's say, Linode.<p>2. Scaling out, high concurrency and rapid growth? DEDICATED hardware from a QUALITY service provider--use rackspace, softlayer et al. Have them rack the servers for you and you'll still get ~3 hour turnarounds on new server orders. That's <i>plenty</i> fast for most kinds of growth. No inventory to deal with, and with deployment automation you're really not doing much "sysadmin-y" work or requiring full timers that know what Cisco switch to buy.<p>3. Technology megacorp, top-100 site? Staff up on hardcore net admin and sysadmin types, colocate first, and eventually, take control of/design the entire datacenter.<p>I simply don't understand why so many of these high-traffic services continue to rely on VPSes for phase 2 instead of managed or unmanaged dedicated hosting. The price/concurrent user is competitive or cheaper for bare metal. Most critically, it's insanely hard to predictably scale out database systems with high write loads when you have unpredictable virtualized (or even networked) I/O performance on your nodes.
评论 #2339652 未加载
评论 #2340566 未加载
评论 #2340947 未加载
naner大约 14 年前
<a href="http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_down_for_6_of_the_last_24_hours/c1l6ykx" rel="nofollow">http://www.reddit.com/r/blog/comments/g66f0/why_reddit_was_d...</a><p>A former employee is not quite as nice to Amazon.
评论 #2339586 未加载
评论 #2339849 未加载
评论 #2339405 未加载
A1kmm大约 14 年前
Amazon claims: "Each storage volume is automatically replicated within the same Availability Zone. This prevents data loss due to failure of any single hardware component".<p>They make it sound like they are already providing RAID or something similar; however, the fact that things like this happen to Reddit, who have built their own RAID on top of Amazon's already replicated volumes, show that reliability is not a good reason to go with AWS.
评论 #2339589 未加载
评论 #2339463 未加载
评论 #2339351 未加载
kowsik大约 14 年前
EBS storage aside, they are down to 3 guys? <i>yikes</i>
评论 #2339462 未加载
评论 #2339897 未加载
评论 #2340283 未加载
bryanh大约 14 年前
On that note, I have been meaning to ask HN (even if nothing more than an exercise)...<p>If you had to run a site like Reddit, what would you do?
评论 #2339747 未加载
评论 #2339795 未加载
duck大约 14 年前
<i>Then, something really bad happened. Something which made the earlier outage a comparative walk in the park.</i><p>Murphy's law on St. Patrick's Day. Doesn't get any better than that.
评论 #2339525 未加载
X-Istence大约 14 年前
I always love seeing a good technical post-mortem of what went wrong and how it could be fixed in the future...<p>I'm currently working on building a backend service that has to scale massively as well, and it has been a fun challenge trying to understand exactly where things can go wrong and how wrong they can go...
marcamillion大约 14 年前
Wow...they sound like they are really beating themselves up over it.<p>I know the community can be demanding, but that just seems stressful.
rgrieselhuber大约 14 年前
Great writeup. I'd love to hear other people's experience with regards to workarounds when / if EBS goes down (switching over to RDS for a short time, etc.).<p>The comment about moving to local storage was interesting. Isn't the local storage on EC2 instances extremely limited (like 10-20GB?)
评论 #2339289 未加载
PaulHoule大约 14 年前
I had two machines running in east-1 last night and one of them went down around the same time reddit did. The other one made it through the night O.K.<p>EBS problems do seem to be the biggest reliability problem in EC2 right now. The most common symptom is that a machine goes to 100% CPU use and 'locks up'. Stopping the instance and restarting usually solves the problem.<p>The events also appear to be clustered in time. I've had instances go for a month with no problems, then it happens 6 times in the next 24 hours.<p>My sites are small, but one of them runs VERY big batch jobs periodically that take up a lot of RAM and CPU. Being able to rent a very powerful machine for a short time to get the batch job done without messing up the site is a big plus.
jwcacces大约 14 年前
This is why you don't outsource your bread and butter, people!<p>If you want to outsource who makes your lunch, fine, but if your whole business is requests in, data out, you do not put the responsibility of storing your data in someone else's hands.<p>I get it, Amazon EBS is cheap. But at the end of the day you've got to make sure it's your fingers on the pulse of those servers, not someone else who's priorities and vigilance may not always line up with yours.<p>(also the cloud is dumb)
评论 #2341255 未加载
评论 #2341570 未加载
tedjdziuba大约 14 年前
&#62; We could make some speculation about the disks possibly losing writes when Postgres flushed commits to disk, but we have no proof to determine what happened.<p>If you read between the lines, this says that EBS lies about the result of fsync(), which is horrifying.
评论 #2339683 未加载
评论 #2340304 未加载
评论 #2339640 未加载
评论 #2339979 未加载
评论 #2339907 未加载