TechEcho

9 comments

mrshoeover 15 years ago

These kinds of domino effects are one reason why scalability is so hard to get right. It reminds me of precipitation in supersaturated solutions. Everything seems normal until you reach some unforeseen tipping point, and then all hell breaks loose.I like his little veiled pitch for Google's services when he talks about how easy it was to bring more request routers online given their elastic architecture. It makes me wonder why that elasticity isn't automated -- more routers should automatically be brought online if any routers hit their maximum load.

评论 #799425 未加载

评论 #799408 未加载

评论 #799680 未加载

评论 #801822 未加载

spolskyover 15 years ago

Wow, I was impressed by how closely this mea culpa was the same as Amazon's when they had that big S3 outage:Compare to:<a href="http://developer.amazonwebservices.com/connect/message.jspa?messageID=79978#79978" rel="nofollow">http://developer.amazonwebservices.com/connect/message.jspa?...</a>

评论 #799500 未加载

smakzover 15 years ago

I admire the transparency but I don't pretend for a second it's the whole story. This happened during work hours and if they indeed did get notified so fast, I'm wondering why it took over 90 minutes to recover.Also, the outage, for me anyway, seemed to last much longer then the stated 100 minutes. I seem to remember being unable to access GMail for a span of about 3 hours today.

评论 #799441 未加载

pmoriciover 15 years ago

Interesting that they say the outage lasted 100 minutes instead of 1 hour 40 minutes which to me sounds worse.

评论 #799466 未加载

ssnover 15 years ago

Look at the bright side of this: GMail just got more reliable.

arfrankover 15 years ago

Its nice to see them being so transparent about what happened and how they plan on fixing it in the future. They're obviously working on anticipating problems in the future, but what I wonder about is things like this, where they thought they were covered. How does one go about finding these failure points on systems that span multiple locations? I hope they followup with lessons learned on their quest to improve reliability.

taitemsover 15 years ago

This is probably the most glaring flaw in SaaS and cloud computing. Even the giants go down eventually. Couple that with your own ISP's issues and your potential downtime is doubled.

评论 #799476 未加载

评论 #799445 未加载

评论 #800227 未加载

评论 #799487 未加载

评论 #799462 未加载

lallysinghover 15 years ago

Unless they've had other downtime on gmail, their uptime's been (after this fault) 99.99239%. Pretty good.

sanjover 15 years ago

Why didn't the routing servers come back online after they cleared their queue?

9 comments

mrshoeover 15 years ago

评论 #799425 未加载

评论 #799408 未加载

评论 #799680 未加载

评论 #801822 未加载

spolskyover 15 years ago

评论 #799500 未加载

smakzover 15 years ago

评论 #799441 未加载

pmoriciover 15 years ago

Interesting that they say the outage lasted 100 minutes instead of 1 hour 40 minutes which to me sounds worse.

评论 #799466 未加载

ssnover 15 years ago

Look at the bright side of this: GMail just got more reliable.

arfrankover 15 years ago

taitemsover 15 years ago

This is probably the most glaring flaw in SaaS and cloud computing. Even the giants go down eventually. Couple that with your own ISP's issues and your potential downtime is doubled.

评论 #799476 未加载

评论 #799445 未加载

评论 #800227 未加载

评论 #799487 未加载

评论 #799462 未加载

lallysinghover 15 years ago

Unless they've had other downtime on gmail, their uptime's been (after this fault) 99.99239%. Pretty good.

sanjover 15 years ago

Why didn't the routing servers come back online after they cleared their queue?

More on today's Gmail issue

9 comments

More on today's Gmail issue

9 comments