TechEcho

9 comments

tyingqalmost 4 years ago

6:54 PM PDTStarting at 1:18 PM PDT we experienced connectivity issues to some EC2 instances, increased API errors rates, and degraded performance for some EBS volumes within a single Availability Zone in the EU-CENTRAL-1 Region.At 4:26 PM PDT, network connectivity was restored and the majority of affected instances and EBS volumes began to recover.At 4:33 PM PDT, increased API error rates and latencies had also returned to normal levels. The issue has been resolved and the service is operating normally. The root cause of this issue was a failure of a control system which disabled multiple air handlers in the affected Availability Zone. These air handlers move cool air to the servers and equipment, and when they were disabled, ambient temperatures began to rise. Servers and networking equipment in the affected Availability Zone began to power-off when unsafe temperatures were reached. Unfortunately, because this issue impacted several redundant network switches, a larger number of EC2 instances in this single Availability Zone lost network connectivity.While our operators would normally had been able to restore cooling before impact, a fire suppression system activated inside a section of the affected Availability Zone. When this system activates, the data center is evacuated and sealed, and a chemical is dispersed to remove oxygen from the air to extinguish any fire. In order to recover the impacted instances and network equipment, we needed to wait until the fire department was able to inspect the facility. After the fire department determined that there was no fire in the data center and it was safe to return, the building needed to be re-oxygenated before it was safe for engineers to enter the facility and restore the affected networking gear and servers. The fire suppression system that activated remains disabled. This system is designed to require smoke to activate and should not have discharged. This system will remain inactive until we are able to determine what triggered it improperly.In the meantime, alternate fire suppression measures are being used to protect the data center. Once cooling was restored and the servers and network equipment was re-powered, affected instances recovered quickly. A very small number of remaining instances and volumes that were adversely affected by the increased ambient temperatures and loss of power remain unresolved.We continue to work to recover those last affected instances and volumes, and have opened notifications for the remaining impacted customers via the Personal Health Dashboard. For immediate recovery of those resources, we recommend replacing any remaining affected instances or volumes if possible.

评论 #27469973 未加载

评论 #27470533 未加载

评论 #27468616 未加载

评论 #27471345 未加载

评论 #27470087 未加载

评论 #27472994 未加载

评论 #27472555 未加载

评论 #27471397 未加载

评论 #27472795 未加载

评论 #27470760 未加载

rsyncalmost 4 years ago

5:19 PM PDTWe have restored network connectivity within the affected Availability Zone in the EU-CENTRAL-1 Region. The vast majority of affected EC2 instances have now fully recovered but we’re continuing to work through some EBS volumes that continue to experience degraded performance. The environmental conditions within the affected Availability Zone have now returned to normal levels. We will provide further details on the root cause in a subsequent posts, but can confirm that there was no fire within the facility.

hyperman1almost 4 years ago

Strange to see them using PDT as a time zone. Both customers and local personell would be better served with either UTc or the local time zone.

评论 #27472627 未加载

评论 #27471865 未加载

评论 #27472297 未加载

GauntletWizardalmost 4 years ago

If you have data in Frankfurt, now is the time to test your backups. There's going to be a massive rash of failures in the next few months as hardware that was compromised but limping along dies off.

评论 #27467920 未加载

评论 #27467951 未加载

评论 #27468504 未加载

c_o_n_v_e_xalmost 4 years ago

Former controls system guy and have worked in data centers. I'd be concerned about why a control system failure took down multiple air handlers. Units typically have their own controllers and can be configured to run by themselves without input from a "parent" controller.

simzoralmost 4 years ago

This reminds me that we should get ChaosMonkey up and running. :D

评论 #27467548 未加载

plasmaalmost 4 years ago

I'm curious to hear if anyone's multi-az setup (RDS, ECS, etc) handled this event without much of an issue?I assume so but would be nice to know its working as expected!

评论 #27471762 未加载

评论 #27470483 未加载

评论 #27468681 未加载

评论 #27471314 未加载

评论 #27473885 未加载

评论 #27468478 未加载

评论 #27470565 未加载

评论 #27469404 未加载

slateralmost 4 years ago

is it on fire?

评论 #27468452 未加载

评论 #27467504 未加载

评论 #27467898 未加载

intsunnyalmost 4 years ago

Omg, please Amazon, switch to UTC for timestamps.It is 3AM in Germany, and I'm tired and I don't want to know what PDT is.

评论 #27468439 未加载

评论 #27468729 未加载

评论 #27469916 未加载

评论 #27470352 未加载

评论 #27471335 未加载

评论 #27468928 未加载

评论 #27470021 未加载

评论 #27469964 未加载

9 comments

tyingqalmost 4 years ago

评论 #27469973 未加载

评论 #27470533 未加载

评论 #27468616 未加载

评论 #27471345 未加载

评论 #27470087 未加载

评论 #27472994 未加载

评论 #27472555 未加载

评论 #27471397 未加载

评论 #27472795 未加载

评论 #27470760 未加载

rsyncalmost 4 years ago

hyperman1almost 4 years ago

Strange to see them using PDT as a time zone. Both customers and local personell would be better served with either UTc or the local time zone.

评论 #27472627 未加载

评论 #27471865 未加载

评论 #27472297 未加载

GauntletWizardalmost 4 years ago

If you have data in Frankfurt, now is the time to test your backups. There's going to be a massive rash of failures in the next few months as hardware that was compromised but limping along dies off.

评论 #27467920 未加载

评论 #27467951 未加载

评论 #27468504 未加载

c_o_n_v_e_xalmost 4 years ago

simzoralmost 4 years ago

This reminds me that we should get ChaosMonkey up and running. :D

评论 #27467548 未加载

plasmaalmost 4 years ago

I'm curious to hear if anyone's multi-az setup (RDS, ECS, etc) handled this event without much of an issue?I assume so but would be nice to know its working as expected!

评论 #27471762 未加载

评论 #27470483 未加载

评论 #27468681 未加载

评论 #27471314 未加载

评论 #27473885 未加载

评论 #27468478 未加载

评论 #27470565 未加载

评论 #27469404 未加载

slateralmost 4 years ago

is it on fire?

评论 #27468452 未加载

评论 #27467504 未加载

评论 #27467898 未加载

intsunnyalmost 4 years ago

Omg, please Amazon, switch to UTC for timestamps.It is 3AM in Germany, and I'm tired and I don't want to know what PDT is.

AWS Frankfurt incident

9 comments

AWS Frankfurt incident

9 comments