Postmortem of the failure of one hosting storage unit on Jan. 8, 2020

120 点作者 nachtigall超过 5 年前

18 条评论

kaizendad超过 5 年前

It's interesting -- they ID the actual cause of the problem up top, and then just zip right past it. The problem wasn't the hardware failure, or the lack of backups, it was that customers expected them to have backups.Gandi goes into some detail on the recovery process and on ways to fix the issue in the future. But, apart from some hand-waving, they don't have any specifics about how they'll communicate expectations better with their customers in the future.Imagine the counterfactual: Gandi's docs clearly communicate "this service has no backups, you can take a snapshot through this api, you're on your own." Of course customers with data loss would've complained, but, at the end of the day, the message from both Gandi and the community would've been "well, next time buy a service with backups?" Yet there's no explicit plan to improve documentation.

评论 #22163266 未加载

zaroth超过 5 年前

I’m no ZFS expert, but it must have been incredibly stressful, if not mildly terrifying, going that far down the rabbit hole with customer data on the line.I have a bad feeling someone is going to read their write up and tweet at them, “Why didn’t you use -xyz switch, it fixes exactly this issue in 12 seconds”.Indeed it appears that the option they needed existed, but only in a later version of ZFS than they were running, and part of the fix was moving the broken array to a system that could run a newer version of ZFS, which apparently was itself not trivial.

评论 #22162902 未加载

评论 #22158915 未加载

_nalply超过 5 年前

Just bad luck. A different story: I had a cheap dedicated host in Atlanta. Their failure was epic. You get what you pay for.An old highrise is filled with ten thousands of old second hand server blades, floors by floors of equpiment prolifically producing waste heat. A sure recipe for disaster?Sure!A wrongly installed fuse at one phase in the building made one phase burn out too early. I saw a picture of an archeological breaker equipment. They fixed that.However the missing phase destroyed the compressor motors of their cooling systems. Temperature crept up higher and higher. They had to turn off whole floors of servers. When they believed they fixed the problem, they turned on row by row. Renters then frantically tried to copy what they had on the servers and the half repaired cooling system was overtaxed and they had to turn off servers again.Edit: made some details more specific.

评论 #22159938 未加载

ZeroCool2u超过 5 年前

This is basically the stuff of nightmares.You can't really fault them for the zfs version being so old the feature they needed wasn't yet implemented, because the machine was literally part of the last batch to be upgraded. The root cause is just some random hardware failure that can't be anticipated.Just bad luck. Beyond radically changing how their core infrastructure works, doesn't seem like there was a lot they could have done to prevent this. Kudos for releasing the post mortem though, at least they've been fairly honest and direct about it.

评论 #22159436 未加载

评论 #22162937 未加载

tmikaeld超过 5 年前

This actually makes me glad to use ZFS (FreeBSD and ZOL) on all servers, a broken RAID on a different filesystem could have meant complete data-loss.From all of the cases I've read where people where not idiots (Not using snapshots and overwriting a dataset..), it's by far the safest filesystem I've seen during my 12 years working with it and I've yet to loose a single file.Sure, performance can suffer and RAM is pricey, but safety of the data is more important.Considering this is a hardware fault, I think Gandi.net did their best. However, they should offer clients optional ZFS-Replication as an extra measure.

评论 #22165295 未加载

gizmo超过 5 年前

They don't say they're sorry, because they're not. Instead they minimize their actions by: 1) stating how few customers customers were affected, 2) how it's not really their fault because it was a hardware error, 3) it's not really their fault because they had already planned to upgrade the server, 4) it's not really their fault the restore procedure took so long because they had to make backups first, 5) the restore took so long because spinning disks are slow, and they really had no way to know this in advance. And to top it all off they point out they're not contractually obligated to provide working snapshots at all, so really it's the customers who are at fault here.The take-away here is clear: don't trust Gandi with anything you care about.

评论 #22159706 未加载

评论 #22160362 未加载

评论 #22160413 未加载

评论 #22162317 未加载

marcinzm超过 5 年前

>But contractually, we don’t provide a backup product for customers. That may have not been explained clearly enough in our V5 documentation.If you have a single point of failure for data and "snapshots" then you should explain that very clearly to customers. Moreover, as I understand it, competitors like AWS do not have such a single point of failure (ie: EBS Snapshots are on S3 and not EBS) so using the same terminology/workflow is going to cause confusion.

评论 #22160765 未加载

generalpass超过 5 年前

I realize everyone here seems focused on the file system, but one of the things stressed by the OpenBSD project is that difficulty in upgrades are the root cause of the biggest problems.Case in point.

vahomu超过 5 年前

> As disks are read at 3M/s, we estimate the duration of the operation to be up to 370 hours.Am I reading this right? This works out to just ~3.8TiB.So much drama over, basically, one HDD worth of data?

评论 #22161401 未加载

评论 #22164713 未加载

评论 #22161201 未加载

评论 #22161297 未加载

loa_in_超过 5 年前

Their services otherwise run flawlessly for me. I appreciate the transparency.

BrentOzar超过 5 年前

And now their postmortem blog post is down with a 503 error. Doesn’t exactly fill me with confidence about their abilities.

评论 #22158972 未加载

acd超过 5 年前

Once had a data loss on a test ZFS system with iSCSI on top. What I learnt from that is that you need to schedule scrubs your ZFS pools regularly. Its always easy to be wise afterwards but harder to predict before. Not sure if that would have helped.

iicc超过 5 年前

> We think it may be due to a hardware problem linked to the server RAM.Are they using ECC RAM?

评论 #22162485 未加载

cdubzzz超过 5 年前

Previously: <a href="https://news.ycombinator.com/item?id=22001822" rel="nofollow">https://news.ycombinator.com/item?id=22001822</a>

perlgeek超过 5 年前

As a postmortem, this does not inspire confidence. It's a very technical piece, but doesn't even try to take a customer's perspective.If you want to learn from such an outage, you have to do a fault analysis that leads to parameters you can can control.Sure, there can be faulty hardware and software, but you are the ones selecting and running and monitoring them.If recovery takes ages, you might want to practice recovery and improve your tooling.And so on.Blaming ZFS and faulty hardware and old software all cries "we didn't do anything wrong", so no improvements in sight.

_eht超过 5 年前

Being a Gandi customer must be terrifying, generally.

评论 #22160501 未加载

gdm85超过 5 年前

Would it be possible to backup the precious metadata separately to mitigate the issue?

评论 #22159515 未加载

评论 #22165306 未加载

dmh2000超过 5 年前

After a glance I thought, 'why a storage unit?, where do they get power, how do they cool it, its not physically secure, etc'. Then, oh, that kind of storage unit. Yes, I'm dumb.