科技回声

Careful with resizing on Digital Ocean...A recent resize of a Digital Ocean instance corrupted the filesystem beyond repair. Support have offered me a $80 credit and offered to restore from a backup.You should probably snapshot your instances before resizing, or migrate to a new instance manually.(Thankfully I've built a replacement box from Ansible in the meantime.)The upgrade process appears to have involved copying to different hardware across the network, since the resize operation took over 15 minutes. That copy seems to be incomplete, which led to a completely unrecoverable filesystem.On coming back up, the system displayed "DOROOT does not exist after resize". Running a filesystem check scrolled tens of thousands of e2fsck "fixes" for a period of 12 minutes.As expected, the end result of that is that all "files" on the filesystem were in /mnt/lost+found with random names, and the data in them no doubt corrupt too.Digital Ocean support does not appear to be able to re-copy or review the previous block device to determine the source of the problem. They also don't appear to have logs of the resize operation.Sure - it's always possible the filesystem was irretrievably corrupted before reboot - but I think it's pretty unlikely. Given that I've not been doing things like 'dd if=/dev/urandom of=/dev/vda1' on there, that would probably indicate a hardware fault on their side anyway.It's worth noting that I rebooted the box successfully a few minutes before the resize, so a (journaled) e2fsck ran at that point. The filesystem was at least useable a few minutes before the resize.(Ticket #633210 in case anyone from Digital Ocean wants to investigate.)

1 comment

cat9大约 10 年前

Nice of them to give you a credit, and a good decision from a customer service standpoint, but I doubt they're at fault in any real way. It could be any number of things, many of which are completely out of their control to do more than mitigate and minimize, and thus part of working with real computing systems at scale.Cultivate healthy paranoia that systems will fail - because eventually, they will, particularly if you run 100 of them or run them for several years or any other "you have to survive 1000 coin tosses to miss the error" combinatoric series. And always make a backup before doing system-changing events like resizing a partition or reprovisioning a VM.

1 comment

cat9大约 10 年前

Warn HN: Resize of Digital Ocean instance corrupted filesystem beyond repair

1 comment

Warn HN: Resize of Digital Ocean instance corrupted filesystem beyond repair

1 comment