"The directory is intended for temporary storage of results before staging them into a more permanent location [...] During the three years that the filesystem has been in operation, it has accumulated 1.7 Petabytes of data in 850 million objects."<p>There needs to be some law about how temporary directories always end up containing vitally important data.
Lots of fun; while backing up the filesystem prior to wiping and rebuilding it, they ran out of IOPS to do it in a reasonable time frame, so after considering other options:<p><i>One obvious solution would be to use a ramdisk, a virtual disk that actually resides in the memory of a node. The problem was that even our biggest system had 1.5TB of memory while we needed at least 3TB.<p>As a workaround we created ramdisks on a number of Taito cluster compute nodes, mounted them via iSCSI over the high-speed InfiniBand network to a server and pooled them together to make a sufficiently large filesystem for our needs.</i><p>A hack they weren't at all sure would work, but it did nicely.
Current HN Title: 1.7 petabytes and 850M files lost, and how we survived it.<p>Article title: The largest unplanned outage in years and how we survived it<p>Article overview:
A month ago CSC's high-performance computing services suffered the largest unplanned outage in years. In total approximately 1.7 petabytes and 850 million files were recovered.<p>Although technically correct, the HN title is misleading.
It should be noted that this is about a Lustre filesystem hosted on DDN hardware. It's unclear whether the failed controller contributed to the file system corruption, but Lustre is quite capable of accelerating local entropy all by itself. It was designed/spec-ed at LLNL as huge file, high performance, short term scratch/swap and even after 15 years isn't especially reliable or fit for use outside that domain.
I'm surprised that the copying bottleneck seems to have been entirely at the target rather than the source. Is that because there were multiple copies of the source?<p>I've had to employ the horrible hack of iscsi from compute nodes, raided and re-exported, but it's not what I'd have tried to use first. The article doesn't mention the possibility of just spinning up a parallel filesystem on compute node local disks (assuming they have disks); I wonder if that was ruled out. I don't have a good feeling for the numbers, but I'd have tried OrangeFS on a good number of nodes initially.<p>By the way, it's been pointed out that RAM disk is relatively slow, if in the context of data rates rather than metadata <<a href="http://mvapich.cse.ohio-state.edu/static/media/publications/slide/rajachan-hpdc13.pdf>" rel="nofollow">http://mvapich.cse.ohio-state.edu/static/media/publications/...</a>.
Out of curiosity, why weren't they running the metadata drive in a mirroring raid? If you have PB of data, wouldn't it make sense to spend the ~$100 for a second 3TB drive to mirror your metadata?<p>Or was the inode problem not a local disk problem but a problem in the Luster fs? I couldn't quite tell from the article.