I designed a similar system 10 years ago at Bytemark which worked for a few thousand VMs, ran for about 12 years. It called BigV [1]. It might still be running (any customers here still?). I think the new owners tried to shut it down but customers kept protesting when offered a less-featureful platform :-)<p>The two architectural differences from fly:<p>* the VM clusters were split into "head" and "tail" machines & linked on a dedicated 10Gbps LAN. So each customer VM needed its corresponding head & tail machine to be alive in order to run, but qemu could do all that natively;<p>* we built our own network storage layer based on NBD called flexnbd [2]. It served local discs to the heads, managed access control and so on. It could also be put into a "mirror" mode where a VM's disc would start writing its blocks out to another server while continuing to serve, keeping track of "dirty" blocks etc. exactly as described here.<p>It was very handy to be able to sell and directly attach discs of different performance characteristics without having to migrate machines. But I suspect the network (even at 10Gbps) was too much of a bottleneck.<p>I can't remember whether Linux supported the kind of fancy disc migration we wanted to do back in 2011. If it did, it was hard enough that spending a year getting our own server right seemed worth it.<p>It <i>is</i> particularly sweet trick to have a suspicion about a server and just say "flush it!" and in 12-24 hours, it's no longer in service. We had tools that most of our support team could use to execute on a slight suspicion. You do notice a performance dip while migrations are going on, but the decision to use network storage (and reduce it overall lol) might have masked that.<p>Having our discs served from userspace reduced the administration that we needed to do. But it comes with terror of maintaining a piece of C that shuffled our customers data around. Also - because I was a masochist - customers discs were files stored on btrfs and we became reluctant experts. <i>Overall</i> the system was reliable but it took a good 12-18 months of customers tolerating fscks (& us being careful not to anger the filesystem).<p>I did miss this kind of work in 2022 and interviewed for a support role at fly. I'm not sure how to take being rejected at the screener stage, I'm sure some of my former staff might be able to explain it :)<p>[1] <a href="https://blog.bytemark.co.uk/wp-content/uploads/2012/04/DesignAndImplementationOfBigV.pdf" rel="nofollow">https://blog.bytemark.co.uk/wp-content/uploads/2012/04/Desig...</a><p>[2] <a href="https://github.com/BytemarkHosting/flexnbd-c">https://github.com/BytemarkHosting/flexnbd-c</a>