It's easier to write the system's front end while paying little attention to the backend and "just" letting a local filesystem do a lot of the work for you, but it doesn't work well. The interesting question is if the result is also that the frontend-to-backend communication abstraction is good enough to replace the backend with a better solution. I'm not familiar enough with Ceph and BlueStore to have a conclusion on that.<p>I happen to work for a distributed file-system company, and while I don't do the filesystem part itself, the old saying "it takes software 10 years to mature" is so true in this domain.
It really is true, I spent years of my life wrangling a massive glusterfs cluster and it was awful. You basically can't do any kind of file system operations on it that aren't CRUD on well known specific paths. Anything else— traversal, moving/copying, linking, updating permissions would just hang forever. You're also at the mercy of the kernel driver which does hate you personally. You will have nightmares about uninterruptible sleep. Migrating it all to S3 over Ceph was a beautiful thing.
See also "Hierarchical File Systems are Dead" by Margo Seltzer and Nicholas Murphy <a href="https://www.usenix.org/legacy/events/hotos09/tech/full_papers/seltzer/seltzer.pdf" rel="nofollow">https://www.usenix.org/legacy/events/hotos09/tech/full_paper...</a>
Lot's of these issues seem to be not specific to distributed systems and also impact local single-node systems. Notable example is postgresql fsyncgate, or how mail servers in the past struggled (iirc that was one of the cases where reiserfs shined).
Noooo, really?<p>It all depends on what you want to do. For things that are already in files like all that data that DeepSeek and other models train on and for which DS open sourced their own distributed file system, it makes sense to go with a distributed file system.<p>For OLTP you need a database with appropriate isolation levels.<p>I know someone will build a distributed file system on top of FoundationDB if they haven’t yet.
Before Bluestore, we ran Ceph on ZFS with the ZFS Intent Log on NVDIMM (basically non-volatile RAM backed by a battery). The performance was extremely good. Today, we run Bluestore on ZVOLs on the same setup and if the zpool is a "hybrid" pool we put the Ceph OSD databases on an all-NVMe zpool. Ceph WAL wants a disk slice for each OSD, so we don't do Ceph WAL and consolidate incoming writes on the ZiL/SLOG on NVDIMM.