Man, reading that made me wish Clustrix (YC '06) open sourced their database (<a href="https://www.clustrix.com/" rel="nofollow">https://www.clustrix.com/</a>). They had a MySQL compatible scale our DB nearly 10 years ago, wireline compatible with MySQL without using any MySQL code, could participate in a MySQL replication cluster with normal MySQL servers (made migration easy). It was scale out shared-nothing, writes would scale linearly as you added nodes, unlike POLARDB which is shared-everything with a single master. It used RDMA 10 years ago, and custom PCIe devices because NVMe didn't exist.<p>But they didn't open source it, so only a small handful of companies get to use it. Sad.
The hardest part of a Distributed File system (and I mean File system here) is managing the Meatadata (where a file is, where the directory is, who last did something to it.)<p>Lustre, GFS2 and GPFS all have centralised metadatstores, which is both a boon and a drawback.<p>What I can't figure out is what they've done here. It appears like metadata is stored in a special partition ("journal") which is shared? But there is a control process as well.
I am rather sleep deprived, so I may have misread things, but this doesn't seem to me to be the best benchmark to evaluate Ceph for database work.<p>From what I understand best practice in ceph for databases is to make a rbd image and format that with your filesystem of choice. I believe. The rbd stripe size should be tuned to you database writes in mind.<p>I believe ceph rbd supports rdma, but I cannot find much current details about it.