From the Blog post:<p>> We looked at this issue earlier. Fundamentally the tension here is that copy-on-write semantics don’t fit with the emerging zone interface semantics.<p>While the paper writes:<p>> It is not surprising that attempts to modify
production file systems, such as XFS and ext4, to work with
the zone interface have so far been unsuccessful [19, 68],
primarily because these are overwrite file systems, whereas
the zone interface requires a copy-on-write approach to data
management<p>This seems to be a contradiction, and I'd side with the original paper.
I ran a ~ 0.5 PB Ceph cluster for a few years, on quite old spinning disk hardware (bought second-hand.) It was great: it just worked, coped very well with hardware failures, told the operator what was happening. An extremely solid, well-engineered system. My thanks to the Ceph team :)
I was just discussing with a colleague how technology accretes and how no one reevaluates high-level design decisions even after every single factor leading to those decisions has changed.<p>It's weird that basic filesystems today are so out of touch with modern realities that we are <i>universally</i> forced to resort to using complex databases even in cases when the logical model of files and directories fits the storage needs really well.<p>It's weird that hierarchical storage is the only universal model available on all OSes and in all languages.<p>The more I think about it, the more I realize that we live in a bizarro world where software runs everything, yet makes little to no sense from either human, or modern hardware or system design perspectives.
The link to the paper in the article requires an ACM subscription. Here is a link to the version hosted by the authors:<p><a href="https://www.pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf" rel="nofollow">https://www.pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf</a>
For anyone looking for more information and benchmarks on the performance improvements in recent versions of Ceph (and with Bluestore in particular), here's a write-up that was done as part of testing for infrastructure to support the Human Brain Project: <a href="https://www.stackhpc.com/ceph-on-the-brain-a-year-with-the-human-brain-project.html" rel="nofollow">https://www.stackhpc.com/ceph-on-the-brain-a-year-with-the-h...</a>
The end-to-end principle strikes again: lowest common denominator abstractions like filesystems are often incorrect, inefficient, or both for complex applications and ultimately must be bypassed by custom abstractions tailored for the application.
We're actually facing an issue with our Ceph infrastructure in the 'upgrade' from FileStore to BlueStore: the loss of use of our SSDs.<p>We created our infrastructure with a bunch of hardware that had HDDs for bulk storage and an SSD for async I/O and intent log stuff.<p>The problem is that BlueStore does not seem to have any use for off-to-the-side SSDs AFA(we)CT. So we're left with a bunch hardware that may not be as performant under the new BlueStore world order.<p>The Ceph mailing list consensus seems to be "don't buy SSDs, but rather buy more spindles for more independent OSDs". That's fine for future purchases, but we have a whole bunch of gear designed for the Old Way of doing things. We could leave things be and continue using FileStore, but it seems the Path Forward is BlueStore.<p>Some of us do not need the speed of an all-SSD setup, but perhaps want something a little faster than only-HDDs. We're playing with benchmarks now to see how much worse the latency is with BlueStore+no-SSD, and whether the latency is good enough for us as-is.<p>Any new storage design that cannot handle an "hybrid" configuration of combining HDDs and SSDs is silly IMHO.<p>I joked that we could tie the HDDs together using ZFS zvol, with the SSD as the ZIL, and point the OSD(s) there.
I have sympathy with, and am open-minded to, the conclusions of this article - even as a die-hard, true believer in the filesystem (esp. ZFS) as a useful foundational building block.<p>However, I hope that these conclusions do not lead to the intentional deprecation of support for filesystems in projects like Ceph. If a non-filesystem backing store is superior, then by all means do it, but I hope the ability to deploy a filesystem-backed endpoint will be retained.<p>In a pinch, it's very flexible and there are a lot of them lying around ...
> “For its next-generation backend, the Ceph community is exploring techniques that reduce the CPU consumption, such as minimizing data serialization-deserialization, and using the SeaStar framework with a shared-nothing model…“<p>Seastart HTTPD throughput as mentioned on their site Between 5 to 10 CPU, it can achieve 2,000,000 HTTP request/sec. Just Vow. But If you look at Http Performance data on below URL, running similar configuration on clouds (AWS etc.) looks costly though.<p><a href="http://seastar.io/http-performance/" rel="nofollow">http://seastar.io/http-performance/</a><p>I wonder what would be cost of achieving similar performance on Hadoop stack.
This isn't surprising, but I guess the results need to be put into perspective with the use case for distributed file systems and NFS eg. reasonable scaling for static asset serving with excellent modularity, in particular when paired with node-local caches. Of course Ceph etc. won't scale to Google Search and Facebook levels, but it's still damn practical if you're scaling out from a single HTTP server to a load-balanced cluster of those without having to bring in whole new I/O infrastructures. And they help with cloud vendor lock-in as well; for example you can use CephFS on DO, OVH, and other providers.
Slightly off topic but I love Adrian Coyler’s blog. Since I never pursued graduate studies in CS I never really got into reading research papers but would love to start reading some on my commute to work.<p>Does anyone have any recommendations for finding interesting papers? Do I need to buy subscriptions? Is there a list of “recommended” papers to read like we have with programming literature e.g. <i>The Pragmatic Programmer</i>?
Don't most of their problems go away when you fallocate a pile of space and use AIO+O_DIRECT like a database to get the buffer cache and most of the filesystem out of the way?<p>CoW filesystems like BTRFS proviode ioctls to disable CoW as well, which would be useful here when you've grown your own.<p>XFS has supported shutting off metadata like ctime/mtime updates for ages.<p>If you jump through some hoops, with a fully allocated file, you can get a file on a filesystem to behave very closely to a bare block store.
I use ext4 with my distributed async-to-async JSON database: <a href="https://github.com/tinspin/rupy/wiki/Storage" rel="nofollow">https://github.com/tinspin/rupy/wiki/Storage</a><p>You can try it here: <a href="http://root.rupy.se" rel="nofollow">http://root.rupy.se</a><p>The actual syncing is done over HTTP with Java though so maybe that's why it works well for me.
I don't think the conventional wisdom of building on top of filesystems exists. In distributed systems you always naturally gravitate towards using raw storage devices instead of filesystems, it becomes obvious very early on that filesystems suck too much and only create problems. And it's the same with all the embedded database libraries, you really want to write your own, because none of the existing ones were made to address performance and operational problems that arise even in small distributed systems. But at the same time early on you don't yet know most of the problems and don't want to invest time implementing something you don't yet understand well enough, so you end up building on top of filesystems and embedded databases and making plenty of poor choices and learning on your mistakes.