File systems unfit as distributed storage back ends: lessons from Ceph evolution

264 pointsby r4umover 5 years ago

17 comments

j-pbover 5 years ago

From the Blog post:> We looked at this issue earlier. Fundamentally the tension here is that copy-on-write semantics don’t fit with the emerging zone interface semantics.While the paper writes:> It is not surprising that attempts to modify production file systems, such as XFS and ext4, to work with the zone interface have so far been unsuccessful [19, 68], primarily because these are overwrite file systems, whereas the zone interface requires a copy-on-write approach to data managementThis seems to be a contradiction, and I'd side with the original paper.

评论 #21462340 未加载

angrygoatover 5 years ago

I ran a ~ 0.5 PB Ceph cluster for a few years, on quite old spinning disk hardware (bought second-hand.) It was great: it just worked, coped very well with hardware failures, told the operator what was happening. An extremely solid, well-engineered system. My thanks to the Ceph team :)

评论 #21462054 未加载

gamblerover 5 years ago

I was just discussing with a colleague how technology accretes and how no one reevaluates high-level design decisions even after every single factor leading to those decisions has changed.It's weird that basic filesystems today are so out of touch with modern realities that we are universally forced to resort to using complex databases even in cases when the logical model of files and directories fits the storage needs really well.It's weird that hierarchical storage is the only universal model available on all OSes and in all languages.The more I think about it, the more I realize that we live in a bizarro world where software runs everything, yet makes little to no sense from either human, or modern hardware or system design perspectives.

评论 #21464786 未加载

评论 #21466385 未加载

评论 #21469196 未加载

评论 #21464860 未加载

评论 #21466878 未加载

评论 #21478516 未加载

wmittyover 5 years ago

The link to the paper in the article requires an ACM subscription. Here is a link to the version hosted by the authors:<a href="https://www.pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf" rel="nofollow">https://www.pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf</a>

评论 #21462539 未加载

yankcrimeover 5 years ago

For anyone looking for more information and benchmarks on the performance improvements in recent versions of Ceph (and with Bluestore in particular), here's a write-up that was done as part of testing for infrastructure to support the Human Brain Project: <a href="https://www.stackhpc.com/ceph-on-the-brain-a-year-with-the-human-brain-project.html" rel="nofollow">https://www.stackhpc.com/ceph-on-the-brain-a-year-with-the-h...</a>

评论 #21466779 未加载

gwernover 5 years ago

The end-to-end principle strikes again: lowest common denominator abstractions like filesystems are often incorrect, inefficient, or both for complex applications and ultimately must be bypassed by custom abstractions tailored for the application.

评论 #21470590 未加载

throw0101aover 5 years ago

We're actually facing an issue with our Ceph infrastructure in the 'upgrade' from FileStore to BlueStore: the loss of use of our SSDs.We created our infrastructure with a bunch of hardware that had HDDs for bulk storage and an SSD for async I/O and intent log stuff.The problem is that BlueStore does not seem to have any use for off-to-the-side SSDs AFA(we)CT. So we're left with a bunch hardware that may not be as performant under the new BlueStore world order.The Ceph mailing list consensus seems to be "don't buy SSDs, but rather buy more spindles for more independent OSDs". That's fine for future purchases, but we have a whole bunch of gear designed for the Old Way of doing things. We could leave things be and continue using FileStore, but it seems the Path Forward is BlueStore.Some of us do not need the speed of an all-SSD setup, but perhaps want something a little faster than only-HDDs. We're playing with benchmarks now to see how much worse the latency is with BlueStore+no-SSD, and whether the latency is good enough for us as-is.Any new storage design that cannot handle an "hybrid" configuration of combining HDDs and SSDs is silly IMHO.I joked that we could tie the HDDs together using ZFS zvol, with the SSD as the ZIL, and point the OSD(s) there.

评论 #21464858 未加载

rsyncover 5 years ago

I have sympathy with, and am open-minded to, the conclusions of this article - even as a die-hard, true believer in the filesystem (esp. ZFS) as a useful foundational building block.However, I hope that these conclusions do not lead to the intentional deprecation of support for filesystems in projects like Ceph. If a non-filesystem backing store is superior, then by all means do it, but I hope the ability to deploy a filesystem-backed endpoint will be retained.In a pinch, it's very flexible and there are a lot of them lying around ...

评论 #21463848 未加载

ph2082over 5 years ago

> “For its next-generation backend, the Ceph community is exploring techniques that reduce the CPU consumption, such as minimizing data serialization-deserialization, and using the SeaStar framework with a shared-nothing model…“Seastart HTTPD throughput as mentioned on their site Between 5 to 10 CPU, it can achieve 2,000,000 HTTP request/sec. Just Vow. But If you look at Http Performance data on below URL, running similar configuration on clouds (AWS etc.) looks costly though.<a href="http://seastar.io/http-performance/" rel="nofollow">http://seastar.io/http-performance/</a>I wonder what would be cost of achieving similar performance on Hadoop stack.

评论 #21464115 未加载

评论 #21462718 未加载

评论 #21463331 未加载

tannhaeuserover 5 years ago

This isn't surprising, but I guess the results need to be put into perspective with the use case for distributed file systems and NFS eg. reasonable scaling for static asset serving with excellent modularity, in particular when paired with node-local caches. Of course Ceph etc. won't scale to Google Search and Facebook levels, but it's still damn practical if you're scaling out from a single HTTP server to a load-balanced cluster of those without having to bring in whole new I/O infrastructures. And they help with cloud vendor lock-in as well; for example you can use CephFS on DO, OVH, and other providers.

ngrillyover 5 years ago

I read DigitalOcean is using Ceph for its object and block storage. Do they use BlueStore in production too?

评论 #21462573 未加载

cforsover 5 years ago

Slightly off topic but I love Adrian Coyler’s blog. Since I never pursued graduate studies in CS I never really got into reading research papers but would love to start reading some on my commute to work.Does anyone have any recommendations for finding interesting papers? Do I need to buy subscriptions? Is there a list of “recommended” papers to read like we have with programming literature e.g. The Pragmatic Programmer?

评论 #21462713 未加载

newnewpdroover 5 years ago

Don't most of their problems go away when you fallocate a pile of space and use AIO+O_DIRECT like a database to get the buffer cache and most of the filesystem out of the way?CoW filesystems like BTRFS proviode ioctls to disable CoW as well, which would be useful here when you've grown your own.XFS has supported shutting off metadata like ctime/mtime updates for ages.If you jump through some hoops, with a fully allocated file, you can get a file on a filesystem to behave very closely to a bare block store.

评论 #21464258 未加载

bullenover 5 years ago

I use ext4 with my distributed async-to-async JSON database: <a href="https://github.com/tinspin/rupy/wiki/Storage" rel="nofollow">https://github.com/tinspin/rupy/wiki/Storage</a>You can try it here: <a href="http://root.rupy.se" rel="nofollow">http://root.rupy.se</a>The actual syncing is done over HTTP with Java though so maybe that's why it works well for me.

Mathnerd314over 5 years ago

I guess a minimal install of Ceph is a client and a node with 2-3 hard disks / SSDs.

lightedmanover 5 years ago

Unfit? That's why I'm using KirbyCMS atop NTFS with zero issues, right?

评论 #21471066 未加载

zzzcpanover 5 years ago

I don't think the conventional wisdom of building on top of filesystems exists. In distributed systems you always naturally gravitate towards using raw storage devices instead of filesystems, it becomes obvious very early on that filesystems suck too much and only create problems. And it's the same with all the embedded database libraries, you really want to write your own, because none of the existing ones were made to address performance and operational problems that arise even in small distributed systems. But at the same time early on you don't yet know most of the problems and don't want to invest time implementing something you don't yet understand well enough, so you end up building on top of filesystems and embedded databases and making plenty of poor choices and learning on your mistakes.

评论 #21462274 未加载

评论 #21462654 未加载