Ceph: A Journey to 1 TiB/s

411 点作者 davidmr超过 1 年前

23 条评论

alberth超过 1 年前

Ceph has an interesting history.It was created at Dreamhost (DH), for their internal needs by the founders.DH was doing effectively IaaS & PaaS before those were industry coined words (VPS, managed OS/database/app-servers).They spun Ceph off and Redhat bought it.<a href="https://en.wikipedia.org/wiki/DreamHost" rel="nofollow">https://en.wikipedia.org/wiki/DreamHost</a>

评论 #39063398 未加载

评论 #39062622 未加载

amadio超过 1 年前

Nice article! We've also recently reached the mark of 1TB/s at CERN, but with EOS (<a href="https://cern.ch/eos" rel="nofollow">https://cern.ch/eos</a>), not ceph: <a href="https://www.home.cern/news/news/computing/exabyte-disk-storage-cern" rel="nofollow">https://www.home.cern/news/news/computing/exabyte-disk-stora...</a>Our EOS clusters have a lot more nodes, however, and use mostly HDDs. CERN also uses ceph extensively.

评论 #39065481 未加载

stuff4ben超过 1 年前

I used to love doing experiments like this. I was afforded that luxury as a tech lead back when I was at Cisco setting up Kubernetes on bare metal and getting to play with setting up GlusterFS and Ceph just to learn and see which was better. This was back in 2017/2018 if I recall. Good ole days. Loved this writeup!

评论 #39062123 未加载

评论 #39065393 未加载

amluto超过 1 年前

I wish someone would try to scale the nodes down. The system described here is ~300W/node for 10 disks/node, so 30W or so per disk. That’s a fair amount of overhead, and it also requires quite a lot of storage to get any redundancy at all.I bet some engineering effort could divide the whole thing by 10. Build a tiny SBC with 4 PCIe lanes for NVMe, 2x10GbE (as two SFP+ sockets), and a just-fast-enough ARM or RISC-V CPU. Perhaps an eMMC chip or SD slot for boot.This could scale down to just a few nodes, and it reduces the exposure to a single failure taking out 10 disks at a time.I bet a lot of copies of this system could fit in a 4U enclosure. Optionally the same enclosure could contain two entirely independent switches to aggregate the internal nodes.

评论 #39063675 未加载

评论 #39074124 未加载

评论 #39065254 未加载

评论 #39062578 未加载

评论 #39062208 未加载

评论 #39062244 未加载

评论 #39064729 未加载

评论 #39062344 未加载

评论 #39062069 未加载

评论 #39069784 未加载

chx超过 1 年前

There was a point in history when the total amount of digital data stored worldwide reached 1TiB for the first time. It is extremely likely this day was within the last sixty years.And here we are moving that amount of data every second on the servers of a fairly random entity. We not talking of a nation state or a supranatural research effort.

评论 #39064273 未加载

评论 #39064357 未加载

kylegalbraith超过 1 年前

This is a fascinating read. We run a Ceph storage cluster for persisting Docker layer cache [0]. We went from using EBS to Ceph and saw a massive difference in throughput. Went from a write throughput of 146 MB/s and 3,000 IOPS to 900 MB/s and 30,000 IOPS.The best part is that it pretty much just works. Very little babysitting with the exception of the occasional fs trim or something.It’s been a massive improvement for our caching system.[0] <a href="https://depot.dev/blog/cache-v2-faster-builds">https://depot.dev/blog/cache-v2-faster-builds</a>

评论 #39065871 未加载

评论 #39081446 未加载

MPSimmons超过 1 年前

The worst problems I've had with in-cluster dynamic storage were never strictly IO related, and were more the storage controller software in kubernetes having problems with real-world problems like pods dying and the PVCs not attaching until after very long timeouts expired, with the pod sitting in ContainerCreating until the PVC lock was freed.This has happened in multiple clusters, using rook/ceph as well as Longhorn.

matheusmoreira超过 1 年前

Does anyone have experience running ceph in a home lab? Last time I looked into it, there were quite significant hardware requirements.

评论 #39061103 未加载

评论 #39061168 未加载

评论 #39061498 未加载

评论 #39061331 未加载

评论 #39062020 未加载

评论 #39062106 未加载

评论 #39061558 未加载

评论 #39062596 未加载

评论 #39063028 未加载

评论 #39061833 未加载

评论 #39064893 未加载

评论 #39064162 未加载

评论 #39061151 未加载

评论 #39062971 未加载

评论 #39061139 未加载

评论 #39061477 未加载

mrb超过 1 年前

I wanted to see how 1 TiB/s compares to the actual theoretical limits of the hardware. So here is what I found:The cluster has 68 nodes, each a Dell PowerEdge R6615 (<a href="https://www.delltechnologies.com/asset/en-us/products/servers/technical-support/poweredge-r6615-technical-guide.pdf" rel="nofollow">https://www.delltechnologies.com/asset/en-us/products/server...</a>). The R6615 configuration they run is the one with 10 U.2 drive bays. The U.2 link carries data over 4 PCIe gen4 lanes. Each PCIe lane is capable of 16 Gbit/s. The lanes have negligible ~3% overhead thanks to 128b-132b encoding.This means each U.2 link has a maximum link bandwith of 16 * 4 = 64 Gbit/s or 8 Gbyte/s. However the U.2 NVMe drives they use are Dell 15.36TB Enterprise NVMe Read Intensive AG, which appear to be capable of 7 Gbyte/s read throughput (<a href="https://www.serversupply.com/SSD%20W-TRAY/NVMe/15.36TB/DELL/182NW_356114.htm" rel="nofollow">https://www.serversupply.com/SSD%20W-TRAY/NVMe/15.36TB/DELL/...</a>). So they are not bottlenecked by the U.2 link (8 Gbyte/s).Each node has 10 U.2 drive, so each node can do local read I/O at a maximum of 10 * 7 = 70 Gbyte/s.However each node has a network bandwith of only 200 Gbit/s (2 x 100GbE Mellanox ConnectX-6) which is only 25 Gbyte/s. This implies that remote reads are under-utilizing the drives (capable of 70 Gbyte/s). The network is the bottleneck.Assuming no additional network bottlenecks (they don't describe the network architecture), this implies the 68 nodes can provide 68 * 25 = 1700 Gbyte/s of network reads. The author benchmarked 1 TiB/s actually exactly 1025 GiB/s = 1101 Gbyte/s which is 65% of the maximum theoretical 1700 Gbyte/s. That's pretty decent, but in theory it's still possible to be doing a bit better assuming all nodes can concurrently truly saturate their 200 Gbit/s network link.Reading this whole blog post, I got the impression ceph's complexity hits the CPU pretty hard. Not compiling a module with -O2 ("Fix Three": linked by the author: <a href="https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1894453" rel="nofollow">https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1894453</a>) can reduce performance "up to 5x slower with some workloads" (<a href="https://bugs.gentoo.org/733316" rel="nofollow">https://bugs.gentoo.org/733316</a>) is pretty unexpected, for a pure I/O workload. Also what's up with OSD's threads causing excessive CPU waste grabbing the IOMMU spinlock? I agree with the conclusion that the OSD threading model is suboptimal. A relatively simple synthetic 100% read benchmark should not expose a threading contention if that part of ceph's software architecture was well designed (which is fixable, so I hope the ceph devs prioritize this.)

评论 #39067858 未加载

评论 #39065134 未加载

评论 #39063585 未加载

kaliszad超过 1 年前

What surprises me is, why they went with the harder to cool 1U nodes and 10 SSDs/2x100Gb NICs instead of 2U nodes with 24 SSDs/2x200 or even 400Gb NICs. They could remove the network bottleneck and save on power thanks to larger, lower speed fans and less CPU packages, possibly with more cores per socket though. Also, having a smaller number of nodes increases the blast radius but with even 34 nodes this is probably not such a problem. However, with less nodes they could have a flatter network with 4 switches or so too.

评论 #39110331 未加载

mobilemidget超过 1 年前

Cool benchmark, and interesting, however it would have read a lot better if abbreviations are explained at first usage. Not everybody is familiar with all terminology used in the post. Nonetheless congrats with results.

评论 #39067773 未加载

one_buggy_boi超过 1 年前

Is modern Ceph appropriate for transactional database storage, how is the IO latency? I'd like to move to a cheaper cfs that can compete with systems like Oracle's clustered file system or DBs backed by something like Veritas. Veritas supports multi-petabyte DBs and I haven't seen much outside of it or ocfs that similarly scales with acceptable latency

评论 #39065220 未加载

评论 #39063831 未加载

louwrentius超过 1 年前

I wrote an intro to Ceph[0] for those who are new to Ceph.It featured in a Jeff Geerling video briefly recently :-)[0]: Understanding Ceph: open-source scalable storage <a href="https://louwrentius.com/understanding-ceph-open-source-scalable-storage.html" rel="nofollow">https://louwrentius.com/understanding-ceph-open-source-scala...</a>

评论 #39066776 未加载

rafaelturk超过 1 年前

I'm playing a lot with MicroCeph. Its aopinionated low TOS, friendly setup of Ceph. Looking forward additional comments. Planning to use it in production and replace lots of NAS servers.

评论 #39066593 未加载

louwrentius超过 1 年前

Remember, random IOPs without latency is a meaningless figure.

francoismassot超过 1 年前

Does someone knows how Ceph compares to other object storage engine like MinIO/Garage/...?I would love to see some benchmarks there.

评论 #39070394 未加载

peter_d_sherman超过 1 年前

Ceph is interesting... open source software whose only purpose is to implement a distributed file system...Functionally, Linux implements a file system (well, several!) as well (in addition to many other OS features) -- but (usually!) only on top of local hardware.There seems to be some missing software here -- if we examine these two paradigms side-by-side.For example, what if I want a Linux (or more broadly, a general OS) -- but one that doesn't manage a local file system or local storage at all?One that operates solely using the network, solely using a distributed file system that Ceph, or software like Ceph, would provide?Conversely, what if I don't want to run a full OS on a network machine, a network node that manages its own local storage?The only thing I can think of to solve those types of problems -- is:What if the Linux filesystem was written such that it was a completely separate piece of software, and a distributed file system like Ceph, and not dependent on the other kernel source code (although, still complilable into the kernel as most linux components normally are)...A lot of work? Probably!But there seems to be some software need for something between a solely distributed file system as Ceph is, and a completely monolithic "everything baked in" (but not distributed!) OS/kernel as Linux is...Note that I am just thinking aloud here -- I probably am wrong and/or misinformed on one or more fronts!So, kindly take this random "thinking aloud" post -- with the proverbial "grain of salt!" :-)

评论 #39063617 未加载

评论 #39070698 未加载

nghnam超过 1 年前

My old company ran public and private cloud with Openstack and Ceph. We had 20 Supermicro (24 disks per server) storage nodes and total capacity was 3PB. We learnt some experiences, especially a flapping disk made whole system performance degraded. Solution was removing bad sector disk as soon as possible.

einpoklum超过 1 年前

Where can I read about the rationale for ceph as a project? I'm not familiar with it.

评论 #39063069 未加载

评论 #39062906 未加载

brobinson超过 1 年前

I'm curious what the performance difference would be on a modern kernel.

评论 #39069404 未加载

riku_iki超过 1 年前

What router/switch one would use for such speed?

评论 #39061232 未加载

评论 #39061201 未加载

评论 #39061046 未加载

hinkley超过 1 年前

Sure would be nice if you defined some acronyms.

up2isomorphism超过 1 年前

This is an insanely expensive cluster built to show a benchmark. 68 node cluster serving only 15TB storage in total.

评论 #39069624 未加载

评论 #39072017 未加载

23 条评论

alberth超过 1 年前

评论 #39063398 未加载

评论 #39062622 未加载

amadio超过 1 年前

评论 #39065481 未加载

stuff4ben超过 1 年前

评论 #39062123 未加载

评论 #39065393 未加载

amluto超过 1 年前

评论 #39063675 未加载

评论 #39074124 未加载

评论 #39065254 未加载

评论 #39062578 未加载

评论 #39062208 未加载

评论 #39062244 未加载

评论 #39064729 未加载

评论 #39062344 未加载

评论 #39062069 未加载

评论 #39069784 未加载

chx超过 1 年前

评论 #39064273 未加载

评论 #39064357 未加载

kylegalbraith超过 1 年前

评论 #39065871 未加载

评论 #39081446 未加载

MPSimmons超过 1 年前

matheusmoreira超过 1 年前

Does anyone have experience running ceph in a home lab? Last time I looked into it, there were quite significant hardware requirements.

评论 #39061103 未加载

评论 #39061168 未加载

评论 #39061498 未加载

评论 #39061331 未加载

评论 #39062020 未加载

评论 #39062106 未加载

评论 #39061558 未加载

评论 #39062596 未加载

评论 #39063028 未加载

评论 #39061833 未加载

评论 #39064893 未加载

评论 #39064162 未加载

评论 #39061151 未加载

评论 #39062971 未加载

评论 #39061139 未加载

评论 #39061477 未加载

mrb超过 1 年前

评论 #39067858 未加载

评论 #39065134 未加载

评论 #39063585 未加载

kaliszad超过 1 年前

评论 #39110331 未加载

mobilemidget超过 1 年前

评论 #39067773 未加载

one_buggy_boi超过 1 年前

评论 #39065220 未加载

评论 #39063831 未加载

louwrentius超过 1 年前

评论 #39066776 未加载

rafaelturk超过 1 年前

I'm playing a lot with MicroCeph. Its aopinionated low TOS, friendly setup of Ceph. Looking forward additional comments. Planning to use it in production and replace lots of NAS servers.

评论 #39066593 未加载

louwrentius超过 1 年前

Remember, random IOPs without latency is a meaningless figure.

francoismassot超过 1 年前

Does someone knows how Ceph compares to other object storage engine like MinIO/Garage/...?I would love to see some benchmarks there.

评论 #39070394 未加载

peter_d_sherman超过 1 年前

评论 #39063617 未加载

评论 #39070698 未加载

nghnam超过 1 年前

einpoklum超过 1 年前

Where can I read about the rationale for ceph as a project? I'm not familiar with it.

评论 #39063069 未加载

评论 #39062906 未加载

brobinson超过 1 年前

I'm curious what the performance difference would be on a modern kernel.

评论 #39069404 未加载

riku_iki超过 1 年前

What router/switch one would use for such speed?

评论 #39061232 未加载

评论 #39061201 未加载

评论 #39061046 未加载

hinkley超过 1 年前

Sure would be nice if you defined some acronyms.

up2isomorphism超过 1 年前

This is an insanely expensive cluster built to show a benchmark. 68 node cluster serving only 15TB storage in total.

评论 #39069624 未加载

评论 #39072017 未加载