Fire-Flyer File System (3FS)

361 点作者 wenyuanyu3 个月前

17 条评论

For those who are interested, the design was originally published here:(Chinese) <a href="https://www.high-flyer.cn/blog/3fs/" rel="nofollow">https://www.high-flyer.cn/blog/3fs/</a>This file system has been developed and utilized by them for several years .Compared to the traditional file systems, it is more focused on model training that contains a lot of random reads. Read cache and prefetching are useless in this case. Therefore, they designed the file system without those features to improve the performance.I google translated some key parts here:3FS is a special file system because it is almost only used in the scenario of batch reading sample data in the computing node during AI training, and accelerates model training through high-speed computing and storage interaction. This is a large-scale random reading task, and the read data will not be used again in a short time, so we cannot use the most important tool "read cache" to optimize file reading, and even advance reading is useless. Therefore, the implementation of 3FS is also quite different from other file systems.Specifically, as shown in the figure above, 3FS uses the Linux-based AIO and io_uring interfaces to complete sample reading, because in the 3FS scenario, File Cache has no effect at all, but will consume system memory in a way that is difficult for users to control, affecting the operation of subsequent tasks, so we turned off File Cache and only used Direct I/O mode to read data. But it should be noted that when reading in this way, the buffer pointer, offset and length all need to be aligned. If the user is allowed to do this alignment, additional memory copies will be generated, so we have done the alignment inside the file system, which not only optimizes performance but also facilitates users.

评论 #43205341 未加载

评论 #43208124 未加载

评论 #43203231 未加载

codingwagie3 个月前

I think the difference between deepseek and OpenAI/Anthropic is one of the difference between practitioners and academics. Ofcourse there is world class talent at OpenAI. But there are also alot of "I went to Harvard and want to work in AI", and those types of people just simply dont have the technical exposure to even think of building something like this.

评论 #43208225 未加载

评论 #43219823 未加载

评论 #43208073 未加载

评论 #43208620 未加载

评论 #43209418 未加载

thohj42342343243 个月前

This is very humbling.OpenAI et. al kind of have also been very deep down the systems rabbit hole (eg. Triton), but I can't think of anyone else (outside of Google/Facebook) who pay this amount to attention to things.Great work; hope Deepseek does even more awesome things going forward.

评论 #43202258 未加载

评论 #43202235 未加载

tetron3 个月前

Was curious how they get such performance with a FUSE based design. It seems that they sort of cheat, FUSE is used to manage metadata but to get high performance you have to link in the C++ client library and do all your reads and writes through that. So it isn't general purpose, you have to modify your application to take advantage of it. Still, that's a clever trick, and makes me wonder if there's a LD_PRELOAD strategy that could generalize.

评论 #43201470 未加载

评论 #43203742 未加载

pella3 个月前

related research paper (english - html ) - <a href="https://arxiv.org/html/2408.14158v2" rel="nofollow">https://arxiv.org/html/2408.14158v2</a>arXiv:2408.14158v2 [cs.DC] 31 Aug 2024"Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning"Abstract:"The rapid progress in Deep Learning (DL) and Large Language Models (LLMs) has exponentially increased demands of computational power and bandwidth. This, combined with the high costs of faster computing chips and interconnects, has significantly inflated High Performance Computing (HPC) construction costs. To address these challenges, we introduce the Fire-Flyer AI-HPC architecture, a synergistic hardware-software co-design framework and its best practices. For DL training, we deployed the Fire-Flyer 2 with 10,000 PCIe A100 GPUs, achieved performance approximating the DGX-A100 while reducing costs by half and energy consumption by 40%. We specifically engineered HFReduce to accelerate allreduce communication and implemented numerous measures to keep our Computation-Storage Integrated Network congestion-free. Through our software stack, including HaiScale, 3FS, and HAI-Platform, we achieved substantial scalability by overlapping computation and communication. Our system-oriented experience from DL training provides valuable insights to drive future advancements in AI-HPC."

hintymad3 个月前

A distributed file system is honed as one of the trickiest software to write, and we are usually advised not to write a file system from scratch (even on top of FUSE), let alone a highly optimized one. When a silicon value company is having the 100th meeting to align god-knows-what, a team of fewer than 60 already came up with a production-grade highly efficient parallel file system.Have we in the valley companies lost touch?

评论 #43209416 未加载

评论 #43237002 未加载

jauntywundrkind3 个月前

Man, 6.6TB/s across 180 nodes is 300Gbps/node, or 37.5GBps.That's with 14 unnamed SSD per node. I wonder how this would scale to higher end SSD,dealing from PCIe 4 to PCIe 5 or PCIe 6... Particularly whether one could scale down!

bee_rider3 个月前

They sure are productive.What are we going to see tomorrow? DeepSeek OS or something?

评论 #43207103 未加载

评论 #43203000 未加载

评论 #43201986 未加载

yalogin3 个月前

It’s not clear to me where and how the current popular systems fall short. Do they talk about I anywhere?Also, what specifically is the data access patterns for training and inference that are different from traditional use cases?

评论 #43202938 未加载

budududuroiu3 个月前

Does anyone know if there’s a benefit to porting this to an orchestrator like K8s, maybe overkill for training but the KVCache might be useful when having multiple replicas for inference?

do_not_redeem3 个月前

Can someone convince me this isn't NIH syndrome? Why would you use this instead of SeaweedFS, Ceph, or MinIO?

评论 #43201126 未加载

评论 #43201087 未加载

评论 #43202594 未加载

评论 #43201051 未加载

评论 #43205622 未加载

评论 #43201184 未加载

评论 #43208579 未加载

whalesalad3 个月前

The throughput on those charts is pretty wild - multiple terabytes per second.

jeffbee3 个月前

Interesting that their GraySort result is CPU bound while they are using 3x more CPUs than the record holder from ten years ago.

评论 #43202057 未加载

brcmthrowaway3 个月前

What does Anthropic use?

rvz3 个月前

Once again, DeepSeek continues with another home run.Can't wait to see what they release next. DeepSeek should be studied carefully.

WithinReason3 个月前

Why is this even necessary? Can you just shard your training set to the training nodes ahead of time instead?

评论 #43203849 未加载

pepsi-not-coke3 个月前

I love it. AWS EFS costs too much. The open source solutions are clunky. I'm hoping DS applied their ingenuity to this one, too. Can't wait to trial it.