For those who are interested, the design was originally published here:<p>(Chinese) <a href="https://www.high-flyer.cn/blog/3fs/" rel="nofollow">https://www.high-flyer.cn/blog/3fs/</a><p>This file system has been developed and utilized by them for several years .<p>Compared to the traditional file systems, it is more focused on model training that contains a lot of random reads. Read cache and prefetching are useless in this case. Therefore, they designed the file system without those features to improve the performance.<p>I google translated some key parts here:<p>3FS is a special file system because it is almost only used in the scenario of batch reading sample data in the computing node during AI training, and accelerates model training through high-speed computing and storage interaction. This is a large-scale random reading task, and the read data will not be used again in a short time, so we cannot use the most important tool "read cache" to optimize file reading, and even advance reading is useless. Therefore, the implementation of 3FS is also quite different from other file systems.<p>Specifically, as shown in the figure above, 3FS uses the Linux-based AIO and io_uring interfaces to complete sample reading, because in the 3FS scenario, File Cache has no effect at all, but will consume system memory in a way that is difficult for users to control, affecting the operation of subsequent tasks, so we turned off File Cache and only used Direct I/O mode to read data. But it should be noted that when reading in this way, the buffer pointer, offset and length all need to be aligned. If the user is allowed to do this alignment, additional memory copies will be generated, so we have done the alignment inside the file system, which not only optimizes performance but also facilitates users.
I think the difference between deepseek and OpenAI/Anthropic is one of the difference between practitioners and academics. Ofcourse there is world class talent at OpenAI. But there are also alot of "I went to Harvard and want to work in AI", and those types of people just simply dont have the technical exposure to even think of building something like this.
This is very humbling.<p>OpenAI et. al kind of have also been very deep down the systems rabbit hole (eg. Triton), but I can't think of anyone else (outside of Google/Facebook) who pay this amount to attention to things.<p>Great work; hope Deepseek does even more awesome things going forward.
Was curious how they get such performance with a FUSE based design. It seems that they sort of cheat, FUSE is used to manage metadata but to get high performance you have to link in the C++ client library and do all your reads and writes through that. So it isn't general purpose, you have to modify your application to take advantage of it. Still, that's a clever trick, and makes me wonder if there's a LD_PRELOAD strategy that could generalize.
related research paper (english - html ) - <a href="https://arxiv.org/html/2408.14158v2" rel="nofollow">https://arxiv.org/html/2408.14158v2</a><p>arXiv:2408.14158v2 [cs.DC] 31 Aug 2024<p><i>"Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning"</i><p>Abstract:<p><i>"The rapid progress in Deep Learning (DL) and Large Language Models (LLMs) has exponentially increased demands of computational power and bandwidth. This, combined with the high costs of faster computing chips and interconnects, has significantly inflated High Performance Computing (HPC) construction costs. To address these challenges, we introduce the Fire-Flyer AI-HPC architecture, a synergistic hardware-software co-design framework and its best practices. For DL training, we deployed the Fire-Flyer 2 with 10,000 PCIe A100 GPUs, achieved performance approximating the DGX-A100 while reducing costs by half and energy consumption by 40%. We specifically engineered HFReduce to accelerate allreduce communication and implemented numerous measures to keep our Computation-Storage Integrated Network congestion-free. Through our software stack, including HaiScale, 3FS, and HAI-Platform, we achieved substantial scalability by overlapping computation and communication. Our system-oriented experience from DL training provides valuable insights to drive future advancements in AI-HPC."</i>
A distributed file system is honed as one of the trickiest software to write, and we are usually advised not to write a file system from scratch (even on top of FUSE), let alone a highly optimized one. When a silicon value company is having the 100th meeting to align god-knows-what, a team of fewer than 60 already came up with a production-grade highly efficient parallel file system.<p>Have we in the valley companies lost touch?
Man, 6.6TB/s across 180 nodes is 300Gbps/node, or 37.5GBps.<p>That's with 14 unnamed SSD per node. I wonder how this would scale to higher end SSD,dealing from PCIe 4 to PCIe 5 or PCIe 6... Particularly whether one could scale down!
It’s not clear to me where and how the current popular systems fall short. Do they talk about I anywhere?<p>Also, what specifically is the data access patterns for training and inference that are different from traditional use cases?
Does anyone know if there’s a benefit to porting this to an orchestrator like K8s, maybe overkill for training but the KVCache might be useful when having multiple replicas for inference?
I love it. AWS EFS costs too much. The open source solutions are clunky. I'm hoping DS applied their ingenuity to this one, too. Can't wait to trial it.