At $oldjob, taking care of a busy and successful web estate that is now close to 25 years old, one of the ugliest and longest-standing warts was the "image store". That was a simple, flat directory on a single node, shared over NFS, which had accumulated more than 1.2 million (yes, 1_200_000) inodes/files in a single directory. No-one wanted (read: dared) to properly fix the code to rid it of this once-convenient assumption and (lack of) hierarchy, so I tried to work around the ever-growing pain by tuning the filesystem, the host's kernel (a buffed dentry cache size goes a long way at making this kind of abuse more tolerable, for instance), and the NFS server and clients involved to mitigate emerging problems as much as possible. In the end, a readdirplus()/getdents()-chain invoked over NFS only took a few seconds to finish. It's pretty amazing what Linux can cope with, if you make it have to :)
As a consumer of Android devices, one thing that's super annoying is that all pictures you take with a camera are stored in a single massive folder.<p>When you connect to the device from your computer over USB (which is usually USB 2.0 even if USB-C, except in major 2021+ models), it will take forever to enumerate the files in that folder. Once you start copying, there's big chance it hangs, so you need to disconnect the device, and go through that painful process again.<p>(I know, I'm oldschool, I don't have automatic cloud backup enabled).
Bruce Dawson often finds issues in software linked to inefficient algorithms, and sometimes they're linked to slow folder enumeration on big folders. Example: Windows Voice Recorder enumerates files in a folder on startup in a interaction-blocking way.<p><a href="https://randomascii.wordpress.com/2022/09/29/why-modern-software-is-slow-windows-voice-recorder/" rel="nofollow">https://randomascii.wordpress.com/2022/09/29/why-modern-soft...</a>
A bit tangential, but, recently my partner had her Google Drive with around ~3.000 files in the root folder (created mostly by Google Classroom), which means the Files app of the iPad couldn't show them all, because for some reason it limits to 500 files only.<p>So naturally the next step was to try to clean the directory.
We tried through the webpage, deleting by chunks of 300 files consumed around ~8GB of RAM... and it was slow as hell, and her laptop is a bit old.
I moved onto my desktop and selecting 500 files consumed ~10GB of RAM, it was still slow.<p>I thought of using Google Colab to access to the Drive as filesystem but no dice there either because the google account wasn't managed by her.<p>At the end, we tried the iPad app, it took like 8 minutes to be able to select all files, and when deleting them, it took about an hour to actually do it, I imagine it was submitted by batches.<p>It was stupidly painful.
Perhaps this is one of the reasons that I've seen Linux deployments use XFS (e.g AWS). If you page through the filesystem documentation, once you hit a certain directory size, it actually switches over to using b+ trees like a RDMS would.<p><a href="https://www.kernel.org/pub/linux/utils/fs/xfs/docs/xfs_filesystem_structure.pdf" rel="nofollow">https://www.kernel.org/pub/linux/utils/fs/xfs/docs/xfs_files...</a> (section 16.2 on PDF pg 127)
I worked at a place where the directory was split based on a character or two from the start of a hash.<p>They had millions of profile images but didn't want them all in one directory, so they hashed profile ID and used the first 2 letters as the name of a sub-directory. So you end up with sub-directories called aa, ab, ac, ad etc.<p>It's not perfect but I suppose the original creator had seen issues in the past when directories have too many files in them.
Developers can do a lot to fix this by simply choosing SQLite to store all the local things.<p>Performing backups of our production apps used to take hours (especially in cheap clouds) because of all the loose files. Today, it takes about 3-5 minutes since there are just a handful of consolidated files to worry about.
This is an issue that is pretty common when auto generating files.<p>For instance when generating receipt PDF it could feel natural to store them in folders by account ID. Except there will be a bunch of accounts generating 20 or 30 receipts a day, which isn't much on the face of it. But within months it becomes a pain to list receipts across accounts, within a year or two even individual account receipts become a nightmare to list, and fixing the situation requires a few tricks to avoid all the tools that assume directory listing cost nothing.
This is just another in a long list of problems that existing file system have when built on an architecture that was created before there was sufficient storage to hold more than a few hundred total files (decades ago).<p>I have been working on a new data manager that could replace existing file systems with something much better. You can store hundreds of millions of files in a single container (I call them pods instead of volumes) and put tags on everything. Folders can hold millions of files with virtually no degradation in performance. Searches to find subsets of files based on tags or other criteria is lightning fast.<p>The software has been in beta for about a year and is available for free download at www.Didgets.com yet interest has been very moderate in spite of problems like the one discussed in this thread.<p>demo video: <a href="https://www.youtube.com/watch?v=dWIo6sia_hw">https://www.youtube.com/watch?v=dWIo6sia_hw</a>
So we found out the hard way that having MultiViews enabled in Apache, in your otherwise static folder full of image files, is a reeeeeally bad idea if that folder is filled by automation and contains millions of files. That was a <i>fun</i> support call. :) "Why is our site giving 500s after less than 10 minutes? What are all of those workers <i>doing</i>?"
Why is it actually so difficult for filesystems to deal with such folders? I mean, a million is not such a large number, not for a computer anyhow. A table with million rows doesn't generally causes a RDBMS to choke, why should filesystems?
Back in 2010s in my first job I've seen a bug caused by too many files in one directory. I don't remember the exact details, but it was making me crazy.<p>Basically we wrote temporary files into one big directory and later printed them. And sometimes our code returned "File not found" errors despite the fact that ls filename had showed the file is there and has correct permissions.<p>And when I tried to cat filename from shell - it also caused the same error :) But if you created another file with a different name in the same directory - it worked correctly :) There was also space on the disk, and the number of files was high, but not crazy high (a few hundred thousand I think)?<p>It turns out this particular filesystem (it was ext2 or ext3 with specific parameters IIRC) can behave like that when there's too many similarly-named files in one directory, because there's some metadata with hash of filenames in the filesystem and there can be collisions and it only can handle so many of them before failing.<p>The solution was to remove the files after printing them of course, so that they don't accumulate forever.
Meanwhile on the Onedrive desktop client, the cost of some operations is proportional not to the number of files in a folder you're trying to open, but in the whole filesystem. Your "root" folder can take a lot longer to load if /some/sub/folder has a ton of files in it.
Bombich.com is a great source of filesystem feature deep dives. He started with running a backup lab, helped people with corner cases and common failure modes. He created a GUI Mac app to guide people through the gnarly bits, and later contributed metadata patches to rsync, for which I am truly grateful.<p>Back in 2004, I played with file systems, metadata archives, directories with 20,000+ files in them.<p>Learned that at that time, GNU ls had a polynomial-time sort algorithm. I didn't dig into it as much as I should have, but there are sort algorithms that have already-sorted input as worst-case for runtime.
Way back in the 90s I read "New Need Not Be Slow"[1], about usenet, and one of the issues that came up consistently was performance limits because of the number of inodes and the filesystem. When I was tasked with setting up INN for my organization, I was able to get a DEC running OSF/1 with advfs, which at the time was a highly optimized filesystem, to more or less bypass the performance problems of UFS.<p>1 <a href="http://www.collyer.net/who/geoff/newspaper.pdf" rel="nofollow">http://www.collyer.net/who/geoff/newspaper.pdf</a>
At previous job v.2, changed an IoT system that received daily text files from remote devices from dumping millions of them in the top level of the /data partition to using a /data/YYYY/MM/unit_id/ structure.<p>The claim was that the original files needed to be kept for audits, even after database ingest.<p>Management didn't care, but I made the change because I wanted to have my terminal not die if I accidentally typed "ls" in the wrong directory.
> Adding a new file, for example, requires that the filesystem compare the new item name to the name of every other file in the folder to check for conflicts<p>Is this true? I would've assumed that filesystems have smarter ways to find a file in a folder than to do a linear search through every entry.<p>That doesn't take away from this post, those smarter datastructures and algorithms will still grow slower with more entries, just not linearly so.
> Adding a new file, for example, requires that the filesystem compare the new item name to the name of every other file in the folder to check for conflicts, so trivial tasks like that will take progressively longer as the file count increases.<p>Not necessarily. Any file system worth its salt is using B-trees or hash tables, where file name existence can be checked in respectively O(log n) or O(1) time.
"Some of these can safely be deleted if you find crazy-high file counts."<p>It would be nice to know which of these library folders can be cleared out.
I have worked on a filesystem with 225 million files in a single directory, hosted on ZFS. Operations like `ls` took a minute or so to execute, but creating or accessing a file in this directory was almost the same speed as in an empty directory. I later replaced this filesystem with an sqlite database, which provided a pretty great speedup in table-scan type operations.
> Adding a new file, for example, requires that the filesystem compare the new item name to the name of every other file in the folder to check for conflicts<p>Tell me you don't know how data structures work without telling me you don't know how data structures work.
It's like they never heard of using a tree structure for filename storage.
NTFS used to (at some point this century) take O(n) time to give user control in the File Save dialog but they fixed that at some point.
For an app I'm building I need to store each day ~1M json files 10kb each, which I then upload as parquet to S3. What is the alternative to using the filesystem here ? Should I put them in a database ?
Oh, we ran into this years ago. Using the RedGate libraries to compare SQL database structure (and data) generates temporary files in a folder.<p>A lot of weird shit starts happening once that folder hits about 10 million files.
Obligatory: <a href="https://www.sqlite.org/fasterthanfs.html" rel="nofollow">https://www.sqlite.org/fasterthanfs.html</a><p>"SQLite reads and writes small blobs (for example, thumbnail images) 35% faster¹ than the same blobs can be read from or written to individual files on disk using fread() or fwrite().<p>"Furthermore, a single SQLite database holding 10-kilobyte blobs uses about 20% less disk space than storing the blobs in individual files..."