Folders with high file counts

89 pointsby mrzoolover 2 years ago

30 comments

c0l0over 2 years ago

At $oldjob, taking care of a busy and successful web estate that is now close to 25 years old, one of the ugliest and longest-standing warts was the "image store". That was a simple, flat directory on a single node, shared over NFS, which had accumulated more than 1.2 million (yes, 1_200_000) inodes/files in a single directory. No-one wanted (read: dared) to properly fix the code to rid it of this once-convenient assumption and (lack of) hierarchy, so I tried to work around the ever-growing pain by tuning the filesystem, the host's kernel (a buffed dentry cache size goes a long way at making this kind of abuse more tolerable, for instance), and the NFS server and clients involved to mitigate emerging problems as much as possible. In the end, a readdirplus()/getdents()-chain invoked over NFS only took a few seconds to finish. It's pretty amazing what Linux can cope with, if you make it have to :)

评论 #34443266 未加载

评论 #34442714 未加载

评论 #34442131 未加载

评论 #34452216 未加载

jakub_gover 2 years ago

As a consumer of Android devices, one thing that's super annoying is that all pictures you take with a camera are stored in a single massive folder.When you connect to the device from your computer over USB (which is usually USB 2.0 even if USB-C, except in major 2021+ models), it will take forever to enumerate the files in that folder. Once you start copying, there's big chance it hangs, so you need to disconnect the device, and go through that painful process again.(I know, I'm oldschool, I don't have automatic cloud backup enabled).

评论 #34442852 未加载

评论 #34441229 未加载

评论 #34440634 未加载

评论 #34453737 未加载

评论 #34443792 未加载

评论 #34442370 未加载

评论 #34452280 未加载

评论 #34442775 未加载

jakub_gover 2 years ago

Bruce Dawson often finds issues in software linked to inefficient algorithms, and sometimes they're linked to slow folder enumeration on big folders. Example: Windows Voice Recorder enumerates files in a folder on startup in a interaction-blocking way.<a href="https://randomascii.wordpress.com/2022/09/29/why-modern-software-is-slow-windows-voice-recorder/" rel="nofollow">https://randomascii.wordpress.com/2022/09/29/why-modern-soft...</a>

etra0over 2 years ago

A bit tangential, but, recently my partner had her Google Drive with around ~3.000 files in the root folder (created mostly by Google Classroom), which means the Files app of the iPad couldn't show them all, because for some reason it limits to 500 files only.So naturally the next step was to try to clean the directory. We tried through the webpage, deleting by chunks of 300 files consumed around ~8GB of RAM... and it was slow as hell, and her laptop is a bit old. I moved onto my desktop and selecting 500 files consumed ~10GB of RAM, it was still slow.I thought of using Google Colab to access to the Drive as filesystem but no dice there either because the google account wasn't managed by her.At the end, we tried the iPad app, it took like 8 minutes to be able to select all files, and when deleting them, it took about an hour to actually do it, I imagine it was submitted by batches.It was stupidly painful.

评论 #34441158 未加载

评论 #34441827 未加载

jaxrtechover 2 years ago

Perhaps this is one of the reasons that I've seen Linux deployments use XFS (e.g AWS). If you page through the filesystem documentation, once you hit a certain directory size, it actually switches over to using b+ trees like a RDMS would.<a href="https://www.kernel.org/pub/linux/utils/fs/xfs/docs/xfs_filesystem_structure.pdf" rel="nofollow">https://www.kernel.org/pub/linux/utils/fs/xfs/docs/xfs_files...</a> (section 16.2 on PDF pg 127)

评论 #34441340 未加载

评论 #34445473 未加载

jonwinstanleyover 2 years ago

I worked at a place where the directory was split based on a character or two from the start of a hash.They had millions of profile images but didn't want them all in one directory, so they hashed profile ID and used the first 2 letters as the name of a sub-directory. So you end up with sub-directories called aa, ab, ac, ad etc.It's not perfect but I suppose the original creator had seen issues in the past when directories have too many files in them.

评论 #34444702 未加载

评论 #34441320 未加载

评论 #34443030 未加载

bob1029over 2 years ago

Developers can do a lot to fix this by simply choosing SQLite to store all the local things.Performing backups of our production apps used to take hours (especially in cheap clouds) because of all the loose files. Today, it takes about 3-5 minutes since there are just a handful of consolidated files to worry about.

评论 #34441540 未加载

评论 #34441087 未加载

评论 #34440508 未加载

makeitdoubleover 2 years ago

This is an issue that is pretty common when auto generating files.For instance when generating receipt PDF it could feel natural to store them in folders by account ID. Except there will be a bunch of accounts generating 20 or 30 receipts a day, which isn't much on the face of it. But within months it becomes a pain to list receipts across accounts, within a year or two even individual account receipts become a nightmare to list, and fixing the situation requires a few tricks to avoid all the tools that assume directory listing cost nothing.

didgetmasterover 2 years ago

This is just another in a long list of problems that existing file system have when built on an architecture that was created before there was sufficient storage to hold more than a few hundred total files (decades ago).I have been working on a new data manager that could replace existing file systems with something much better. You can store hundreds of millions of files in a single container (I call them pods instead of volumes) and put tags on everything. Folders can hold millions of files with virtually no degradation in performance. Searches to find subsets of files based on tags or other criteria is lightning fast.The software has been in beta for about a year and is available for free download at www.Didgets.com yet interest has been very moderate in spite of problems like the one discussed in this thread.demo video: <a href="https://www.youtube.com/watch?v=dWIo6sia_hw">https://www.youtube.com/watch?v=dWIo6sia_hw</a>

zeta0134over 2 years ago

So we found out the hard way that having MultiViews enabled in Apache, in your otherwise static folder full of image files, is a reeeeeally bad idea if that folder is filled by automation and contains millions of files. That was a fun support call. :) "Why is our site giving 500s after less than 10 minutes? What are all of those workers doing?"

Joker_vDover 2 years ago

Why is it actually so difficult for filesystems to deal with such folders? I mean, a million is not such a large number, not for a computer anyhow. A table with million rows doesn't generally causes a RDBMS to choke, why should filesystems?

评论 #34441996 未加载

评论 #34441906 未加载

评论 #34442912 未加载

评论 #34443210 未加载

ajucover 2 years ago

Back in 2010s in my first job I've seen a bug caused by too many files in one directory. I don't remember the exact details, but it was making me crazy.Basically we wrote temporary files into one big directory and later printed them. And sometimes our code returned "File not found" errors despite the fact that ls filename had showed the file is there and has correct permissions.And when I tried to cat filename from shell - it also caused the same error :) But if you created another file with a different name in the same directory - it worked correctly :) There was also space on the disk, and the number of files was high, but not crazy high (a few hundred thousand I think)?It turns out this particular filesystem (it was ext2 or ext3 with specific parameters IIRC) can behave like that when there's too many similarly-named files in one directory, because there's some metadata with hash of filenames in the filesystem and there can be collisions and it only can handle so many of them before failing.The solution was to remove the files after printing them of course, so that they don't accumulate forever.

ape4over 2 years ago

Don't modern filesystems handle this nicely - eg btrfs

评论 #34441315 未加载

评论 #34441338 未加载

评论 #34441419 未加载

red_admiralover 2 years ago

Meanwhile on the Onedrive desktop client, the cost of some operations is proportional not to the number of files in a folder you're trying to open, but in the whole filesystem. Your "root" folder can take a lot longer to load if /some/sub/folder has a ton of files in it.

watersbover 2 years ago

Bombich.com is a great source of filesystem feature deep dives. He started with running a backup lab, helped people with corner cases and common failure modes. He created a GUI Mac app to guide people through the gnarly bits, and later contributed metadata patches to rsync, for which I am truly grateful.Back in 2004, I played with file systems, metadata archives, directories with 20,000+ files in them.Learned that at that time, GNU ls had a polynomial-time sort algorithm. I didn't dig into it as much as I should have, but there are sort algorithms that have already-sorted input as worst-case for runtime.

cratermoonover 2 years ago

Way back in the 90s I read "New Need Not Be Slow"[1], about usenet, and one of the issues that came up consistently was performance limits because of the number of inodes and the filesystem. When I was tasked with setting up INN for my organization, I was able to get a DEC running OSF/1 with advfs, which at the time was a highly optimized filesystem, to more or less bypass the performance problems of UFS.1 <a href="http://www.collyer.net/who/geoff/newspaper.pdf" rel="nofollow">http://www.collyer.net/who/geoff/newspaper.pdf</a>

angst_riddenover 2 years ago

At previous job v.2, changed an IoT system that received daily text files from remote devices from dumping millions of them in the top level of the /data partition to using a /data/YYYY/MM/unit_id/ structure.The claim was that the original files needed to be kept for audits, even after database ingest.Management didn't care, but I made the change because I wanted to have my terminal not die if I accidentally typed "ls" in the wrong directory.

mort96over 2 years ago

> Adding a new file, for example, requires that the filesystem compare the new item name to the name of every other file in the folder to check for conflictsIs this true? I would've assumed that filesystems have smarter ways to find a file in a folder than to do a linear search through every entry.That doesn't take away from this post, those smarter datastructures and algorithms will still grow slower with more entries, just not linearly so.

nayukiover 2 years ago

> Adding a new file, for example, requires that the filesystem compare the new item name to the name of every other file in the folder to check for conflicts, so trivial tasks like that will take progressively longer as the file count increases.Not necessarily. Any file system worth its salt is using B-trees or hash tables, where file name existence can be checked in respectively O(log n) or O(1) time.

s1monover 2 years ago

"Some of these can safely be deleted if you find crazy-high file counts."It would be nice to know which of these library folders can be cleared out.

aftbitover 2 years ago

I have worked on a filesystem with 225 million files in a single directory, hosted on ZFS. Operations like `ls` took a minute or so to execute, but creating or accessing a file in this directory was almost the same speed as in an empty directory. I later replaced this filesystem with an sqlite database, which provided a pretty great speedup in table-scan type operations.

dmdover 2 years ago

[laughs in medical imaging] ... we often see folders with 10s of millions of files. Thankfully zfs doesn't bat an eye.

puffoflogicover 2 years ago

> Adding a new file, for example, requires that the filesystem compare the new item name to the name of every other file in the folder to check for conflictsTell me you don't know how data structures work without telling me you don't know how data structures work.

wittjeffover 2 years ago

It's like they never heard of using a tree structure for filename storage. NTFS used to (at some point this century) take O(n) time to give user control in the File Save dialog but they fixed that at some point.

lairvover 2 years ago

For an app I'm building I need to store each day ~1M json files 10kb each, which I then upload as parquet to S3. What is the alternative to using the filesystem here ? Should I put them in a database ?

评论 #34444577 未加载

评论 #34449528 未加载

NDizzleover 2 years ago

Oh, we ran into this years ago. Using the RedGate libraries to compare SQL database structure (and data) generates temporary files in a folder.A lot of weird shit starts happening once that folder hits about 10 million files.

lapcatover 2 years ago

Was this a problem on the older HFS+ file system too, or just APFS? HFS+ has a B-tree catalog and super-fast searchfs().

guestbestover 2 years ago

This basically makes konqueror and dolphin unusable for me on Linux. I use pcmanfm instead

sixothreeover 2 years ago

Honestly, any program that chokes on just 200k files needs to be re-examined.

theszover 2 years ago

Obligatory: <a href="https://www.sqlite.org/fasterthanfs.html" rel="nofollow">https://www.sqlite.org/fasterthanfs.html</a>"SQLite reads and writes small blobs (for example, thumbnail images) 35% faster¹ than the same blobs can be read from or written to individual files on disk using fread() or fwrite()."Furthermore, a single SQLite database holding 10-kilobyte blobs uses about 20% less disk space than storing the blobs in individual files..."

30 comments

c0l0over 2 years ago

评论 #34443266 未加载

评论 #34442714 未加载

评论 #34442131 未加载

评论 #34452216 未加载

jakub_gover 2 years ago

评论 #34442852 未加载

评论 #34441229 未加载

评论 #34440634 未加载

评论 #34453737 未加载

评论 #34443792 未加载

评论 #34442370 未加载

评论 #34452280 未加载

评论 #34442775 未加载

jakub_gover 2 years ago

etra0over 2 years ago

评论 #34441158 未加载

评论 #34441827 未加载

jaxrtechover 2 years ago

评论 #34441340 未加载

评论 #34445473 未加载

jonwinstanleyover 2 years ago

评论 #34444702 未加载

评论 #34441320 未加载

评论 #34443030 未加载

bob1029over 2 years ago

评论 #34441540 未加载

评论 #34441087 未加载

评论 #34440508 未加载

makeitdoubleover 2 years ago

didgetmasterover 2 years ago

zeta0134over 2 years ago

Joker_vDover 2 years ago

评论 #34441996 未加载

评论 #34441906 未加载

评论 #34442912 未加载

评论 #34443210 未加载

ajucover 2 years ago

ape4over 2 years ago

Don't modern filesystems handle this nicely - eg btrfs

评论 #34441315 未加载

评论 #34441338 未加载

评论 #34441419 未加载

red_admiralover 2 years ago

watersbover 2 years ago

cratermoonover 2 years ago

angst_riddenover 2 years ago

mort96over 2 years ago

nayukiover 2 years ago

s1monover 2 years ago

"Some of these can safely be deleted if you find crazy-high file counts."It would be nice to know which of these library folders can be cleared out.

aftbitover 2 years ago

dmdover 2 years ago

[laughs in medical imaging] ... we often see folders with 10s of millions of files. Thankfully zfs doesn't bat an eye.

puffoflogicover 2 years ago

wittjeffover 2 years ago

lairvover 2 years ago

评论 #34444577 未加载

评论 #34449528 未加载

NDizzleover 2 years ago

lapcatover 2 years ago

Was this a problem on the older HFS+ file system too, or just APFS? HFS+ has a B-tree catalog and super-fast searchfs().

guestbestover 2 years ago

This basically makes konqueror and dolphin unusable for me on Linux. I use pcmanfm instead

sixothreeover 2 years ago

Honestly, any program that chokes on just 200k files needs to be re-examined.

theszover 2 years ago