A faster way to delete millions of files in a directory

135 点作者 bluetooth将近 12 年前

18 条评论

lloeki将近 12 年前

It's sad to see so much guesswork around here...Here's GNU coreutils rm [0] calling its remove() function [1] itself using fts to open, traverse, and remove each entry[2], vs rsync delete() [3] calling {{robust,do}_,}unlink() function [4] [5].Now a little profiling could certainly help.(damn gitweb that doesn't highlight the referenced line)[0]: <a href="http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/rm.c;h=3e187cf80d5ecf3e7743662f8b0e9ee0b956c0ac;hb=HEAD#l349" rel="nofollow">http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f...</a>[1]: <a href="http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/remove.c;h=cdbbec5bbbd6a6fc96c078e63e9b2b918a0f322e;hb=HEAD#l538" rel="nofollow">http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f...</a>[2]: <a href="http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/remove.c;h=cdbbec5bbbd6a6fc96c078e63e9b2b918a0f322e;hb=HEAD#l417" rel="nofollow">http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=blob;f...</a>[3]: <a href="http://rsync.samba.org/ftp/unpacked/rsync/delete.c" rel="nofollow">http://rsync.samba.org/ftp/unpacked/rsync/delete.c</a>[4]: <a href="http://rsync.samba.org/ftp/unpacked/rsync/util.c" rel="nofollow">http://rsync.samba.org/ftp/unpacked/rsync/util.c</a>[5]: <a href="http://rsync.samba.org/ftp/unpacked/rsync/syscall.c" rel="nofollow">http://rsync.samba.org/ftp/unpacked/rsync/syscall.c</a>

评论 #5804542 未加载

js2将近 12 年前

FWIW, a directory with millions of files is likely to be quite large (I'm referring to the directory inode itself, which contains a mapping of filenames to inodes). Depending upon the file system, reclaiming the space used by all those millions of mappings might require creating a new directory into which to move the remaining files.BTW, having millions of files in an ext3 directory in the first place is probably a bad idea. Instead, layer the files into two or three directory levels. See here:<a href="http://www.redhat.com/archives/ext3-users/2007-August/msg00000.html" rel="nofollow">http://www.redhat.com/archives/ext3-users/2007-August/msg000...</a>(Git for example places its objects under 1 of 256 directories based on the first hex byte representation of the object's SHA-1.)

评论 #5804586 未加载

评论 #5803398 未加载

NelsonMinar将近 12 年前

Details will vary depending on the filesystem. Bad old filesystems are O(n^2) in the # of files in a directory. ext3fs is fine. Also tools like find and and rm often do more work on a file than strictly nececssary. I'm curious why rsync would be better myself; on first blush that'd be the worst choice!I've salvaged an unwieldy directory by using Python to directly call unlink(2). Details: <a href="http://www.somebits.com/weblog/tech/bad/giant-directories.html" rel="nofollow">http://www.somebits.com/weblog/tech/bad/giant-directories.ht...</a>

Hawkee将近 12 年前

For anybody who might try to copy and paste from this article it is actually "rsync -a --delete empty/ your_dir". The dashes are improperly encoded for a copy/paste.

评论 #5804353 未加载

js2将近 12 年前

No mention of filesystem. As it's RHEL 5.4 I'm going to guess ext3, which uses indirect blocks instead of extents for large files (which a directory containing millions of files surely is). Would also be useful to confirm that dir_index is enabled.Some useful background material:<a href="http://computer-forensics.sans.org/blog/2008/12/24/understanding-indirect-blocks-in-unix-file-systems/" rel="nofollow">http://computer-forensics.sans.org/blog/2008/12/24/understan...</a><a href="http://static.usenix.org/publications/library/proceedings/als01/full_papers/phillips/phillips_html/index.html" rel="nofollow">http://static.usenix.org/publications/library/proceedings/al...</a>

评论 #5802354 未加载

japaget将近 12 年前

There is an error in the blog posting. If you look at the original output from the rsync command, you will see that the elapsed time should be 12.42 seconds and the system time should be 10.60 seconds. Elapsed time is a third that of rm -rf and system time is 70% as much.

comex将近 12 年前

It would certainly be nice if there were a specialized function to do this - no need for a million context switches, and the filesystem code can probably delete things more intelligently than individually removing each file, with accompanying intermediate bookkeeping.

pstuart将近 12 年前

rsync is an order of magnitude faster than rm -rf. Why would be? (ok, I'm being lazy).

评论 #5802237 未加载

评论 #5802751 未加载

评论 #5802234 未加载

Elv13将近 12 年前

Even faster: mkdir ../.tmp${RANDOM} && mv ./* ../.tmp[0-9]* && rm -rf ../.tmp[0-9]* & #or the rsync trickAs long as ../ is on the same device, that should clear the directory instantaneously. It is the point, right? Of course, if you want an rm for lower IO-wait or lower CPU, use the rsync method, but if you want something that clear a directory as fast as possible, this is fast. Tested with for I in `seq 1 1000000`; do echo ${I} > ./${I};done;sync #^ much faster than "touch"

评论 #5802366 未加载

评论 #5802847 未加载

评论 #5802480 未加载

codesink将近 12 年前

I stumbled over it few months ago and the issue was that readdir(), used by rm on the box I was using, by default alloc'd a small buffer (the usual 4KB) and with millions of files that turned in millions of syscalls (that's just to find out the files to delete).A small program using getdents() with a large buffer (5MB or so) speeds it up a lot.If you want to be kind to your hard drive then sorting the buffer by inode before running unlink()s will be better to access the disk semi-sequentially (less head jumps).

ralph将近 12 年前

This Perl beats rsync by quite a margin here.<pre><code> perl -e 'opendir D, "."; @f = grep {$_ ne "." && $_ ne ".."} readdir D; unlink(@f) == $#f + 1 or die' </code></pre> It goes a bit quicker still if @f and the error handling are omitted.The original article is comparing different things some of the time, e.g. find is having to stat(2) everything to test if it's a file.

评论 #5803760 未加载

miles将近 12 年前

More along these same lines:How to delete million of files on busy Linux servers ("Argument list too long")<a href="http://pc-freak.net/blog/how-to-delete-million-of-files-on-busy-linux-servers-work-out-argument-list-too-long/" rel="nofollow">http://pc-freak.net/blog/how-to-delete-million-of-files-on-b...</a>

akeck将近 12 年前

Interesting. Some time ago we had to regularly clear a directory with many files in an ext2 file system. We ended up mounting a separate small volume at the point in the VFS. When we needed to clear it, we would just make a new file system on the volume.

评论 #5803469 未加载

incision将近 12 年前

Brings back memories of an in-house correspondence application I once encountered - 16 million TIFFs in a single directory.The lead dev responsible for the app was also fond of hard-coding IP addresses and wouldn't even entertain talk of doing anything differently.I got out of there ASAP.

aangjie将近 12 年前

Another excellent resource is serverfault question <a href="http://serverfault.com/questions/183821/rm-on-a-directory-with-millions-of-files" rel="nofollow">http://serverfault.com/questions/183821/rm-on-a-directory-wi...</a>

malkia将近 12 年前

if you need the same folder emptied, but can accept background process deleting in background, you could rename the folder, create empty one with the old name, and run something to delete in background.

yaldasoft将近 12 年前

I had to delete a few million files in bash once. 'find' didn't work. I used perl to overcome the issues.opendir D, "."; while ($n = readdir D) { unlink $n }

pkrumins将近 12 年前

The results are statistically insignificant.

18 条评论

lloeki将近 12 年前

评论 #5804542 未加载

js2将近 12 年前

评论 #5804586 未加载

评论 #5803398 未加载

NelsonMinar将近 12 年前

Hawkee将近 12 年前

For anybody who might try to copy and paste from this article it is actually "rsync -a --delete empty/ your_dir". The dashes are improperly encoded for a copy/paste.

评论 #5804353 未加载

js2将近 12 年前

评论 #5802354 未加载

japaget将近 12 年前

comex将近 12 年前

pstuart将近 12 年前

rsync is an order of magnitude faster than rm -rf. Why would be? (ok, I'm being lazy).

评论 #5802237 未加载

评论 #5802751 未加载

评论 #5802234 未加载

Elv13将近 12 年前

评论 #5802366 未加载

评论 #5802847 未加载

评论 #5802480 未加载

codesink将近 12 年前

ralph将近 12 年前

评论 #5803760 未加载

miles将近 12 年前

akeck将近 12 年前

评论 #5803469 未加载

incision将近 12 年前

aangjie将近 12 年前

malkia将近 12 年前

yaldasoft将近 12 年前

I had to delete a few million files in bash once. 'find' didn't work. I used perl to overcome the issues.opendir D, "."; while ($n = readdir D) { unlink $n }

pkrumins将近 12 年前

The results are statistically insignificant.