My experience with using cp to copy 432 million files (39 TB)

826 pointsby nazri1over 10 years ago

34 comments

fintlerover 10 years ago

I wrote a little copy program at my last job to copy files in a reasonable time frame on 5PB to 55PB filesystems.<a href="https://github.com/hpc/dcp" rel="nofollow">https://github.com/hpc/dcp</a>We got an IEEE paper out of it:<a href="http://conferences.computer.org/sc/2012/papers/1000a015.pdf" rel="nofollow">http://conferences.computer.org/sc/2012/papers/1000a015.pdf</a>A few people are continuing the concept to other tools -- that should be available at <a href="http://fileutils.io/" rel="nofollow">http://fileutils.io/</a> relatively soon.We also had another tool written on top of <a href="https://github.com/hpc/libcircle" rel="nofollow">https://github.com/hpc/libcircle</a> that would gather metadata on a few hundred-million files in a few hours (we had to limit the speed so it wouldn't take down the filesystem). For a slimmed down version of that tool, take a look at <a href="https://github.com/hpc/libdftw" rel="nofollow">https://github.com/hpc/libdftw</a>

评论 #8306128 未加载

评论 #8307956 未加载

评论 #8307083 未加载

评论 #8309664 未加载

pedrocrover 10 years ago

How about this for a better cp strategy to deal with hardlinks:1. Calculate the hash of /sourcedir/some/path/to/file2. Copy the file to /tempdir/$hash if it doesn't exist yet3. Hard-link /destdir/some/path/to/file to /tempdir/$hash4. Repeat until you run out of source files5. Recursively delete /tempdir/This should give you a faithful copy with all the hard-links with constant RAM at the cost of CPU to run all the hashing. If you're smart about doing steps 1 and 2 together it shouldn't require any additional I/O (ignoring the extra file metadata).Edit: actually this won't recreate the same hardlink structure, it will deduplicate any identical files, which may not be what you want. Replacing the hashing with looking up the inode with stat() would actually do the right thing. And that would basically be an on-disk implementation of the hash table cp is setting up in memory.

评论 #8305994 未加载

评论 #8305918 未加载

评论 #8305917 未加载

评论 #8306759 未加载

rwgover 10 years ago

Disassembling data structures nicely can take much more time than just tearing them down brutally when the process exits.A wonderful trend I've noticed in Free/Open Source software lately is proudly claiming that a program is "Valgrind clean." It's a decent indication that the program won't doing anything silly with memory during normal use, like leak it. (There's also a notable upswing in the number of projects using static analyzers on their code and fixing legitimate problems that turn up, which is great, too!)While you can certainly just let the OS reclaim all of your process's allocated memory at exit time, you're technically (though intentionally) leaking memory. When it becomes too hard to separate the intentional leaks from the unintentional leaks, I'd wager most programmers will just stop looking at the Valgrind reports. (I suppose you could wrap free() calls in "#ifdef DEBUG ... #endif" blocks and only run Valgrind on debug builds, but that seems ugly.)A more elegant solution is to use an arena/region/zone allocator and place potentially large data structures (like cp's hard link/inode table) entirely in their own arenas. When the time comes to destroy one of these data structures, you can destroy its arena with a single function call instead of walking the data structure and free()ing it piece by piece.Unfortunately, like a lot of useful plumbing, there isn't a standard API for arena allocators, so actually doing this in a cross-platform way is painful:• Windows lets you create multiple heaps and allocate/free memory in them (HeapCreate(), HeapDestroy(), HeapAlloc(), HeapFree(), etc.).• OS X and iOS come with a zone allocator (malloc_create_zone(), malloc_destroy_zone(), malloc_zone_malloc(), malloc_zone_free(), etc.).• glibc doesn't have a user-facing way to create/destroy arenas (though it uses arenas internally), so you're stuck using a third-party allocator on Linux to get arena support.• IRIX used to come with an arena allocator (acreate(), adelete(), amalloc(), afree(), etc.), so if you're still developing on an SGI Octane because you can't get enough of that sexy terminal font, you're good to go.

评论 #8306367 未加载

评论 #8308166 未加载

评论 #8306396 未加载

评论 #8307316 未加载

mililaniover 10 years ago

This may be a little off topic, but I used to think RAID 5 and RAID 6 were the best RAID configs to use. It seemed to offer the best bang for buck. However, after seeing how long it took to rebuild an array after a drive failed (over 3 days), I'm much more hesitant to use those RAIDS. I much rather prefer RAID 1+0 even though the overall cost is nearly double that of RAID 5. It's much faster, and there is no rebuild process if the RAID controller is smart enough. You just swap failed drives, and the RAID controller automatically utilizes the back up drive and then mirrors onto the new drive. Just much faster and much less prone to multiple drive failures killing the entire RAID.

评论 #8306499 未加载

评论 #8306088 未加载

评论 #8305799 未加载

评论 #8305853 未加载

评论 #8305710 未加载

评论 #8305692 未加载

评论 #8306027 未加载

vhost-over 10 years ago

These are the types of stories I love. I just learned a boat load in 5 minutes.

评论 #8305684 未加载

评论 #8306933 未加载

calvinsover 10 years ago

I would usually use the tarpipe mentioned already by others for this sort of thing (although I probably wouldn't do 432 million files in one shot):<pre><code> (cd $SOURCE && tar cf - .) | (mkdir -p $DEST && cd $DEST && tar xf -) </code></pre> Another option which I just learned about through reading some links from this thread is pax (<a href="http://en.wikipedia.org/wiki/Pax_%28Unix%29" rel="nofollow">http://en.wikipedia.org/wiki/Pax_%28Unix%29</a>), which can do it with just a single process:<pre><code> (mkdir -p $DEST && cd $SOURCE && pax -rw . $DEST) </code></pre> Both will handle hard links fine, but pax may have some advantages in terms of resource usage when processing huge numbers of files and tons of hard links.

评论 #8308623 未加载

pflanzeover 10 years ago

I've written a program that attempts to deal with the given situation gracefully: instead of using a hash table, it creates a temporary file with a list of inode/device/path entries, then sorts this according to inode/device, then uses the sorted list to perform the copying/hardlinking. The idea is that sorting should work well with much lower RAM requirements than the size of the file to be sorted (due to data locality, unless the random accesses with the hash, it will be able to work with big chunks, at least when done right (a bit hand-wavy, I know, this is called an "online algorithm" and I remember Knuth having written about those, haven't had the chance to recheck yet); the program is using the system sort command, which is hopefully implementing this well already).The program stupidly calls "cp" right now for every individual file copy (not the hard linking), just to get the script done quickly, it's easy to replace that with something that saves the fork/exec overhead; even so, it might be faster than the swapping hash table if the swap is on a spinning disk. Also read the notes in the --help text. I.e. this is a work in progress as a basis to test the idea, it will be easy to round off the corners if there's interest.<a href="https://github.com/pflanze/megacopy" rel="nofollow">https://github.com/pflanze/megacopy</a>PS. the idea of this is to make copying work well with the given situation on a single machine, unless the approach taken by the dcp program mentioned by fintler which seems to rely on a cluster of machines.There may also be some more discussion about this on the mailing list: <a href="http://lists.gnu.org/archive/html/coreutils/2014-09/msg00013.html" rel="nofollow">http://lists.gnu.org/archive/html/coreutils/2014-09/msg00013...</a>

jrochkind1over 10 years ago

So it was all the files in one go, presumably with `cp -r`?What about doing something with find/xargs/i-dunno to copy all the files, but break em into batches so you aren't asking cp to do it's bookkeeping for so many files in one process? Would that work better? Or worse in other ways?

评论 #8305833 未加载

评论 #8305766 未加载

评论 #8305757 未加载

评论 #8305911 未加载

pedrocrover 10 years ago

Unix could really use a way to get all the paths that point to a given inode. These days that shouldn't really cost all that much and this issue comes up a lot in copying/sync situations. Here's the git-annex bug report about this:<a href="https://git-annex.branchable.com/bugs/Hard_links_not_synced_in_direct_mode/" rel="nofollow">https://git-annex.branchable.com/bugs/Hard_links_not_synced_...</a>

评论 #8305733 未加载

评论 #8307427 未加载

pixelbeatover 10 years ago

I found an issue in cp that caused 350% extra mem usage for the original bug reporter, which fixing would have kept his working set at least within RAM.<a href="http://lists.gnu.org/archive/html/coreutils/2014-09/msg00014.html" rel="nofollow">http://lists.gnu.org/archive/html/coreutils/2014-09/msg00014...</a>

gwernover 10 years ago

> Wanting the buffers to be flushed so that I had a complete logfile, I gave cp more than a day to finish disassembling its hash table, before giving up and killing the process....Disassembling data structures nicely can take much more time than just tearing them down brutally when the process exits.Does anyone know what the 'tear down' part is about? If it's about erasing the hashtable from memory, what takes so long? I would expect that to be very fast: you don't have to write zeros to it all, you just tell your GC or memory manager to mark it as free.

评论 #8305679 未加载

评论 #8305675 未加载

评论 #8305779 未加载

评论 #8305701 未加载

评论 #8305660 未加载

评论 #8305700 未加载

sitkackover 10 years ago

I appreciate that he had the foresight to install more ram and configure more swap. I would hate to be days into a transfer and have the OOM killer strike.

angry_octetover 10 years ago

The difficulty is that you are using a filesystem hierarchy to 'copy files' when you actually want to do a volume dump (block copy). Use XFS and xfsdump, or ZFS and zfs send, to achieve this.Copy with hard link preservation is essentially like running dedupe except that you know ahead of time how many dupes there are. Dedupe is often very memory intensive, and even well thought out implementations don't support keeping book keeping structures on disk.

评论 #8305838 未加载

评论 #8311400 未加载

minopretover 10 years ago

In light of experience would it perhaps be helpful after all to use a block-level copy (such as Partclone, PartImage, or GNU ddrescue) and analyze later which files have the bad blocks?I see that the choice of a file-level copy was deliberate: "I'd have copied/moved the files at block-level (eg. using dd or pvmove), but suspecting bad blocks, I went for a file-level copy because then I'd know which files contained the bad blocks."

评论 #8308667 未加载

评论 #8306742 未加载

IvyMikeover 10 years ago

Interesting.In Windows-land, the default copy is pretty anemic, so probably most people avoid it for serious work.I'd probably use robocopy from the command line. And if I was being lazy, I'd use the Teracopy GUI.I think my limit for a single copy command has been around 4TB with robocopy--and that was a bunch of large media files, instead of smaller more numerous files. Maybe there's a limit I haven't hit.

评论 #8305712 未加载

评论 #8305707 未加载

评论 #8305740 未加载

pmontraover 10 years ago

Another lesson to be learnt is that it's nice to have the source code for the tools we are using.

dredmorbiusover 10 years ago

The email states that file-based copy operations were used in favor of dd due to suspected block errors. Two questions come to mind:1. I've not used dd on failing media, so I'm not sure of the behavior. Will it plow through a file with block-read failures or halt?2. There's the ddrescue utility, which is specifically intended for reading from nonreliable storage. Seems that this could have offered another means for addressing Rasmus's problem. It can also fill in additional data on multiple runs across media, such that more complete restores might be achieved. <a href="https://www.gnu.org/software/ddrescue/ddrescue.html" rel="nofollow">https://www.gnu.org/software/ddrescue/ddrescue.html</a>

评论 #8309754 未加载

icedchaiover 10 years ago

For that many files I probably would've used rsync between local disks. shrug

评论 #8305887 未加载

dspillettover 10 years ago

> The number of hard drives flashing red is not the same as the number of hard drives with bad blocks.This is the real take-away. Monitor your drives. At very least enable SMART, and also regularly run a read on the full underlying drive (SMART won't see and log blocks that are on the way out so need retries for successful reads, unless you actually try to read those blocks).That won't completely make you safe, but it'll greatly reduce the risk of other drives failing during a rebuild by increasing the chance you get advanced warning that problems are building up.

评论 #8308505 未加载

mturmonover 10 years ago

The later replies regarding the size of the data structures cp is using are also worth reading. This is a case where pushing the command farther can make you think harder about the computations being done.

grondiluover 10 years ago

On Unix, isn't it considered bad practice to use cp in order to copy a large directory tree?IIRC, the use of tar is recommended.Something like:<pre><code> $ (cd $origin && tar cf - *) | (cd $destination && tar xvf - )</code></pre>

评论 #8306696 未加载

评论 #8307224 未加载

sauereover 10 years ago

> While rebuilding, the replacement disk failed, and in the meantime another disk had also failed.I feel the pain. I went thru the same hell a few months ago.

maakuover 10 years ago

Another lesson: routinely scrub your RAID arrays.

评论 #8305562 未加载

0x0over 10 years ago

I wonder how well rsync would have fared here.

评论 #8305627 未加载

ccleveover 10 years ago

Maybe this is naive, but wouldn't it have made more sense to do a bunch of smaller cp commands? Like sweep through the directory structure and do one cp per directory? Or find some other way to limit the number of files copied per command?

评论 #8306430 未加载

Andysover 10 years ago

A problem with cp (and rsync, tar, and linux in general) is there is read-ahead within single files, but no read-ahead for the next file in the directory. So it doesn't make full use of the available IOPS capacity.

daviduover 10 years ago

This is not, not, not how one should be using RAID.The math is clear that in sufficiently large disk systems, RAID5, RAID6, and friends, are all insufficient.

评论 #8307276 未加载

dbboltonover 10 years ago

>We use XFSWhy?

评论 #8305791 未加载

评论 #8307438 未加载

评论 #8305666 未加载

评论 #8325271 未加载

评论 #8308659 未加载

limaoscarjulietover 10 years ago

Rsync seems a better tool for this. Can be run multiple times and it will just copy missing blocks.

nraynaudover 10 years ago

it reminds me of crash only software.

gaiusover 10 years ago

I would probably have used tar|tar for this, or rsync.

评论 #8305545 未加载

评论 #8305622 未加载

评论 #8305553 未加载

评论 #8305635 未加载

评论 #8305574 未加载

评论 #8307039 未加载

评论 #8306732 未加载

评论 #8305580 未加载

RexMover 10 years ago

Is this where a new cp fork comes about called libracp?

brokentoneover 10 years ago

Feels like a similar situation to this: <a href="http://dis.4chan.org/read/prog/1109211978/21" rel="nofollow">http://dis.4chan.org/read/prog/1109211978/21</a>

lucb1eover 10 years ago

> 20 years experience with various Unix variants> I browsed the net for other peoples' experience with copying many files and quickly decided that cp would do the job nicely.After 20 years you no longer google how to copy files.Edit: Reading on he talks about strace and even reading cp's source code which makes it even weirder that he had to google how to do this...Edit2: Comments! Took only ten downvotes before someone bothered to explain what I was doing wrong, but now there are three almost simultaneously. I guess those make a few good points. I'd still think cp ought to handle just about anything especially given its ubiquitousness and age, but I see the point.And to clarify: I'm not saying the author is stupid or anything. It's just weird to me that someone with that much experience would google something which on the surface sounds so trivial, even at 40TB.

评论 #8305632 未加载

评论 #8305677 未加载

评论 #8305637 未加载

评论 #8305786 未加载

评论 #8306669 未加载

评论 #8305727 未加载

34 comments

fintlerover 10 years ago

评论 #8306128 未加载

评论 #8307956 未加载

评论 #8307083 未加载

评论 #8309664 未加载

pedrocrover 10 years ago

评论 #8305994 未加载

评论 #8305918 未加载

评论 #8305917 未加载

评论 #8306759 未加载

rwgover 10 years ago

评论 #8306367 未加载

评论 #8308166 未加载

评论 #8306396 未加载

评论 #8307316 未加载

mililaniover 10 years ago

评论 #8306499 未加载

评论 #8306088 未加载

评论 #8305799 未加载

评论 #8305853 未加载

评论 #8305710 未加载

评论 #8305692 未加载

评论 #8306027 未加载

vhost-over 10 years ago

These are the types of stories I love. I just learned a boat load in 5 minutes.

评论 #8305684 未加载

评论 #8306933 未加载

calvinsover 10 years ago

评论 #8308623 未加载

pflanzeover 10 years ago

jrochkind1over 10 years ago

评论 #8305833 未加载

评论 #8305766 未加载

评论 #8305757 未加载

评论 #8305911 未加载

pedrocrover 10 years ago

评论 #8305733 未加载

评论 #8307427 未加载

pixelbeatover 10 years ago

gwernover 10 years ago

评论 #8305679 未加载

评论 #8305675 未加载

评论 #8305779 未加载

评论 #8305701 未加载

评论 #8305660 未加载

评论 #8305700 未加载

sitkackover 10 years ago

I appreciate that he had the foresight to install more ram and configure more swap. I would hate to be days into a transfer and have the OOM killer strike.

angry_octetover 10 years ago

评论 #8305838 未加载

评论 #8311400 未加载

minopretover 10 years ago

评论 #8308667 未加载

评论 #8306742 未加载

IvyMikeover 10 years ago

评论 #8305712 未加载

评论 #8305707 未加载

评论 #8305740 未加载

pmontraover 10 years ago

Another lesson to be learnt is that it's nice to have the source code for the tools we are using.

dredmorbiusover 10 years ago

评论 #8309754 未加载

icedchaiover 10 years ago

For that many files I probably would've used rsync between local disks. shrug

评论 #8305887 未加载

dspillettover 10 years ago

评论 #8308505 未加载

mturmonover 10 years ago

grondiluover 10 years ago

评论 #8306696 未加载

评论 #8307224 未加载

sauereover 10 years ago

> While rebuilding, the replacement disk failed, and in the meantime another disk had also failed.I feel the pain. I went thru the same hell a few months ago.

maakuover 10 years ago

Another lesson: routinely scrub your RAID arrays.

评论 #8305562 未加载

0x0over 10 years ago

I wonder how well rsync would have fared here.

评论 #8305627 未加载

ccleveover 10 years ago

评论 #8306430 未加载

Andysover 10 years ago

daviduover 10 years ago

This is not, not, not how one should be using RAID.The math is clear that in sufficiently large disk systems, RAID5, RAID6, and friends, are all insufficient.

评论 #8307276 未加载

dbboltonover 10 years ago

>We use XFSWhy?

评论 #8305791 未加载

评论 #8307438 未加载

评论 #8305666 未加载

评论 #8325271 未加载

评论 #8308659 未加载

limaoscarjulietover 10 years ago

Rsync seems a better tool for this. Can be run multiple times and it will just copy missing blocks.

nraynaudover 10 years ago

it reminds me of crash only software.

gaiusover 10 years ago

I would probably have used tar|tar for this, or rsync.

评论 #8305545 未加载

评论 #8305622 未加载

评论 #8305553 未加载

评论 #8305635 未加载

评论 #8305574 未加载

评论 #8307039 未加载

评论 #8306732 未加载

评论 #8305580 未加载

RexMover 10 years ago

Is this where a new cp fork comes about called libracp?

brokentoneover 10 years ago

Feels like a similar situation to this: <a href="http://dis.4chan.org/read/prog/1109211978/21" rel="nofollow">http://dis.4chan.org/read/prog/1109211978/21</a>