TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Moving all your data, 9TB edition

149 点作者 CD1212大约 10 年前

10 条评论

mattzito大约 10 年前
It seems to me that, in fact, your original idea was, in fact, the correct one - rsync probably would have been the best way to do this (and separately, a truck full of disks probably would have been the other best way).<p>First, rsync took too long probably because you used just one thread and didn&#x27;t optimize your command-line options - most of the performance problems with rsync with large filesystem trees comes from using one command to run everything, something like:<p>rsync -av &#x2F;source&#x2F;giant&#x2F;tree &#x2F;dest&#x2F;giant&#x2F;tree<p>And the process of crawling, checksumming, storing is not only generally slow, but incredibly inefficient on today&#x27;s modern multicore processors.<p>Much better to break it up into many threads, something like:<p>rsync -av &#x2F;source&#x2F;giant&#x2F;tree&#x2F;subdir1 &#x2F;dest&#x2F;giant&#x2F;tree&#x2F;subdir1<p>rsync -av &#x2F;source&#x2F;giant&#x2F;tree&#x2F;subdir2 &#x2F;dest&#x2F;giant&#x2F;tree&#x2F;subdir2<p>rsync -av &#x2F;source&#x2F;giant&#x2F;tree&#x2F;subdir3 &#x2F;dest&#x2F;giant&#x2F;tree&#x2F;subdir3<p>That alone probably would have dramatically sped things up, BUT you do still have your speed of light issues.<p>This is where Amazon import&#x2F;export comes in - do a one-time tar&#x2F;rsync of your data to an external 9TB array, ship it to Amazon, have them import it to S3, load it onto your local Amazon machines.<p>You now have two copies of your data - one on s3, and one on your amazon machine.<p>Then you use your optimized rsync to run and bring it up to a relatively consistent state - i.e. it runs for 8 hours to sync up, now you&#x27;re 8 hours behind.<p>Then you take a brief downtime and run the optimized rsync one more time, and now you have two fully consistent filesystems.<p>No need for drbd and all the rest of this - just rsync and an external array.<p>I&#x27;ve used this method to duplicate terabytes and terabytes of data around, and 10s of millions of small files. It works, and is a lot fewer moving parts than drbd
评论 #9172788 未加载
评论 #9172885 未加载
评论 #9175029 未加载
评论 #9175549 未加载
评论 #9176750 未加载
peterwwillis大约 10 年前
The two lessons to take away from this:<p>1. Ask someone else who has already done what you&#x27;re thinking of doing. They have already made all the mistakes you might make and have figured out a way that works.<p>2. Assume that whatever you think will work will fail in an unexpected way with probably-catastrophic results. Test everything before you try something new.
discardorama大约 10 年前
Stories like these remind me of Andy Tannenbaum&#x27;s statement: Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.
评论 #9174238 未加载
codingdave大约 10 年前
I&#x27;ve been involved with multiple large data center migrations over the course of the last 25 years. Every single time, there is a discussion over the best way to transfer the data. Every single time, we choose the same option: Copy all the data to external hard drives. A tech puts them in a carry on bag, heads to the airport, and flies to the new data center.
cplease大约 10 年前
Honestly, this seems to me like a case study in getting carried away by cloud. There&#x27;s a very strong case for cloud hosting when you have a clear need for elasticity and an distributed application to leverage it.<p>Here, you have your entire business in one non-distributed Amazon instance. Amazon does not provide excellent service, availability, flexibility, or value for this model. It is in every way inferior to what you would get from colo or managed dedicated server hosting. Hosting your whole business on a single anonymous Amazon cloud instance that you can&#x27;t walk up to and touch is engineering malpractice.
评论 #9176072 未加载
dnr大约 10 年前
LVM can only shrink devices by full extents, but LVM is just a wrapper around device-mapper (plus metadata management). By creating device-mapper devices directly with the linear target, you could have gotten sector (512 byte) granularity for offset and size.
Dylan16807大约 10 年前
I don&#x27;t quite understand. Why did the entire truck dance have to be redone when there was a perfectly nice stale copy of the data sitting there? Couldn&#x27;t the second truck dance have used rsync and been done in under a day?
watersb大约 10 年前
I love these kinds of write-ups, because I always learn new techniques and tools.<p>I&#x27;ve done this sort of thing with rsync, and with ZFS send&#x2F;receive.<p>And of course, mailing hard disks.
jjwiseman大约 10 年前
It&#x27;s meant for smaller numbers of large files, and high speed connections, but Aspera Direct-To-S3 advertises 10 GB&#x2F;24 hours: <a href="http://cloud.asperasoft.com/news-events/aspera-direct-to-s3/" rel="nofollow">http:&#x2F;&#x2F;cloud.asperasoft.com&#x2F;news-events&#x2F;aspera-direct-to-s3&#x2F;</a>
评论 #9179195 未加载
cmurf大约 10 年前
I wonder how GlusterFS would perform in this case. Replicated volumes with one or more local nodes for the most likely failover requirement, with async-georeplication in case the building burned down.