TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

How to speed up massive data analysis by eliminating disk seeks

56 点作者 petewarden超过 15 年前

12 条评论

petewarden超过 15 年前
I'm <i>certain</i> I'm re-inventing the wheel with this approach, but I obviously haven't been researching in the right places, since I hadn't run across this approach before I cobbled it together.<p>I'm expecting an education on what I'm missing from the HN community!
评论 #1027057 未加载
评论 #1026503 未加载
评论 #1027127 未加载
评论 #1026539 未加载
评论 #1027125 未加载
评论 #1026542 未加载
NateLawson超过 15 年前
Yes, you are reinventing the wheel. This kind of approach has been used for decades in disk drive controllers. You sort track accesses in ascending or descending order to prevent longer seeks. It's called the "elevator algorithm".<p><a href="http://en.wikipedia.org/wiki/Elevator_algorithm" rel="nofollow">http://en.wikipedia.org/wiki/Elevator_algorithm</a><p>This is combined with tag queuing, where multiple requests can be accepted from the host at once. The greater your tag depth, the more insight the controller gets into future seeks.
评论 #1027446 未加载
aaronblohowiak超过 15 年前
Random access is <i>always</i> slower than linear reading. Even if you aren't going to disk, you can avoid blowing out the processor cache and having to get from main system memory. Ram is a cache, and L2 is a cache, &#38;c. What you are doing is pretty normal "old-school" unix programming.
timtadh超过 15 年前
it is a good point that classical dbms's aren't always good, however the method that a dbms uses to perform queries are always good to know. the author here implemented a sort-merge-join which is one of the classic implementations of join algorithm. for a good overview of the trade offs between various the various join and sort algorithms see "Principles of Database &#38; Knowledge-Base Systems Vol. 2" by Jeff Ullman. The first chapter in the book is the one you want. It is dated but therefore cheap if you get it used.<p>here is the worldcat link <a href="http://www.worldcat.org/oclc/439156325" rel="nofollow">http://www.worldcat.org/oclc/439156325</a>
评论 #1027316 未加载
stephenjudkins超过 15 年前
Take this post with a grain of salt, since I have the zeal of a recently saved sinner, but you should try using Hive and Hadoop for this sort of thing.<p>We recently switched from a workflow that is very similar to the one you describe to using Hive with Amazon's elastic map reduce. Hive presents a SQL-like layer of abstraction over exactly this sort of thing. Instead of doing the sorting and merging by hand, you simply write it as a series of joins. It's like writing SQL, except the actual implementation works almost exactly like what you're doing.<p>Integrating simple Ruby scripts for JSON processing was also trivial.<p>Elastic MapReduce also had near-zero infrastructure and management overhead for us (besides the 10% Amazon charges for the machine instances). We use S3 for all data input and output, which is perfect for us.<p>Even when running on a single machine, using Hive was a big win in terms of development time, and performance of the jobs seemed only slightly slower that using Unix utilities on big text files. It's almost a bonus that we can also scale it out to dozens of machines, for a huge speedup. Running a job that took several hours on a single machine took less than five minutes, and only a few hours of EC2 machine time. Cheap and easy!
jbeda超过 15 年前
This is basically a map reduce. You should look at hadoop as you start doing more complicated stuff.
评论 #1026598 未加载
评论 #1026582 未加载
speek超过 15 年前
You should check out this paper: Disk is the new ram <a href="http://www.ccs.neu.edu/home/gene/papers/acm-viewpoint08.pdf" rel="nofollow">http://www.ccs.neu.edu/home/gene/papers/acm-viewpoint08.pdf</a> It's really neat.
jbl超过 15 年前
You can even combine this approach with split and parallel make to do a cheap single-machine parallel sort. I use a little script that generates a Makefile that can be called with make -j n that splits an input file, sorts the parts, and then merges them with sort -m. It's proved to be quite handy.
conquest超过 15 年前
Where do you store the processed data for recall? It seems you have data per fan page as well as a google style suggest index of those page names.
评论 #1027314 未加载
earle超过 15 年前
This is precisely how data is stored in ranges within HBase/Hypertable. Your data is also sorted between mapping and reducing.
tlack超过 15 年前
is it really faster to write to a bunch of text files, sort them to a new bigger text file, and then do the insert? seems like a lot of extra steps, all involving a lot of reading and writing..
评论 #1027122 未加载
ilaksh超过 15 年前
how about not using a mechanical disk.<p>solid state seeks are like 10 or 50 times faster.
评论 #1026744 未加载