科技回声

12 条评论

petewarden超过 15 年前

I'm certain I'm re-inventing the wheel with this approach, but I obviously haven't been researching in the right places, since I hadn't run across this approach before I cobbled it together.I'm expecting an education on what I'm missing from the HN community!

评论 #1027057 未加载

评论 #1026503 未加载

评论 #1027127 未加载

评论 #1026539 未加载

评论 #1027125 未加载

评论 #1026542 未加载

NateLawson超过 15 年前

Yes, you are reinventing the wheel. This kind of approach has been used for decades in disk drive controllers. You sort track accesses in ascending or descending order to prevent longer seeks. It's called the "elevator algorithm".<a href="http://en.wikipedia.org/wiki/Elevator_algorithm" rel="nofollow">http://en.wikipedia.org/wiki/Elevator_algorithm</a>This is combined with tag queuing, where multiple requests can be accepted from the host at once. The greater your tag depth, the more insight the controller gets into future seeks.

评论 #1027446 未加载

aaronblohowiak超过 15 年前

Random access is always slower than linear reading. Even if you aren't going to disk, you can avoid blowing out the processor cache and having to get from main system memory. Ram is a cache, and L2 is a cache, &c. What you are doing is pretty normal "old-school" unix programming.

timtadh超过 15 年前

it is a good point that classical dbms's aren't always good, however the method that a dbms uses to perform queries are always good to know. the author here implemented a sort-merge-join which is one of the classic implementations of join algorithm. for a good overview of the trade offs between various the various join and sort algorithms see "Principles of Database & Knowledge-Base Systems Vol. 2" by Jeff Ullman. The first chapter in the book is the one you want. It is dated but therefore cheap if you get it used.here is the worldcat link <a href="http://www.worldcat.org/oclc/439156325" rel="nofollow">http://www.worldcat.org/oclc/439156325</a>

评论 #1027316 未加载

stephenjudkins超过 15 年前

Take this post with a grain of salt, since I have the zeal of a recently saved sinner, but you should try using Hive and Hadoop for this sort of thing.We recently switched from a workflow that is very similar to the one you describe to using Hive with Amazon's elastic map reduce. Hive presents a SQL-like layer of abstraction over exactly this sort of thing. Instead of doing the sorting and merging by hand, you simply write it as a series of joins. It's like writing SQL, except the actual implementation works almost exactly like what you're doing.Integrating simple Ruby scripts for JSON processing was also trivial.Elastic MapReduce also had near-zero infrastructure and management overhead for us (besides the 10% Amazon charges for the machine instances). We use S3 for all data input and output, which is perfect for us.Even when running on a single machine, using Hive was a big win in terms of development time, and performance of the jobs seemed only slightly slower that using Unix utilities on big text files. It's almost a bonus that we can also scale it out to dozens of machines, for a huge speedup. Running a job that took several hours on a single machine took less than five minutes, and only a few hours of EC2 machine time. Cheap and easy!

jbeda超过 15 年前

This is basically a map reduce. You should look at hadoop as you start doing more complicated stuff.

评论 #1026598 未加载

评论 #1026582 未加载

speek超过 15 年前

You should check out this paper: Disk is the new ram <a href="http://www.ccs.neu.edu/home/gene/papers/acm-viewpoint08.pdf" rel="nofollow">http://www.ccs.neu.edu/home/gene/papers/acm-viewpoint08.pdf</a> It's really neat.

jbl超过 15 年前

You can even combine this approach with split and parallel make to do a cheap single-machine parallel sort. I use a little script that generates a Makefile that can be called with make -j n that splits an input file, sorts the parts, and then merges them with sort -m. It's proved to be quite handy.

conquest超过 15 年前

Where do you store the processed data for recall? It seems you have data per fan page as well as a google style suggest index of those page names.

评论 #1027314 未加载

earle超过 15 年前

This is precisely how data is stored in ranges within HBase/Hypertable. Your data is also sorted between mapping and reducing.

tlack超过 15 年前

is it really faster to write to a bunch of text files, sort them to a new bigger text file, and then do the insert? seems like a lot of extra steps, all involving a lot of reading and writing..

评论 #1027122 未加载

ilaksh超过 15 年前

how about not using a mechanical disk.solid state seeks are like 10 or 50 times faster.

评论 #1026744 未加载

12 条评论

petewarden超过 15 年前

评论 #1027057 未加载

评论 #1026503 未加载

评论 #1027127 未加载

评论 #1026539 未加载

评论 #1027125 未加载

评论 #1026542 未加载

NateLawson超过 15 年前

评论 #1027446 未加载

aaronblohowiak超过 15 年前

timtadh超过 15 年前

评论 #1027316 未加载

stephenjudkins超过 15 年前

jbeda超过 15 年前

This is basically a map reduce. You should look at hadoop as you start doing more complicated stuff.

评论 #1026598 未加载

评论 #1026582 未加载

speek超过 15 年前

jbl超过 15 年前

conquest超过 15 年前

Where do you store the processed data for recall? It seems you have data per fan page as well as a google style suggest index of those page names.

评论 #1027314 未加载

earle超过 15 年前

This is precisely how data is stored in ranges within HBase/Hypertable. Your data is also sorted between mapping and reducing.

tlack超过 15 年前

is it really faster to write to a bunch of text files, sort them to a new bigger text file, and then do the insert? seems like a lot of extra steps, all involving a lot of reading and writing..

评论 #1027122 未加载

ilaksh超过 15 年前

how about not using a mechanical disk.solid state seeks are like 10 or 50 times faster.

评论 #1026744 未加载

How to speed up massive data analysis by eliminating disk seeks

12 条评论

How to speed up massive data analysis by eliminating disk seeks

12 条评论