科技回声

5 条评论

ljw1001大约 9 年前

I'm working on a dataframe for Java for large datasets: <a href="https://github.com/lwhite1/tablesaw" rel="nofollow">https://github.com/lwhite1/tablesaw</a>.It's not ready for prime time, but:Time to sum 1,000,000,000 floats: 1.5 secondsTime to sort 1,000,000,000 floats: 30.5 secondsThe code to fill a column and sort it:<pre><code> FloatColumn fc = new FloatColumn("test", 1_000_000_000); for (int i = 0; i < 1_000_000_000; i++) { fc.add((float) Math.random()); } fc.sortAscending();</code></pre>

评论 #11334347 未加载

评论 #11335787 未加载

rxin大约 9 年前

As part of Spark 2.0, we are introducing some new neat optimizations to make a general engine as efficient as specialized code.I just tried on Spark master branch (i.e. the work-in-progress code for Spark 2.0). It takes about 1.5 secs to sum up 1 billion 64-bit integers using a single thread, and about 1 secs using 2 threads. This was done on my laptop (Early 2015 Macbook Pro 13, 3.1GHz Intel Core i7).We haven't optimized integer sorting yet, so that's probably not going to be super fast, but the aggregation performance has been pretty good.<pre><code> scala> val start = System.nanoTime start: Long = 56832659265590 scala> sqlContext.range(0, 1000L * 1000 * 1000, 1, 2).count() res8: Long = 1000000000 scala> val end = System.nanoTime end: Long = 56833605100948 scala> (end - start) / 1000 / 1000 res9: Long = 945 </code></pre> Part of the time are actually spent analyzing the query plan, optimizing it, and generating bytecode for it. If we run this on 10 billion integers, the time is about 5 secs.

en4bz大约 9 年前

I was initially skeptical of the 2.5s for the SUM base case but after a bit of experimenting I concluded the following. Note this is for the SUM base case only. The whole file is loaded into memory for my tests.MMAP'd from HDD 1st run - 47sMMAP'd from HDD 2nd run - 1.35s (OS caches pages in memory)MMAP'd from NVMe 1st run - 6.5s (OS Caches dropped)MMAp'd from NVMe 2nd run 1.35s (OS cached again)

brashrat大约 9 年前

in a purely functional language like Haskell, you can sort a billion numbers in nanoseconds, all with no performance crushing side effects. Yes, it's true, that's the kind of results you get with lazy evaluation!When you start searching the resultant sorted list, it might be somewhat slower than if you sorted using other techniques, but--silver lining--search times will only improve after that!I'd give you the actual stats but so far I've only lazy evaluated them.

评论 #11333855 未加载

评论 #11332956 未加载

评论 #11335304 未加载

评论 #11334277 未加载

评论 #11333318 未加载

评论 #11332943 未加载

mateuszb大约 9 年前

1 billion numbers (let's say DWORDs) doesn't even use full 4GB worth of space. Load it up into memory and sort. With QWORDS it grows to about 8GB. A modern laptop still can load twice the amount and sort everything in memory. That is not a large dataset.

Sorting a Billion Numbers with Julia

5 条评论

Sorting a Billion Numbers with Julia

5 条评论