The One Billion Row Challenge

482 pointsby madmax108over 1 year ago

42 comments

planbover 1 year ago

As far as I see the currently best performing solution [0] does not account for hash collisions and therefore probably generates wrong results if enough different cities are in the dataset. Or am I missing something?[0] <a href="https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_ebarlas.java">https://github.com/gunnarmorling/1brc/blob/main/src/main/jav...</a>

评论 #38864609 未加载

londons_exploreover 1 year ago

I believe the whole thing can be done in 0.3 seconds with the following approach:(Describing only the 'happy path' here - other paths can be made fast too, but will require different implementations)* Since temperatures are only to 0.1 decimal points, we have a finite number of temperatures. ~400 temperatures will cover all common cases.* We also have a finite number of place names. (~400)* Just make a lookup table of all temps and all place names. (~160,000)* autogenerate a state machine that will map each of the above 160,000 things, at any rotation within a 4 byte register, to a unique bin in a hash table. The state machine will have one 32 bit state register (16 bits to output, 16 to carry over to the next cycle) where every cycle a lookup in the state transition table is done, and the next 4 bytes of data XOR'ed on top.* run through all the data, at RAM speed, incrementing counters for each state the machine ends up in. (there will only be 65k). These counters fit fully in cache.* With just 32 bits of state, with AVX512 we can be running 512 copies of this in parallel if we like, per core! AKA, compute will not be the bottleneck.* from the values of the counters, you can calculate the answers. 65k is a far smaller number than 1 billion, so you don't need to do this bit fast.* for anything that doesn't map to a valid bin (higher/lower temperatures, unknown place names), just fallback to slow code (one state of the state machine can be reserved for 'escape to slow code'. Use these escapes for min/max handling too, since it will only happen a few thousand times).* I think this approach can operate at RAM speed with just one core with AVX512, so no benefit in splitting across cores.

评论 #38868289 未加载

评论 #38867809 未加载

评论 #38870685 未加载

评论 #38867683 未加载

评论 #38868793 未加载

评论 #38868508 未加载

评论 #38867454 未加载

评论 #38867411 未加载

评论 #38867730 未加载

评论 #38868549 未加载

JohnKemenyover 1 year ago

> The slowest and the fastest runs are discarded. The mean value of the remaining three runs is the result for that contender and will be added to the leaderboard.I think it's better to discard the two slowest, or simply accept the fastest as the correct. There's (in my opinion) no good reason to discard the best runs.

评论 #38864597 未加载

评论 #38866660 未加载

评论 #38869208 未加载

e63f67dd-065bover 1 year ago

The rule lawyer in me wants to spend the first run spinning up a background daemon that loads everything into memory, pins it there, and maybe even prefetches everything into cache as the subsequent runs perform basically a linear scan (you never have to pagemiss if you have an oracle!).> write a Java program for retrieving temperature measurement values from a text file and calculating the min, mean, and max temperature per weather stationDepending on how far you want to stretch it, doesn’t precomputing the result on 1st run count too? Or pre-parsing the numbers into a compact format you can just slurp directly into rolling sums for subsequent runs.Not the slightest in the spirit of the competition, but not against the rules as far as I can tell.Edit: if we don't like pre-computation, we can still play with fancy out-of-the-box tricks like pre-sorting the input, pre-parsing, compacting, aligning everything, etc.Edit 2: while we're here, why not just patch the calculate_time script to return 0 seconds :) And return 9999 for competitors, for good measure

评论 #38864602 未加载

评论 #38864314 未加载

评论 #38864836 未加载

mgaunardover 1 year ago

Isn't this simply bound by the speed of the disk? Surely none of the suggested optimizations (SIMD, multi-threading) are relevant.It would also depend of how many different stations there are and what they are for the hash lookup, but seriously I doubt this will be anything measurable compared to I/O.

评论 #38865084 未加载

评论 #38865246 未加载

评论 #38865323 未加载

评论 #38868641 未加载

评论 #38867879 未加载

评论 #38865057 未加载

评论 #38868057 未加载

评论 #38866002 未加载

lifthrasiirover 1 year ago

> Q: Can I make assumptions on the names of the weather stations showing up in the data set?> A: No, while only a fixed set of station names is used by the data set generator, any solution should work with arbitrary UTF-8 station names (for the sake of simplicity, names are guaranteed to contain no `;` character).I'm unsure if it's intentional or not, but this essentially means that the submission should be correct for all inputs, but can and probably should be tuned for the particular input regenerated by `create_measurements.sh`. I can imagine submissions with a perfect hash function tuned for given set of stations, for example.

评论 #38863763 未加载

评论 #38865640 未加载

solarizedover 1 year ago

Just for fun. Speed testing awk vs Java.<pre><code> awk -F';' '{ station = $1 temperature = $2 sum[station] += temperature count[station]++ if (temperature < min[station] || count[station] == 1) { min[station] = temperature } if (temperature > max[station] || count[station] == 1) { max[station] = temperature } } END { for (s in sum) { mean = sum[s] / count[s] printf "{%s=%.1f/%.1f/%.1f", s, min[s], mean, max[s] printf (s == PROCINFO["sorted_in"][length(PROCINFO["sorted_in"])] ? "}\n" : ", ") } }' measurement.txt</code></pre>

评论 #38865177 未加载

评论 #38866160 未加载

评论 #38865697 未加载

worthless-trashover 1 year ago

Interesting challenge, shame its only java. Can't wait till people start hand rolling their own JVM bytecode.

评论 #38863616 未加载

评论 #38863822 未加载

vessenesover 1 year ago

Ooh fun, Advent of Code chaser!A fair comparison between languages should include the make and build times. I haven't used Java / Maven for years, and I'm reminded why, heading into minute 2 of downloads for './mvnw clean verify'.

评论 #38864017 未加载

评论 #38864982 未加载

评论 #38864209 未加载

评论 #38864055 未加载

kebsupover 1 year ago

At the Czech technical university C course, we've had a very similar assignment. The submissions of all the course students were continuously evaluated on a leaderboard and many spent tens of hours optimizing to get the extra points for better grade (but mostly status points).

divbzeroover 1 year ago

I suspect Java is not the fastest language for this. I’d love to see unofficial contenders tackle this challenge using different languages.

评论 #38870319 未加载

pbh101over 1 year ago

> Each contender will be run five times in a row. The slowest and the fastest runs are discarded. The mean value of the remaining three runs is the result for that contender and will be added to the leaderboard.Shouldn’t he take the single fastest time, assuming file (and JDK) being in file cache is controlled for?

评论 #38863914 未加载

评论 #38864060 未加载

unobatbayarover 1 year ago

This must be Decodale's engineering problem disguised as a challenge.

dvkoover 1 year ago

Very fun challenge that nerd sniped me right away. Had to do a C version in standard C99 with POSIX threads. It[1] clocks in at just under 4 seconds on my AMD Ryzen 4800U Laptop CPU.Should run about 10-20% faster than that on the mentioned Hetzner hardware.- Since we only do one decimal of floating point precision it uses integer math right from the get-go.- FNV1-a hash with linear probing and a load factor well under 0.5.- Data file is mmap’d into memory.- Data is processed in 8 totally separate chunks (no concurrent data structures) and then those aggregations are in turn aggregated when all threads have finished.1: <a href="https://github.com/dannyvankooten/1brc">https://github.com/dannyvankooten/1brc</a>

countrymileover 1 year ago

Comparing solutions from the Rstats world: <a href="https://twitter.com/RandVegan/status/1742721195781267650" rel="nofollow">https://twitter.com/RandVegan/status/1742721195781267650</a>

评论 #38864449 未加载

spullaraover 1 year ago

This has been super fun. My current PR[1] is another 15% faster than my entry in the README. Sadly I haven't been able to make any progress on using SIMD to accelerate any part of it.I think the issues with hashing could be easily covered by having the 500 city names in the test data also be randomly generated at test time. There is no way to ensure there aren't hash collisions without doing a complete comparison between the names.[1] <a href="https://github.com/gunnarmorling/1brc/pull/56">https://github.com/gunnarmorling/1brc/pull/56</a>

评论 #38865600 未加载

brrrrrmover 1 year ago

looking at the sample code I’m quite glad I’ve never really touched Java. What a clunky language

评论 #38864656 未加载

jimmyedover 1 year ago

I think the optimal strategy would be to use the "reduce" step in mapreduce. Have threads that read portions of the file and add data to a "list", 1 for each unique name. Then, this set of threads can "process" these lists. I don't think we need to sort, that'd be too expensive, just a linear pass would be good. I can't see how we can do SIMD since we want max/min which mandate a linear pass anyway.

评论 #38864837 未加载

评论 #38864978 未加载

wokwokwokover 1 year ago

I suppose it defeats the spirit of the game to, knowing your worst run is discarded, calculate the results on the first run by whatever slow method you want, save them somewhere useful, and just read the results and print them out on the following runs?Or at the very least, convert the input into a more convenient binary format for the following runs.

评论 #38864418 未加载

Ellipsis753over 1 year ago

Interestingly, because your program runs on Linux and is run 5 times, Linux will almost certainly cache the 12gb file to RAM on the first invocation.This means that future invocations don't have to load the file from disk. This also makes it pretty critical that your program doesn't use more than 16gb of ram itself (out of the server's 32gb) or it'll push the file out of cache making future invocations of your program slower.

weinzierlover 1 year ago

Unfortunately I have no time to code this up but things that I would try to make it fast:- Read with O_DIRECT (but compare it with the mmap approach). I know O_DIRECT gets much hate, but this could be one of the rare cases where it helps.- Use a simple array with sentinel and linear search for the station names. This is dumb, but if the number of stations is small enough this could beat the hash. (In the back of my head I have a rule of thumb that linear search is faster for up to 1000 elements, but I'm not sure anymore where I got this from).

评论 #38866999 未加载

评论 #38870426 未加载

评论 #38865654 未加载

corlinpalmerover 1 year ago

Here's my implementation in Go, which runs in under 5 seconds. It doesn't use anything too obscure, only the built-in maps and no external libraries. It's also the fastest solution I've tested on my M3 Pro Mac. Eager to see what beats it!<a href="https://gist.github.com/corlinp/176a97c58099bca36bcd5679e68f9708" rel="nofollow">https://gist.github.com/corlinp/176a97c58099bca36bcd5679e68f...</a>

评论 #38874812 未加载

评论 #38878593 未加载

评论 #38875589 未加载

lessbergsteinover 1 year ago

I don't understand, it should be pretty easy. A rolling average with BigDecimal would probably be sufficient but a scientific lib might be better for a rolling average or more than a hundred million numbers.<a href="https://stackoverflow.com/questions/277309/java-floating-point-high-precision-library" rel="nofollow">https://stackoverflow.com/questions/277309/java-floating-poi...</a>

评论 #38863802 未加载

评论 #38864347 未加载

drusepthover 1 year ago

This would make for a really fun challenge in SQL, too.

评论 #38866073 未加载

评论 #38864727 未加载

AUnterrainerover 1 year ago

My one liner solution runs in 76ms on my 10 year old mac. q)\ts exec (min;max;avg)@\:measurement by city from flip `city`measurement!("SF";";") 0: `:weather_stations.csv 76 8720688

评论 #38901695 未加载

PeterCorlessover 1 year ago

1 Billion Row Challenge with Apache Pinot <a href="https://hubertdulay.substack.com/p/1-billion-row-challenge-in-apache" rel="nofollow">https://hubertdulay.substack.com/p/1-billion-row-challenge-i...</a>852 ms on an M1 Max

评论 #38883693 未加载

jmclnxover 1 year ago

Is there a way to create this file without having java ?This is for use on a BSD where Java may not work too well.Thanks

评论 #38870397 未加载

bradorover 1 year ago

Would offering a cash prize increase or decrease the quality of submissions?

DeathArrowover 1 year ago

This can turn into a nice benchmark if done in multiple languages.

atbpacaover 1 year ago

Would have been nice to accept other JVM languages like Scala, Clojure, Kotlin, etc and compare against Java, which I expect to have slightly better performance.

hknmttover 1 year ago

first search result for averaging streaming data:<a href="https://nestedsoftware.com/2018/03/20/calculating-a-moving-average-on-streaming-data-5a7k.22879.html" rel="nofollow">https://nestedsoftware.com/2018/03/20/calculating-a-moving-a...</a>so you just walk the file and read a chunk, update the averages and move on. the resource usage should be 0.000 nothing and speed should be limited by your disk IO.

评论 #38864821 未加载

评论 #38864810 未加载

评论 #38865116 未加载

pronoiacover 1 year ago

I've idly been considering how to score this with more languages, with far less memory. I'm thinking, buildpacks on a Raspberry Pi.

Xophmeisterover 1 year ago

This is an IO bound problem. Trying to "go faster" by using multiple threads or vectorisation isn't going to achieve much.

评论 #38865183 未加载

评论 #38865187 未加载

评论 #38875055 未加载

BeefWellingtonover 1 year ago

I wonder why the choice to go with a shared compute for measuring this, given the variability that will introduce.

andersrsover 1 year ago

Kind of a shame that median isn't included to make it tougher.

mgoetzkeover 1 year ago

Anyone up to also do this in Rust ?

评论 #38864615 未加载

justinl33over 1 year ago

but… can I use pandas?

评论 #38864590 未加载

评论 #38864237 未加载

scumolaover 1 year ago

This is just a map/reduce problem. Use Hadoop. It's Java, isn't it?

评论 #38863631 未加载

评论 #38863588 未加载

评论 #38863578 未加载

评论 #38863590 未加载

wakasakaover 1 year ago

how to make someone else to do your homework for free

gigatexalover 1 year ago

Single line solve using clickhouse-local or duckdb.

评论 #38865541 未加载

评论 #38865318 未加载

daftegoover 1 year ago

Any elixir devs in here find this funny?

sixlelyt321over 1 year ago

You had 32GB more than expected 16GB RAM but for Java I doubt that 32GB is fine enough. Java is aboslutely good on older days but nowadays there is lot of opportunities than that poor guy.

评论 #38864585 未加载