As far as I see the currently best performing solution [0] does not account for hash collisions and therefore probably generates wrong results if enough different cities are in the dataset. Or am I missing something?<p>[0] <a href="https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_ebarlas.java">https://github.com/gunnarmorling/1brc/blob/main/src/main/jav...</a>
I believe the whole thing can be done in 0.3 seconds with the following approach:<p>(Describing only the 'happy path' here - other paths can be made fast too, but will require different implementations)<p>* Since temperatures are only to 0.1 decimal points, we have a finite number of temperatures. ~400 temperatures will cover all common cases.<p>* We also have a finite number of place names. (~400)<p>* Just make a lookup table of all temps and all place names. (~160,000)<p>* autogenerate a state machine that will map each of the above 160,000 things, at any rotation within a 4 byte register, to a unique bin in a hash table. The state machine will have one 32 bit state register (16 bits to output, 16 to carry over to the next cycle) where every cycle a lookup in the state transition table is done, and the next 4 bytes of data XOR'ed on top.<p>* run through all the data, at RAM speed, incrementing counters for each state the machine ends up in. (there will only be 65k). These counters fit fully in cache.<p>* With just 32 bits of state, with AVX512 we can be running 512 copies of this in parallel if we like, per core! AKA, compute will not be the bottleneck.<p>* from the values of the counters, you can calculate the answers. 65k is a far smaller number than 1 billion, so you don't need to do this bit fast.<p>* for anything that doesn't map to a valid bin (higher/lower temperatures, unknown place names), just fallback to slow code (one state of the state machine can be reserved for 'escape to slow code'. Use these escapes for min/max handling too, since it will only happen a few thousand times).<p>* I think this approach can operate at RAM speed with just one core with AVX512, so no benefit in splitting across cores.
> The slowest and the fastest runs are discarded. The mean value of the remaining three runs is the result for that contender and will be added to the leaderboard.<p>I think it's better to discard the two slowest, or simply accept the fastest as the correct. There's (in my opinion) no good reason to discard the best runs.
The rule lawyer in me wants to spend the first run spinning up a background daemon that loads everything into memory, pins it there, and maybe even prefetches everything into cache as the subsequent runs perform basically a linear scan (you never have to pagemiss if you have an oracle!).<p>> write a Java program for retrieving temperature measurement values from a text file and calculating the min, mean, and max temperature per weather station<p>Depending on how far you want to stretch it, doesn’t precomputing the result on 1st run count too? Or pre-parsing the numbers into a compact format you can just slurp directly into rolling sums for subsequent runs.<p>Not the slightest in the spirit of the competition, but not against the rules as far as I can tell.<p>Edit: if we don't like pre-computation, we can still play with fancy out-of-the-box tricks like pre-sorting the input, pre-parsing, compacting, aligning everything, etc.<p>Edit 2: while we're here, why not just patch the calculate_time script to return 0 seconds :) And return 9999 for competitors, for good measure
Isn't this simply bound by the speed of the disk? Surely none of the suggested optimizations (SIMD, multi-threading) are relevant.<p>It would also depend of how many different stations there are and what they are for the hash lookup, but seriously I doubt this will be anything measurable compared to I/O.
> Q: Can I make assumptions on the names of the weather stations showing up in the data set?<p>> A: No, while only a fixed set of station names is used by the data set generator, any solution should work with arbitrary UTF-8 station names (for the sake of simplicity, names are guaranteed to contain no `;` character).<p>I'm unsure if it's intentional or not, but this essentially means that the submission should be correct for all inputs, but can and probably should be tuned for the particular input regenerated by `create_measurements.sh`. I can imagine submissions with a perfect hash function tuned for given set of stations, for example.
Just for fun. Speed testing awk vs Java.<p><pre><code> awk -F';' '{
station = $1
temperature = $2
sum[station] += temperature
count[station]++
if (temperature < min[station] || count[station] == 1) {
min[station] = temperature
}
if (temperature > max[station] || count[station] == 1) {
max[station] = temperature
}
}
END {
for (s in sum) {
mean = sum[s] / count[s]
printf "{%s=%.1f/%.1f/%.1f", s, min[s], mean, max[s]
printf (s == PROCINFO["sorted_in"][length(PROCINFO["sorted_in"])] ? "}\n" : ", ")
}
}' measurement.txt</code></pre>
Ooh fun, Advent of Code chaser!<p>A fair comparison between languages should include the make and build times. I haven't used Java / Maven for years, and I'm reminded why, heading into minute 2 of downloads for './mvnw clean verify'.
At the Czech technical university C course, we've had a very similar assignment. The submissions of all the course students were continuously evaluated on a leaderboard and many spent tens of hours optimizing to get the extra points for better grade (but mostly status points).
> Each contender will be run five times in a row. The slowest and the fastest runs are discarded. The mean value of the remaining three runs is the result for that contender and will be added to the leaderboard.<p>Shouldn’t he take the single fastest time, assuming file (and JDK) being in file cache is controlled for?
Very fun challenge that nerd sniped me right away. Had to do a C version in standard C99 with POSIX threads. It[1] clocks in at just under 4 seconds on my AMD Ryzen 4800U Laptop CPU.<p>Should run about 10-20% faster than that on the mentioned Hetzner hardware.<p>- Since we only do one decimal of floating point precision it uses integer math right from the get-go.<p>- FNV1-a hash with linear probing and a load factor well under 0.5.<p>- Data file is mmap’d into memory.<p>- Data is processed in 8 totally separate chunks (no concurrent data structures) and then those aggregations are in turn aggregated when all threads have finished.<p>1: <a href="https://github.com/dannyvankooten/1brc">https://github.com/dannyvankooten/1brc</a>
Comparing solutions from the Rstats world: <a href="https://twitter.com/RandVegan/status/1742721195781267650" rel="nofollow">https://twitter.com/RandVegan/status/1742721195781267650</a>
This has been super fun. My current PR[1] is another 15% faster than my entry in the README. Sadly I haven't been able to make any progress on using SIMD to accelerate any part of it.<p>I think the issues with hashing could be easily covered by having the 500 city names in the test data also be randomly generated at test time. There is no way to ensure there aren't hash collisions without doing a complete comparison between the names.<p>[1] <a href="https://github.com/gunnarmorling/1brc/pull/56">https://github.com/gunnarmorling/1brc/pull/56</a>
I think the optimal strategy would be to use the "reduce" step in mapreduce. Have threads that read portions of the file and add data to a "list", 1 for each unique name. Then, this set of threads can "process" these lists. I don't think we need to sort, that'd be too expensive, just a linear pass would be good. I can't see how we can do SIMD since we want max/min which mandate a linear pass anyway.
I suppose it defeats the spirit of the game to, knowing your worst run is discarded, calculate the results on the first run by whatever slow method you want, save them somewhere useful, and just read the results and print them out on the following runs?<p>Or at the very least, convert the input into a more convenient binary format for the following runs.
Interestingly, because your program runs on Linux and is run 5 times, Linux will almost certainly cache the 12gb file to RAM on the first invocation.<p>This means that future invocations don't have to load the file from disk. This also makes it pretty critical that your program doesn't use more than 16gb of ram itself (out of the server's 32gb) or it'll push the file out of cache making future invocations of your program slower.
Unfortunately I have no time to code this up but things that I would try to make it fast:<p>- Read with O_DIRECT (but compare it with the mmap approach). I know O_DIRECT gets much hate, but this could be one of the rare cases where it helps.<p>- Use a simple array with sentinel and linear search for the station names. This is dumb, but if the number of stations is small enough this could beat the hash. (In the back of my head I have a rule of thumb that linear search is faster for up to 1000 elements, but I'm not sure anymore where I got this from).
Here's my implementation in Go, which runs in under 5 seconds. It doesn't use anything too obscure, only the built-in maps and no external libraries. It's also the fastest solution I've tested on my M3 Pro Mac. Eager to see what beats it!<p><a href="https://gist.github.com/corlinp/176a97c58099bca36bcd5679e68f9708" rel="nofollow">https://gist.github.com/corlinp/176a97c58099bca36bcd5679e68f...</a>
I don't understand, it should be pretty easy. A rolling average with BigDecimal would probably be sufficient but a scientific lib might be better for a rolling average or more than a hundred million numbers.<p><a href="https://stackoverflow.com/questions/277309/java-floating-point-high-precision-library" rel="nofollow">https://stackoverflow.com/questions/277309/java-floating-poi...</a>
My one liner solution runs in 76ms on my 10 year old mac.
q)\ts exec (min;max;avg)@\:measurement by city from flip `city`measurement!("SF";";") 0: `:weather_stations.csv 76 8720688
1 Billion Row Challenge with Apache Pinot
<a href="https://hubertdulay.substack.com/p/1-billion-row-challenge-in-apache" rel="nofollow">https://hubertdulay.substack.com/p/1-billion-row-challenge-i...</a><p>852 ms on an M1 Max
Would have been nice to accept other JVM languages like Scala, Clojure, Kotlin, etc and compare against Java, which I expect to have slightly better performance.
first search result for averaging streaming data:<p><a href="https://nestedsoftware.com/2018/03/20/calculating-a-moving-average-on-streaming-data-5a7k.22879.html" rel="nofollow">https://nestedsoftware.com/2018/03/20/calculating-a-moving-a...</a><p>so you just walk the file and read a chunk, update the averages and move on. the resource usage should be 0.000 nothing and speed should be limited by your disk IO.
You had 32GB more than expected 16GB RAM but for Java I doubt that 32GB is fine enough. Java is aboslutely good on older days but nowadays there is lot of opportunities than that poor guy.