TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

The One Billion Row Challenge

482 pointsby madmax108over 1 year ago

42 comments

planbover 1 year ago
As far as I see the currently best performing solution [0] does not account for hash collisions and therefore probably generates wrong results if enough different cities are in the dataset. Or am I missing something?<p>[0] <a href="https:&#x2F;&#x2F;github.com&#x2F;gunnarmorling&#x2F;1brc&#x2F;blob&#x2F;main&#x2F;src&#x2F;main&#x2F;java&#x2F;dev&#x2F;morling&#x2F;onebrc&#x2F;CalculateAverage_ebarlas.java">https:&#x2F;&#x2F;github.com&#x2F;gunnarmorling&#x2F;1brc&#x2F;blob&#x2F;main&#x2F;src&#x2F;main&#x2F;jav...</a>
评论 #38864609 未加载
londons_exploreover 1 year ago
I believe the whole thing can be done in 0.3 seconds with the following approach:<p>(Describing only the &#x27;happy path&#x27; here - other paths can be made fast too, but will require different implementations)<p>* Since temperatures are only to 0.1 decimal points, we have a finite number of temperatures. ~400 temperatures will cover all common cases.<p>* We also have a finite number of place names. (~400)<p>* Just make a lookup table of all temps and all place names. (~160,000)<p>* autogenerate a state machine that will map each of the above 160,000 things, at any rotation within a 4 byte register, to a unique bin in a hash table. The state machine will have one 32 bit state register (16 bits to output, 16 to carry over to the next cycle) where every cycle a lookup in the state transition table is done, and the next 4 bytes of data XOR&#x27;ed on top.<p>* run through all the data, at RAM speed, incrementing counters for each state the machine ends up in. (there will only be 65k). These counters fit fully in cache.<p>* With just 32 bits of state, with AVX512 we can be running 512 copies of this in parallel if we like, per core! AKA, compute will not be the bottleneck.<p>* from the values of the counters, you can calculate the answers. 65k is a far smaller number than 1 billion, so you don&#x27;t need to do this bit fast.<p>* for anything that doesn&#x27;t map to a valid bin (higher&#x2F;lower temperatures, unknown place names), just fallback to slow code (one state of the state machine can be reserved for &#x27;escape to slow code&#x27;. Use these escapes for min&#x2F;max handling too, since it will only happen a few thousand times).<p>* I think this approach can operate at RAM speed with just one core with AVX512, so no benefit in splitting across cores.
评论 #38868289 未加载
评论 #38867809 未加载
评论 #38870685 未加载
评论 #38867683 未加载
评论 #38868793 未加载
评论 #38868508 未加载
评论 #38867454 未加载
评论 #38867411 未加载
评论 #38867730 未加载
评论 #38868549 未加载
JohnKemenyover 1 year ago
&gt; The slowest and the fastest runs are discarded. The mean value of the remaining three runs is the result for that contender and will be added to the leaderboard.<p>I think it&#x27;s better to discard the two slowest, or simply accept the fastest as the correct. There&#x27;s (in my opinion) no good reason to discard the best runs.
评论 #38864597 未加载
评论 #38866660 未加载
评论 #38869208 未加载
e63f67dd-065bover 1 year ago
The rule lawyer in me wants to spend the first run spinning up a background daemon that loads everything into memory, pins it there, and maybe even prefetches everything into cache as the subsequent runs perform basically a linear scan (you never have to pagemiss if you have an oracle!).<p>&gt; write a Java program for retrieving temperature measurement values from a text file and calculating the min, mean, and max temperature per weather station<p>Depending on how far you want to stretch it, doesn’t precomputing the result on 1st run count too? Or pre-parsing the numbers into a compact format you can just slurp directly into rolling sums for subsequent runs.<p>Not the slightest in the spirit of the competition, but not against the rules as far as I can tell.<p>Edit: if we don&#x27;t like pre-computation, we can still play with fancy out-of-the-box tricks like pre-sorting the input, pre-parsing, compacting, aligning everything, etc.<p>Edit 2: while we&#x27;re here, why not just patch the calculate_time script to return 0 seconds :) And return 9999 for competitors, for good measure
评论 #38864602 未加载
评论 #38864314 未加载
评论 #38864836 未加载
mgaunardover 1 year ago
Isn&#x27;t this simply bound by the speed of the disk? Surely none of the suggested optimizations (SIMD, multi-threading) are relevant.<p>It would also depend of how many different stations there are and what they are for the hash lookup, but seriously I doubt this will be anything measurable compared to I&#x2F;O.
评论 #38865084 未加载
评论 #38865246 未加载
评论 #38865323 未加载
评论 #38868641 未加载
评论 #38867879 未加载
评论 #38865057 未加载
评论 #38868057 未加载
评论 #38866002 未加载
lifthrasiirover 1 year ago
&gt; Q: Can I make assumptions on the names of the weather stations showing up in the data set?<p>&gt; A: No, while only a fixed set of station names is used by the data set generator, any solution should work with arbitrary UTF-8 station names (for the sake of simplicity, names are guaranteed to contain no `;` character).<p>I&#x27;m unsure if it&#x27;s intentional or not, but this essentially means that the submission should be correct for all inputs, but can and probably should be tuned for the particular input regenerated by `create_measurements.sh`. I can imagine submissions with a perfect hash function tuned for given set of stations, for example.
评论 #38863763 未加载
评论 #38865640 未加载
solarizedover 1 year ago
Just for fun. Speed testing awk vs Java.<p><pre><code> awk -F&#x27;;&#x27; &#x27;{ station = $1 temperature = $2 sum[station] += temperature count[station]++ if (temperature &lt; min[station] || count[station] == 1) { min[station] = temperature } if (temperature &gt; max[station] || count[station] == 1) { max[station] = temperature } } END { for (s in sum) { mean = sum[s] &#x2F; count[s] printf &quot;{%s=%.1f&#x2F;%.1f&#x2F;%.1f&quot;, s, min[s], mean, max[s] printf (s == PROCINFO[&quot;sorted_in&quot;][length(PROCINFO[&quot;sorted_in&quot;])] ? &quot;}\n&quot; : &quot;, &quot;) } }&#x27; measurement.txt</code></pre>
评论 #38865177 未加载
评论 #38866160 未加载
评论 #38865697 未加载
worthless-trashover 1 year ago
Interesting challenge, shame its only java. Can&#x27;t wait till people start hand rolling their own JVM bytecode.
评论 #38863616 未加载
评论 #38863822 未加载
vessenesover 1 year ago
Ooh fun, Advent of Code chaser!<p>A fair comparison between languages should include the make and build times. I haven&#x27;t used Java &#x2F; Maven for years, and I&#x27;m reminded why, heading into minute 2 of downloads for &#x27;.&#x2F;mvnw clean verify&#x27;.
评论 #38864017 未加载
评论 #38864982 未加载
评论 #38864209 未加载
评论 #38864055 未加载
kebsupover 1 year ago
At the Czech technical university C course, we&#x27;ve had a very similar assignment. The submissions of all the course students were continuously evaluated on a leaderboard and many spent tens of hours optimizing to get the extra points for better grade (but mostly status points).
divbzeroover 1 year ago
I suspect Java is not the fastest language for this. I’d love to see unofficial contenders tackle this challenge using different languages.
评论 #38870319 未加载
pbh101over 1 year ago
&gt; Each contender will be run five times in a row. The slowest and the fastest runs are discarded. The mean value of the remaining three runs is the result for that contender and will be added to the leaderboard.<p>Shouldn’t he take the single fastest time, assuming file (and JDK) being in file cache is controlled for?
评论 #38863914 未加载
评论 #38864060 未加载
unobatbayarover 1 year ago
This must be Decodale&#x27;s engineering problem disguised as a challenge.
dvkoover 1 year ago
Very fun challenge that nerd sniped me right away. Had to do a C version in standard C99 with POSIX threads. It[1] clocks in at just under 4 seconds on my AMD Ryzen 4800U Laptop CPU.<p>Should run about 10-20% faster than that on the mentioned Hetzner hardware.<p>- Since we only do one decimal of floating point precision it uses integer math right from the get-go.<p>- FNV1-a hash with linear probing and a load factor well under 0.5.<p>- Data file is mmap’d into memory.<p>- Data is processed in 8 totally separate chunks (no concurrent data structures) and then those aggregations are in turn aggregated when all threads have finished.<p>1: <a href="https:&#x2F;&#x2F;github.com&#x2F;dannyvankooten&#x2F;1brc">https:&#x2F;&#x2F;github.com&#x2F;dannyvankooten&#x2F;1brc</a>
countrymileover 1 year ago
Comparing solutions from the Rstats world: <a href="https:&#x2F;&#x2F;twitter.com&#x2F;RandVegan&#x2F;status&#x2F;1742721195781267650" rel="nofollow">https:&#x2F;&#x2F;twitter.com&#x2F;RandVegan&#x2F;status&#x2F;1742721195781267650</a>
评论 #38864449 未加载
spullaraover 1 year ago
This has been super fun. My current PR[1] is another 15% faster than my entry in the README. Sadly I haven&#x27;t been able to make any progress on using SIMD to accelerate any part of it.<p>I think the issues with hashing could be easily covered by having the 500 city names in the test data also be randomly generated at test time. There is no way to ensure there aren&#x27;t hash collisions without doing a complete comparison between the names.<p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;gunnarmorling&#x2F;1brc&#x2F;pull&#x2F;56">https:&#x2F;&#x2F;github.com&#x2F;gunnarmorling&#x2F;1brc&#x2F;pull&#x2F;56</a>
评论 #38865600 未加载
brrrrrmover 1 year ago
looking at the sample code I’m quite glad I’ve never really touched Java. What a clunky language
评论 #38864656 未加载
jimmyedover 1 year ago
I think the optimal strategy would be to use the &quot;reduce&quot; step in mapreduce. Have threads that read portions of the file and add data to a &quot;list&quot;, 1 for each unique name. Then, this set of threads can &quot;process&quot; these lists. I don&#x27;t think we need to sort, that&#x27;d be too expensive, just a linear pass would be good. I can&#x27;t see how we can do SIMD since we want max&#x2F;min which mandate a linear pass anyway.
评论 #38864837 未加载
评论 #38864978 未加载
wokwokwokover 1 year ago
I suppose it defeats the spirit of the game to, knowing your worst run is discarded, calculate the results on the first run by whatever slow method you want, save them somewhere useful, and just read the results and print them out on the following runs?<p>Or at the very least, convert the input into a more convenient binary format for the following runs.
评论 #38864418 未加载
Ellipsis753over 1 year ago
Interestingly, because your program runs on Linux and is run 5 times, Linux will almost certainly cache the 12gb file to RAM on the first invocation.<p>This means that future invocations don&#x27;t have to load the file from disk. This also makes it pretty critical that your program doesn&#x27;t use more than 16gb of ram itself (out of the server&#x27;s 32gb) or it&#x27;ll push the file out of cache making future invocations of your program slower.
weinzierlover 1 year ago
Unfortunately I have no time to code this up but things that I would try to make it fast:<p>- Read with O_DIRECT (but compare it with the mmap approach). I know O_DIRECT gets much hate, but this could be one of the rare cases where it helps.<p>- Use a simple array with sentinel and linear search for the station names. This is dumb, but if the number of stations is small enough this could beat the hash. (In the back of my head I have a rule of thumb that linear search is faster for up to 1000 elements, but I&#x27;m not sure anymore where I got this from).
评论 #38866999 未加载
评论 #38870426 未加载
评论 #38865654 未加载
corlinpalmerover 1 year ago
Here&#x27;s my implementation in Go, which runs in under 5 seconds. It doesn&#x27;t use anything too obscure, only the built-in maps and no external libraries. It&#x27;s also the fastest solution I&#x27;ve tested on my M3 Pro Mac. Eager to see what beats it!<p><a href="https:&#x2F;&#x2F;gist.github.com&#x2F;corlinp&#x2F;176a97c58099bca36bcd5679e68f9708" rel="nofollow">https:&#x2F;&#x2F;gist.github.com&#x2F;corlinp&#x2F;176a97c58099bca36bcd5679e68f...</a>
评论 #38874812 未加载
评论 #38878593 未加载
评论 #38875589 未加载
lessbergsteinover 1 year ago
I don&#x27;t understand, it should be pretty easy. A rolling average with BigDecimal would probably be sufficient but a scientific lib might be better for a rolling average or more than a hundred million numbers.<p><a href="https:&#x2F;&#x2F;stackoverflow.com&#x2F;questions&#x2F;277309&#x2F;java-floating-point-high-precision-library" rel="nofollow">https:&#x2F;&#x2F;stackoverflow.com&#x2F;questions&#x2F;277309&#x2F;java-floating-poi...</a>
评论 #38863802 未加载
评论 #38864347 未加载
drusepthover 1 year ago
This would make for a really fun challenge in SQL, too.
评论 #38866073 未加载
评论 #38864727 未加载
AUnterrainerover 1 year ago
My one liner solution runs in 76ms on my 10 year old mac. q)\ts exec (min;max;avg)@\:measurement by city from flip `city`measurement!(&quot;SF&quot;;&quot;;&quot;) 0: `:weather_stations.csv 76 8720688
评论 #38901695 未加载
PeterCorlessover 1 year ago
1 Billion Row Challenge with Apache Pinot <a href="https:&#x2F;&#x2F;hubertdulay.substack.com&#x2F;p&#x2F;1-billion-row-challenge-in-apache" rel="nofollow">https:&#x2F;&#x2F;hubertdulay.substack.com&#x2F;p&#x2F;1-billion-row-challenge-i...</a><p>852 ms on an M1 Max
评论 #38883693 未加载
jmclnxover 1 year ago
Is there a way to create this file without having java ?<p>This is for use on a BSD where Java may not work too well.<p>Thanks
评论 #38870397 未加载
bradorover 1 year ago
Would offering a cash prize increase or decrease the quality of submissions?
DeathArrowover 1 year ago
This can turn into a nice benchmark if done in multiple languages.
atbpacaover 1 year ago
Would have been nice to accept other JVM languages like Scala, Clojure, Kotlin, etc and compare against Java, which I expect to have slightly better performance.
hknmttover 1 year ago
first search result for averaging streaming data:<p><a href="https:&#x2F;&#x2F;nestedsoftware.com&#x2F;2018&#x2F;03&#x2F;20&#x2F;calculating-a-moving-average-on-streaming-data-5a7k.22879.html" rel="nofollow">https:&#x2F;&#x2F;nestedsoftware.com&#x2F;2018&#x2F;03&#x2F;20&#x2F;calculating-a-moving-a...</a><p>so you just walk the file and read a chunk, update the averages and move on. the resource usage should be 0.000 nothing and speed should be limited by your disk IO.
评论 #38864821 未加载
评论 #38864810 未加载
评论 #38865116 未加载
pronoiacover 1 year ago
I&#x27;ve idly been considering how to score this with more languages, with far less memory. I&#x27;m thinking, buildpacks on a Raspberry Pi.
Xophmeisterover 1 year ago
This is an IO bound problem. Trying to &quot;go faster&quot; by using multiple threads or vectorisation isn&#x27;t going to achieve much.
评论 #38865183 未加载
评论 #38865187 未加载
评论 #38875055 未加载
BeefWellingtonover 1 year ago
I wonder why the choice to go with a shared compute for measuring this, given the variability that will introduce.
andersrsover 1 year ago
Kind of a shame that median isn&#x27;t included to make it tougher.
mgoetzkeover 1 year ago
Anyone up to also do this in Rust ?
评论 #38864615 未加载
justinl33over 1 year ago
but… can I use pandas?
评论 #38864590 未加载
评论 #38864237 未加载
scumolaover 1 year ago
This is just a map&#x2F;reduce problem. Use Hadoop. It&#x27;s Java, isn&#x27;t it?
评论 #38863631 未加载
评论 #38863588 未加载
评论 #38863578 未加载
评论 #38863590 未加载
wakasakaover 1 year ago
how to make someone else to do your homework for free
gigatexalover 1 year ago
Single line solve using clickhouse-local or duckdb.
评论 #38865541 未加载
评论 #38865318 未加载
daftegoover 1 year ago
Any elixir devs in here find this funny?
sixlelyt321over 1 year ago
You had 32GB more than expected 16GB RAM but for Java I doubt that 32GB is fine enough. Java is aboslutely good on older days but nowadays there is lot of opportunities than that poor guy.
评论 #38864585 未加载