Vaguely reminds me of a bug I thought I had in a Python sorting routine (sorting a set of database entries by date/time) - all of the tests passed until I removed some debug print statements, after which it ostensibly stopped sorting and just returned the original order.<p>I eventually realised that the routine was fine, but my test data was being generated quickly enough that time.time() (IIRC) returned identical values for all of the dummy records (with the print statements, there was just enough of a delay for there to be a few milliseconds between each one).
Very nice!<p>I've been consistently impressed with Databricks' approachable blog. This particular post spawned a nice discussion around database design with my son, who has taken a lot of recent interest in all things technological.
Keep up the good work.
Interesting read! I have a service that was in SAS, and we've been translating it to run in Spark, but one of the killer issues that we identified was latency, without understanding what held up the computation, the would be increasing pauses of a few seconds, sometimes reaching nearly a minute, in execution. This is on a single machine, and at that time we wouldn't notice any resource utilisation. No disk writes, CPU nearly at 0.00, etc.<p>I keep coming back with every new Spark version to see if the problem has gone away, (wrote it at 2.0.0, so I mean every minor and patch). I looked up what I could online about optimisation in Spark, and applied that.<p>The business people got tired of us wasting time trying to optimise, and forced us down the lines of SAP HANA and other proprietary marketing hoohah because we need a product that's real-time.<p>I hope the upcoming version of Spark at least helps reduce latency, perhaps through improvements in the whole-stage code-gen.
There are some insane performance. I thought memory bandwidth was a limitation, to move 2~4TB data through memory. Then saw it was a cross join of 1M numbers X 1M numbers. 1 million of 4-byte int is just 4MB, which can fit comfortably in the L1/L2/L3 caches. And the output of the 1M x 1M cross join is thrown away; just a counter is incremented. So no 1T results were pushed through the memory.<p>A cross join is just two nested loops iterating over one array over another. With 40 cores, each handles 25 billions iterations of the 1 trillion. Assuming each iteration takes 10 CPU cycles, a 6GHZ core can handle 600M iterations/second. 25B / 600M = 41 seconds to run the whole thing.<p>Yes, 1 second is too fast.<p>Awesome that they figured it out it's the JVM optimized out the no-side effect computation.
Interesting. Most of the traditional RDBMS implementations have some kind of delay functionality. SLEEP is in mysql, postgres has pg_sleep, MSSQL has WAITFOR, and so on. I guess Spark doesn't have one.<p>These are also handy to test for SQL injection issues without screwing anything up.