TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Debugging a failing test case caused by query running “too fast”

42 pointsby rxinover 8 years ago

6 comments

stordoffover 8 years ago
Vaguely reminds me of a bug I thought I had in a Python sorting routine (sorting a set of database entries by date&#x2F;time) - all of the tests passed until I removed some debug print statements, after which it ostensibly stopped sorting and just returned the original order.<p>I eventually realised that the routine was fine, but my test data was being generated quickly enough that time.time() (IIRC) returned identical values for all of the dummy records (with the print statements, there was just enough of a delay for there to be a few milliseconds between each one).
tomrodover 8 years ago
Very nice!<p>I&#x27;ve been consistently impressed with Databricks&#x27; approachable blog. This particular post spawned a nice discussion around database design with my son, who has taken a lot of recent interest in all things technological. Keep up the good work.
nevi-meover 8 years ago
Interesting read! I have a service that was in SAS, and we&#x27;ve been translating it to run in Spark, but one of the killer issues that we identified was latency, without understanding what held up the computation, the would be increasing pauses of a few seconds, sometimes reaching nearly a minute, in execution. This is on a single machine, and at that time we wouldn&#x27;t notice any resource utilisation. No disk writes, CPU nearly at 0.00, etc.<p>I keep coming back with every new Spark version to see if the problem has gone away, (wrote it at 2.0.0, so I mean every minor and patch). I looked up what I could online about optimisation in Spark, and applied that.<p>The business people got tired of us wasting time trying to optimise, and forced us down the lines of SAP HANA and other proprietary marketing hoohah because we need a product that&#x27;s real-time.<p>I hope the upcoming version of Spark at least helps reduce latency, perhaps through improvements in the whole-stage code-gen.
ww520over 8 years ago
There are some insane performance. I thought memory bandwidth was a limitation, to move 2~4TB data through memory. Then saw it was a cross join of 1M numbers X 1M numbers. 1 million of 4-byte int is just 4MB, which can fit comfortably in the L1&#x2F;L2&#x2F;L3 caches. And the output of the 1M x 1M cross join is thrown away; just a counter is incremented. So no 1T results were pushed through the memory.<p>A cross join is just two nested loops iterating over one array over another. With 40 cores, each handles 25 billions iterations of the 1 trillion. Assuming each iteration takes 10 CPU cycles, a 6GHZ core can handle 600M iterations&#x2F;second. 25B &#x2F; 600M = 41 seconds to run the whole thing.<p>Yes, 1 second is too fast.<p>Awesome that they figured it out it&#x27;s the JVM optimized out the no-side effect computation.
tyingqover 8 years ago
Interesting. Most of the traditional RDBMS implementations have some kind of delay functionality. SLEEP is in mysql, postgres has pg_sleep, MSSQL has WAITFOR, and so on. I guess Spark doesn&#x27;t have one.<p>These are also handy to test for SQL injection issues without screwing anything up.
评论 #13673447 未加载
AtlasLionover 8 years ago
Awesome seeing Ala and Bogdan already producing awesome stuff at Databricks.