TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Accidentally exponential behavior in Spark

75 pointsby drobalmost 4 years ago

6 comments

d_burfootalmost 4 years ago
There are two lessons you could learn from this episode:<p>1. Use shallow trees and the clever workaround presented in the article.<p>2. Don&#x27;t use Spark for tasks that require complex logic.<p>People should trace out the line of reasoning that leads them to use tools like Spark. It is convoluted and contingent - it goes back to work done at Google in the early 2000s, when the key to getting good price &#x2F; performance was using a large number of commodity machines. Because they were cheap, these machines would break often, so you needed some really smart fault tolerance technology like Hadoop&#x2F;HDFS, which was followed by Spark.<p>The current era is completely different. Now the key to good price &#x2F; performance is to light up machines on-demand and then shut them down, only paying for what you use - and perhaps using the spot market. You don&#x27;t need to worry about storage - that&#x27;s taken care of by the cloud provider, and you can&#x27;t &quot;bring the computation to the data&quot; like in the old days, removing one of the big advantages of Hadoop&#x2F;HDFS. Because they are doing mostly IO and networking, and because computers are just more resilient nowadays, jobs rarely fail because of hardware errors. So almost the entire rationale that led to Hadoop&#x2F;HDFS&#x2F;Spark is gone. But people still use Spark - and put up with &quot;accidentally exponential behavior&quot; - because the tech industry is do dominated by groupthink and marketing dollars.
评论 #27587619 未加载
评论 #27587026 未加载
评论 #27589427 未加载
gopalvalmost 4 years ago
I&#x27;ve hit almost the exact same issue with Hive, with a somewhat temporary workaround (like this post) to build a balanced tree out of this by reading it into a list [1] and rebuilding a binary balanced tree out of it.<p>But we ended up implementing a single level Multi-AND [2] so that this no longer a tree for just AND expressions &amp; can be vectorized neater than the nested structure with a function call for each (this looks more like a tail-call rather than a recursive function).<p>The ORC CNF conversion has a similar massively exponential item inside which is protected by a check for 256 items or less[3].<p>[1] - <a href="https:&#x2F;&#x2F;github.com&#x2F;t3rmin4t0r&#x2F;captain-hook&#x2F;blob&#x2F;master&#x2F;src&#x2F;main&#x2F;java&#x2F;org&#x2F;notmysock&#x2F;hive&#x2F;hooks&#x2F;AndOrRewriteHook.java#L155" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;t3rmin4t0r&#x2F;captain-hook&#x2F;blob&#x2F;master&#x2F;src&#x2F;m...</a><p>[2] - <a href="https:&#x2F;&#x2F;issues.apache.org&#x2F;jira&#x2F;browse&#x2F;HIVE-11398" rel="nofollow">https:&#x2F;&#x2F;issues.apache.org&#x2F;jira&#x2F;browse&#x2F;HIVE-11398</a><p>[3] - <a href="https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;hive&#x2F;blob&#x2F;master&#x2F;storage-api&#x2F;src&#x2F;java&#x2F;org&#x2F;apache&#x2F;hadoop&#x2F;hive&#x2F;ql&#x2F;io&#x2F;sarg&#x2F;SearchArgumentImpl.java#L288" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;hive&#x2F;blob&#x2F;master&#x2F;storage-api&#x2F;src&#x2F;j...</a>
评论 #27587470 未加载
mlylealmost 4 years ago
Why not just...<p><pre><code> val transformedLeftTemp = transform(tree.left) val transformedLeft = if (transformedLeftTemp.isDefined) { transformedLeftTemp } else None</code></pre>
评论 #27584766 未加载
评论 #27584714 未加载
评论 #27584010 未加载
IvanVergilievalmost 4 years ago
Post author here. Let me know if you have any questions!
评论 #27589539 未加载
评论 #27588178 未加载
hobsalmost 4 years ago
Good read - fwiw if this is your blog some of your links are broken and think they are local - <a href="https:&#x2F;&#x2F;heap.io&#x2F;blog&#x2F;%E2%80%9Dhttps:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;spark&#x2F;pull&#x2F;24068%E2%80%9D" rel="nofollow">https:&#x2F;&#x2F;heap.io&#x2F;blog&#x2F;%E2%80%9Dhttps:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;spar...</a>
评论 #27583735 未加载
dreyfanalmost 4 years ago
Spark is this weird ecosystem of people who take absolutely trivial concepts in SQL, bury their heads in the sand and ignore the past 50 years of RDBMS evolution, and then write extremely complicated (or broken) and expensive to run code. But whatever it takes to get Databricks to IPO! Afterwards the hype will die down and everyone will collectively abandon it just like MongoDB except for the unfortunate companies with so much technical debt they can&#x27;t extricate themselves from it.
评论 #27586050 未加载
评论 #27585452 未加载
评论 #27584035 未加载