There are two lessons you could learn from this episode:<p>1. Use shallow trees and the clever workaround presented in the article.<p>2. Don't use Spark for tasks that require complex logic.<p>People should trace out the line of reasoning that leads them to use tools like Spark. It is convoluted and contingent - it goes back to work done at Google in the early 2000s, when the key to getting good price / performance was using a large number of commodity machines. Because they were cheap, these machines would break often, so you needed some really smart fault tolerance technology like Hadoop/HDFS, which was followed by Spark.<p>The current era is completely different. Now the key to good price / performance is to light up machines on-demand and then shut them down, only paying for what you use - and perhaps using the spot market. You don't need to worry about storage - that's taken care of by the cloud provider, and you can't "bring the computation to the data" like in the old days, removing one of the big advantages of Hadoop/HDFS. Because they are doing mostly IO and networking, and because computers are just more resilient nowadays, jobs rarely fail because of hardware errors. So almost the entire rationale that led to Hadoop/HDFS/Spark is gone. But people still use Spark - and put up with "accidentally exponential behavior" - because the tech industry is do dominated by groupthink and marketing dollars.