Here's the Scala Spark style guide: <a href="https://github.com/MrPowers/spark-style-guide" rel="nofollow">https://github.com/MrPowers/spark-style-guide</a><p>The chispa README also provides a lot of useful info on how to properly write PySpark code: <a href="https://github.com/MrPowers/chispa" rel="nofollow">https://github.com/MrPowers/chispa</a><p>Scala is easier than Python for Spark because it allows functions with multiple argument lists and isn't whitespace sensitive. Both are great & Spark is a lot of fun.<p>Some specific notes:<p>> Doing a select at the beginning of a PySpark transform, or before returning, is considered good practice<p>Manual selects ensure column pruning is performed (column pruning only works for columnar file formats like Parquet). Spark does this automatically and always manually selecting may not be practical. Explicitly pruning columns is required for Pandas and Dask.<p>> Be careful with joins! If you perform a left join, and the right side has multiple matches for a key, that row will be duplicated as many times as there are matches<p>When performing joins, the first thing to think about is if a broadcast join is possible. Joins on clusters are hard. Then it's good to think about using a data stores that allows for predicate pushdown aggregations.
There is also a blog post on this: <a href="https://medium.com/palantir/a-pyspark-style-guide-for-real-world-data-scientists-1727fda397e9" rel="nofollow">https://medium.com/palantir/a-pyspark-style-guide-for-real-w...</a>
I worked quite a lot in pandas, dplyr, data.table and pyspark for a few years. And even occasionally some scala spark and sparkR. But after getting a bit fed up with F.lit()-this, F.col()-that, and the umpteenth variation on SQL, nowadays I pretty much just stick with plain SQL. I believe I've found my Enlightenment.
I'm so confused.<p>These examples are all using the SQL-like features of Spark. Not a map() or flatMap() in sight.<p>So... why not just write SQL?<p><pre><code> df.registerTempTable('some_name')
new_df = spark.sql("""select ... from some_name ...""")
</code></pre>
All of this F.col(...) and .alias(...) and .withColumn(...) nonsense is a million times harder to read than proper SQL. I just don't understand what any of this is intended to accomplish.
I think this guide mostly dates from 2017, when Palantir was rolling out Spark and Spark SQL code authoring in their Foundry data platform. It mostly targets their untrained "delta" and "echo" employees, most of whose jobs rotated around writing ETL code for customers. I have no idea why this glorified Quip doc was open-sourced.<p>Looking at the list of contributors on Github, I think I remember that the main author was actually James Thompson (UK), and not anyone on the contributor list. JTUK was called that because the company had another employee, based in the US, who had the same name. James Thompson (US) is now at Facebook and is a pretty cool designer. His astronomer.io media project from 2011 comes up on HN periodically.<p>Of the people listed on Github, Andrew Ash (now at Google) is the original evangelist for Spark on Palantir Foundry, and Francisco is the PM for SkyWise, Palantir's lucrative, but ill-fated effort to save the Airbus A380.
Spark is a cancer. Sooner or later, 99.9% of the people using Spark will wake up to the fact that "hey, I got 1TB of RAM, why do I need this?"<p>Spark and PySpark are just PITA to the max.