TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

PySpark Style Guide

62 pointsby fmsfover 4 years ago

9 comments

MrPowersover 4 years ago
Here&#x27;s the Scala Spark style guide: <a href="https:&#x2F;&#x2F;github.com&#x2F;MrPowers&#x2F;spark-style-guide" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;MrPowers&#x2F;spark-style-guide</a><p>The chispa README also provides a lot of useful info on how to properly write PySpark code: <a href="https:&#x2F;&#x2F;github.com&#x2F;MrPowers&#x2F;chispa" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;MrPowers&#x2F;chispa</a><p>Scala is easier than Python for Spark because it allows functions with multiple argument lists and isn&#x27;t whitespace sensitive. Both are great &amp; Spark is a lot of fun.<p>Some specific notes:<p>&gt; Doing a select at the beginning of a PySpark transform, or before returning, is considered good practice<p>Manual selects ensure column pruning is performed (column pruning only works for columnar file formats like Parquet). Spark does this automatically and always manually selecting may not be practical. Explicitly pruning columns is required for Pandas and Dask.<p>&gt; Be careful with joins! If you perform a left join, and the right side has multiple matches for a key, that row will be duplicated as many times as there are matches<p>When performing joins, the first thing to think about is if a broadcast join is possible. Joins on clusters are hard. Then it&#x27;s good to think about using a data stores that allows for predicate pushdown aggregations.
评论 #24848241 未加载
speedgooseover 4 years ago
I wonder whether I should read about best practices from Palantir.
评论 #24849876 未加载
评论 #24847677 未加载
fmsfover 4 years ago
There is also a blog post on this: <a href="https:&#x2F;&#x2F;medium.com&#x2F;palantir&#x2F;a-pyspark-style-guide-for-real-world-data-scientists-1727fda397e9" rel="nofollow">https:&#x2F;&#x2F;medium.com&#x2F;palantir&#x2F;a-pyspark-style-guide-for-real-w...</a>
em500over 4 years ago
I worked quite a lot in pandas, dplyr, data.table and pyspark for a few years. And even occasionally some scala spark and sparkR. But after getting a bit fed up with F.lit()-this, F.col()-that, and the umpteenth variation on SQL, nowadays I pretty much just stick with plain SQL. I believe I&#x27;ve found my Enlightenment.
评论 #24846511 未加载
评论 #24847940 未加载
slotransover 4 years ago
I&#x27;m so confused.<p>These examples are all using the SQL-like features of Spark. Not a map() or flatMap() in sight.<p>So... why not just write SQL?<p><pre><code> df.registerTempTable(&#x27;some_name&#x27;) new_df = spark.sql(&quot;&quot;&quot;select ... from some_name ...&quot;&quot;&quot;) </code></pre> All of this F.col(...) and .alias(...) and .withColumn(...) nonsense is a million times harder to read than proper SQL. I just don&#x27;t understand what any of this is intended to accomplish.
评论 #24849968 未加载
评论 #24849760 未加载
gostsamoover 4 years ago
&gt; The preferred option is more complicated, longer, and polluted - and correct.<p>This is the definition of bad design.
评论 #24846614 未加载
评论 #24846118 未加载
legerdemainover 4 years ago
I think this guide mostly dates from 2017, when Palantir was rolling out Spark and Spark SQL code authoring in their Foundry data platform. It mostly targets their untrained &quot;delta&quot; and &quot;echo&quot; employees, most of whose jobs rotated around writing ETL code for customers. I have no idea why this glorified Quip doc was open-sourced.<p>Looking at the list of contributors on Github, I think I remember that the main author was actually James Thompson (UK), and not anyone on the contributor list. JTUK was called that because the company had another employee, based in the US, who had the same name. James Thompson (US) is now at Facebook and is a pretty cool designer. His astronomer.io media project from 2011 comes up on HN periodically.<p>Of the people listed on Github, Andrew Ash (now at Google) is the original evangelist for Spark on Palantir Foundry, and Francisco is the PM for SkyWise, Palantir&#x27;s lucrative, but ill-fated effort to save the Airbus A380.
nautilus12over 4 years ago
Why not just use Scala and frameless? No style guide needed :)
xiaodaiover 4 years ago
Spark is a cancer. Sooner or later, 99.9% of the people using Spark will wake up to the fact that &quot;hey, I got 1TB of RAM, why do I need this?&quot;<p>Spark and PySpark are just PITA to the max.
评论 #24847097 未加载
评论 #24846882 未加载
评论 #24846670 未加载