TechEcho

9 comments

MrPowersover 4 years ago

Here's the Scala Spark style guide: <a href="https://github.com/MrPowers/spark-style-guide" rel="nofollow">https://github.com/MrPowers/spark-style-guide</a>The chispa README also provides a lot of useful info on how to properly write PySpark code: <a href="https://github.com/MrPowers/chispa" rel="nofollow">https://github.com/MrPowers/chispa</a>Scala is easier than Python for Spark because it allows functions with multiple argument lists and isn't whitespace sensitive. Both are great & Spark is a lot of fun.Some specific notes:> Doing a select at the beginning of a PySpark transform, or before returning, is considered good practiceManual selects ensure column pruning is performed (column pruning only works for columnar file formats like Parquet). Spark does this automatically and always manually selecting may not be practical. Explicitly pruning columns is required for Pandas and Dask.> Be careful with joins! If you perform a left join, and the right side has multiple matches for a key, that row will be duplicated as many times as there are matchesWhen performing joins, the first thing to think about is if a broadcast join is possible. Joins on clusters are hard. Then it's good to think about using a data stores that allows for predicate pushdown aggregations.

评论 #24848241 未加载

speedgooseover 4 years ago

I wonder whether I should read about best practices from Palantir.

评论 #24849876 未加载

评论 #24847677 未加载

fmsfover 4 years ago

There is also a blog post on this: <a href="https://medium.com/palantir/a-pyspark-style-guide-for-real-world-data-scientists-1727fda397e9" rel="nofollow">https://medium.com/palantir/a-pyspark-style-guide-for-real-w...</a>

em500over 4 years ago

I worked quite a lot in pandas, dplyr, data.table and pyspark for a few years. And even occasionally some scala spark and sparkR. But after getting a bit fed up with F.lit()-this, F.col()-that, and the umpteenth variation on SQL, nowadays I pretty much just stick with plain SQL. I believe I've found my Enlightenment.

评论 #24846511 未加载

评论 #24847940 未加载

slotransover 4 years ago

I'm so confused.These examples are all using the SQL-like features of Spark. Not a map() or flatMap() in sight.So... why not just write SQL?<pre><code> df.registerTempTable('some_name') new_df = spark.sql("""select ... from some_name ...""") </code></pre> All of this F.col(...) and .alias(...) and .withColumn(...) nonsense is a million times harder to read than proper SQL. I just don't understand what any of this is intended to accomplish.

评论 #24849968 未加载

评论 #24849760 未加载

gostsamoover 4 years ago

> The preferred option is more complicated, longer, and polluted - and correct.This is the definition of bad design.

评论 #24846614 未加载

评论 #24846118 未加载

legerdemainover 4 years ago

I think this guide mostly dates from 2017, when Palantir was rolling out Spark and Spark SQL code authoring in their Foundry data platform. It mostly targets their untrained "delta" and "echo" employees, most of whose jobs rotated around writing ETL code for customers. I have no idea why this glorified Quip doc was open-sourced.Looking at the list of contributors on Github, I think I remember that the main author was actually James Thompson (UK), and not anyone on the contributor list. JTUK was called that because the company had another employee, based in the US, who had the same name. James Thompson (US) is now at Facebook and is a pretty cool designer. His astronomer.io media project from 2011 comes up on HN periodically.Of the people listed on Github, Andrew Ash (now at Google) is the original evangelist for Spark on Palantir Foundry, and Francisco is the PM for SkyWise, Palantir's lucrative, but ill-fated effort to save the Airbus A380.

nautilus12over 4 years ago

Why not just use Scala and frameless? No style guide needed :)

xiaodaiover 4 years ago

Spark is a cancer. Sooner or later, 99.9% of the people using Spark will wake up to the fact that "hey, I got 1TB of RAM， why do I need this?"Spark and PySpark are just PITA to the max.

评论 #24847097 未加载

评论 #24846882 未加载

评论 #24846670 未加载

9 comments

MrPowersover 4 years ago

评论 #24848241 未加载

speedgooseover 4 years ago

I wonder whether I should read about best practices from Palantir.

评论 #24849876 未加载

评论 #24847677 未加载

fmsfover 4 years ago

em500over 4 years ago

评论 #24846511 未加载

评论 #24847940 未加载

slotransover 4 years ago

评论 #24849968 未加载

评论 #24849760 未加载

gostsamoover 4 years ago

> The preferred option is more complicated, longer, and polluted - and correct.This is the definition of bad design.

评论 #24846614 未加载

评论 #24846118 未加载

legerdemainover 4 years ago

nautilus12over 4 years ago

Why not just use Scala and frameless? No style guide needed :)

xiaodaiover 4 years ago

Spark is a cancer. Sooner or later, 99.9% of the people using Spark will wake up to the fact that "hey, I got 1TB of RAM， why do I need this?"Spark and PySpark are just PITA to the max.

评论 #24847097 未加载

评论 #24846882 未加载

评论 #24846670 未加载

PySpark Style Guide

9 comments

PySpark Style Guide

9 comments