科技回声

5 条评论

natekupp将近 8 年前

We're also doing this at Thumbtack. We run all of our Spark jobs in job-scoped Cloud Dataproc clusters. We wrote a custom Airflow operator which launches a cluster, schedules a job on that cluster, and shuts down the cluster upon job completion. Since Google can bring up Spark clusters in < 90s and bills minutely, this works really well for us, simplifying our infrastructure and eliminating resource contention issues.

评论 #14500918 未加载

评论 #14503490 未加载

gfodor将近 8 年前

Hah, we were doing this with EMR 6 years ago, I guess we were a little early :)<p><a href="https://www.youtube.com/watch?v=NF6zwHlbh_I" rel="nofollow">https://www.youtube.com/watch?v=NF6zwHlbh_I</a><p>We built a coordinator that would spin up specific categories of machines for each stage (some stages were MR jobs, some were hadoop streaming jobs) -- for example when doing in-memory work it was useful to have fewer nodes with more RAM, etc.

评论 #14502134 未加载

评论 #14503699 未加载

评论 #14504823 未加载

评论 #14504386 未加载

matt_wulfeck将近 8 年前

interesting idea. I can see it being worthwhile with per-minute billing, which is not something supported by AWS. Also EMR charges a per-instance premium, does anyone know if Google Cloud does similar?<p>I'm curious how shuffle data is handled. Does the cluster intelligently scale down and move the shuffle data, or will the entire thing keep running while waiting for a single skewed reducer to finish? Or does the entire thing run on a single instance??

评论 #14500978 未加载

rmnoon将近 8 年前

What about the traditional IO / data locality win of having your processing colocated with your DFS? Is GCS bandwidth that amazing?

评论 #14502795 未加载

评论 #14503737 未加载

cutler将近 8 年前

Can anyone tell me how, as a sole developer, it's possible to gain real-world experience with distributed Hadoop and Spark given the massive computing resources required? It just seems like a closed shop to me.

评论 #14504232 未加载

评论 #14515700 未加载

5 条评论

natekupp将近 8 年前

评论 #14500918 未加载

评论 #14503490 未加载

gfodor将近 8 年前

评论 #14502134 未加载

评论 #14503699 未加载

评论 #14504823 未加载

评论 #14504386 未加载

matt_wulfeck将近 8 年前

评论 #14500978 未加载

rmnoon将近 8 年前

What about the traditional IO / data locality win of having your processing colocated with your DFS? Is GCS bandwidth that amazing?

Job-Scoped Hadoop Clusters with Google Cloud

5 条评论

Job-Scoped Hadoop Clusters with Google Cloud

5 条评论