TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Job-Scoped Hadoop Clusters with Google Cloud

65 点作者 vgt将近 8 年前

5 条评论

natekupp将近 8 年前
We're also doing this at Thumbtack. We run all of our Spark jobs in job-scoped Cloud Dataproc clusters. We wrote a custom Airflow operator which launches a cluster, schedules a job on that cluster, and shuts down the cluster upon job completion. Since Google can bring up Spark clusters in < 90s and bills minutely, this works really well for us, simplifying our infrastructure and eliminating resource contention issues.
评论 #14500918 未加载
评论 #14503490 未加载
gfodor将近 8 年前
Hah, we were doing this with EMR 6 years ago, I guess we were a little early :)<p><a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=NF6zwHlbh_I" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=NF6zwHlbh_I</a><p>We built a coordinator that would spin up specific categories of machines for each stage (some stages were MR jobs, some were hadoop streaming jobs) -- for example when doing in-memory work it was useful to have fewer nodes with more RAM, etc.
评论 #14502134 未加载
评论 #14503699 未加载
评论 #14504823 未加载
评论 #14504386 未加载
matt_wulfeck将近 8 年前
interesting idea. I can see it being worthwhile with per-minute billing, which is not something supported by AWS. Also EMR charges a per-instance premium, does anyone know if Google Cloud does similar?<p>I&#x27;m curious how shuffle data is handled. Does the cluster intelligently scale down and move the shuffle data, or will the entire thing keep running while waiting for a single skewed reducer to finish? Or does the entire thing run on a single instance??
评论 #14500978 未加载
rmnoon将近 8 年前
What about the traditional IO &#x2F; data locality win of having your processing colocated with your DFS? Is GCS bandwidth that amazing?
评论 #14502795 未加载
评论 #14503737 未加载
cutler将近 8 年前
Can anyone tell me how, as a sole developer, it&#x27;s possible to gain real-world experience with distributed Hadoop and Spark given the massive computing resources required? It just seems like a closed shop to me.
评论 #14504232 未加载
评论 #14515700 未加载