TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: What are the alternatives of hosting Apache Spark?

28 pointsby muramiraabout 7 years ago
I truly love what databricks is doing, but their pricing model is unpredictable. Are there any other hosting companies that provide a fixed price?

12 comments

SmirkingRevengeabout 7 years ago
If your spark jobs are mostly batch workloads, that can tolerate moderately infrequent failures and restarts, try using google dataproc with preemptible vms or amazon emr using spot instances.<p>Depending on your use case, you might spend many times less than you would using regular VMs. Many instances that are several dollars an hour on AWS can be used for a fraction of the price.<p>Its also fairly easy to automate the region selection and bid (on AWS that is, not sure about gcloud).<p>If you need streaming, obviously this might not be the way to go.
评论 #16910029 未加载
perlinabout 7 years ago
Rewrite all of your jobs using Apache Beam. Then use whatever runner you want: Spark, Flink, Google Cloud Dataflow, etc.
sandGorgonabout 7 years ago
Google Dataproc - very good and very soon they will release kubernetes as the manager instead of yarn.
评论 #16914638 未加载
Zaheerabout 7 years ago
Check out AWS Glue: <a href="https:&#x2F;&#x2F;aws.amazon.com&#x2F;glue&#x2F;" rel="nofollow">https:&#x2F;&#x2F;aws.amazon.com&#x2F;glue&#x2F;</a><p>Disclosure: I work on this service
评论 #16909898 未加载
tejasmanoharabout 7 years ago
All 3 major cloud providers have offerings in this space. Amazon [0], Google [1], Microsoft [2].<p>[0]: <a href="https:&#x2F;&#x2F;aws.amazon.com&#x2F;emr&#x2F;" rel="nofollow">https:&#x2F;&#x2F;aws.amazon.com&#x2F;emr&#x2F;</a><p>[1]: <a href="https:&#x2F;&#x2F;cloud.google.com&#x2F;dataproc&#x2F;" rel="nofollow">https:&#x2F;&#x2F;cloud.google.com&#x2F;dataproc&#x2F;</a><p>[2]: <a href="https:&#x2F;&#x2F;azure.microsoft.com&#x2F;en-us&#x2F;services&#x2F;databricks&#x2F;" rel="nofollow">https:&#x2F;&#x2F;azure.microsoft.com&#x2F;en-us&#x2F;services&#x2F;databricks&#x2F;</a>
Seviiabout 7 years ago
You could give AWS EMR a shot, it probably doesn&#x27;t offer as much as databricks but should have consistent pricing.
antoncohenabout 7 years ago
Run Spark on a managed Kubernetes like GKE? There is experimental support for using Kubernetes as the cluster manager.<p><a href="https:&#x2F;&#x2F;apache-spark-on-k8s.github.io&#x2F;userdocs&#x2F;index.html" rel="nofollow">https:&#x2F;&#x2F;apache-spark-on-k8s.github.io&#x2F;userdocs&#x2F;index.html</a>
hiyerabout 7 years ago
You can try Qubole [0]. The pricing is a small percentage of what you pay to the cloud provider, so it&#x27;s predictable to an extent.<p>[0]: <a href="https:&#x2F;&#x2F;www.qubole.com&#x2F;" rel="nofollow">https:&#x2F;&#x2F;www.qubole.com&#x2F;</a><p>Disclosure: I work here.
tspannabout 7 years ago
<a href="https:&#x2F;&#x2F;hortonworks.com&#x2F;products&#x2F;data-platforms&#x2F;cloud&#x2F;aws&#x2F;" rel="nofollow">https:&#x2F;&#x2F;hortonworks.com&#x2F;products&#x2F;data-platforms&#x2F;cloud&#x2F;aws&#x2F;</a>
scarecrowxabout 7 years ago
We&#x27;re using Spark on EMR with Data Pipeline to do ETL and to run Scheduled Jobs. Data pipelines terminates the cluster once ETL or job gets completed, helps us a lot to save cost.
shelzzzzzabout 7 years ago
what part of it is unpredictable? I guess if you know how much VMs or EC2 you&#x27;re planning on using its the same pricing model as Dataproc or EMR
curiousDogabout 7 years ago
Check out Azure Databricks