科技回声

8 条评论

Has anyone had some good experiences with Spark?I put several weeks in to moving our machine learning pipeline over to Spark only to find I kept hitting a race condition in their scheduler.After doing a bit of searching, it seems this is actually a known issue <a href="https://issues.apache.org/jira/browse/SPARK-4454" rel="nofollow">https://issues.apache.org/jira/browse/SPARK-4454</a> and there's been a fix on their github for a while: <a href="https://github.com/apache/spark/pull/3345" rel="nofollow">https://github.com/apache/spark/pull/3345</a> and yet in that time two releases have swung by bringing a tonne of features.I ended up having to drop Spark ultimately because I wasn't confident about putting it in to production (the random OOMs and NPEs during development weren't great either). Does anyone have any positive experiences?

评论 #9064496 未加载

评论 #9064402 未加载

评论 #9064766 未加载

评论 #9064587 未加载

评论 #9064459 未加载

评论 #9065658 未加载

评论 #9066050 未加载

评论 #9064475 未加载

评论 #9066001 未加载

评论 #9065344 未加载

评论 #9066489 未加载

elliptic超过 10 年前

Spark the platform seems awesome. I'm somewhat less convinced by mllib - I'm not sure there are as many use cases for distributed machine learning as people seem to think (and I would bet that a good deal of companies that use distributed ML don't really need it). I've seen a lot of tasks that could be handled by simpler, faster algos on large workstations (you can get 250 GB RAM from AWS for like $4.00/hr). I'd love to hear counterarguments, though!

评论 #9067729 未加载

评论 #9067576 未加载

rxin超过 10 年前

I'm one of the authors of the blog post as well as this new API. Feel free to ask me anything.

评论 #9065144 未加载

评论 #9067422 未加载

评论 #9066812 未加载

评论 #9066474 未加载

评论 #9064601 未加载

评论 #9064742 未加载

评论 #9067423 未加载

rjurney超过 10 年前

Hey... don't downvote Reynold Xin, author of the post as dupe when he says AMA.

评论 #9064550 未加载

homerowilson超过 10 年前

A replica of this benchmark on my laptop running R has this running in about 1/4 second. Seems like a pretty trivial benchmark?library(data.table)x = data.table(a=sample(10,10e6,replace=TRUE),num=sample(100,10e6,replace=TRUE)) t1=proc.time(); x[,sum(num),by=a]; print(proc.time()-t1)<pre><code> user system elapsed 0.209 0.032 0.245</code></pre>

评论 #9065314 未加载

评论 #9067413 未加载

eranation超过 10 年前

Native question - how is this different than Spark SQL and things like project zeppelin?

jyotiska超过 10 年前

This is great news! Where do I see the source for this?

评论 #9067246 未加载

zodvik超过 10 年前

Will the DataFrame API work with Spark Streaming?

评论 #9064578 未加载

8 条评论

super_sloth超过 10 年前

评论 #9064496 未加载

评论 #9064402 未加载

评论 #9064766 未加载

评论 #9064587 未加载

评论 #9064459 未加载

评论 #9065658 未加载

评论 #9066050 未加载

评论 #9064475 未加载

评论 #9066001 未加载

评论 #9065344 未加载

评论 #9066489 未加载

elliptic超过 10 年前

评论 #9067729 未加载

评论 #9067576 未加载

rxin超过 10 年前

I'm one of the authors of the blog post as well as this new API. Feel free to ask me anything.

评论 #9065144 未加载

评论 #9067422 未加载

评论 #9066812 未加载

评论 #9066474 未加载

评论 #9064601 未加载

评论 #9064742 未加载

评论 #9067423 未加载

rjurney超过 10 年前

Hey... don't downvote Reynold Xin, author of the post as dupe when he says AMA.

评论 #9064550 未加载

homerowilson超过 10 年前

评论 #9065314 未加载

评论 #9067413 未加载

eranation超过 10 年前

Native question - how is this different than Spark SQL and things like project zeppelin?

jyotiska超过 10 年前

This is great news! Where do I see the source for this?

评论 #9067246 未加载

zodvik超过 10 年前

Will the DataFrame API work with Spark Streaming?

评论 #9064578 未加载

Introducing DataFrames in Spark for Large Scale Data Science

8 条评论

Introducing DataFrames in Spark for Large Scale Data Science

8 条评论