TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Introducing DataFrames in Spark for Large Scale Data Science

138 点作者 rxin超过 10 年前
I&#x27;m the author of this blog post. We are very excited about this API and think it will be the common interchange format for data in Spark. It has also some neat features (such as code generation, predicate push down, etc) that make it very useful for Big Data.<p>Feel free to ask me anything.

8 条评论

super_sloth超过 10 年前
Has anyone had some good experiences with Spark?<p>I put several weeks in to moving our machine learning pipeline over to Spark only to find I kept hitting a race condition in their scheduler.<p>After doing a bit of searching, it seems this is actually a known issue <a href="https://issues.apache.org/jira/browse/SPARK-4454" rel="nofollow">https:&#x2F;&#x2F;issues.apache.org&#x2F;jira&#x2F;browse&#x2F;SPARK-4454</a> and there&#x27;s been a fix on their github for a while: <a href="https://github.com/apache/spark/pull/3345" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;apache&#x2F;spark&#x2F;pull&#x2F;3345</a> and yet in that time two releases have swung by bringing a tonne of features.<p>I ended up having to drop Spark ultimately because I wasn&#x27;t confident about putting it in to production (the random OOMs and NPEs during development weren&#x27;t great either). Does anyone have any positive experiences?
评论 #9064496 未加载
评论 #9064402 未加载
评论 #9064766 未加载
评论 #9064587 未加载
评论 #9064459 未加载
评论 #9065658 未加载
评论 #9066050 未加载
评论 #9064475 未加载
评论 #9066001 未加载
评论 #9065344 未加载
评论 #9066489 未加载
elliptic超过 10 年前
Spark the platform seems awesome. I&#x27;m somewhat less convinced by mllib - I&#x27;m not sure there are as many use cases for distributed machine learning as people seem to think (and I would bet that a good deal of companies that use distributed ML don&#x27;t really need it). I&#x27;ve seen a lot of tasks that could be handled by simpler, faster algos on large workstations (you can get 250 GB RAM from AWS for like $4.00&#x2F;hr). I&#x27;d love to hear counterarguments, though!
评论 #9067729 未加载
评论 #9067576 未加载
rxin超过 10 年前
I&#x27;m one of the authors of the blog post as well as this new API. Feel free to ask me anything.
评论 #9065144 未加载
评论 #9067422 未加载
评论 #9066812 未加载
评论 #9066474 未加载
评论 #9064601 未加载
评论 #9064742 未加载
评论 #9067423 未加载
rjurney超过 10 年前
Hey... don&#x27;t downvote Reynold Xin, author of the post as dupe when he says AMA.
评论 #9064550 未加载
homerowilson超过 10 年前
A replica of this benchmark on my laptop running R has this running in about 1&#x2F;4 second. Seems like a pretty trivial benchmark?<p>library(data.table)<p>x = data.table(a=sample(10,10e6,replace=TRUE),num=sample(100,10e6,replace=TRUE)) t1=proc.time(); x[,sum(num),by=a]; print(proc.time()-t1)<p><pre><code> user system elapsed 0.209 0.032 0.245</code></pre>
评论 #9065314 未加载
评论 #9067413 未加载
eranation超过 10 年前
Native question - how is this different than Spark SQL and things like project zeppelin?
jyotiska超过 10 年前
This is great news! Where do I see the source for this?
评论 #9067246 未加载
zodvik超过 10 年前
Will the DataFrame API work with Spark Streaming?
评论 #9064578 未加载