TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Spark 2.0 Technical Preview

259 pointsby rxinabout 9 years ago

7 comments

graffiticiabout 9 years ago
The new Structured Streaming API looks pretty interesting. I have the impression that many Apache projects are trying to address the problems that arise with the lambda architecture. When implementing such a system, you have to worry about dealing with two separate systems, one for low-latency stream processing, and the other is the batch-style processing of large amounts of data.<p>Samza and Storm mostly focus on streaming, while Spark and MapReduce traditionally deal with batch. Spark leverages its core competency of dealing with batch data, and treats streams like mini-batches, effectively treating everything as batch.<p>And I imagine in the following snippet, the author is referring to Apache Flink, among other projects:<p>&gt; One school of thought is to treat everything like a stream; that is, adopt a single programming model integrating both batch and streaming data.<p>My understanding of Structured Streaming also treats everything like batch, but can recognize that the code is being applied to a stream, and do some optimizations for low-latency processing. Is this what&#x27;s going on?
评论 #11679004 未加载
minimaxirabout 9 years ago
Given that all the big data talk lately has been about GPU computing&#x2F;TensorFlow, I&#x27;m glad to see that this Spark update shows in-memory computing is still viable. (Much cheaper to play with too!)<p>The key feature for me is Machine Learning functions in R, which otherwise lacks parallelizeable and scalable options. (Without resorting to black magic, anyways)
评论 #11678667 未加载
评论 #11679313 未加载
评论 #11680391 未加载
评论 #11681352 未加载
评论 #11681421 未加载
评论 #11678419 未加载
harigovabout 9 years ago
I like the direction Spark is heading. I am happy to see that they look at Spark in the same way compiler developers look at programming languages. There are huge optimizations to be made in this space. It&#x27;s insane how inefficient our current systems are when it is related to big-data processing.
评论 #11679020 未加载
buryatabout 9 years ago
For me, the biggest improvement is the unified typed Dataset API [1]. The current Dataset API gave us a lot of flexibility and type-safety and the new API lets us use it as DataFrame API instead of converting to RDD and reinventing the wheel, like aggregators [2].<p>[1] <a href="https:&#x2F;&#x2F;databricks-prod-cloudfront.cloud.databricks.com&#x2F;public&#x2F;4027ec902e239c93eaaa8714f173bcfc&#x2F;6122906529858466&#x2F;431554386690871&#x2F;4814681571895601&#x2F;latest.html" rel="nofollow">https:&#x2F;&#x2F;databricks-prod-cloudfront.cloud.databricks.com&#x2F;publ...</a><p>[2] <a href="https:&#x2F;&#x2F;docs.cloud.databricks.com&#x2F;docs&#x2F;spark&#x2F;1.6&#x2F;examples&#x2F;Dataset%20Aggregator.html" rel="nofollow">https:&#x2F;&#x2F;docs.cloud.databricks.com&#x2F;docs&#x2F;spark&#x2F;1.6&#x2F;examples&#x2F;Da...</a>
doug1001about 9 years ago
The micro-benchmarks are impressive--eg, to join 1 billion records: Spark 1.6 ~ 61 sec; Spark 2.0 ~ 0.8 sec<p>i assume results such as this are due various optimizations under the Tungsten rubric (code generation, manual memory management) which rely on the sun.misc.Unsafe api.
评论 #11679205 未加载
oonnyabout 9 years ago
Is there a video of a real live example of how spark helped to solve a specific problem? I&#x27;ve tried quite a few times to get my head wrapped around what Spark helps you solve.
评论 #11678809 未加载
评论 #11679986 未加载
评论 #11682201 未加载
baboabout 9 years ago
Is compiling from source is the only way to test Spark 2.0 without using the proprietary Databricks package?
评论 #11683045 未加载