科技回声

7 条评论

graffitici大约 9 年前

The new Structured Streaming API looks pretty interesting. I have the impression that many Apache projects are trying to address the problems that arise with the lambda architecture. When implementing such a system, you have to worry about dealing with two separate systems, one for low-latency stream processing, and the other is the batch-style processing of large amounts of data.Samza and Storm mostly focus on streaming, while Spark and MapReduce traditionally deal with batch. Spark leverages its core competency of dealing with batch data, and treats streams like mini-batches, effectively treating everything as batch.And I imagine in the following snippet, the author is referring to Apache Flink, among other projects:> One school of thought is to treat everything like a stream; that is, adopt a single programming model integrating both batch and streaming data.My understanding of Structured Streaming also treats everything like batch, but can recognize that the code is being applied to a stream, and do some optimizations for low-latency processing. Is this what's going on?

评论 #11679004 未加载

minimaxir大约 9 年前

Given that all the big data talk lately has been about GPU computing/TensorFlow, I'm glad to see that this Spark update shows in-memory computing is still viable. (Much cheaper to play with too!)The key feature for me is Machine Learning functions in R, which otherwise lacks parallelizeable and scalable options. (Without resorting to black magic, anyways)

评论 #11678667 未加载

评论 #11679313 未加载

评论 #11680391 未加载

评论 #11681352 未加载

评论 #11681421 未加载

评论 #11678419 未加载

harigov大约 9 年前

I like the direction Spark is heading. I am happy to see that they look at Spark in the same way compiler developers look at programming languages. There are huge optimizations to be made in this space. It's insane how inefficient our current systems are when it is related to big-data processing.

评论 #11679020 未加载

buryat大约 9 年前

For me, the biggest improvement is the unified typed Dataset API [1]. The current Dataset API gave us a lot of flexibility and type-safety and the new API lets us use it as DataFrame API instead of converting to RDD and reinventing the wheel, like aggregators [2].[1] <a href="https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/431554386690871/4814681571895601/latest.html" rel="nofollow">https://databricks-prod-cloudfront.cloud.databricks.com/publ...</a>[2] <a href="https://docs.cloud.databricks.com/docs/spark/1.6/examples/Dataset%20Aggregator.html" rel="nofollow">https://docs.cloud.databricks.com/docs/spark/1.6/examples/Da...</a>

doug1001大约 9 年前

The micro-benchmarks are impressive--eg, to join 1 billion records: Spark 1.6 ~ 61 sec; Spark 2.0 ~ 0.8 seci assume results such as this are due various optimizations under the Tungsten rubric (code generation, manual memory management) which rely on the sun.misc.Unsafe api.

评论 #11679205 未加载

oonny大约 9 年前

Is there a video of a real live example of how spark helped to solve a specific problem? I've tried quite a few times to get my head wrapped around what Spark helps you solve.

评论 #11678809 未加载

评论 #11679986 未加载

评论 #11682201 未加载

babo大约 9 年前

Is compiling from source is the only way to test Spark 2.0 without using the proprietary Databricks package?

评论 #11683045 未加载

7 条评论

graffitici大约 9 年前

评论 #11679004 未加载

minimaxir大约 9 年前

评论 #11678667 未加载

评论 #11679313 未加载

评论 #11680391 未加载

评论 #11681352 未加载

评论 #11681421 未加载

评论 #11678419 未加载

harigov大约 9 年前

评论 #11679020 未加载

buryat大约 9 年前

doug1001大约 9 年前

评论 #11679205 未加载

oonny大约 9 年前

Is there a video of a real live example of how spark helped to solve a specific problem? I've tried quite a few times to get my head wrapped around what Spark helps you solve.

评论 #11678809 未加载

评论 #11679986 未加载

评论 #11682201 未加载

babo大约 9 年前

Is compiling from source is the only way to test Spark 2.0 without using the proprietary Databricks package?

评论 #11683045 未加载

Spark 2.0 Technical Preview

7 条评论

Spark 2.0 Technical Preview

7 条评论