The new Structured Streaming API looks pretty interesting. I have the impression that many Apache projects are trying to address the problems that arise with the lambda architecture. When implementing such a system, you have to worry about dealing with two separate systems, one for low-latency stream processing, and the other is the batch-style processing of large amounts of data.<p>Samza and Storm mostly focus on streaming, while Spark and MapReduce traditionally deal with batch. Spark leverages its core competency of dealing with batch data, and treats streams like mini-batches, effectively treating everything as batch.<p>And I imagine in the following snippet, the author is referring to Apache Flink, among other projects:<p>> One school of thought is to treat everything like a stream; that is, adopt a single programming model integrating both batch and streaming data.<p>My understanding of Structured Streaming also treats everything like batch, but can recognize that the code is being applied to a stream, and do some optimizations for low-latency processing. Is this what's going on?
Given that all the big data talk lately has been about GPU computing/TensorFlow, I'm glad to see that this Spark update shows in-memory computing is still viable. (Much cheaper to play with too!)<p>The key feature for me is Machine Learning functions in R, which otherwise lacks parallelizeable and scalable options. (Without resorting to black magic, anyways)
I like the direction Spark is heading. I am happy to see that they look at Spark in the same way compiler developers look at programming languages. There are huge optimizations to be made in this space. It's insane how inefficient our current systems are when it is related to big-data processing.
For me, the biggest improvement is the unified typed Dataset API [1]. The current Dataset API gave us a lot of flexibility and type-safety and the new API lets us use it as DataFrame API instead of converting to RDD and reinventing the wheel, like aggregators [2].<p>[1] <a href="https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/6122906529858466/431554386690871/4814681571895601/latest.html" rel="nofollow">https://databricks-prod-cloudfront.cloud.databricks.com/publ...</a><p>[2] <a href="https://docs.cloud.databricks.com/docs/spark/1.6/examples/Dataset%20Aggregator.html" rel="nofollow">https://docs.cloud.databricks.com/docs/spark/1.6/examples/Da...</a>
The micro-benchmarks are impressive--eg, to join 1 billion records: Spark 1.6 ~ 61 sec; Spark 2.0 ~ 0.8 sec<p>i assume results such as this are due various optimizations under the Tungsten rubric (code generation, manual memory management) which rely on the sun.misc.Unsafe api.
Is there a video of a real live example of how spark helped to solve a specific problem? I've tried quite a few times to get my head wrapped around what Spark helps you solve.