Apache Heron: A realtime, distributed, fault-tolerant stream processing engine

170 pointsby yagizdegirmencialmost 4 years ago

21 comments

bob1029almost 4 years ago

If you are interested in hyper-scale event processing but you want to learn it from first principles, I strongly recommend following Martin Thompson's talks. Here is an example of one on cluster consensus:> <a href="https://www.youtube.com/watch?v=GFfLCGW_5-w" rel="nofollow">https://www.youtube.com/watch?v=GFfLCGW_5-w</a>and another on event log architecture:> <a href="https://www.youtube.com/watch?v=RlwO6CJbJjQ" rel="nofollow">https://www.youtube.com/watch?v=RlwO6CJbJjQ</a>After digging through all of this material and playing around with LMAX Disruptor & Raft, I have been able to develop a really good understanding of how to build these sorts of systems on my own. Fun constraints like "Only one thread actually mutates anything, and its the same one over and over" make for incredibly elegant implementation opportunities. Not having to constantly hunt down exotic thread-safe data structures means that you can focus on building actual value.Latency is the biggest devil you will dance with in this arena, so almost everything you do will be oriented around mitigating that effect. Latency both at the network and inside the CPU/memory/storage. It applies at every level.

评论 #27826146 未加载

dotekaalmost 4 years ago

How many distributed stream processing engines is the Apache foundation planning to collect? At this point it seems like there’s more projects that do this (if you squint a bit), than companies with a serious usecase for that type of architecture.

评论 #27820393 未加载

评论 #27820545 未加载

评论 #27827965 未加载

评论 #27824947 未加载

评论 #27820336 未加载

评论 #27820256 未加载

ubertacoalmost 4 years ago

Oh, is this that "next-gen Storm" project that Twitter built? Seems like they've finally given it to Apache, like they did with Storm after buying the company that built it.Based on a quick Wikipedia skim, looks like the answer is "yes". That explains this bullet point:> Heron is API compatible with Apache Storm and hence no code change is required for migration.

评论 #27825048 未加载

评论 #27822044 未加载

aynycalmost 4 years ago

Sometimes, I feel like Apache foundation is like Thanos, collecting all the distributed engines and watch the IT world burn.From the beginning, Heron was envisioned as a new kind of stream processing system, built to meet the most demanding of technological requirements, to handle even the most massive of workloads, and to meet the needs of organizations of all sizes and degrees of complexity. Amongst these requirements:<pre><code> The ability to process billions of events per minute Extremely low end-to-end latency Predictable behavior regardless of scale and in the face of issue like extreme traffic spikes and pipeline congestion Simple administration, including: The ability to deploy on shared infrastructure Powerful monitoring capabilities Fine-grained configurability Easy debuggability</code></pre> I can't wait for my next startup interview. We have a requirement of 25 messages per hour, with 10KB per message, you think you can build the ingestion pipeline using Kafka and MongoDB on a 10 node M5d.24xlarge cluster?

评论 #27820490 未加载

评论 #27822223 未加载

评论 #27820934 未加载

评论 #27828284 未加载

yamrzoualmost 4 years ago

I've worked with a bunch of stream processing engines a few years ago (Samza, Kafka Streaming, Spark Streaming, Storm and Flink), and did a comparison between them as part of my internship.IMO, Apache Flink is the most complete project for those use cases. It is well maintained and the devs are very helpful when asked on the mailing lists.

评论 #27826645 未加载

pixelmonkeyalmost 4 years ago

Wonder why this is getting posted today in particular?The quick summary here is that this was a clean-house rewrite of Apache Storm done by an internal team at Twitter. As an open source project history refresher, Apache Storm was originally built by a startup called Backtype, and the project was led by Nathan Marz, the technical founder of Backtype. Then, Backtype was acquired by Twitter, and Storm became a major component for large-scale stream processing (of tweets, tweet analytics, and other things) at Twitter.I wrote a summary of the "interesting bits" of Apache Storm here:<a href="https://blog.parse.ly/storm/" rel="nofollow">https://blog.parse.ly/storm/</a>However, at a certain point, Nathan Marz left Twitter, and a different group of engineers tried to rethink Storm inside Twitter. There was also a lot of work going on around Apache Mesos at the time. Heron is kind of a merger of their "rethinking" of Storm while also making it possible to manage Storm-like Heron clusters using Mesos.But, I don't think Heron really took off. Meanwhile, Storm got very, very stable in the 1.x series, and then had a clean-house rewrite from Clojure to Java in the 2.x series, mainly to improve performance even more. The last stable/major Storm release was in 2020.Storm provides a stream processing programming API, a multi-lang wire protocol, and a cluster management approach. But certain cluster computing problems can probably be better solved at the infrastructure layer today. (For example, Storm was developed before the whole container + docker + k8s focus in cloud ops.) That said, it's still a very powerful system; on my team, we process 75K+ events per second across hundreds of vCPU cores and thousands of Python processes with sub-second latencies by combining Storm and Kafka with our open source Python project, streamparse.<a href="https://github.com/Parsely/streamparse" rel="nofollow">https://github.com/Parsely/streamparse</a>The core problems Storm solves: modeling data processing as a computation graph; high-speed network communication between threads, processes, and nodes; message delivery guarantees and retry capabilities; tunable parallelism; built-in monitoring and logging; and much more.(Also, I'd be remiss if I didn't mention -- if you're interested in stream processing and distributed computing, we are hiring Python Data Engineers to work on a stack involving Storm, Spark, Kafka, Cassandra, etc.) -- <a href="https://www.parse.ly/careers/python_data_engineer" rel="nofollow">https://www.parse.ly/careers/python_data_engineer</a>

评论 #27825278 未加载

评论 #27823412 未加载

Wonnk13almost 4 years ago

sometimes I poke fun at the front-end folks and all the new frameworks they're constantly chasing.Lately it's been starting to feel the same for distributed systems. How many streaming engines are there now under Apache? Four?

评论 #27822246 未加载

JasonFruitalmost 4 years ago

I read this as a realtime, distributed, fault-tolerant steam engine, which was hard to imagine.

评论 #27820290 未加载

dikeialmost 4 years ago

IMHO, if you need stream processing, start with Apache Flink. It not only offers a much easier to user API compare to Storm and Heron, but also has a superior execution model for time-based, exactly-once, stateful stream processing.

评论 #27821598 未加载

评论 #27821728 未加载

评论 #27822034 未加载

loremipsiumalmost 4 years ago

hadoop, kafka, storm, spark, flink, samza, confluent This tastes an aweful lot like javascript framework hell

评论 #27821209 未加载

评论 #27821084 未加载

diehundealmost 4 years ago

I find odd so many people complaining about having too many alternatives for streaming systems. If you are not an advanced user of these systems of course you won't be able to choose the right one and it won't be your job anyways. Think of this, we have dozens of databases out there and there are new ones coming out every week and nobody complains about that. That's because there's way more people that understands the differences and the use cases so it makes sense.

majormajoralmost 4 years ago

I would love to know more about how they do stateful stuff, since I haven't found anything that can compare to Flink there, but Flink has really poor quality of life stuff compared to some other options (e.g. Flink serialization setup pains vs Beam coders) but their docs kinda trail off here:> Non-idempotent stateful topologies are stateful topologies that do not apply processing logic along the model of "multiply by zero" and thus cannot provide effectively-once semantics. An example of a non-idempotent<a href="https://heron.incubator.apache.org/docs/heron-delivery-semantics#stateful-topologies" rel="nofollow">https://heron.incubator.apache.org/docs/heron-delivery-seman...</a>(I'm puzzled by their idempotent vs non-idempotent stateful topology description, because if something is mutating an internal state upon receiving events, it will likely be non-idempotent by design... unless they just mean "idempotent stateful" here to refer to keeping track of source/output position state and such.)(They also do say that can only support state storage in ZK or local FS, which feels like a likely non-starter compared to Flink for some of my use cases.)

评论 #27829810 未加载

rayrrralmost 4 years ago

For once almost all commenters so far are thinking the same thing I am. Yet another stream processor from Apache. Might as well create the acronym now. YASP.

评论 #27821237 未加载

CSDudealmost 4 years ago

I really wonder what people actually use stream processing for, like very concrete examples. My best examples would only go far filtering a stream over a time window to compute an aggregate. My job does not require anything more, it's always basic ETL, but I really need to hear specific examples where it's useful for others. Been a long time fan of Apache Flink.

评论 #27823740 未加载

评论 #27828208 未加载

elchiefalmost 4 years ago

So, AWS Pelican coming in 3 months?

biggestloualmost 4 years ago

Everyone, please note that this is NOT an announcement of Heron entering the Apache Foundation. This happened several years ago, I believe in 2017 (when I was working on it).

antplsalmost 4 years ago

Since no one mentioned it in the comments so far, please add Hazelcast Jet to the long list of stream processors : <a href="https://github.com/hazelcast/hazelcast-jet" rel="nofollow">https://github.com/hazelcast/hazelcast-jet</a>

karmasimidaalmost 4 years ago

Anyone use this in production except Twitter? Just curious.

fmakunboundalmost 4 years ago

Is it another corporate dumping at Apache or does Twitter seem to continue to support it as an open source project there?

aetherspawnalmost 4 years ago

Uhh, a triple-take: read this twice as Apache Heroin, and I’m not a druggy.Is it apt? Reader exercise.

cblconfederatealmost 4 years ago

For a nonprofit, apache seems to have an unhealthy affection for solving scaling problems that only a few giant rich companies have

评论 #27822269 未加载

评论 #27821292 未加载