Storm sounds great, but this post probably should have waited until it was actually open-sourced. As it is, it just comes across as naked self-promotion based on a technology that could for all we know be vaporware.
It sounds like a neat project, but I think describing it as "real time" is misleading if you're not also providing information on latency. The majority of the provided use cases seem to indicate a high level of scalability and durability, as well as a high level of throughput, but these are not necessary characteristics of a true real time system.<p>It's a common misconception. A real time system doesn't have to be fast, efficient, or fault tolerant. A real time system must guarantee with 100% certainty that in all cases it will respond to input X within a time period of Y.<p>I would be interested to learn the timing issues driving the development of this system and how you've guaranteed such a response time, especially given that it's running on top of the JVM and must therefore deal with a non-deterministic garbage collection process.
This looks interesting. Questions:<p>(1) What do you mean by a processing topology -- is this a data dependency graph?<p>(2) How does one define a topology? Is this specified at deployment time via the jar file, or can it be configured separately and on the fly?<p>(3) Must records be processed in time order, or can they be sorted and aggregated on some other key?
TW;DR!!<p>For a variety of reasons, I keep my browser windows about 900 pixels wide. Your site requires a honking 1280 to get rid of the horizontal scrollbar -- and can't be read in 900 without scrolling horizontally for every line (i.e. the menu on the left is much too wide).<p>(OT, I know, but it's a pet peeve of mine. It's been known for years how to use CSS to make pages stretch or squish, within reason, to the user's window width. 900 is not too narrow!)<p>EDITED to add: yeah, I'm willing to spend some karma points on this, if that's what happens. Wide sites are getting more common, and this is one of the worst I've seen.
How is this different from a "traditional" CEP system like Esper?<p>(I mean on the actual processing front, rather than architecturally -- sounds like Storm is a bunch of building blocks instead of a unified system.)
I work on a similar system that was previously discussed on HN: <a href="http://news.ycombinator.com/item?id=2442977" rel="nofollow">http://news.ycombinator.com/item?id=2442977</a>
" To compute reach, you need to get all the people who tweeted the URL, get all the followers of all those people, unique that set of followers, and then count the number of uniques. It's an intense computation that potentially involves thousands of database calls and tens of millions of follower records."<p>Or you could use a Graph DB to solve a Graph problem.<p>URL -> tweeted_by -> users -> followed_by -> users<p>Try that on Neo4j.
This sounds great.<p><i>This is the traditional realtime processing use case: process messages and update a variety of databases.</i><p>Question: I typically think of real-time as a need for user-facing things, i.e. handling a user's requests before he gets bored and goes away. Is Storm set up for that? Or is it mostly meant to update a database with results rather than return them to a waiting process?
I'm not sure if this is the same thing, but there's also a new company called Hadapt (the commercialization of HadoopDB). It's about adapting Hadoop for real-time analytic SQL queries by putting local SQL dbs on the Hadoop nodes and then using the Hadoop plumbing. It's based on Daniel Abadi's research, he's a really smart guy.
How is this different / better than Yahoo S4 [1], which does have code on github? [2]? Why did you choose to build this, or did you start before S4 became public?<p>[1] <a href="http://docs.s4.io/" rel="nofollow">http://docs.s4.io/</a>
[2] <a href="https://github.com/s4/core" rel="nofollow">https://github.com/s4/core</a>
This sounds like something that's been painfully over-engineered.<p>One of the main problems they solve is "distributed RPC", from TFA: "There are a lot of queries that are both hard to precompute and too intense to compute on the fly on a single machine."<p>That's generally a sign that you've made a mistake somewhere in your application design. Pain is a response that tells you "stop doing that".