We're looking to implement a new data pipeline architecture at work. The primary goal is speed (data size is small enough to fit entirely in memory, sharded across multiple machines if needed). The primary bottleneck is feature extraction, transformation and iteration, which is both CPU and read/write intensive. Model building is not too slow, so no need to distribute training/testing as of yet.<p>I've heard good things about Spark/Shark and Storm. Does anyone have any experiences or recommendations? Maybe we don't even need a super sophisticated system and a Riak/Redis K-V store cluster would do?<p>Thanks in advance
Hard to offer suggestions without knowing rough size of data - depending on how much money you're willing to cough up, even 1 TB is in the range of "can fit in the memory" territory.<p>Having said that, Spark is really great for running iterative algorithms and will definitely fit with what you have described. I suggest staying away from building it on your own using riak/redis (atleast until you have ruled out spark), as you will run into lots of operational issues like handling failures, resource allocation, retries etc.
I can vouch for storm. If only for the fact it's pretty easy to setup (especially compared to hadoop) Being able to leverage zookeeper for coordination allows you some extra capabilities for coordination as well. With that being said, just watch how you build your bolts/spouts. There's lots of ways you can send data in to the system, but in general , storm's documentation has been superb to work with.<p>I built a mini library for myself to auto construct the topologies based on a set of named dependencies to handle bolt/spout wiring. Aside from that, the builder interface for it is really nice if your data pipeline doesn't change.<p>There's good support for testing with a local cluster as well.
you should check out <a href="http://0xdata.com/" rel="nofollow">http://0xdata.com/</a> ; it's built from the ground up on a custom dkv to do in-memory ML. Reasons to check it out:<p>1 - it's open source <a href="https://github.com/0xdata/h2o" rel="nofollow">https://github.com/0xdata/h2o</a><p>2 - ingest data from hdfs, s3, csv<p>3 - I've built systems like what you're discussing twice; the ML algorithms are often easier to write than expected while data management (moving data, sending updates, etc) which initially seems easier is much harder. 0xdata handles this for you.<p>4 - under active development<p>5 - it cleanly runs on your dev box with 1 or many nodes for development; deploying is a simple as uploading a jar to a cluster and putting a single file on each naming peers in the cluster<p>5a - see scripts to walk you through doing this<p>disclosure: I work on it as of very recently =P