I am rebuilding a data pipeline that processes billions of records. An overview of what I built is as follows:<p>collect data from (n*k) sources-> derive new data -> generate a unified/merged collection of data (n) data.<p>The current solution is all hand crafted code.<p>I know this is a 10,000 foot view of the problem, but are there any guides or books on how to better design and implement this type of solution?
That's pretty broad, here are some questions that might help you find something relevant: 1) is it a big batch job or incremental/streaming analysis, 2) does the data reasonably fit on one machine, 3) does your existing analysis depend on any complex 3rd party libraries (hard to port away from), 4) would you be willing to use a cloud provider's proprietary tool, 5) what level of commercial support do you need?<p>You can get pretty far with R or Pandas + Scipy on a fast machine, after that then you start taking on more hassle of Spark or whatever fits your situation.<p>Oh, and 0) pain that's motivating the rebuild. Feel free to e-mail me even just to rubber duck your thinking.
I believe we need more details. Even though you may not be able to disclose the n*k sources, at least you can estimate the volume per min/hour/day as well as the analysis requirements (real-time or delayed)? Also what kind of "processing" should we expect? There is a lot of difference between a simple transform against some data science algos.