Allowing datasets of arbitrary size is going to make things tough. My first thought is to keep the data as .csv files on Amazon S3 or some other persistent storage network. Getting a database tuned is tough even when you know what data you have up front. Hadoop wouldn't be quite as bad but it still wouldn't be trivial.<p>If you do that, I would recommend looking at WEKA's arff file format. It's a really clunky file format but it captures a bunch of meta data (data types, max/min, etc) needed by many typical machine learning algorithms. You could capture that type of data as the data is being loaded, which would make later analysis easier.<p>After that, you'd have a situation where you can either stream the data out of the csv files or chunk the files into subsets for use in map-reduce type algorithms. I'm not sure what the performance is like when you start requesting the middle of a large file from S3, though.<p>As for a stats package, if you know python, I'd go with it. There are a few stats packages already out there that seem pretty good. But really, if you're just going to do basic stats like averages, st. dev, moving averages over time, etc, those are pretty trivial to implement. That might be beneficial if you have very large data sets that can't fit in memory at once and a custom way of accessing data.<p>I should say I haven't used a lot of the newer whiz-bang analytics setups that have been coming out, but in general my experience has been that working around the idiosyncrasies of stats packages is usually more difficult than implementing my own methods while using their code as a reference.<p>My final advice is to not adopt an analytics framework that has to be the top level of the program. You really need to be able to control the analytics engine programatically from your application. Stay away from systems that make you create modules or data flows inside their application, and the only way to modify them is inside a gui or a complex config file. These systems are everywhere. They are nice as a high-powered replacement for Excel but not when you are trying to develop a software application.