科技回声

My brother and I had an idea for an application that will perform statistical analysis on arbitrary data sets. We plan to work on it as a side project over the summer (or longer if need be) to learn as well as potentially build it into a business if it turns out as planned. I'm currently doing research to see what would be the best tech stack to build this on.I'm have experience with LAMP, but I'm not sure this would be the best tool for what we have in mind (he has no programming experience). Ideally, I want:- A server that can efficiently handle large uploads (mostly csv/spreadsheets). Could this just be configuring Apache properly?- A language/framework that is good/efficient for stats. (Maybe some combination of Python and R?)- A database that can handle arbitrary data sets and preferably integrates well with the previous tools. I forced MySQL to do this in a proof-of-concept, but there are probably better tools for the job.- A graphing/visualization tool. I kind of like rgraph.net for HTML5 charts, but I'm open to recommendations.- Lastly, a lightweight/simple framework that handles all the typical features of a web app (user registration and management for now).As I mentioned, this project is primarily for learning at this point, so I'm pretty open to any recommendations. Feel free to ask anything and let me know if I'm missing anything.Thanks!

Allowing datasets of arbitrary size is going to make things tough. My first thought is to keep the data as .csv files on Amazon S3 or some other persistent storage network. Getting a database tuned is tough even when you know what data you have up front. Hadoop wouldn't be quite as bad but it still wouldn't be trivial.If you do that, I would recommend looking at WEKA's arff file format. It's a really clunky file format but it captures a bunch of meta data (data types, max/min, etc) needed by many typical machine learning algorithms. You could capture that type of data as the data is being loaded, which would make later analysis easier.After that, you'd have a situation where you can either stream the data out of the csv files or chunk the files into subsets for use in map-reduce type algorithms. I'm not sure what the performance is like when you start requesting the middle of a large file from S3, though.As for a stats package, if you know python, I'd go with it. There are a few stats packages already out there that seem pretty good. But really, if you're just going to do basic stats like averages, st. dev, moving averages over time, etc, those are pretty trivial to implement. That might be beneficial if you have very large data sets that can't fit in memory at once and a custom way of accessing data.I should say I haven't used a lot of the newer whiz-bang analytics setups that have been coming out, but in general my experience has been that working around the idiosyncrasies of stats packages is usually more difficult than implementing my own methods while using their code as a reference.My final advice is to not adopt an analytics framework that has to be the top level of the program. You really need to be able to control the analytics engine programatically from your application. Stay away from systems that make you create modules or data flows inside their application, and the only way to modify them is inside a gui or a complex config file. These systems are everywhere. They are nice as a high-powered replacement for Excel but not when you are trying to develop a software application.

You can try MarkLogic Server. We have a free edition that will cover what you need and Office toolkits that are really cool (so you can work directly in the excel sheets)

Hadoop + Clojure + (Ring for web and Incanter for stats)?

You can try MarkLogic Server. We have a free edition that will cover what you need and Office toolkits that are really cool (so you can work directly in the excel sheets)

Hadoop + Clojure + (Ring for web and Incanter for stats)?

Ask HN: Recommended stack for a data-heavy application?

3 条评论

Ask HN: Recommended stack for a data-heavy application?

3 条评论