Here's the scenario: if you were asked to build out three Linux machines that would be used together in a cluster to perform data mining and machine learning tasks, with the occasional mapreduce job thrown in, how would you spec the machines out? What distro would you use? Any must have software installs?<p>With regards to the hardware, what is your preference for manufacturer? How much would you expect to pay per machine?<p>Your thoughts and suggestions are appreciated!
I hate to say it, I'm not sure you have the necessary skills required to actually do what you're looking to do if these are the types of questions you have. A better question might be <i>"what are good resources to read to get into datamining and machine learning"</i>.
You have a requirement, now you need a specification, until you can specify the needs accurately it is impossible to design a solution. So now you need to ask questions regarding the algorithms that will be used, what hardware can they be run on, CPUs, GPUs? What size are the datasets? What sort of speed is needed? What constraints are there, such as cost? Etc. As for hardware manufacturers, you might look at Supermicro and Appro, it really depends on your needs.
Use AWS until you have a reasonable grasp of your dataset and real requirements. Then buy whatever provides the best bang for your buck, in terms of servers. That will probably mean getting 6 mid-range servers, rather than the three servers with the absolute fastest CPU/most memory available. Use either RedHat (CentOS) or Debian, and you'll almost certainly be using Hadoop. Dell servers are fine, although you can sometimes save significantly by going with something like Supermicro servers from Newegg. In terms of cost, you'll want to order the bulk of your servers' memory from a third-party, not have it included in the build.
A former employer of mine in the financial sector used Scalable Informatics[1] and Dell[2] servers for that sort of thing.<p>1- <a href="http://scalableinformatics.com/" rel="nofollow">http://scalableinformatics.com/</a><p>2- <a href="http://www.dell.com/us/business/p/poweredge-cloud-servers" rel="nofollow">http://www.dell.com/us/business/p/poweredge-cloud-servers</a>