TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Linux rig for data mining and machine learning

9 pointsby big_dataover 14 years ago
Here's the scenario: if you were asked to build out three Linux machines that would be used together in a cluster to perform data mining and machine learning tasks, with the occasional mapreduce job thrown in, how would you spec the machines out? What distro would you use? Any must have software installs?<p>With regards to the hardware, what is your preference for manufacturer? How much would you expect to pay per machine?<p>Your thoughts and suggestions are appreciated!

4 comments

burgerbrainover 14 years ago
I hate to say it, I'm not sure you have the necessary skills required to actually do what you're looking to do if these are the types of questions you have. A better question might be <i>"what are good resources to read to get into datamining and machine learning"</i>.
评论 #2296054 未加载
turbojerryover 14 years ago
You have a requirement, now you need a specification, until you can specify the needs accurately it is impossible to design a solution. So now you need to ask questions regarding the algorithms that will be used, what hardware can they be run on, CPUs, GPUs? What size are the datasets? What sort of speed is needed? What constraints are there, such as cost? Etc. As for hardware manufacturers, you might look at Supermicro and Appro, it really depends on your needs.
评论 #2298254 未加载
bobfover 14 years ago
Use AWS until you have a reasonable grasp of your dataset and real requirements. Then buy whatever provides the best bang for your buck, in terms of servers. That will probably mean getting 6 mid-range servers, rather than the three servers with the absolute fastest CPU/most memory available. Use either RedHat (CentOS) or Debian, and you'll almost certainly be using Hadoop. Dell servers are fine, although you can sometimes save significantly by going with something like Supermicro servers from Newegg. In terms of cost, you'll want to order the bulk of your servers' memory from a third-party, not have it included in the build.
评论 #2296857 未加载
bayareaguyover 14 years ago
A former employer of mine in the financial sector used Scalable Informatics[1] and Dell[2] servers for that sort of thing.<p>1- <a href="http://scalableinformatics.com/" rel="nofollow">http://scalableinformatics.com/</a><p>2- <a href="http://www.dell.com/us/business/p/poweredge-cloud-servers" rel="nofollow">http://www.dell.com/us/business/p/poweredge-cloud-servers</a>
评论 #2296025 未加载