科技回声

8 条评论

law超过 13 年前

Honestly, frameworks like Mahout and Weka have their place, and that's typically for exploratory data analysis. My belief is that for large-scale, extremely intensive machine learning, your best bet is to implement algorithms tailored to the job at hand. Algorithms like logistic regression work fine if your data is linearly separable, but it's not a panacea. None of the algorithms are.If you're interested in machine learning and artificial intelligence, I very strongly consider "enrolling" in Tom Mitchell's machine learning class at <a href="http://www.cs.cmu.edu/~tom/10701_sp11/lectures.shtml" rel="nofollow">http://www.cs.cmu.edu/~tom/10701_sp11/lectures.shtml</a> -- the lectures are long and the mid-term and final are extremely difficult, but the material covered is an outstanding primer for these types of analyses.After going through all of the lectures, you will look at things like Mahout and Weka as mere toys, and will be equipped to write your own implementations for whatever task you and your company are working on. It's a lot of front-loading for rewards that may at first glance seem illusory, but investing the time now will pay dividends later.

评论 #3233305 未加载

评论 #3233344 未加载

dmk23超过 13 年前

Mahout is a great platform, but the real challenge is defining your learning problems, preparing data sets and choosing right algorithms.Once you are clear as to what you actually want to accomplish chances are you are going to need some kind of significantly modified or hybrid algorithm. Packages like Mahout could help get started, but it is kinda funny that even quite a few examples in this article do not demonstrate actually good algorithm performance, like this one -<pre><code> Correctly Classified Instances : 41523 61.9219% Incorrectly Classified Instances : 25534 38.0781% Total Classified Instances : 67057 ======================================================= Confusion Matrix ------------------------------------------------------- a b c d e f ><--Classified as 190440 12 1069 0 0 | 20125 a= cocoon_apache_org_dev 2066 0 1 477 0 0 | 2544 b= cocoon_apache_org_docs 165480 2370 704 0 0 | 19622 c= cocoon_apache_org_users 58 0 0 201090 0 | 20167 d= commons_apache_org_dev 147 0 1 4451 0 0 | 4599 e= commons_apache_org_user</code></pre>

评论 #3233736 未加载

srowen超过 13 年前

Hey all, I'm one of main devs of Mahout and saw this article and commentary. I think it's basically right. I'd like to add my own perspective.I think Mahout has one key problem, and that's its purported scope. The committers' attitude for a long while, which I didn't like myself, was to ingest as many different algorithms that had anything to do with large-scale machine learning.The result is an impressive-looking array of algorithms. It creates a certain level of expectation about coverage. If there were no clustering algorithms, you wouldn't notice the lack of algorithm X or Y. But there are a few, so, people complain it's not supporting what they're looking for.But there's also large variation in quality. Some pieces of the project are quite literally a code dump from someone 2 years ago. Now, some is quite excellent. But because there's a certain level of interest and hype and usage, finding anything a bit stale or buggy leaves a negative impression.I do think Mahout is much, much better than nothing, at least. There is really only one game in town for "mainstream" distributed ML. If it is only a source of good ideas, and a framework to build on, then it's added a lot of value.I also think that some corners of the project are quite excellent. The recommender portions are more mature as they predate Mahout and have more active support. Naive Bayes, for example, in contrast, I don't think has been touched in a while.And I can tell you that Mahout is certainly really used by real companies to do real work! I doubt it solves everyone's problems, but it sure solves some problems better than they'd have solved them from scratch.I strongly agree with here is that you're never likely to find an ML system that works well out-of-the-box. It's always a matter of tuning, customizing for your domain, preparing input, etc. properly. If that's true, then something like Mahout is never going to be satisfying, because any one system is going to be suboptimal as-is for any given system.And for the specialist, no system, including Mahout, is ever going to look as smart or sophisticated as what you know and have done. There are infinite variations, specializations, optimizations possible for any algorithm.So I do see a lot of feedback from smart people that, hmm, I don't think this all that great, and it's valid. For example, I wrote the recommender bits (mostly) and I think the ML implemented there is quite basic. But you see there's somehow a lot of enthusiasm for it, if only because it's managed to roughly bring together, simplify, and make practical the basic ML that people here take for granted. That's good!

mark_l_watson超过 13 年前

Another good article by Grant Ingersoll on Mahout. I used Mahout on a customer project last year when it was not yet a complete machine learning system layered on Hadoop. Looking at Table 1. in this article, many of the previous gaps have been implemented. BTW, the book Mahout in Action is a good guide but the new MEAP released last week does not cover some of the new features, which is OK. Also, Grant has been working on "Taming Text" for a while, but a new MEAP has not been released in a year or two - I would bet that his energies have been focused on extending and using Mahout.

mahmud超过 13 年前

I prefer Weka, mostly because it has excellent literature and has academic leanings, unburdened real-world issues of performance or scalability so it can afford to focus on accuracy.

评论 #3233137 未加载

zgoldberg超过 13 年前

The Google Prediction API (code.google.com/apis/predict) will help you get started with machine learning without the need to write any additional code (other than API calls)!

tel超过 13 年前

Table 1 reminds me why even if these algorithms are available it's a big step to being able to understand and apply them. It's clear the author doesn't have a lot of familiarity with them.

评论 #3234023 未加载

reuser超过 13 年前

That's cool and stuff, but why do I have to write Java?

评论 #3233255 未加载

评论 #3233253 未加载

8 条评论

law超过 13 年前

评论 #3233305 未加载

评论 #3233344 未加载

dmk23超过 13 年前

评论 #3233736 未加载

srowen超过 13 年前

mark_l_watson超过 13 年前

mahmud超过 13 年前

I prefer Weka, mostly because it has excellent literature and has academic leanings, unburdened real-world issues of performance or scalability so it can afford to focus on accuracy.

评论 #3233137 未加载

zgoldberg超过 13 年前

The Google Prediction API (code.google.com/apis/predict) will help you get started with machine learning without the need to write any additional code (other than API calls)!

tel超过 13 年前

Table 1 reminds me why even if these algorithms are available it's a big step to being able to understand and apply them. It's clear the author doesn't have a lot of familiarity with them.

Apache Mahout: Scalable machine learning for everyone

8 条评论

Apache Mahout: Scalable machine learning for everyone

8 条评论