科技回声

3 条评论

For those interested in machine learning, the "self improvement" technique that the author talks about falls under semi-supervised learning, specifically self-training, which is apparently "still extensively used in the natural language processing community." <a href="http://en.wikipedia.org/wiki/Semi-supervised_learning" rel="nofollow">http://en.wikipedia.org/wiki/Semi-supervised_learning</a> <a href="http://pages.cs.wisc.edu/~jerryzhu/icml07tutorial.html" rel="nofollow">http://pages.cs.wisc.edu/~jerryzhu/icml07tutorial.html</a>Semi-supervised learning is a good idea in this type of situation, given that unlabeled samples are far more abundant than labeled samples, but there are gotchas to watch out for. In general, SSL helps when your model of the data is correct and hurts when it is not.Here's an example of what can go wrong in this particular application: let's say the word 'better' is mildly positive, but when it appears in high-confidence samples, it's usually because it appears together with the words 'business' and 'bureau', as in "I just reported Company X to the Better Business Bureau", i.e., strongly negative. This means that the new self-training samples containing the word better will all be negative, which will bias the corpus until eventually 'better' is treated as a strongly negative feature.Occasional random human spot-checks of the high-confidence classifications would be useful :-) Also, self-training gives diminishing returns in accuracy, whereas the possibility for craziness remains, so turning it off after a while might be best.A survey of semi-supervised learning: <a href="http://www.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf" rel="nofollow">http://www.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf</a>

评论 #1638490 未加载

评论 #1638169 未加载

评论 #1638268 未加载

评论 #1638223 未加载

ThomPete超过 14 年前

Standford also have an entire semester of lectures on Itunes University<a href="http://itunes.apple.com/WebObjects/MZStore.woa/wa/viewiTunesUCollection?id=384233048" rel="nofollow">http://itunes.apple.com/WebObjects/MZStore.woa/wa/viewiTunes...</a>

评论 #1639632 未加载

herdrick超过 14 年前

Very cool. That seemed to work better than I'd have expected.The explanation of the 'naive' part of Naive Bayes isn't quite right, though. Throwing out the possibility of the animal being human based on the datapoint "four legs" is orthogonal to naivety. A more sophisticated Naive Bayes system could reject classifying an animal as human based on having four legs, and conversely a non-naive system i.e. one that used joint probabilities of the features, might not be more likely to.I guess the most rational way to do that would be to express and calculate the conditional probabilities with some statistical distance, like the # of standard deviations. So 100k examples of humans without a single one having the "four legs" feature would make that a very strong indicator of being non-human. And that'd work just as well with a naive algorithm as any other.Is there a name for this?

Self-Improving Bayesian Sentiment Analysis for Twitter

3 条评论

Self-Improving Bayesian Sentiment Analysis for Twitter

3 条评论