TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Self-Improving Bayesian Sentiment Analysis for Twitter

82 点作者 spxdcz超过 14 年前

3 条评论

randomwalker超过 14 年前
For those interested in machine learning, the "self improvement" technique that the author talks about falls under <i>semi-supervised learning</i>, specifically <i>self-training</i>, which is apparently "still extensively used in the natural language processing community." <a href="http://en.wikipedia.org/wiki/Semi-supervised_learning" rel="nofollow">http://en.wikipedia.org/wiki/Semi-supervised_learning</a> <a href="http://pages.cs.wisc.edu/~jerryzhu/icml07tutorial.html" rel="nofollow">http://pages.cs.wisc.edu/~jerryzhu/icml07tutorial.html</a><p>Semi-supervised learning is a good idea in this type of situation, given that unlabeled samples are far more abundant than labeled samples, but there are gotchas to watch out for. In general, SSL helps when your model of the data is correct and hurts when it is not.<p>Here's an example of what can go wrong in this particular application: let's say the word 'better' is mildly positive, but when it appears in high-confidence samples, it's usually because it appears together with the words 'business' and 'bureau', as in "I just reported Company X to the Better Business Bureau", i.e., strongly negative. This means that the new self-training samples containing the word better will all be negative, which will bias the corpus until eventually 'better' is treated as a strongly negative feature.<p>Occasional random human spot-checks of the high-confidence classifications would be useful :-) Also, self-training gives diminishing returns in accuracy, whereas the possibility for craziness remains, so turning it off after a while might be best.<p>A survey of semi-supervised learning: <a href="http://www.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf" rel="nofollow">http://www.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf</a>
评论 #1638490 未加载
评论 #1638169 未加载
评论 #1638268 未加载
评论 #1638223 未加载
ThomPete超过 14 年前
Standford also have an entire semester of lectures on Itunes University<p><a href="http://itunes.apple.com/WebObjects/MZStore.woa/wa/viewiTunesUCollection?id=384233048" rel="nofollow">http://itunes.apple.com/WebObjects/MZStore.woa/wa/viewiTunes...</a>
评论 #1639632 未加载
herdrick超过 14 年前
Very cool. That seemed to work better than I'd have expected.<p>The explanation of the 'naive' part of Naive Bayes isn't quite right, though. Throwing out the possibility of the animal being human based on the datapoint "four legs" is orthogonal to naivety. A more sophisticated Naive Bayes system could reject classifying an animal as human based on having four legs, and conversely a non-naive system i.e. one that used joint probabilities of the features, might not be more likely to.<p>I guess the most rational way to do that would be to express and calculate the conditional probabilities with some statistical distance, like the # of standard deviations. So 100k examples of humans without a single one having the "four legs" feature would make that a very strong indicator of being non-human. And that'd work just as well with a naive algorithm as any other.<p>Is there a name for this?
评论 #1638916 未加载