For those interested in machine learning, the "self improvement" technique that the author talks about falls under <i>semi-supervised learning</i>, specifically <i>self-training</i>, which is apparently "still extensively used in the natural language processing community." <a href="http://en.wikipedia.org/wiki/Semi-supervised_learning" rel="nofollow">http://en.wikipedia.org/wiki/Semi-supervised_learning</a> <a href="http://pages.cs.wisc.edu/~jerryzhu/icml07tutorial.html" rel="nofollow">http://pages.cs.wisc.edu/~jerryzhu/icml07tutorial.html</a><p>Semi-supervised learning is a good idea in this type of situation, given that unlabeled samples are far more abundant than labeled samples, but there are gotchas to watch out for. In general, SSL helps when your model of the data is correct and hurts when it is not.<p>Here's an example of what can go wrong in this particular application: let's say the word 'better' is mildly positive, but when it appears in high-confidence samples, it's usually because it appears together with the words 'business' and 'bureau', as in "I just reported Company X to the Better Business Bureau", i.e., strongly negative. This means that the new self-training samples containing the word better will all be negative, which will bias the corpus until eventually 'better' is treated as a strongly negative feature.<p>Occasional random human spot-checks of the high-confidence classifications would be useful :-) Also, self-training gives diminishing returns in accuracy, whereas the possibility for craziness remains, so turning it off after a while might be best.<p>A survey of semi-supervised learning:
<a href="http://www.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf" rel="nofollow">http://www.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf</a>