Hello everyone,<p>What do you guys do to improve training data set labels to get better success rate? Are you using some framework or doing it manually after some time?<p>We have been working on topic modeling stuff, and managing everything manually is quite hectic, so looking for some better solutions.<p>Thanks
From an engineering standpoint, the question is understanding where the bottlenecks are.<p>For instance, your feature set might limit your accuracy. Let's say you are interested (or uninterested) in posts about the Go programming language on HN and you are classifying based on the title. "Golang" predicts "Go Language" accurately, but "Go" does not. No matter how much you train, you will reach a limit unless you have beyond-bag-of-words features like "Go Development", "Go Implementation", ...<p>Many NLP projects fail because people decide up front to throw away critical information that they can never get back. Beyond BoW is not trivial, however, because if you vastly increase the number of features, most will be poorly sampled and you won't learn from them.<p>Past feature engineering there are very interesting questions in active learning that are not covered well in the academic literature, largely because active learning experiments are not reproducible in a Kaggle-like competition. There is also the human factor; you can destroy people psychologically by making them split hairs that don't matter; realistically you can get 2000 judgements a day out of a person if that is all they do, 200 is more likely from an expert who does other things.<p>Click on my profile link and send me an email and I can share what I know.