TechEcho

2 comments

visargaalmost 8 years ago

I did something similar for another language. I crawled millions of articles first and build word2vec on the plain text. To compute the embedding of a topic, I summed the vectors of its main keywords - 3-4 well chosen words are enough. The embedding of an article was obtained by summing (or averaging) the vectors of its words. I skipped the stop words (also tried tf-idf) to reduce the noise. The final step was to compute the similarity score of an article related to a topic. This is extremely easy and fast - a dot product between the vectors. Scores over 0.3 (or 0.5) indicate similarity. The main advantage of this method is that it only requires a topic vector, not a whole dataset of training examples. But if you have such a dataset, then you can average the most central keywords per topic, and get topic vectors.<p>If you have hundreds of classes and a training dataset with about 500+ examples per class, you can also try fastText, Vowpal Wabbit or even Naive Bayes. If you want to use neural nets, there are some 1D CNNs floating around on GitHub, but they don't work all that well compared to simpler classifiers or simple dot product between vectors. Hundreds of classes usually make classifiers sluggish and accuracy is not so great compared to the binary case (spam/not spam). I wouldn't try to do that to predict the best subreddit for an article for example, because there are too many subreddits, but with vectors it's still OK.

visargaalmost 8 years ago

Take a look here as well: <a href="https://hackernoon.com/the-unreasonable-ineffectiveness-of-deep-learning-in-nlu-e4b4ce3a0da0" rel="nofollow">https://hackernoon.com/the-unreasonable-ineffectiveness-of-d...</a><p>Exactly about news classification with DL.

2 comments

visargaalmost 8 years ago

Ask HN: Deep learning algorithms to aggregate technology topics from the web

2 comments

Ask HN: Deep learning algorithms to aggregate technology topics from the web

2 comments