TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Deep learning algorithms to aggregate technology topics from the web

9 pointsby larryfreemanalmost 8 years ago
A friend and I were talking about putting together a project that crowdsources cool technology topics that leveraged the latest algorithms.<p>I&#x27;m assuming that we should use something like Pytorch and possibly leverage the great work done by Yan Lecun.<p>What deep learning algorithms are recommended for this project? If you can specify a technical paper or library with sample code, that would be awesome.

2 comments

visargaalmost 8 years ago
I did something similar for another language. I crawled millions of articles first and build word2vec on the plain text. To compute the embedding of a topic, I summed the vectors of its main keywords - 3-4 well chosen words are enough. The embedding of an article was obtained by summing (or averaging) the vectors of its words. I skipped the stop words (also tried tf-idf) to reduce the noise. The final step was to compute the similarity score of an article related to a topic. This is extremely easy and fast - a dot product between the vectors. Scores over 0.3 (or 0.5) indicate similarity. The main advantage of this method is that it only requires a topic vector, not a whole dataset of training examples. But if you have such a dataset, then you can average the most central keywords per topic, and get topic vectors.<p>If you have hundreds of classes and a training dataset with about 500+ examples per class, you can also try fastText, Vowpal Wabbit or even Naive Bayes. If you want to use neural nets, there are some 1D CNNs floating around on GitHub, but they don&#x27;t work all that well compared to simpler classifiers or simple dot product between vectors. Hundreds of classes usually make classifiers sluggish and accuracy is not so great compared to the binary case (spam&#x2F;not spam). I wouldn&#x27;t try to do that to predict the best subreddit for an article for example, because there are too many subreddits, but with vectors it&#x27;s still OK.
visargaalmost 8 years ago
Take a look here as well: <a href="https:&#x2F;&#x2F;hackernoon.com&#x2F;the-unreasonable-ineffectiveness-of-deep-learning-in-nlu-e4b4ce3a0da0" rel="nofollow">https:&#x2F;&#x2F;hackernoon.com&#x2F;the-unreasonable-ineffectiveness-of-d...</a><p>Exactly about news classification with DL.