TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: Deep learning algorithms to aggregate technology topics from the web

9 点作者 larryfreeman将近 8 年前
A friend and I were talking about putting together a project that crowdsources cool technology topics that leveraged the latest algorithms.<p>I&#x27;m assuming that we should use something like Pytorch and possibly leverage the great work done by Yan Lecun.<p>What deep learning algorithms are recommended for this project? If you can specify a technical paper or library with sample code, that would be awesome.

2 条评论

visarga将近 8 年前
I did something similar for another language. I crawled millions of articles first and build word2vec on the plain text. To compute the embedding of a topic, I summed the vectors of its main keywords - 3-4 well chosen words are enough. The embedding of an article was obtained by summing (or averaging) the vectors of its words. I skipped the stop words (also tried tf-idf) to reduce the noise. The final step was to compute the similarity score of an article related to a topic. This is extremely easy and fast - a dot product between the vectors. Scores over 0.3 (or 0.5) indicate similarity. The main advantage of this method is that it only requires a topic vector, not a whole dataset of training examples. But if you have such a dataset, then you can average the most central keywords per topic, and get topic vectors.<p>If you have hundreds of classes and a training dataset with about 500+ examples per class, you can also try fastText, Vowpal Wabbit or even Naive Bayes. If you want to use neural nets, there are some 1D CNNs floating around on GitHub, but they don&#x27;t work all that well compared to simpler classifiers or simple dot product between vectors. Hundreds of classes usually make classifiers sluggish and accuracy is not so great compared to the binary case (spam&#x2F;not spam). I wouldn&#x27;t try to do that to predict the best subreddit for an article for example, because there are too many subreddits, but with vectors it&#x27;s still OK.
visarga将近 8 年前
Take a look here as well: <a href="https:&#x2F;&#x2F;hackernoon.com&#x2F;the-unreasonable-ineffectiveness-of-deep-learning-in-nlu-e4b4ce3a0da0" rel="nofollow">https:&#x2F;&#x2F;hackernoon.com&#x2F;the-unreasonable-ineffectiveness-of-d...</a><p>Exactly about news classification with DL.