TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

High-reproducibility and high-accuracy method for automated topic classification

142 点作者 jestinjoy1超过 10 年前

8 条评论

matt4077超过 10 年前
I had some sort of violent dopamine release just reading the headline.<p>I&#x27;m working on a project to make (EU-) law more accessible. So if anybody here knows good methods to visualise&#x2F;summarise long legal texts (30-300 pages) you could do something for humanity by posting a reply.<p>(Word clouds just don&#x27;t cut it in these cases.)
评论 #8977744 未加载
评论 #8978911 未加载
评论 #8979540 未加载
评论 #8977263 未加载
评论 #8977194 未加载
评论 #8977426 未加载
评论 #8977537 未加载
abeppu超过 10 年前
I find it interesting that this appears to be written by a group of physicists rather than NLP or ML researchers, and I think you can kind of see that in the way they approach the problem. I think a bunch of the work done after LDA among ML and NLP people tended towards (a) using Hierarchical Dirichlet Process models as a platform from which to explore Bayesian nonparametrics more generally (b) better inference algorithms for topic models and (c) somewhat richer models (i.e. author topic models, syntax aware topic models, etc).<p>And it&#x27;s not like the people in this field haven&#x27;t been aware of network-oriented methods. But rather than using community-detection as a mechanism for topic discovery, instead people either focused on networks among topics to see how topics are related, networks among authors such that social network information informed topic discovery, or networks among documents where link&#x2F;reference information was explicitly part of the model.<p>These authors seem to get solid results in part by having totally different values&#x2F;aesthetics. Unlike the Bayesian nonparametrics people, they clearly don&#x27;t care about picking arbitrary, inflexible parameters (e.g. the 5% threshold), nor do they want their model to have a clear, generative form, nor are they particularly concerned about having a new algorithmic insight (since they throw their hard work to InfoMap, and discuss none of its details), nor do they attempt to advance the expressiveness of their topic model (they proceed with the most basic bag-of-words model available). But it does seem like they get good results on the basic task with a very pragmatic, pipeline approach.
评论 #8978102 未加载
jetsnguns超过 10 年前
It was interesting to see a take on the problem from the researchers outside of NLP or ML fields, but the authors only considered classic LDA and PLSA for comparison. I am not currently involved in topic modeling, but I know there exist techniques and modifications to classic models that improve topic discovery (like tf-idf weighting). Can you suggest any modern methods from NLP and ML communities that address the same issues and can rival the authors&#x27; findings?
helderts超过 10 年前
Modeling words co-occurrence graph and then pruning &quot;weak&quot; edges (or achieving similar pruning by using community detection to find clusters) works kind of like a &quot;feature selection&quot; based on something that resembles a bare mutual information or tf*idf.<p>I&#x27;m not entirely familiar with LDA, but from what I was able to understand from their intro, it feels like their LDA application could have used some feature selection.
avyfain超过 10 年前
You can see the source code of a previous iteration of the algorithm here: <a href="https://bitbucket.org/andrealanci/topicmapping/src" rel="nofollow">https:&#x2F;&#x2F;bitbucket.org&#x2F;andrealanci&#x2F;topicmapping&#x2F;src</a>
b0b0b0b超过 10 年前
I&#x27;m confused by the discussion of multi-lingual corpora. Is it common in topic modeling to consider documents drawn from disjoint vocabularies, or is it just a kind of thought experiment?
评论 #8977618 未加载
b6超过 10 年前
I haven&#x27;t dug into the details of the paper yet, but I want to commend the authors for 1.) making it possible to actually download the PDF and 2.) giving some indication, within the actual document, when the paper was published. I&#x27;m being a little bit snarky, but I&#x27;m very sincere in thanking them.
评论 #8978261 未加载
评论 #8977565 未加载
评论 #8978736 未加载
评论 #8977293 未加载
curiously超过 10 年前
is there an open source implementation I can use?<p>What about that sentiment analysis NLP tool that someone posted on HN last year? That was also very good.