TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

What is TF-IDF?

63 点作者 helium超过 9 年前

7 条评论

gibrown超过 9 年前
Lucene is moving away from TF-IDF to BM25 as the default. Pretty similar idea, but tends to performs a better with short content.<p><a href="https:&#x2F;&#x2F;issues.apache.org&#x2F;jira&#x2F;browse&#x2F;LUCENE-6789" rel="nofollow">https:&#x2F;&#x2F;issues.apache.org&#x2F;jira&#x2F;browse&#x2F;LUCENE-6789</a><p><a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Okapi_BM25" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Okapi_BM25</a><p>In the very limited test cases where I&#x27;ve compared them it hasn&#x27;t mattered much, but other&#x27;s results are pretty compelling.<p><a href="https:&#x2F;&#x2F;www.elastic.co&#x2F;blog&#x2F;found-bm-vs-lucene-default-similarity" rel="nofollow">https:&#x2F;&#x2F;www.elastic.co&#x2F;blog&#x2F;found-bm-vs-lucene-default-simil...</a>
评论 #10297611 未加载
rohwer超过 9 年前
Translate IDF to &quot;how uncommon is this word in the corpus?&quot;<p>TF-IDF is acronym soup, but mathematically simple: IDF is a scalar applied to a term&#x27;s frequency. And in the comparison, the numerator is the document overlap score and the denominator is the square root of the two documents. For more, Stanford&#x27;s natural language processing course is the bee&#x27;s knees: <a href="https:&#x2F;&#x2F;class.coursera.org&#x2F;nlp&#x2F;lecture&#x2F;preview" rel="nofollow">https:&#x2F;&#x2F;class.coursera.org&#x2F;nlp&#x2F;lecture&#x2F;preview</a>
nathell超过 9 年前
TF-IDF solves an important problem and it&#x27;s good to know about.<p>However, in some applications, such as Latent Semantic Analysis (LSA) and its generalizations, there are practical alternatives such as log-entropy [1] that I&#x27;ve found to work better in practice.<p>[1]: <a href="http:&#x2F;&#x2F;link.springer.com&#x2F;article&#x2F;10.3758%2FBF03203370#page-1" rel="nofollow">http:&#x2F;&#x2F;link.springer.com&#x2F;article&#x2F;10.3758%2FBF03203370#page-1</a>
rhema超过 9 年前
Here&#x27;s an interesting demo I made where you can type or paste in words to get a sense of their IDF ( <a href="http:&#x2F;&#x2F;tpoem.com&#x2F;test&#x2F;dict&#x2F;test_dictionary.html" rel="nofollow">http:&#x2F;&#x2F;tpoem.com&#x2F;test&#x2F;dict&#x2F;test_dictionary.html</a> ).
meeper16超过 9 年前
It&#x27;s also used in AI-based document summarization systems that are worth millions e.g.<p>Yahoo Paid $30 Million in Cash for 18 Months of Young Summly <a href="http:&#x2F;&#x2F;allthingsd.com&#x2F;20130325&#x2F;yahoo-paid-30-million-in-cash-for-18-months-of-young-summly-entrepreneurs-time&#x2F;" rel="nofollow">http:&#x2F;&#x2F;allthingsd.com&#x2F;20130325&#x2F;yahoo-paid-30-million-in-cash...</a><p>Google Buys Wavii For North Of $30 Million <a href="http:&#x2F;&#x2F;techcrunch.com&#x2F;2013&#x2F;04&#x2F;23&#x2F;google-buys-wavii-for-north-of-30-million&#x2F;" rel="nofollow">http:&#x2F;&#x2F;techcrunch.com&#x2F;2013&#x2F;04&#x2F;23&#x2F;google-buys-wavii-for-north...</a>
评论 #10297260 未加载
wyldfire超过 9 年前
Is this similar to the concept used by Amazon&#x27;s &quot;statistically improbable phrases&quot; (word-based instead of n-gram based)?<p>EDIT: according to SO, yes: <a href="http:&#x2F;&#x2F;stackoverflow.com&#x2F;a&#x2F;2009546&#x2F;489590" rel="nofollow">http:&#x2F;&#x2F;stackoverflow.com&#x2F;a&#x2F;2009546&#x2F;489590</a>
评论 #10297288 未加载
languagehacker超过 9 年前
Nice job explaining a fundamental IR algorithm.