TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Ask HN: Tf-idf search index vs. brute-force Doc2vec and query comparison

5 点作者 pacavaca超过 7 年前
If on one hand, we have a classic search index with ~tf-idf based scoring algorithm (a well configured elastic search, let&#x27;s say). And on the other hand, we have a list of document vectors generated for every document, using some sort of modern, semantic-aware Doc2vec algorithm (a-la word2vec). Now, if we completely omit the speed concern and for the second case will just iterate over every document and calculate a distance from it to a query vector and then pick N closest as a search result. Is it a common sense that these results will definitely be more relevant to a query than those, obtained from a regular search engine? Or will the improvement be just marginal or not better at all? Can anyone point me to an existing experiment with some real numbers available?<p>Also, am I right that the second way should do things like matching &quot;health insurance&quot; to &quot;employee benefits&quot; and &quot;SF taxi&quot; to &quot;California transportation&quot; kind of out of the box, assuming that the &quot;Doc2vec&quot; is well trained and produces rather large vectors for every document (or let&#x27;s even assume we only work with document titles and hence rely on &quot;Sentence2vec&quot;).<p>I would be really grateful if someone could shade some light on this area of information retrieval for me. Thanks!

3 条评论

PaulHoule超过 7 年前
With some kind of &quot;doc2vec&quot; you can get improved results for &quot;more like this&quot; queries where the user supplies a document and the system finds more.<p>This leads to &quot;relevance feedback&quot; that really works.<p>I worked with a &quot;doc2vec&quot; system for patent search, it did a great job in the scenario that somebody writes a paragraph describing an invention. In the case of a short query we fell back on something closer to tf*idf.<p>Click on my HN profile link and I can tell you more.
lovefromatx超过 7 年前
word2vec is useful when one wants to build a machine learning based system. It allows you to get away with a really small matrix [number of documents,~25-1000]. This really makes ML feasible. Another advantage is preserving context. A vector for car and vehicle are closely aligned.<p>Problem when implementing a vector based search engine system is that your recall is going to be really high. You will potentially get a lot of marginally related results with your query.<p>My recommendation will be to implement a tf idf based system. You could enhance your queries by also enriching them with synonyms as well. You could find synonyms by using something like LDA, get a topic model and use the words add the words from that topic in the query.
pacavaca超过 7 年前
Thank you!