TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Tf-idf search index vs. brute-force Doc2vec and query comparison

5 pointsby pacavacaover 7 years ago
If on one hand, we have a classic search index with ~tf-idf based scoring algorithm (a well configured elastic search, let&#x27;s say). And on the other hand, we have a list of document vectors generated for every document, using some sort of modern, semantic-aware Doc2vec algorithm (a-la word2vec). Now, if we completely omit the speed concern and for the second case will just iterate over every document and calculate a distance from it to a query vector and then pick N closest as a search result. Is it a common sense that these results will definitely be more relevant to a query than those, obtained from a regular search engine? Or will the improvement be just marginal or not better at all? Can anyone point me to an existing experiment with some real numbers available?<p>Also, am I right that the second way should do things like matching &quot;health insurance&quot; to &quot;employee benefits&quot; and &quot;SF taxi&quot; to &quot;California transportation&quot; kind of out of the box, assuming that the &quot;Doc2vec&quot; is well trained and produces rather large vectors for every document (or let&#x27;s even assume we only work with document titles and hence rely on &quot;Sentence2vec&quot;).<p>I would be really grateful if someone could shade some light on this area of information retrieval for me. Thanks!

3 comments

PaulHouleover 7 years ago
With some kind of &quot;doc2vec&quot; you can get improved results for &quot;more like this&quot; queries where the user supplies a document and the system finds more.<p>This leads to &quot;relevance feedback&quot; that really works.<p>I worked with a &quot;doc2vec&quot; system for patent search, it did a great job in the scenario that somebody writes a paragraph describing an invention. In the case of a short query we fell back on something closer to tf*idf.<p>Click on my HN profile link and I can tell you more.
lovefromatxover 7 years ago
word2vec is useful when one wants to build a machine learning based system. It allows you to get away with a really small matrix [number of documents,~25-1000]. This really makes ML feasible. Another advantage is preserving context. A vector for car and vehicle are closely aligned.<p>Problem when implementing a vector based search engine system is that your recall is going to be really high. You will potentially get a lot of marginally related results with your query.<p>My recommendation will be to implement a tf idf based system. You could enhance your queries by also enriching them with synonyms as well. You could find synonyms by using something like LDA, get a topic model and use the words add the words from that topic in the query.
pacavacaover 7 years ago
Thank you!