If on one hand, we have a classic search index with ~tf-idf based scoring algorithm (a well configured elastic search, let's say). And on the other hand, we have a list of document vectors generated for every document, using some sort of modern, semantic-aware Doc2vec algorithm (a-la word2vec).
Now, if we completely omit the speed concern and for the second case will just iterate over every document and calculate a distance from it to a query vector and then pick N closest as a search result. Is it a common sense that these results will definitely be more relevant to a query than those, obtained from a regular search engine? Or will the improvement be just marginal or not better at all?
Can anyone point me to an existing experiment with some real numbers available?<p>Also, am I right that the second way should do things like matching "health insurance" to "employee benefits" and "SF taxi" to "California transportation" kind of out of the box, assuming that the "Doc2vec" is well trained and produces rather large vectors for every document (or let's even assume we only work with document titles and hence rely on "Sentence2vec").<p>I would be really grateful if someone could shade some light on this area of information retrieval for me. Thanks!
With some kind of "doc2vec" you can get improved results for "more like this" queries where the user supplies a document and the system finds more.<p>This leads to "relevance feedback" that really works.<p>I worked with a "doc2vec" system for patent search, it did a great job in the scenario that somebody writes a paragraph describing an invention. In the case of a short query we fell back on something closer to tf*idf.<p>Click on my HN profile link and I can tell you more.
word2vec is useful when one wants to build a machine learning based system. It allows you to get away with a really small matrix [number of documents,~25-1000]. This really makes ML feasible. Another advantage is preserving context. A vector for car and vehicle are closely aligned.<p>Problem when implementing a vector based search engine system is that your recall is going to be really high. You will potentially get a lot of marginally related results with your query.<p>My recommendation will be to implement a tf idf based system. You could enhance your queries by also enriching them with synonyms as well. You could find synonyms by using something like LDA, get a topic model and use the words add the words from that topic in the query.