I'd reject it still (speaking of someone who has developed products based on word vectors, document vectors, dimensional reduction, etc. before y'all thought it was cool...)<p>I quit a job because they were insisting on using Word2Vec in an application where it would have doomed the project to failure. The basic problem is that in a real-life application many of the most important words are <i>not in the dictionary</i> and if you throw out words that are not in the dictionary you <i>choose</i> to fail.<p>Let a junk paper like that through and the real danger is that you will get 1000s of other junk papers following it up.<p>For instance, take a look at the illustrations on this page<p><a href="https://nlp.stanford.edu/projects/glove/" rel="nofollow noreferrer">https://nlp.stanford.edu/projects/glove/</a><p>particularly under "2. Linear Substructures". They make it look like a miracle that they project down from a 50-dimensional subspace down to 2 and get a nice pattern of cities and zip codes, for instance. The thing is you could have a random set of 20 points in a 50-d space and, assuming there are no degeneracy, you can map them to any 20 points you want in the 2-d space with an appropriately chosen projection matrix. Show me a graph like that with 200 points and I might be impressed. (I'd say those graphs on that server damage the Stanford brand for me about as much as SBF and Marc Tessier-Lavign)<p>(It's a constant theme in dimensional reduction literature that people forget that random matrices often work pretty well, fail to consider how much gain they are getting over the random matrix, ...)<p>BERT, FastText and the like were revolutionary for a few reasons, but I saw the use of subword tokens as absolutely critical because... for once, you could capture a medical note and not <i>erase the patient's name!</i><p>The various conventions of computer science literature prevented explorations that would have put Word2Vec in its place. For instance, it's an obvious idea that you should be able to make a classifier that, given a document vector, can predict "is this a color word?" or "is this a verb?" but if you actually try it, it doesn't work in a particularly maddening way. With a tiny training/eval set (say 10 words) you might convince yourself it is working, but the more data you train on the more you realize the words are scattered mostly randomly and even those those "linear structures" exist in a statistical sense they aren't well defined and not particularly useful. It's the kind of thing that is so weird and inconclusive and fuzzy that I'm not aware of anyone writing a paper about it... Cause you're not going to draw any conclusions out of it except that you found a Jupiter-sized hairball.<p>For all the excitement people had over Word2Vec you didn't see an explosion of interest in vector search engines because... Word2Vec sucked, applying it to documents didn't improve the search engine very much. Some of it is that adding sensitivity to synonyms can hurt performance because many possible synonyms turn out to be red herrings. BERT, on the other hand, is context sensitive and is able to some extent know the different because "my pet jaguar" and "the jaguar dealership in your town" and that really does help find the relevant documents and hide the irrelevant documents.