As pre-built word vectors go, Conceptnet Numberbatch [1], introduced less flippantly as the ConceptNet Vector Ensemble [2], already outperforms this on all the measures evaluated in its paper: Rare Words, MEN-3000, and WordSim-353.<p>This fact is hard to publicize because somehow the luminaries of the field decided that they didn't care about these evaluations anymore, back when RW performance was around 0.4. I have had reviewers dismiss it as "incremental improvements" to improve Rare Words from 0.4 to 0.6 and to improve MEN-3000 to be as good as a high estimate of inter-annotator agreement.<p>It is possible to do much, much better than Google News skip-grams ("word2vec"), and one thing that helps get there is lexical knowledge of the kind that's in ConceptNet.<p>[1] <a href="https://blog.conceptnet.io/2016/05/25/conceptnet-numberbatch-a-new-name-for-the-best-word-embeddings-you-can-download/" rel="nofollow">https://blog.conceptnet.io/2016/05/25/conceptnet-numberbatch...</a><p>[2] <a href="https://blog.luminoso.com/2016/04/06/an-introduction-to-the-conceptnet-vector-ensemble/" rel="nofollow">https://blog.luminoso.com/2016/04/06/an-introduction-to-the-...</a>
It feels weird how word embedding models have come to refer to both the underlying model, as well as the implementation. word2vec is the implementation of two models: the continuous bag-of-word and the skipgram models by Mikolov, while LexVec implements a version of the PPMI weighted count matrix as referenced in the README file. But the papers also discuss implementation details of LexVec that has no bearing on the final accuracy. I feel like we should make more effort to keep the models and reference implementations separate.
If anyone else is wondering what the heck "word embedding" means, it's a natural language processing technique.<p>Here's a nice blog post about it: <a href="http://sebastianruder.com/word-embeddings-1/" rel="nofollow">http://sebastianruder.com/word-embeddings-1/</a><p>It can process something like this: king - man + woman = queen<p>Neat-o.
Slightly off-topic, but I thought this would be a good place to ask.<p>Are there any word embedding tools which take a Lucene/Solr/ES index as input and output a synonyms file which can be used to improve search recall?
Has anyone done any work on handing words that have overloading meanings? Something like 'lead' has two really distinct uses. It's really multiple words that happened to be spelt the same.
Reminds me of Chord[1], word2vec written in Chapel<p>[1] <a href="https://github.com/briangu/chord" rel="nofollow">https://github.com/briangu/chord</a>
Well done, that's probably the <i>least</i> relevant use of "written in go" in a HN headline I've seen. And there's some stiff competition for that title.
From the viewpoint of commercial applications I find this profoundly depressing.<p>When the state of the art for accuracy is 0.6 on some task, you are going to always be a bridesmaid and never a bride, but hey, you can get bragging rights cause you did well on Kaggle.