TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

LexVec, a word embedding model written in Go that outperforms word2vec

131 点作者 atrudeau将近 9 年前

9 条评论

rspeer将近 9 年前
As pre-built word vectors go, Conceptnet Numberbatch [1], introduced less flippantly as the ConceptNet Vector Ensemble [2], already outperforms this on all the measures evaluated in its paper: Rare Words, MEN-3000, and WordSim-353.<p>This fact is hard to publicize because somehow the luminaries of the field decided that they didn&#x27;t care about these evaluations anymore, back when RW performance was around 0.4. I have had reviewers dismiss it as &quot;incremental improvements&quot; to improve Rare Words from 0.4 to 0.6 and to improve MEN-3000 to be as good as a high estimate of inter-annotator agreement.<p>It is possible to do much, much better than Google News skip-grams (&quot;word2vec&quot;), and one thing that helps get there is lexical knowledge of the kind that&#x27;s in ConceptNet.<p>[1] <a href="https:&#x2F;&#x2F;blog.conceptnet.io&#x2F;2016&#x2F;05&#x2F;25&#x2F;conceptnet-numberbatch-a-new-name-for-the-best-word-embeddings-you-can-download&#x2F;" rel="nofollow">https:&#x2F;&#x2F;blog.conceptnet.io&#x2F;2016&#x2F;05&#x2F;25&#x2F;conceptnet-numberbatch...</a><p>[2] <a href="https:&#x2F;&#x2F;blog.luminoso.com&#x2F;2016&#x2F;04&#x2F;06&#x2F;an-introduction-to-the-conceptnet-vector-ensemble&#x2F;" rel="nofollow">https:&#x2F;&#x2F;blog.luminoso.com&#x2F;2016&#x2F;04&#x2F;06&#x2F;an-introduction-to-the-...</a>
评论 #12173837 未加载
评论 #12177053 未加载
评论 #12175217 未加载
评论 #12176888 未加载
herrkanin将近 9 年前
It feels weird how word embedding models have come to refer to both the underlying model, as well as the implementation. word2vec is the implementation of two models: the continuous bag-of-word and the skipgram models by Mikolov, while LexVec implements a version of the PPMI weighted count matrix as referenced in the README file. But the papers also discuss implementation details of LexVec that has no bearing on the final accuracy. I feel like we should make more effort to keep the models and reference implementations separate.
评论 #12174218 未加载
loudmax将近 9 年前
If anyone else is wondering what the heck &quot;word embedding&quot; means, it&#x27;s a natural language processing technique.<p>Here&#x27;s a nice blog post about it: <a href="http:&#x2F;&#x2F;sebastianruder.com&#x2F;word-embeddings-1&#x2F;" rel="nofollow">http:&#x2F;&#x2F;sebastianruder.com&#x2F;word-embeddings-1&#x2F;</a><p>It can process something like this: king - man + woman = queen<p>Neat-o.
评论 #12175750 未加载
评论 #12174989 未加载
mooneater将近 9 年前
Are there IP considerations? Word2vec is patented.
评论 #12174112 未加载
评论 #12176754 未加载
rpedela将近 9 年前
Slightly off-topic, but I thought this would be a good place to ask.<p>Are there any word embedding tools which take a Lucene&#x2F;Solr&#x2F;ES index as input and output a synonyms file which can be used to improve search recall?
评论 #12178421 未加载
评论 #12179859 未加载
IshKebab将近 9 年前
Has anyone done any work on handing words that have overloading meanings? Something like &#x27;lead&#x27; has two really distinct uses. It&#x27;s really multiple words that happened to be spelt the same.
评论 #12174469 未加载
评论 #12175101 未加载
评论 #12178392 未加载
评论 #12174934 未加载
ianbertolacci将近 9 年前
Reminds me of Chord[1], word2vec written in Chapel<p>[1] <a href="https:&#x2F;&#x2F;github.com&#x2F;briangu&#x2F;chord" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;briangu&#x2F;chord</a>
ris将近 9 年前
Well done, that&#x27;s probably the <i>least</i> relevant use of &quot;written in go&quot; in a HN headline I&#x27;ve seen. And there&#x27;s some stiff competition for that title.
PaulHoule将近 9 年前
From the viewpoint of commercial applications I find this profoundly depressing.<p>When the state of the art for accuracy is 0.6 on some task, you are going to always be a bridesmaid and never a bride, but hey, you can get bragging rights cause you did well on Kaggle.
评论 #12176681 未加载