TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

A Word Is Worth a Thousand Vectors

88 点作者 legel超过 9 年前

4 条评论

ggchappell超过 9 年前
This is interesting stuff. I recall that at one time Google seemed to be heading in somewhat similar directions with Google Sets (now sadly gone -- I miss it).<p>I know that the author is looking squarely at use cases along the lines of a recommendation engine that would replace a human expert. But personally, I think it might be more interesting to examine things the algorithm can do that humans would find difficult or unintuitive. Sure, king - man + woman = queen is a very significant achievement; it&#x27;s also obvious, to a human. Now, what can this algorithm come up with that is worthwhile, but that I would not find so obvious?<p>A couple of little comments:<p>&gt; The algorithm eventually sees so many examples that it can infer the gender of a single word, ....<p>Do we really want to say that? Perhaps we should say that the algorithm is eventually able to make inferences that people would make based on knowledge of the gender of words -- which is not quite the same thing. (And again, I ask: what useful inferences can the algorithm make that humans would <i>not</i> make so quickly?)<p>&gt; Despite the impressive results that come with word vectorization, no NLP technique is perfect. Take care that your system is robust to results that a computer deems relevant but an expert human wouldn&#x27;t.<p>It should be noted that that &quot;no NLP technique is perfect&quot; idea applies to the NLP techniques used by human brains.
评论 #10127248 未加载
jerf超过 9 年前
It occurs to me the most-likely singularity won&#x27;t be when humanity is wiped off the face of the Earth, Skynet-style, or forcibly absorbed by encroaching grey goo, but when you realize that you and every other human on the planet went bankrupt in the same week because suddenly we are all getting targeted with absolutely, utterly, completely irresistable targeted advertisement beyond our human ability to resist, and we&#x27;ve spent every dollar we have or can get access to... and still the ads are coming. Thus ends Humanity, in a Tantalus-ian hell of infinitely targeted ads we will no longer have the wherewithal to respond with purchases, spending our remaining days in unbounded consumerist ennui....
评论 #10127475 未加载
Radim超过 9 年前
Very nicely written -- as usual for Chris :)<p>One minor nitpick: near the end, Chris recommends LSH for similarity retrieval. This may be a bad idea. That implementation seems to perform very poorly: [benchmarks](<a href="https:&#x2F;&#x2F;github.com&#x2F;erikbern&#x2F;ann-benchmarks&#x2F;pull&#x2F;5#issuecomment-111750051" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;erikbern&#x2F;ann-benchmarks&#x2F;pull&#x2F;5#issuecomme...</a>)<p>As is often the case, simpler algorithms have fewer moving parts, and due to cache localities can even perform better than theoretically-big-O superior ones (see &quot;bruteforce&quot; in that same benchmark graph -- that&#x27;s a simple linear database scan! Observe how it&#x27;s faster than most fancy approximate algos).<p>Note that these benchmarks are run specifically on real world vectors (100 dimensional GloVe word vectors trained over 2 billion tweets), so they&#x27;re highly relevant here.
hyperbovine超过 9 年前
A thousand-dimensional vector, no?
评论 #10125747 未加载