TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

A Word Is Worth a Thousand Vectors

88 pointsby legelover 9 years ago

4 comments

ggchappellover 9 years ago
This is interesting stuff. I recall that at one time Google seemed to be heading in somewhat similar directions with Google Sets (now sadly gone -- I miss it).<p>I know that the author is looking squarely at use cases along the lines of a recommendation engine that would replace a human expert. But personally, I think it might be more interesting to examine things the algorithm can do that humans would find difficult or unintuitive. Sure, king - man + woman = queen is a very significant achievement; it&#x27;s also obvious, to a human. Now, what can this algorithm come up with that is worthwhile, but that I would not find so obvious?<p>A couple of little comments:<p>&gt; The algorithm eventually sees so many examples that it can infer the gender of a single word, ....<p>Do we really want to say that? Perhaps we should say that the algorithm is eventually able to make inferences that people would make based on knowledge of the gender of words -- which is not quite the same thing. (And again, I ask: what useful inferences can the algorithm make that humans would <i>not</i> make so quickly?)<p>&gt; Despite the impressive results that come with word vectorization, no NLP technique is perfect. Take care that your system is robust to results that a computer deems relevant but an expert human wouldn&#x27;t.<p>It should be noted that that &quot;no NLP technique is perfect&quot; idea applies to the NLP techniques used by human brains.
评论 #10127248 未加载
jerfover 9 years ago
It occurs to me the most-likely singularity won&#x27;t be when humanity is wiped off the face of the Earth, Skynet-style, or forcibly absorbed by encroaching grey goo, but when you realize that you and every other human on the planet went bankrupt in the same week because suddenly we are all getting targeted with absolutely, utterly, completely irresistable targeted advertisement beyond our human ability to resist, and we&#x27;ve spent every dollar we have or can get access to... and still the ads are coming. Thus ends Humanity, in a Tantalus-ian hell of infinitely targeted ads we will no longer have the wherewithal to respond with purchases, spending our remaining days in unbounded consumerist ennui....
评论 #10127475 未加载
Radimover 9 years ago
Very nicely written -- as usual for Chris :)<p>One minor nitpick: near the end, Chris recommends LSH for similarity retrieval. This may be a bad idea. That implementation seems to perform very poorly: [benchmarks](<a href="https:&#x2F;&#x2F;github.com&#x2F;erikbern&#x2F;ann-benchmarks&#x2F;pull&#x2F;5#issuecomment-111750051" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;erikbern&#x2F;ann-benchmarks&#x2F;pull&#x2F;5#issuecomme...</a>)<p>As is often the case, simpler algorithms have fewer moving parts, and due to cache localities can even perform better than theoretically-big-O superior ones (see &quot;bruteforce&quot; in that same benchmark graph -- that&#x27;s a simple linear database scan! Observe how it&#x27;s faster than most fancy approximate algos).<p>Note that these benchmarks are run specifically on real world vectors (100 dimensional GloVe word vectors trained over 2 billion tweets), so they&#x27;re highly relevant here.
hyperbovineover 9 years ago
A thousand-dimensional vector, no?
评论 #10125747 未加载