科技回声

4 条评论

ggchappell超过 9 年前

This is interesting stuff. I recall that at one time Google seemed to be heading in somewhat similar directions with Google Sets (now sadly gone -- I miss it).I know that the author is looking squarely at use cases along the lines of a recommendation engine that would replace a human expert. But personally, I think it might be more interesting to examine things the algorithm can do that humans would find difficult or unintuitive. Sure, king - man + woman = queen is a very significant achievement; it's also obvious, to a human. Now, what can this algorithm come up with that is worthwhile, but that I would not find so obvious?A couple of little comments:> The algorithm eventually sees so many examples that it can infer the gender of a single word, ....Do we really want to say that? Perhaps we should say that the algorithm is eventually able to make inferences that people would make based on knowledge of the gender of words -- which is not quite the same thing. (And again, I ask: what useful inferences can the algorithm make that humans would not make so quickly?)> Despite the impressive results that come with word vectorization, no NLP technique is perfect. Take care that your system is robust to results that a computer deems relevant but an expert human wouldn't.It should be noted that that "no NLP technique is perfect" idea applies to the NLP techniques used by human brains.

评论 #10127248 未加载

jerf超过 9 年前

It occurs to me the most-likely singularity won't be when humanity is wiped off the face of the Earth, Skynet-style, or forcibly absorbed by encroaching grey goo, but when you realize that you and every other human on the planet went bankrupt in the same week because suddenly we are all getting targeted with absolutely, utterly, completely irresistable targeted advertisement beyond our human ability to resist, and we've spent every dollar we have or can get access to... and still the ads are coming. Thus ends Humanity, in a Tantalus-ian hell of infinitely targeted ads we will no longer have the wherewithal to respond with purchases, spending our remaining days in unbounded consumerist ennui....

评论 #10127475 未加载

Radim超过 9 年前

Very nicely written -- as usual for Chris :)One minor nitpick: near the end, Chris recommends LSH for similarity retrieval. This may be a bad idea. That implementation seems to perform very poorly: [benchmarks](<a href="https://github.com/erikbern/ann-benchmarks/pull/5#issuecomment-111750051" rel="nofollow">https://github.com/erikbern/ann-benchmarks/pull/5#issuecomme...</a>)As is often the case, simpler algorithms have fewer moving parts, and due to cache localities can even perform better than theoretically-big-O superior ones (see "bruteforce" in that same benchmark graph -- that's a simple linear database scan! Observe how it's faster than most fancy approximate algos).Note that these benchmarks are run specifically on real world vectors (100 dimensional GloVe word vectors trained over 2 billion tweets), so they're highly relevant here.

hyperbovine超过 9 年前

A thousand-dimensional vector, no?

评论 #10125747 未加载

4 条评论

ggchappell超过 9 年前

评论 #10127248 未加载

jerf超过 9 年前

评论 #10127475 未加载

Radim超过 9 年前

hyperbovine超过 9 年前

A thousand-dimensional vector, no?

评论 #10125747 未加载

A Word Is Worth a Thousand Vectors

4 条评论

A Word Is Worth a Thousand Vectors

4 条评论