Show HN: A search engine that lets you 'add' words as vectors

148 点作者 jakek超过 11 年前

36 条评论

gjm11超过 11 年前

Hmm. It's a lovely idea but I find the results uninspiring. (Which isn't surprising; it's a very difficult problem. But I was hoping to be amazed.) Here are the examples I tried (all of them, no cherrypicking):daughter + male - female -> { The Eldest (book), Songwriter, Granddaughter }(Hopeless; should have had "son" in there)pc - microsoft + apple -> { Olynssis The Silver Color (Japanese book), Burger Time (arcade game), Phantasy Star (series of games) }(Hopeless; should have had "Mac" in there)violin - string + woodwind -> { clarinet, oboe, flute }(OK)mccartney - beatles + stones -> { Rolling Stone (magazine), carvedilol (pharmaceutical), stone (geological term) }(Poor; should have had Jagger or Richards or something of the kind in the top few results)sofa - large + small -> { relaxing, asleep, cupboard }(Poor; I'd have hoped for "armchair" or something of the kind)

评论 #6720309 未加载

nemo1618超过 11 年前

Lisp + JVM = ClojureI'm sold. This is really cool! (Though it's worth noting that a Google search with the same terms returns the exact same result...)

评论 #6745572 未加载

leephillips超过 11 年前

I was really excited by this writeup, so I tried it. Four test queries returned nothing that seemed useful or even relevant:fluid dynamics + electromagnetism : expected magnetrohydrodynamics, got Maxwell’s equations and classical mechanics (not useful);verse + 5 - rhyme : expected blank iambic pentameter, Shakespeare, etc.: got nonsense;writer + American + Russian + Great - Nobel Prize : expected Nabokov, got Meirkhaim Gavrielov + 1 nonsense result;plant + illegal - addictive : expected cannabis, chronic, etc; got “Plants” (thanks) and “Nuclear Weapon” (?!? ) and some Hungarian village.EDIT: I thought maybe I wasn't being sufficiently imaginative, so I tried "Nixon + Clinton - JFK" and got nothing that looked interesting. Then I noticed that the "Nixon" part of my query was "disambiguated" to something like "non_film", and the word "Nixon" was just stripped out. I think this thing is just broken.

评论 #6720243 未加载

评论 #6720369 未加载

doctoboggan超过 11 年前

Hey juxtaposicion, fascinating work. I have many questions so I am just going to shoot them rapid fire.What is the dimensionality of each word vector and what does a words position in this space "mean"? What is this dimensionality determined by? Have your tried any dimensionality reduction algorithms like PCA or Isomap? It would be interesting to find the word vectors that contain the most variation across all of wikipedia. Have you tried any other nearest neighbor search methods other than a simple dot product, such as locality sensitive hashing?I guess most of those questions are about the word2vec algorithm, but you are probably in a good place to answer them after working with it. Anyways, cool work, and I am glad you did it in python so I can really dig in and understand it.

评论 #6720858 未加载

评论 #6721252 未加载

juxtaposicion超过 11 年前

Harvard - Boston + Silicon <a href="http://www.thisplusthat.me/search/Harvard%20-%20Boston%20%2B%20Silicon" rel="nofollow">http://www.thisplusthat.me/search/Harvard%20-%20Boston%20%2B...</a>

评论 #6719879 未加载

SandB0x超过 11 年前

I saw what you wrote about your dot product speed issue. Did you try using NumPy's einsum function? <a href="http://docs.scipy.org/doc/numpy/reference/generated/numpy.einsum.html" rel="nofollow">http://docs.scipy.org/doc/numpy/reference/generated/numpy.ei...</a>It's really fast for this kind of stuff. Happy to give details about how to use it if you need.

评论 #6722430 未加载

emehrkay超过 11 年前

Interesting. I've played around with words as vectors (with values) and the cosign similarity algorithm (<a href="http://en.wikipedia.org/wiki/Cosine_similarity" rel="nofollow">http://en.wikipedia.org/wiki/Cosine_similarity</a>). This is very cool stuff. I wonder how they're doing it in real-time, it is heavy number crunching

评论 #6720359 未加载

donretag超过 11 年前

Interesting concept, but how will it work with more dynamic content? You can train the model on a fairly static corpus such as Wikipedia, but what if you content changes with a greater frequency?Since MapReduce is used, perhaps the model is already being trained on small batches making incremental updates possible.

评论 #6720007 未加载

logn超过 11 年前

daft punk - repetitive + lyrics == La Rouxnice work!

axblount超过 11 年前

I guess we all just need a little more LeAnn Rimes. <a href="http://www.thisplusthat.me/search/the%20world%20-%20violence%20%2B%20love" rel="nofollow">http://www.thisplusthat.me/search/the%20world%20-%20violence...</a>

est超过 11 年前

Sounds like this paper from Google<a href="http://www.technologyreview.com/view/519581" rel="nofollow">http://www.technologyreview.com/view/519581</a>For example, the operation ‘king’ – ‘man’ + ‘woman’ results in a vector that is similar to ‘queen’.

jeorgun超过 11 年前

Is it just me, or do almost half of my searches return `Dvbc' for no apparent reason?<a href="http://www.thisplusthat.me/search/Saturn%20-%20Rings%20%2B%20Spot" rel="nofollow">http://www.thisplusthat.me/search/Saturn%20-%20Rings%20%2B%2...</a><a href="http://www.thisplusthat.me/search/Chrome%20%2B%20open%20source" rel="nofollow">http://www.thisplusthat.me/search/Chrome%20%2B%20open%20sour...</a><a href="http://www.thisplusthat.me/search/Unix%20%2B%20Open%20Source" rel="nofollow">http://www.thisplusthat.me/search/Unix%20%2B%20Open%20Source</a>

toolslive超过 11 年前

Does this relate to Latent Semantic Indexing? <a href="http://en.wikipedia.org/wiki/Latent_semantic_indexing" rel="nofollow">http://en.wikipedia.org/wiki/Latent_semantic_indexing</a>

评论 #6720524 未加载

CurtMonash超过 11 年前

Sounds like another go-around at 1990s (& early 2000s) concept search -- Excite, Northern Light, etc.And it sounds really close to what I was trying at Elucidate.

dhammack超过 11 年前

Hey, nice work! Can you explain the "comma delimited list" functionality any more? It seems (awesomely) similar to a hack I did a while back with Word2Vec which would pick out the word which didn't belong in a list.My hack: <a href="https://github.com/dhammack/Word2VecExample" rel="nofollow">https://github.com/dhammack/Word2VecExample</a>

Danieru超过 11 年前

Fun bug: handheld - sony + nintendo = {Wii, Wii, Snes}I was hoping for the DS or Gameboy but expecting at least something handheld.

评论 #6720441 未加载

评论 #6723786 未加载

grishma超过 11 年前

Interesting. Currently it generates garbage for lot of queries but, some stuff is kinda fun. Forrest Gump - comedy + romance gives pulp fiction (!), as good as it gets (match) and polar express (?) Avatar - action + comedy gives The Office (haha!)

yetanotherphd超过 11 年前

I know people like to keep things positive, but this is completely useless. Apart from a few cherry picked examples, subtracting words makes no sense most of the time, and there is no clear advantage for their method when it comes to adding words.

jboynyc超过 11 年前

This is neat, and I found a few queries that added interesting results. However, I tried<pre><code> Slavoj Žižek - Jacques Lacan - Hegel </code></pre> which yielded an internal server error, probably due to the diacritics not being encoded properly.

cocoflunchy超过 11 年前

Bug report: using some non-ascii characters crashes the server (for example é or É).

评论 #6722565 未加载

zhemao超过 11 年前

Albert Einstein - Smart = Niels Bohr, Werner Heisenberg, Wolfgang PauliOuch, that's cold

breck超过 11 年前

Neat stuff juxtaposicion.Seems like this is how Numenta's AI works: <a href="http://www.youtube.com/watch?v=iNMbsvK8Q8Y" rel="nofollow">http://www.youtube.com/watch?v=iNMbsvK8Q8Y</a>

akennberg超过 11 年前

Stanford - American + Canadian = University of TorontoI think it should be Waterloo.

评论 #6723633 未加载

whistlerbrk超过 11 年前

Works for me:<a href="http://www.thisplusthat.me/search/Dick%20Cheney%20-%20evil%20%2B%20good" rel="nofollow">http://www.thisplusthat.me/search/Dick%20Cheney%20-%20evil%2...</a>

评论 #6725066 未加载

somberi超过 11 年前

Fantastic work and is relevant to something we are working on in this space. Thanks.On a lighter note I tried "sarah palin + sexy" and I got John Mccain, Hillary Clinton and Mitt Romney.

pit超过 11 年前

Also interesting to try something like:sleep - sleep<a href="http://www.thisplusthat.me/search/sleep%20-%20sleep" rel="nofollow">http://www.thisplusthat.me/search/sleep%20-%20sleep</a>

bocanaut超过 11 年前

<a href="http://www.thisplusthat.me/search/Germany%20-%20Fun" rel="nofollow">http://www.thisplusthat.me/search/Germany%20-%20Fun</a> Germany - Fun = USA:)

corobo超过 11 年前

Hey this is pretty cool!superman - male + female:<pre><code> - Lex Luthor (hmm..) - Superman's pal Jimmy Olsen (haha, what?) - Wonder Woman (That'll do it!)</code></pre>

ppymou超过 11 年前

Great writeup. Curious, are there clear advantages that the vector representation has over graph models (FB graph search, Google Knowledge graph)?

SergeyHack超过 11 年前

The default example "justin bieber - man + women" was ok, but I have found a better one - "justin bieber - women + man "

Lucy_karpova超过 11 年前

What are the use cases for this fancy feature? I'm thinking of e-advisor for fun, but what are the real life serious use cases?

iLoch超过 11 年前

ThisPlusThat.me - fast + slow...Just kidding! :)You could also say...ThisPlusThat.me - another rant + something coolThanks for posting this, very interesting work!

iamchmod超过 11 年前

I thought this one was good "Stanford - Red + Smart" = Berkeley

elwell超过 11 年前

Server apparently wasn't ready for HN frontpage load

epaga超过 11 年前

Pretty impressive for my first try.iPad - cool -> Windows Phone

dlsym超过 11 年前

reddit - dumbExpected: HN, Got: Digg