Hmm. It's a lovely idea but I find the results uninspiring. (Which isn't surprising; it's a very difficult problem. But I was hoping to be amazed.) Here are the examples I tried (all of them, no cherrypicking):<p>daughter + male - female -> { The Eldest (book), Songwriter, Granddaughter }<p>(Hopeless; should have had "son" in there)<p>pc - microsoft + apple -> { Olynssis The Silver Color (Japanese book), Burger Time (arcade game), Phantasy Star (series of games) }<p>(Hopeless; should have had "Mac" in there)<p>violin - string + woodwind -> { clarinet, oboe, flute }<p>(OK)<p>mccartney - beatles + stones -> { Rolling Stone (magazine), carvedilol (pharmaceutical), stone (geological term) }<p>(Poor; should have had Jagger or Richards or something of the kind in the top few results)<p>sofa - large + small -> { relaxing, asleep, cupboard }<p>(Poor; I'd have hoped for "armchair" or something of the kind)
Lisp + JVM = Clojure<p>I'm sold. This is really cool! (Though it's worth noting that a Google search with the same terms returns the exact same result...)
I was really excited by this writeup, so I tried it. Four test queries returned nothing that seemed useful or even relevant:<p>fluid dynamics + electromagnetism : expected magnetrohydrodynamics, got Maxwell’s equations and classical mechanics (not useful);<p>verse + 5 - rhyme : expected blank iambic pentameter, Shakespeare, etc.: got nonsense;<p>writer + American + Russian + Great - Nobel Prize : expected Nabokov, got Meirkhaim Gavrielov + 1 nonsense result;<p>plant + illegal - addictive : expected cannabis, chronic, etc; got “Plants” (thanks) and “Nuclear Weapon” (?!? ) and some Hungarian village.<p>EDIT: I thought maybe I wasn't being sufficiently imaginative, so I tried "Nixon + Clinton - JFK" and got nothing that looked interesting. Then I noticed that the "Nixon" part of my query was "disambiguated" to something like "non_film", and the word "Nixon" was just stripped out. I think this thing is just broken.
Hey juxtaposicion, fascinating work. I have many questions so I am just going to shoot them rapid fire.<p>What is the dimensionality of each word vector and what does a words position in this space "mean"? What is this dimensionality determined by? Have your tried any dimensionality reduction algorithms like PCA or Isomap? It would be interesting to find the word vectors that contain the most variation across all of wikipedia. Have you tried any other nearest neighbor search methods other than a simple dot product, such as locality sensitive hashing?<p>I guess most of those questions are about the word2vec algorithm, but you are probably in a good place to answer them after working with it. Anyways, cool work, and I am glad you did it in python so I can really dig in and understand it.
I saw what you wrote about your dot product speed issue. Did you try using NumPy's einsum function? <a href="http://docs.scipy.org/doc/numpy/reference/generated/numpy.einsum.html" rel="nofollow">http://docs.scipy.org/doc/numpy/reference/generated/numpy.ei...</a><p>It's really fast for this kind of stuff. Happy to give details about how to use it if you need.
Interesting. I've played around with words as vectors (with values) and the cosign similarity algorithm (<a href="http://en.wikipedia.org/wiki/Cosine_similarity" rel="nofollow">http://en.wikipedia.org/wiki/Cosine_similarity</a>). This is very cool stuff. I wonder how they're doing it in real-time, it is heavy number crunching
Interesting concept, but how will it work with more dynamic content? You can train the model on a fairly static corpus such as Wikipedia, but what if you content changes with a greater frequency?<p>Since MapReduce is used, perhaps the model is already being trained on small batches making incremental updates possible.
I guess we all just need a little more LeAnn Rimes. <a href="http://www.thisplusthat.me/search/the%20world%20-%20violence%20%2B%20love" rel="nofollow">http://www.thisplusthat.me/search/the%20world%20-%20violence...</a>
Sounds like this paper from Google<p><a href="http://www.technologyreview.com/view/519581" rel="nofollow">http://www.technologyreview.com/view/519581</a><p>For example, the operation ‘king’ – ‘man’ + ‘woman’ results in a vector that is similar to ‘queen’.
Is it just me, or do almost half of my searches return `Dvbc' for no apparent reason?<p><a href="http://www.thisplusthat.me/search/Saturn%20-%20Rings%20%2B%20Spot" rel="nofollow">http://www.thisplusthat.me/search/Saturn%20-%20Rings%20%2B%2...</a><p><a href="http://www.thisplusthat.me/search/Chrome%20%2B%20open%20source" rel="nofollow">http://www.thisplusthat.me/search/Chrome%20%2B%20open%20sour...</a><p><a href="http://www.thisplusthat.me/search/Unix%20%2B%20Open%20Source" rel="nofollow">http://www.thisplusthat.me/search/Unix%20%2B%20Open%20Source</a>
Does this relate to Latent Semantic Indexing?
<a href="http://en.wikipedia.org/wiki/Latent_semantic_indexing" rel="nofollow">http://en.wikipedia.org/wiki/Latent_semantic_indexing</a>
Sounds like another go-around at 1990s (& early 2000s) concept search -- Excite, Northern Light, etc.<p>And it sounds really close to what I was trying at Elucidate.
Hey, nice work! Can you explain the "comma delimited list" functionality any more? It seems (awesomely) similar to a hack I did a while back with Word2Vec which would pick out the word which didn't belong in a list.<p>My hack: <a href="https://github.com/dhammack/Word2VecExample" rel="nofollow">https://github.com/dhammack/Word2VecExample</a>
Interesting.
Currently it generates garbage for lot of queries but, some stuff is kinda fun.
Forrest Gump - comedy + romance gives pulp fiction (!), as good as it gets (match) and polar express (?)
Avatar - action + comedy gives The Office (haha!)
I know people like to keep things positive, but this is completely useless. Apart from a few cherry picked examples, subtracting words makes no sense most of the time, and there is no clear advantage for their method when it comes to adding words.
This is neat, and I found a few queries that added interesting results. However, I tried<p><pre><code> Slavoj Žižek - Jacques Lacan - Hegel
</code></pre>
which yielded an internal server error, probably due to the diacritics not being encoded properly.
Neat stuff juxtaposicion.<p>Seems like this is how Numenta's AI works: <a href="http://www.youtube.com/watch?v=iNMbsvK8Q8Y" rel="nofollow">http://www.youtube.com/watch?v=iNMbsvK8Q8Y</a>
Works for me:<p><a href="http://www.thisplusthat.me/search/Dick%20Cheney%20-%20evil%20%2B%20good" rel="nofollow">http://www.thisplusthat.me/search/Dick%20Cheney%20-%20evil%2...</a>
Fantastic work and is relevant to something we are working on in this space. Thanks.<p>On a lighter note I tried "sarah palin + sexy" and I got John Mccain, Hillary Clinton and Mitt Romney.
Also interesting to try something like:<p>sleep - sleep<p><a href="http://www.thisplusthat.me/search/sleep%20-%20sleep" rel="nofollow">http://www.thisplusthat.me/search/sleep%20-%20sleep</a>
Hey this is pretty cool!<p>superman - male + female:<p><pre><code> - Lex Luthor (hmm..)
- Superman's pal Jimmy Olsen (haha, what?)
- Wonder Woman (That'll do it!)</code></pre>
ThisPlusThat.me - fast + slow...<p>Just kidding! :)<p>You could also say...<p>ThisPlusThat.me - another rant + something cool<p>Thanks for posting this, very interesting work!