Show HN: A search engine that lets you 'add' words as vectors

148 pointsby jakekover 11 years ago

36 comments

gjm11over 11 years ago

Hmm. It's a lovely idea but I find the results uninspiring. (Which isn't surprising; it's a very difficult problem. But I was hoping to be amazed.) Here are the examples I tried (all of them, no cherrypicking):daughter + male - female -> { The Eldest (book), Songwriter, Granddaughter }(Hopeless; should have had "son" in there)pc - microsoft + apple -> { Olynssis The Silver Color (Japanese book), Burger Time (arcade game), Phantasy Star (series of games) }(Hopeless; should have had "Mac" in there)violin - string + woodwind -> { clarinet, oboe, flute }(OK)mccartney - beatles + stones -> { Rolling Stone (magazine), carvedilol (pharmaceutical), stone (geological term) }(Poor; should have had Jagger or Richards or something of the kind in the top few results)sofa - large + small -> { relaxing, asleep, cupboard }(Poor; I'd have hoped for "armchair" or something of the kind)

评论 #6720309 未加载

nemo1618over 11 years ago

Lisp + JVM = ClojureI'm sold. This is really cool! (Though it's worth noting that a Google search with the same terms returns the exact same result...)

评论 #6745572 未加载

leephillipsover 11 years ago

I was really excited by this writeup, so I tried it. Four test queries returned nothing that seemed useful or even relevant:fluid dynamics + electromagnetism : expected magnetrohydrodynamics, got Maxwell’s equations and classical mechanics (not useful);verse + 5 - rhyme : expected blank iambic pentameter, Shakespeare, etc.: got nonsense;writer + American + Russian + Great - Nobel Prize : expected Nabokov, got Meirkhaim Gavrielov + 1 nonsense result;plant + illegal - addictive : expected cannabis, chronic, etc; got “Plants” (thanks) and “Nuclear Weapon” (?!? ) and some Hungarian village.EDIT: I thought maybe I wasn't being sufficiently imaginative, so I tried "Nixon + Clinton - JFK" and got nothing that looked interesting. Then I noticed that the "Nixon" part of my query was "disambiguated" to something like "non_film", and the word "Nixon" was just stripped out. I think this thing is just broken.

评论 #6720243 未加载

评论 #6720369 未加载

doctobogganover 11 years ago

Hey juxtaposicion, fascinating work. I have many questions so I am just going to shoot them rapid fire.What is the dimensionality of each word vector and what does a words position in this space "mean"? What is this dimensionality determined by? Have your tried any dimensionality reduction algorithms like PCA or Isomap? It would be interesting to find the word vectors that contain the most variation across all of wikipedia. Have you tried any other nearest neighbor search methods other than a simple dot product, such as locality sensitive hashing?I guess most of those questions are about the word2vec algorithm, but you are probably in a good place to answer them after working with it. Anyways, cool work, and I am glad you did it in python so I can really dig in and understand it.

评论 #6720858 未加载

评论 #6721252 未加载

juxtaposicionover 11 years ago

Harvard - Boston + Silicon <a href="http://www.thisplusthat.me/search/Harvard%20-%20Boston%20%2B%20Silicon" rel="nofollow">http://www.thisplusthat.me/search/Harvard%20-%20Boston%20%2B...</a>

评论 #6719879 未加载

SandB0xover 11 years ago

I saw what you wrote about your dot product speed issue. Did you try using NumPy's einsum function? <a href="http://docs.scipy.org/doc/numpy/reference/generated/numpy.einsum.html" rel="nofollow">http://docs.scipy.org/doc/numpy/reference/generated/numpy.ei...</a>It's really fast for this kind of stuff. Happy to give details about how to use it if you need.

评论 #6722430 未加载

emehrkayover 11 years ago

Interesting. I've played around with words as vectors (with values) and the cosign similarity algorithm (<a href="http://en.wikipedia.org/wiki/Cosine_similarity" rel="nofollow">http://en.wikipedia.org/wiki/Cosine_similarity</a>). This is very cool stuff. I wonder how they're doing it in real-time, it is heavy number crunching

评论 #6720359 未加载

donretagover 11 years ago

Interesting concept, but how will it work with more dynamic content? You can train the model on a fairly static corpus such as Wikipedia, but what if you content changes with a greater frequency?Since MapReduce is used, perhaps the model is already being trained on small batches making incremental updates possible.

评论 #6720007 未加载

lognover 11 years ago

daft punk - repetitive + lyrics == La Rouxnice work!

axblountover 11 years ago

I guess we all just need a little more LeAnn Rimes. <a href="http://www.thisplusthat.me/search/the%20world%20-%20violence%20%2B%20love" rel="nofollow">http://www.thisplusthat.me/search/the%20world%20-%20violence...</a>

estover 11 years ago

Sounds like this paper from Google<a href="http://www.technologyreview.com/view/519581" rel="nofollow">http://www.technologyreview.com/view/519581</a>For example, the operation ‘king’ – ‘man’ + ‘woman’ results in a vector that is similar to ‘queen’.

jeorgunover 11 years ago

Is it just me, or do almost half of my searches return `Dvbc' for no apparent reason?<a href="http://www.thisplusthat.me/search/Saturn%20-%20Rings%20%2B%20Spot" rel="nofollow">http://www.thisplusthat.me/search/Saturn%20-%20Rings%20%2B%2...</a><a href="http://www.thisplusthat.me/search/Chrome%20%2B%20open%20source" rel="nofollow">http://www.thisplusthat.me/search/Chrome%20%2B%20open%20sour...</a><a href="http://www.thisplusthat.me/search/Unix%20%2B%20Open%20Source" rel="nofollow">http://www.thisplusthat.me/search/Unix%20%2B%20Open%20Source</a>

toolsliveover 11 years ago

Does this relate to Latent Semantic Indexing? <a href="http://en.wikipedia.org/wiki/Latent_semantic_indexing" rel="nofollow">http://en.wikipedia.org/wiki/Latent_semantic_indexing</a>

评论 #6720524 未加载

CurtMonashover 11 years ago

Sounds like another go-around at 1990s (& early 2000s) concept search -- Excite, Northern Light, etc.And it sounds really close to what I was trying at Elucidate.

dhammackover 11 years ago

Hey, nice work! Can you explain the "comma delimited list" functionality any more? It seems (awesomely) similar to a hack I did a while back with Word2Vec which would pick out the word which didn't belong in a list.My hack: <a href="https://github.com/dhammack/Word2VecExample" rel="nofollow">https://github.com/dhammack/Word2VecExample</a>

Danieruover 11 years ago

Fun bug: handheld - sony + nintendo = {Wii, Wii, Snes}I was hoping for the DS or Gameboy but expecting at least something handheld.

评论 #6720441 未加载

评论 #6723786 未加载

grishmaover 11 years ago

Interesting. Currently it generates garbage for lot of queries but, some stuff is kinda fun. Forrest Gump - comedy + romance gives pulp fiction (!), as good as it gets (match) and polar express (?) Avatar - action + comedy gives The Office (haha!)

yetanotherphdover 11 years ago

I know people like to keep things positive, but this is completely useless. Apart from a few cherry picked examples, subtracting words makes no sense most of the time, and there is no clear advantage for their method when it comes to adding words.

jboynycover 11 years ago

This is neat, and I found a few queries that added interesting results. However, I tried<pre><code> Slavoj Žižek - Jacques Lacan - Hegel </code></pre> which yielded an internal server error, probably due to the diacritics not being encoded properly.

cocoflunchyover 11 years ago

Bug report: using some non-ascii characters crashes the server (for example é or É).

评论 #6722565 未加载

zhemaoover 11 years ago

Albert Einstein - Smart = Niels Bohr, Werner Heisenberg, Wolfgang PauliOuch, that's cold

breckover 11 years ago

Neat stuff juxtaposicion.Seems like this is how Numenta's AI works: <a href="http://www.youtube.com/watch?v=iNMbsvK8Q8Y" rel="nofollow">http://www.youtube.com/watch?v=iNMbsvK8Q8Y</a>

akennbergover 11 years ago

Stanford - American + Canadian = University of TorontoI think it should be Waterloo.

评论 #6723633 未加载

whistlerbrkover 11 years ago

Works for me:<a href="http://www.thisplusthat.me/search/Dick%20Cheney%20-%20evil%20%2B%20good" rel="nofollow">http://www.thisplusthat.me/search/Dick%20Cheney%20-%20evil%2...</a>

评论 #6725066 未加载

somberiover 11 years ago

Fantastic work and is relevant to something we are working on in this space. Thanks.On a lighter note I tried "sarah palin + sexy" and I got John Mccain, Hillary Clinton and Mitt Romney.

pitover 11 years ago

Also interesting to try something like:sleep - sleep<a href="http://www.thisplusthat.me/search/sleep%20-%20sleep" rel="nofollow">http://www.thisplusthat.me/search/sleep%20-%20sleep</a>

bocanautover 11 years ago

<a href="http://www.thisplusthat.me/search/Germany%20-%20Fun" rel="nofollow">http://www.thisplusthat.me/search/Germany%20-%20Fun</a> Germany - Fun = USA:)

coroboover 11 years ago

Hey this is pretty cool!superman - male + female:<pre><code> - Lex Luthor (hmm..) - Superman's pal Jimmy Olsen (haha, what?) - Wonder Woman (That'll do it!)</code></pre>

ppymouover 11 years ago

Great writeup. Curious, are there clear advantages that the vector representation has over graph models (FB graph search, Google Knowledge graph)?

SergeyHackover 11 years ago

The default example "justin bieber - man + women" was ok, but I have found a better one - "justin bieber - women + man "

Lucy_karpovaover 11 years ago

What are the use cases for this fancy feature? I'm thinking of e-advisor for fun, but what are the real life serious use cases?

iLochover 11 years ago

ThisPlusThat.me - fast + slow...Just kidding! :)You could also say...ThisPlusThat.me - another rant + something coolThanks for posting this, very interesting work!

iamchmodover 11 years ago

I thought this one was good "Stanford - Red + Smart" = Berkeley

elwellover 11 years ago

Server apparently wasn't ready for HN frontpage load

epagaover 11 years ago

Pretty impressive for my first try.iPad - cool -> Windows Phone

dlsymover 11 years ago

reddit - dumbExpected: HN, Got: Digg