科技回声

6 条评论

aristus超过 16 年前

The current consensus is to use statistics without trying to write a program that "understands" human language.A common approach is to break your items down into a vector of "features" (words, phrases, tokens, etc) and apply standard IR techniques like kNN, singular value decomposition (SVD), tf/idf, etc.One interesting thing about your problem is that people use different words and phrasing for the same thing, which means you need a way to equate different tokens with each other. SVD is pretty good because it finds degrees of similarity based on context: 'Pikachu' becomes strongly associated with 'Pokemon' and 'Squirtle', and also with misspellings like 'Picachoo'. SVD is tricky and expensive if you implement it wrong. The good news is that it recently came off-patent so there are a lot of new libraries popping up.When I worked on a keyword search engine prototype for Brasil, I found that a mix of Tanimoto and Ferber similarity scores worked very well for finding similarity between short phrases, and I didn't even need to know how to read Portuguese.PressFlip is using support vector machines for a similar task of clustering news items. The Vowpal Wabbit is a stellar library used by Yahoo. I think they recently open-sourced it. allvoices.com is tackling the same area as well. (disclaimer: I am a dev at allvoices, though I don't work on that part :)

评论 #302127 未加载

评论 #302217 未加载

haidut超过 16 年前

OK, there is a pretty "simple" way to do it but I can't divulge all the details b/c we are using it for my semantic news startup. However, I can give you a hint on what you need to do. You need to find a way to reduce different words to their "base" semantics and then compute a similarity measure of sorts between the different event descriptions. Traditionally, people have been breaking down the text into word vectors and then calculating a pearson or tanimoto, or whatnot similarity measures b/n the vectors. The problem with this approach is not the similarity measure but the fact that (as aristus points out) different people use different words to describe the same thing. So computing the tanimoto score of two sentences like "Bush visits Iraq to encourage soldiers" and "US. President arrives at Bagdad to raise troops morale" will result in very low similarity score using just the literal words used in both sentences. However, in reality both sentences are almost identical in meaning. So let's say that there was a way to reduce the semantics of both sentences down to their "base", then for sentence one you would have something like "US President present in Middle Eastern country to increase spirit of soldiers", and for sentence two you would have "US President present in capital of Middle Eastern country to increase spirit of soldiers". So NOW if you take the tanimoto measure of similarity or actually any other decent similarity measure b/n the semantic "base" of both sentences you would get a very high score, which is what is needed. So, like I said I can't divulge the method of how to do reduce text to its semantic "base" but there is definitely at least one way, which is what we are using. If you are interested, send me an email to haidut@gmail.com and when I launch my startup within the next couple of weeks I can hopefully divulge more. Sorry for my vagueness but my lawyers will make a minced-meat out of me if I say more:-) Here is something you can do and I am allowed to divulge. You can use Pointwise Mutual Information (PMI) to find out the semantic relatedness of words in both event descriptions and if the overall cross-word similarity is above a certain threshold then you can consider both event descriptions similar. To find out more on how to calculate PMI and some more info on how useful it is, read this paper: <a href="http://cogprints.org/1796/0/ECML2001.pdf" rel="nofollow">http://cogprints.org/1796/0/ECML2001.pdf</a> Overall, what you are trying to do is simple in concept but not easy to implement. Basically, in order for semantic similarity to be computed automatically by a machine you'd need a mapping of EVERY word in EVERY language to its semantic "base" so you can create a table and compute this efficiently. Such mapping does not currently exists and one of the reasons is the constantly changing nature of English language itself - new words come out all the time and even if you could track them all - some of them don't have semantic "base". Using the example I gave above - the fact that Bush is currently a President won't be true in couple of months so that huge word-to-meaning mapping table would have to account for that, so mapping English language to a semantic base if a (very fast) moving target.That's all I have for now. Good luck in your endeavor.

sbt超过 16 年前

What you are trying to do is very hard. Statistical NLP is the way to go for unbounded domains. You might be able to do better semantic analysis for smaller closed domains.There is a relatively new great book on NLP out now that I suggest you take a look at. Particularly the chapters semantics are very useful, but they should give you an idea of how incredibly difficult what you're trying to do is.Book: <a href="http://www.amazon.com/Language-Processing-Prentice-Artificial-Intelligence/dp/0131873210/ref=sr_1_1?ie=UTF8&s=books&qid=1221233467&sr=1-1" rel="nofollow">http://www.amazon.com/Language-Processing-Prentice-Artificia...</a>

joshu超过 16 年前

I recommend an IR and/or NLP book.Check into Jaccard distance. <a href="http://en.wikipedia.org/wiki/Jaccard_index" rel="nofollow">http://en.wikipedia.org/wiki/Jaccard_index</a>You might consider looking at WordNet for generalizing words...

MaysonL超过 16 年前

There was a great Google teck talk posted here a while ago: The Next Generation of Neural Networks by Geoffrey Hinton <a href="http://www.youtube.com/watch?v=AyzOUbkUf3M" rel="nofollow">http://www.youtube.com/watch?v=AyzOUbkUf3M</a> that might be applicable.

snprbob86超过 16 年前

Will these people be mentioning the same URLs, by chance? :-)

6 条评论

aristus超过 16 年前

评论 #302127 未加载

评论 #302217 未加载

haidut超过 16 年前

sbt超过 16 年前

joshu超过 16 年前

MaysonL超过 16 年前

snprbob86超过 16 年前

Will these people be mentioning the same URLs, by chance? :-)

Ask HN: Algorithm to identify the same events based on semantic similarity

6 条评论

Ask HN: Algorithm to identify the same events based on semantic similarity

6 条评论