hi...<p>trying to figure out what ways are there to compare/determine if two separate articles are the same...<p>curently researching semantic analysis, but figured i'd turn here as well...<p>thoughts/comments...<p>thanks<p>bd
Hi Bd,
Ironic, yesterday I uploaded a tech-demo of something I call kindling which attempts to correlate articles against news feeds from social websites.<p>I read a book called Collective Intelligence by Tony Segaran. Its basically machine learning for dummies, very example heavy, all in Python.<p>He talks about clustering to group like things together in an unsupervised way. The way this works is to build a vector of words from each article and compare these using something known as pearson distance. The vector of words is known as a feature set. Early on you create this vector in a naive way (i.e. eliminate words that don't show up enough and words that show up too much). At the end of the book he talks about feature detection (which I assume is building this vector in a smarter way).<p>The book really helped me. Pearson correlation is pretty easy to grasp and implement as well.<p>Good luck.
There's a great Google tech talk on this subject:<p><a href="http://www.youtube.com/watch?v=AyzOUbkUf3M" rel="nofollow">http://www.youtube.com/watch?v=AyzOUbkUf3M</a>