Latent semantic mapping is a technique which takes a large number of text documents, maps them to term frequency vectors (vector-space semantics), and performs dimensionality reduction into a smaller semantic space. This then lets you determine how similar in meaning different documents are. You can use this for a variety of tasks.<p>Wikipedia: Latent Semantic Mapping
<a href="http://en.wikipedia.org/wiki/Latent_semantic_mapping" rel="nofollow">http://en.wikipedia.org/wiki/Latent_semantic_mapping</a><p>WWDC 2011 talk, now available: "Latent semantic mapping: exposing the meaning behind words and documents"
<a href="https://developer.apple.com/videos/wwdc/2011/" rel="nofollow">https://developer.apple.com/videos/wwdc/2011/</a>
We used this, when I was at Apple, to make the Parental Controls web content filter (which I worked on), among other things. It works surprisingly well.
I just can't ever see Microsoft shipping something like this available to every user. This sort of quiet progress is why I like Apple. Sure they highlight the glossy stuff, but below the surface there's so much blood and guts progress.
So is anything like this available on other platforms? Because it's way faster than <a href="http://classifier.rubyforge.org/" rel="nofollow">http://classifier.rubyforge.org/</a> , even with rb-gsl installed. I'd love it for generating related posts on my Jekyll blog.
I've been playing with some clustering stuff in my free time for the past few months.<p>What I've found is that the problem seems to get a lot more reasonable if you know how many clusters there are.<p>K-Means requires this information, but afaict agglomerative techniques don't. I wonder why this tool's agglomerative clustering method requires the number of clusters as an argument.