This is a good layman's introduction to modern search techniques, but to someone not in the SEO field it feels like a very strange inversion of priorities. To me, like most people, the surprise is how effective techniques like LDA[1] can be in characterizing a document, but the 'surprise' in the article is that LDA correlates to Google search order better than a more simplistic model.<p>To a technologically savvy but naive outsider, this might seem obvious: shouldn't pages that rank highly in Google have strong topic-based correlation to pages that the user wants to see? But from the SEO perspective, I guess the conclusion would be that your page is more likely to be ranked highly if it includes all the trappings of other high ranked pages, with, you know, like synonyms and stuff. At a certain point, one has to start thinking, wouldn't it be simpler to make a page that people actually want to find?<p>Are there good examples of actually useful pages that Google doesn't do a good job of ranking? I occasionally find myself lately getting frustrated with Google about ignoring my rarer search terms, but generally I find the good pages are at the top if they exist at all.<p>[1] LDA is Latent Dirichlet Allocation, which is very similar to Latent Semantic Analysis, which in turn is very similar to Principle Component Analysis and Singular Value Decomposition. So it's possible you've already heard of the concept, but coming from another angle in another field.
This is news? Seriously????<p>They have found a correlation between a set of words related to a topic you are searching for and how highly a search engine ranks that page?<p>Well duh! Did anyone really think search engines did a keyword search and then applied Pagerank/HITS (<a href="http://en.wikipedia.org/wiki/HITS_algorithm" rel="nofollow">http://en.wikipedia.org/wiki/HITS_algorithm</a>) or whatever? That would give dreadful results.<p>If you really want to understand this, I recommend <i>Building a Vector Space Search Engine in Perl</i> <a href="http://perl.about.com/b/2007/05/24/building-a-vector-space-search-engine-in-perl.htm" rel="nofollow">http://perl.about.com/b/2007/05/24/building-a-vector-space-s...</a><p>I build the vector space classifier in <a href="http://classifier4j.sf.net" rel="nofollow">http://classifier4j.sf.net</a> based almost entirely on that article, even though I don't know Perl. It's very readable, and gives you a great understanding.
I sometimes use LDA (using Hadoop and Mahout) and it is not an inexpensive calculation for large document sets). I wonder what the costs are for using this large scale.