I was in the process of reading this when I thought to check who this person is. Of course, by that time the site had failed, so I haven't read the whole thing yet.<p>But, it seems to me that the author is falling in to a trap many an unwary data "scientist" falls by not understanding the discipline of Statistics.<p>When one has the entire population data (i.e. a census), rather than a sample, there is no point in carrying out statistical tests.<p>If I know <i>ALL</i> the words spoken by someone, then I know which words they say the most without resorting to any tests simply by counting.<p>No concept of "statistical significance" is applicable because there is no sample. We can calculate the population value of any parameter we can think of, because, we have the entire population (in this specific instance, <i>ALL</i> the words spoken by all the characters).<p>FYI, all budding data "scientists" ...
> Reducing the sparsity brought that down to about 3,100 unique words [from 30,600 unique words]<p>What does that mean? Does he remove words that are only said once or twice?<p>Can anyone point me to a text explaining the difference between Identifying Characteristic Words using <i>Log Likelihood </i>and using <i>tfidf</i>. ?
I've found an image, which i'm guessing it taken from the site: <a href="http://imgur.com/IEudyni" rel="nofollow">http://imgur.com/IEudyni</a>, worth looking at if the sites still down.
Pretty interesting. This Large Scale Study of Myspace (<a href="http://www.cc.gatech.edu/projects/doi/Papers/Caverlee_ICWSM_2008.pdf" rel="nofollow">http://www.cc.gatech.edu/projects/doi/Papers/Caverlee_ICWSM_...</a>) paper shows a similar method for finding characteristic terms, using Mutual Information.
I wonder how the results would change if it was based not on words but rather by lines (not string lines but actor lines in conversation).<p>Its also funny how Stan talks more than Kyle given the show now has a recurring joke that makes fun of Kyle's long educational dialogues.