TechEcho

10 comments

nanisover 9 years ago

I was in the process of reading this when I thought to check who this person is. Of course, by that time the site had failed, so I haven't read the whole thing yet.But, it seems to me that the author is falling in to a trap many an unwary data "scientist" falls by not understanding the discipline of Statistics.When one has the entire population data (i.e. a census), rather than a sample, there is no point in carrying out statistical tests.If I know ALL the words spoken by someone, then I know which words they say the most without resorting to any tests simply by counting.No concept of "statistical significance" is applicable because there is no sample. We can calculate the population value of any parameter we can think of, because, we have the entire population (in this specific instance, ALL the words spoken by all the characters).FYI, all budding data "scientists" ...

评论 #11073497 未加载

评论 #11074034 未加载

评论 #11076191 未加载

评论 #11073338 未加载

评论 #11072847 未加载

评论 #11072867 未加载

评论 #11074487 未加载

评论 #11074161 未加载

评论 #11073455 未加载

评论 #11073334 未加载

seankrossover 9 years ago

Here's the accompanying GitHub repo: <a href="https://github.com/walkerkq/textmining_southpark" rel="nofollow">https://github.com/walkerkq/textmining_southpark</a>

wodenokotoover 9 years ago

> Reducing the sparsity brought that down to about 3,100 unique words [from 30,600 unique words]What does that mean? Does he remove words that are only said once or twice?Can anyone point me to a text explaining the difference between Identifying Characteristic Words using Log Likelihood and using tfidf. ?

评论 #11073301 未加载

cadabover 9 years ago

I've found an image, which i'm guessing it taken from the site: <a href="http://imgur.com/IEudyni" rel="nofollow">http://imgur.com/IEudyni</a>, worth looking at if the sites still down.

LoSboccaccover 9 years ago

I would have loved to see log characterization for the canadians characters, even if they aren't part of the main cast

dropdatabaseover 9 years ago

This is amazing, I wonder what results you'd get from The Simpsons

评论 #11079278 未加载

rhemaover 9 years ago

Pretty interesting. This Large Scale Study of Myspace (<a href="http://www.cc.gatech.edu/projects/doi/Papers/Caverlee_ICWSM_2008.pdf" rel="nofollow">http://www.cc.gatech.edu/projects/doi/Papers/Caverlee_ICWSM_...</a>) paper shows a similar method for finding characteristic terms, using Mutual Information.

peg_legover 9 years ago

This should be nominated for an igNobel

agentgtover 9 years ago

I wonder how the results would change if it was based not on words but rather by lines (not string lines but actor lines in conversation).Its also funny how Stan talks more than Kyle given the show now has a recurring joke that makes fun of Kyle's long educational dialogues.

评论 #11072533 未加载

评论 #11078206 未加载

gulbrandrover 9 years ago

<pre><code> Error establishing a database connection </code></pre> Someone has a cached version please?

评论 #11072493 未加载

10 comments

nanisover 9 years ago

评论 #11073497 未加载

评论 #11074034 未加载

评论 #11076191 未加载

评论 #11073338 未加载

评论 #11072847 未加载

评论 #11072867 未加载

评论 #11074487 未加载

评论 #11074161 未加载

评论 #11073455 未加载

评论 #11073334 未加载

seankrossover 9 years ago

Here's the accompanying GitHub repo: <a href="https://github.com/walkerkq/textmining_southpark" rel="nofollow">https://github.com/walkerkq/textmining_southpark</a>

wodenokotoover 9 years ago

评论 #11073301 未加载

cadabover 9 years ago

I've found an image, which i'm guessing it taken from the site: <a href="http://imgur.com/IEudyni" rel="nofollow">http://imgur.com/IEudyni</a>, worth looking at if the sites still down.

LoSboccaccover 9 years ago

I would have loved to see log characterization for the canadians characters, even if they aren't part of the main cast

dropdatabaseover 9 years ago

This is amazing, I wonder what results you'd get from The Simpsons

评论 #11079278 未加载

rhemaover 9 years ago

peg_legover 9 years ago

This should be nominated for an igNobel

agentgtover 9 years ago

评论 #11072533 未加载

评论 #11078206 未加载

gulbrandrover 9 years ago

<pre><code> Error establishing a database connection </code></pre> Someone has a cached version please?

评论 #11072493 未加载

Text Mining South Park

10 comments

Text Mining South Park

10 comments