TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Text Mining South Park

202 pointsby eamonncareyover 9 years ago

10 comments

nanisover 9 years ago
I was in the process of reading this when I thought to check who this person is. Of course, by that time the site had failed, so I haven&#x27;t read the whole thing yet.<p>But, it seems to me that the author is falling in to a trap many an unwary data &quot;scientist&quot; falls by not understanding the discipline of Statistics.<p>When one has the entire population data (i.e. a census), rather than a sample, there is no point in carrying out statistical tests.<p>If I know <i>ALL</i> the words spoken by someone, then I know which words they say the most without resorting to any tests simply by counting.<p>No concept of &quot;statistical significance&quot; is applicable because there is no sample. We can calculate the population value of any parameter we can think of, because, we have the entire population (in this specific instance, <i>ALL</i> the words spoken by all the characters).<p>FYI, all budding data &quot;scientists&quot; ...
评论 #11073497 未加载
评论 #11074034 未加载
评论 #11076191 未加载
评论 #11073338 未加载
评论 #11072847 未加载
评论 #11072867 未加载
评论 #11074487 未加载
评论 #11074161 未加载
评论 #11073455 未加载
评论 #11073334 未加载
seankrossover 9 years ago
Here&#x27;s the accompanying GitHub repo: <a href="https:&#x2F;&#x2F;github.com&#x2F;walkerkq&#x2F;textmining_southpark" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;walkerkq&#x2F;textmining_southpark</a>
wodenokotoover 9 years ago
&gt; Reducing the sparsity brought that down to about 3,100 unique words [from 30,600 unique words]<p>What does that mean? Does he remove words that are only said once or twice?<p>Can anyone point me to a text explaining the difference between Identifying Characteristic Words using <i>Log Likelihood </i>and using <i>tfidf</i>. ?
评论 #11073301 未加载
cadabover 9 years ago
I&#x27;ve found an image, which i&#x27;m guessing it taken from the site: <a href="http:&#x2F;&#x2F;imgur.com&#x2F;IEudyni" rel="nofollow">http:&#x2F;&#x2F;imgur.com&#x2F;IEudyni</a>, worth looking at if the sites still down.
LoSboccaccover 9 years ago
I would have loved to see log characterization for the canadians characters, even if they aren&#x27;t part of the main cast
dropdatabaseover 9 years ago
This is amazing, I wonder what results you&#x27;d get from The Simpsons
评论 #11079278 未加载
rhemaover 9 years ago
Pretty interesting. This Large Scale Study of Myspace (<a href="http:&#x2F;&#x2F;www.cc.gatech.edu&#x2F;projects&#x2F;doi&#x2F;Papers&#x2F;Caverlee_ICWSM_2008.pdf" rel="nofollow">http:&#x2F;&#x2F;www.cc.gatech.edu&#x2F;projects&#x2F;doi&#x2F;Papers&#x2F;Caverlee_ICWSM_...</a>) paper shows a similar method for finding characteristic terms, using Mutual Information.
peg_legover 9 years ago
This should be nominated for an igNobel
agentgtover 9 years ago
I wonder how the results would change if it was based not on words but rather by lines (not string lines but actor lines in conversation).<p>Its also funny how Stan talks more than Kyle given the show now has a recurring joke that makes fun of Kyle&#x27;s long educational dialogues.
评论 #11072533 未加载
评论 #11078206 未加载
gulbrandrover 9 years ago
<p><pre><code> Error establishing a database connection </code></pre> Someone has a cached version please?
评论 #11072493 未加载