TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Text Mining South Park

202 点作者 eamonncarey超过 9 年前

10 条评论

nanis超过 9 年前
I was in the process of reading this when I thought to check who this person is. Of course, by that time the site had failed, so I haven&#x27;t read the whole thing yet.<p>But, it seems to me that the author is falling in to a trap many an unwary data &quot;scientist&quot; falls by not understanding the discipline of Statistics.<p>When one has the entire population data (i.e. a census), rather than a sample, there is no point in carrying out statistical tests.<p>If I know <i>ALL</i> the words spoken by someone, then I know which words they say the most without resorting to any tests simply by counting.<p>No concept of &quot;statistical significance&quot; is applicable because there is no sample. We can calculate the population value of any parameter we can think of, because, we have the entire population (in this specific instance, <i>ALL</i> the words spoken by all the characters).<p>FYI, all budding data &quot;scientists&quot; ...
评论 #11073497 未加载
评论 #11074034 未加载
评论 #11076191 未加载
评论 #11073338 未加载
评论 #11072847 未加载
评论 #11072867 未加载
评论 #11074487 未加载
评论 #11074161 未加载
评论 #11073455 未加载
评论 #11073334 未加载
seankross超过 9 年前
Here&#x27;s the accompanying GitHub repo: <a href="https:&#x2F;&#x2F;github.com&#x2F;walkerkq&#x2F;textmining_southpark" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;walkerkq&#x2F;textmining_southpark</a>
wodenokoto超过 9 年前
&gt; Reducing the sparsity brought that down to about 3,100 unique words [from 30,600 unique words]<p>What does that mean? Does he remove words that are only said once or twice?<p>Can anyone point me to a text explaining the difference between Identifying Characteristic Words using <i>Log Likelihood </i>and using <i>tfidf</i>. ?
评论 #11073301 未加载
cadab超过 9 年前
I&#x27;ve found an image, which i&#x27;m guessing it taken from the site: <a href="http:&#x2F;&#x2F;imgur.com&#x2F;IEudyni" rel="nofollow">http:&#x2F;&#x2F;imgur.com&#x2F;IEudyni</a>, worth looking at if the sites still down.
LoSboccacc超过 9 年前
I would have loved to see log characterization for the canadians characters, even if they aren&#x27;t part of the main cast
dropdatabase超过 9 年前
This is amazing, I wonder what results you&#x27;d get from The Simpsons
评论 #11079278 未加载
rhema超过 9 年前
Pretty interesting. This Large Scale Study of Myspace (<a href="http:&#x2F;&#x2F;www.cc.gatech.edu&#x2F;projects&#x2F;doi&#x2F;Papers&#x2F;Caverlee_ICWSM_2008.pdf" rel="nofollow">http:&#x2F;&#x2F;www.cc.gatech.edu&#x2F;projects&#x2F;doi&#x2F;Papers&#x2F;Caverlee_ICWSM_...</a>) paper shows a similar method for finding characteristic terms, using Mutual Information.
peg_leg超过 9 年前
This should be nominated for an igNobel
agentgt超过 9 年前
I wonder how the results would change if it was based not on words but rather by lines (not string lines but actor lines in conversation).<p>Its also funny how Stan talks more than Kyle given the show now has a recurring joke that makes fun of Kyle&#x27;s long educational dialogues.
评论 #11072533 未加载
评论 #11078206 未加载
gulbrandr超过 9 年前
<p><pre><code> Error establishing a database connection </code></pre> Someone has a cached version please?
评论 #11072493 未加载