TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

The Easiest Data Analysis Mistake to Make

44 点作者 lejohnq超过 11 年前

8 条评论

lpage超过 11 年前
This is a very simple and very pathological example that&#x27;s easily ferreted out with a few more summary statistics (median, min, max) but it&#x27;s a good illustration of the blind application of statistics. Short of visualization, non-parametric statistics really help with such things. Correlation is a fragile, linear measure, and things that are obviously correlated by inspection can easily appear mathematically uncorrelated -- points on a unit circle, for example. Likewise, the mean of any skewed distribution tells you very little, but that&#x27;s the statistic that&#x27;s always cited. Quantiles, medians, and non-parametric measures of correlation such as rank correlation are simple and often overlooked. They do a good job screening for pathological data sets like Anscombe&#x27;s quartet and real world ones.<p>It&#x27;s also worth mentioning &quot;dumbbell&quot; data sets. Two clusters of data, each of which have a independent, meaningful correlation in them, can easily leverage a linear regression into a meaningless line passing through the two clusters. That&#x27;s a pretty common issue with high dimensional data (obviously you can see it in a 2D scatter plot), and it&#x27;s not easily caught short of looking at regression diagnostic statistics.
评论 #6891898 未加载
brian_peiris超过 11 年前
I think, typically, if you&#x27;ve gone to the trouble of calculating variance and correlation, you would have also calculated the median and mode of these datasets. The differences would have been obvious with those basic analyses.
评论 #6893266 未加载
carlosgg超过 11 年前
<a href="http://en.wikipedia.org/wiki/Anscombe&#x27;s_quartet" rel="nofollow">http:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Anscombe&#x27;s_quartet</a>
mrcactu5超过 11 年前
Statwing looks great!<p>Right now I do my data analysis in numpy, but this looks good for my Excel-based colleagues.<p>What library is doing the statistics?
评论 #6891749 未加载
timruffles超过 11 年前
So a concept you find in a beginner stats textbook is now &#x27;news&#x27;? Definition of blogspam surely...
评论 #6891987 未加载
davidmanescu超过 11 年前
My favourite part of this (aside from the message) is that it links back to this exact discussion page.
ejain超过 11 年前
An equally common mistake is to visualize without analyzing :-)
medagan超过 11 年前
who doesn&#x27;t look at the data range, min, max, mode???
评论 #6891815 未加载
评论 #6891611 未加载