TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Why you should be wary of relying on a single histogram of a data set

81 点作者 aton大约 12 年前

6 条评论

jfim大约 12 年前
As mentioned, one should really be using a kernel density plot instead of a histogram, except when there are already classes in the data.<p>In R, one can simply do:<p><pre><code> library("ggplot2") library("datasets") ggplot(faithful, aes(x=eruptions)) + geom_density() + geom_rug() </code></pre> which gives a chart like this (<a href="http://jean-francois.im/temp/eruptions-kde.png" rel="nofollow">http://jean-francois.im/temp/eruptions-kde.png</a>). Contrast with:<p><pre><code> ggplot(faithful, aes(x=eruptions)) + geom_histogram(binwidth=1) </code></pre> which gives a chart like this (<a href="http://jean-francois.im/temp/eruptions-histogram.png" rel="nofollow">http://jean-francois.im/temp/eruptions-histogram.png</a>).<p>Edit: Other plots mentioned in this discussion:<p><pre><code> ggplot(faithful, aes(x = eruptions)) + stat_ecdf(geom = "step") </code></pre> Cumulative distribution, as suggested by leot (<a href="http://jean-francois.im/temp/eruptions-ecdf.png" rel="nofollow">http://jean-francois.im/temp/eruptions-ecdf.png</a>)<p><pre><code> qqnorm (faithful$eruptions) </code></pre> Q-Q plot, as suggested by christopheraden (<a href="http://jean-francois.im/temp/eruptions-qq.png" rel="nofollow">http://jean-francois.im/temp/eruptions-qq.png</a>)
评论 #5536902 未加载
评论 #5539611 未加载
leot大约 12 年前
Yes, probability density estimation might be fun, but the simplest thing to do when comparing distributions, if you're worried about binning issues, is to plot their empirical cumulative distribution functions.
评论 #5537044 未加载
dude_abides大约 12 年前
This is what you should be doing:<p><pre><code> plot(density(Annie), col="red") lines(density(Brian), col="blue") lines(density(Chris), col="green") lines(density(Zoe), col="cyan") </code></pre> This is the plot you get: <a href="http://i.imgur.com/sY2awX7.png" rel="nofollow">http://i.imgur.com/sY2awX7.png</a>
tantalor大约 12 年前
Reminds me of <a href="http://en.wikipedia.org/wiki/Simpsons_paradox" rel="nofollow">http://en.wikipedia.org/wiki/Simpsons_paradox</a>
评论 #5536939 未加载
christopheraden大约 12 年前
Interesting paradox. I haven't seen that many statisticians using just a histogram when determining whether a certain distribution fits data reasonably. Kernel Density Estimators are a much better choice (for continuous data, like the data in the post), but they are also affected by your choice of bandwidth. When it comes down to it, like going to the doctor, sometimes the best choice is to get a second (or third!) opinion. For what it's worth, drawing a QQ Plot (something I've seen in every statistical consultation I've ever done) reveals the dependent structure of the data immediately and obviously in the form of a perfect linear relationship between any two variables.
评论 #5537564 未加载
评论 #5536887 未加载
radarsat1大约 12 年前
Is this basically just an effect of quantization aliasing?
评论 #5538332 未加载