The Central Limit Theorem Visualized with D3

179 pointsby vicapowalmost 12 years ago

22 comments

Jabblesalmost 12 years ago

This isn't the central limit theorem. This is a binomial distribution. Nice animation though.

评论 #5795252 未加载

评论 #5795158 未加载

评论 #5795166 未加载

评论 #5795247 未加载

评论 #5795197 未加载

评论 #5795389 未加载

calinet6almost 12 years ago

A great example of convergence in the natural world I once saw was drops of water falling off a gutter from about 10 feet in the air. There was very little wind, and the drops fell on a line right under the gutter. It was immediately clear that the drops landed in a roughly normal distribution with their random fall through the air, and the pattern of wet pavement they produced was a perfect little compressed bell curve. It was quite beautiful.

评论 #5795393 未加载

vannialmost 12 years ago

Video of a binomial distribution [1] physically visualized using a Galton box [2] Pachinko [3]: <a href="http://youtu.be/AjI_LcQOOs4" rel="nofollow">http://youtu.be/AjI_LcQOOs4</a>[1]: <a href="http://en.wikipedia.org/wiki/Binomial_distribution" rel="nofollow">http://en.wikipedia.org/wiki/Binomial_distribution</a>[2]: <a href="http://en.wikipedia.org/wiki/Bean_machine" rel="nofollow">http://en.wikipedia.org/wiki/Bean_machine</a>[3]: <a href="http://en.wikipedia.org/wiki/Pachinko" rel="nofollow">http://en.wikipedia.org/wiki/Pachinko</a>

hanzqalmost 12 years ago

Here's a galton box. (<a href="http://www.youtube.com/watch?v=AUSKTk9ENzg" rel="nofollow">http://www.youtube.com/watch?v=AUSKTk9ENzg</a>)It's a meatspace demonstration of the CLT.

评论 #5795224 未加载

drostiealmost 12 years ago

I liked this but as other commenters have pointed out it's purely binomial at this stage. The great thing about the central limit theorem is that it is more general than just the limiting Binomial case.So, there's this thing called the cumulant-generating function. It's pretty much defined for any random variable X. If you want to get technical it is the logarithm of the Fourier transform of a probability density function f(x). You're on HN so you probably know the first two jargon words, "logarithm" and "Fourier transform". A "probability density" just means that f(x) dx is the probability for X to be in the interval (x, x + dx). The Fourier transform puts us into a "frequency space" indexed by some variable k, so we can write the CGF as some function c(k), or in other words:<pre><code> c(k) = ln[ ∫ dx f(x) exp(i k x) ] </code></pre> So for the "step left/right" variable which is -1 with probability 1/2 and +1 with probability 1/2, c(k) = ln[cos(k)]. (It can get a little messy when you ask what happens when cos(k) crosses 0 etc, but this function is infinitely-often differentiable on a disc centered at 0 which is all that we need.)It also turns out that since the sum of all the probability ∫ dx f(x) = 1, you can just evaluate this for k=0 as c(0) = ln[ 1 ] = 0. The cumulants are derivatives evaluated at k=0:<pre><code> c'(0) = i E[X] = i μ (where μ is the "expectation") c''(0) = - (E[X²] - E[X]²) = -σ² (where σ² is the "variance") c'''(0) = -i E[(x - μ)³] = -i σ³ γ (where γ is the "skewness") </code></pre> It is not very hard to prove that if I have a sum of two random variables X + Y = Z, then their CGFs add like cz(k) = cx(k) + cy(k). This is the Fundamental Awesomeness of Cumulants: cumulants are pseudo-linear; they're linear in independent random variables.It is also not hard to prove that if I have a scalar multiplication U = X / n, then cu(k) = cx(k / n). Combining these together, the mean M = (X1 + X2 + ... + Xn) / n of n identical and independent random variables looks like:<pre><code> cm(k) = n c(k / n) </code></pre> Now if you know calculus, you know the rest. Taylor-expand around k = 0 to find:<pre><code> cm(k) = i μ k − k²/2 * (σ²/n) − i k³/6 (σ³ γ / n²) + ... </code></pre> We see a geometric attenuation in this series, we keep dividing new terms by n. If we drop terms from c'' on we get a constant, M = μ. That's boring, so we keep one more term, to find:<pre><code> cm(k) ~= i E[X] − (Var[X] / n) k² / 2 </code></pre> This is extra-good if f(x) is symmetric about the mean (so that the skewness vanishes), but it is also pretty good even if the distribution is skewed.We can now invert all of the steps to return back to a probability density: you exponentiate, so you get exp(i μ k) exp(- k² σ²/ 2 n), then you inverse-Fourier-transform, which transforms the exp(i μ x) into a frequency offset x → x − μ and which transforms the Gaussian into another Gaussian, so you get a Gaussian centered at x = μ.In other words you get approximately a normal distribution with mean E[X] and variance Var[X] / n, with the error given by convolutions of higher-order terms, and the first correction disappearing when the distribution is symmetric about μ (or otherwise non-skew).

评论 #5796713 未加载

评论 #5797827 未加载

graycatalmost 12 years ago

> In probability theory, the central limit theorem (CLT) states that, given certain conditions, the mean of a sufficiently large number of independent random variables, each with a well-defined mean and well-defined variance, will be approximately normally distributed (i.i.d.).No, that's just wrong. The context is some positive integer n and values of some n real valued random variables independent and identically distributed (i.i.d.).If for some n do what the quote says, then are not guaranteed to get a Gaussian.Moreover, if do what the quote says and take the limit as n goes to positive infinity, then, say, in the case the random variables have expectation that is finite and finite variance, will get just the common expectation of those random variables -- this is the law of large numbers, not the central limit theorem. That the distribution of the mean converges to the point with the expectation is the 'weak' law of large numbers. That the mean converges to the expectation with probability 1 is the strong law of large numbers.For the central limit theorem, add up the n i.i.d. values and divide by the square root of n. Then under meager assumptions (e.g., i.i.d. and, for likely the strongest conditions known, the Lindeberg-Feller conditions), as n goes to infinity, the distribution of the result will converge to a Gaussian ('normal') distribution. Authors with the details include J. Neveu, L. Breiman, M. Loeve, and K. Chung.It's essential to divide by the square root of n, not just n.

pseutalmost 12 years ago

Kind of a misleading animation, since n is the number of switches (4) and not the number of balls (which diverges to infinity).Setting the bins to 500 froze the browser, unfortunately.The physical machines that do this are undeniably cool, but especially when they have the curve painted on beforehand; this is the best I could find right now, but I remember seeing a pretty big one as part of a traveling mathematics museum exhibition:<a href="http://www.youtube.com/watch?v=AUSKTk9ENzg" rel="nofollow">http://www.youtube.com/watch?v=AUSKTk9ENzg</a>

评论 #5795321 未加载

评论 #5795299 未加载

babyalmost 12 years ago

I have to learn this Theorem for my exams (which are in a week). All I know is that(X1+X2+ ... + Xn - n * mean)/sqrt(n * variance) --> N(0,1)I don't see how that is visualization of this theorem.

评论 #5795384 未加载

goronbjornalmost 12 years ago

Plinko!(<a href="http://en.wikipedia.org/wiki/List_of_The_Price_Is_Right_pricing_games#Plinko" rel="nofollow">http://en.wikipedia.org/wiki/List_of_The_Price_Is_Right_pric...</a>)

vicapowalmost 12 years ago

"The Bean Machine, also known as the quincunx or Galton box, is a device invented by Sir Francis Galton[1] to demonstrate the central limit theorem, in particular that the normal distribution is approximate to the binomial distribution." - <a href="http://en.wikipedia.org/wiki/Bean_machine" rel="nofollow">http://en.wikipedia.org/wiki/Bean_machine</a>

评论 #5795882 未加载

dangoldinalmost 12 years ago

This is pretty great. I've just started using D3 and it's pretty powerful stuff.For those interested in data visualization - also take a look at Vega (<a href="http://trifacta.github.io/vega/" rel="nofollow">http://trifacta.github.io/vega/</a>) - it makes creating the simpler charts easier than using D3.

graycatalmost 12 years ago

Here's what's going on, what will illustrate the central limit theorem (CLT) and what will not:For positive integer n and for i = 1, 2, ..., n, let real valued random variable X(i) be so thatP(X(i) = -1 = P(X(i) = 1 ) = 1/2Assume that {X(1), ..., X(n)} is an independent set of random variables.Easily the expectation E[X(i)] = 0.Let real valued random variableS(n) = X(1) + X(2) + ... + X(n)Easily E[S(n)] = 0.For n = 4, S(n) can take on values -4, -3, ..., 0, 1, ..., 4.The animation in the OP should converge to the density of S(n) where n = 4.By the CLT, as n approaches infinity, the density of( 1/sqrt(n) ) S(n)will be Gaussian with expectation 0.By the strong law of large numbers, as n approaches infinity, the probability random variable(1/n) S(n)converges to 0 is 1.For increasing n, S(n) is a martingale and is a discrete approximation to Brownian motion. As n approaches infinity, the density of S(n) approaches 0 everywhere on the real line and in its limit is not a density.

评论 #5801633 未加载

sengstromalmost 12 years ago

Very cool. Could I suggest that you display the total balls dropped somewhere as well?Edit: just two niggles - (a) 13 bins would be better numbered 0-12 or 1-13. (b) I would round the percentages to nearest value (not down).

评论 #5795162 未加载

Sammialmost 12 years ago

I had to finish the number examples on that page to prove this to myself:000 -> 0 001 -> 1 010 -> 1 011 -> 2 100 -> 1 101 -> 2 110 -> 2 111 -> 30 -> 1: 12.5% 1 -> 3: 37.5% 2 -> 3: 37.5% 3 -> 1: 12.5%0000 -> 0 0001 -> 1 0010 -> 1 0011 -> 2 0100 -> 1 0101 -> 2 0110 -> 2 0111 -> 3 1000 -> 1 1001 -> 2 1010 -> 2 1011 -> 3 1100 -> 2 1101 -> 3 1110 -> 3 1111 -> 40 -> 1: 6.25% 1 -> 4: 25% 2 -> 6: 37.5% 3 -> 4: 25% 4 -> 1: 6.25%

评论 #5848463 未加载

cypher543almost 12 years ago

You may want to do a bit of proofreading. It's difficult for me to take the content of this article seriously with so many spelling mistakes; especially since you misspelled "probability", which is the focus of the entire thing.

评论 #5795440 未加载

sharkweekalmost 12 years ago

This is great -- I love math visualized. It's fun to set the settings to no delay and just watch the bar graph normalize really quickly<a href="http://i.imgur.com/E8yppGG.png" rel="nofollow">http://i.imgur.com/E8yppGG.png</a>

Fuzzwahalmost 12 years ago

Turn that into something I can gamble on and I'd play it all day.

vicapowalmost 12 years ago

while i have the attention of HN, does anyone have suggestions on any other types of statistic visualizations?

评论 #5796765 未加载

评论 #5796710 未加载

评论 #5795568 未加载

评论 #5795288 未加载

评论 #5795383 未加载

评论 #5795283 未加载

mathattackalmost 12 years ago

It does get a lot more visually interesting with a very high # of bins, and very small delay.

gmriggsalmost 12 years ago

Also known as Plinko Theorem

coherentponyalmost 12 years ago

This is not the Central Limit Theorem.

评论 #5795408 未加载

rypitalmost 12 years ago

nice!

评论 #5795164 未加载

22 comments

Jabblesalmost 12 years ago

This isn't the central limit theorem. This is a binomial distribution. Nice animation though.

评论 #5795252 未加载

评论 #5795158 未加载

评论 #5795166 未加载

评论 #5795247 未加载

评论 #5795197 未加载

评论 #5795389 未加载

calinet6almost 12 years ago

评论 #5795393 未加载

vannialmost 12 years ago

hanzqalmost 12 years ago

Here's a galton box. (<a href="http://www.youtube.com/watch?v=AUSKTk9ENzg" rel="nofollow">http://www.youtube.com/watch?v=AUSKTk9ENzg</a>)It's a meatspace demonstration of the CLT.

评论 #5795224 未加载

drostiealmost 12 years ago

评论 #5796713 未加载

评论 #5797827 未加载

graycatalmost 12 years ago

pseutalmost 12 years ago

评论 #5795321 未加载

评论 #5795299 未加载

babyalmost 12 years ago

评论 #5795384 未加载

goronbjornalmost 12 years ago

Plinko!(<a href="http://en.wikipedia.org/wiki/List_of_The_Price_Is_Right_pricing_games#Plinko" rel="nofollow">http://en.wikipedia.org/wiki/List_of_The_Price_Is_Right_pric...</a>)

vicapowalmost 12 years ago

评论 #5795882 未加载

dangoldinalmost 12 years ago

graycatalmost 12 years ago

评论 #5801633 未加载

sengstromalmost 12 years ago

评论 #5795162 未加载

Sammialmost 12 years ago

评论 #5848463 未加载

cypher543almost 12 years ago

评论 #5795440 未加载

sharkweekalmost 12 years ago

Fuzzwahalmost 12 years ago

Turn that into something I can gamble on and I'd play it all day.

vicapowalmost 12 years ago

while i have the attention of HN, does anyone have suggestions on any other types of statistic visualizations?

评论 #5796765 未加载

评论 #5796710 未加载

评论 #5795568 未加载

评论 #5795288 未加载

评论 #5795383 未加载

评论 #5795283 未加载

mathattackalmost 12 years ago

It does get a lot more visually interesting with a very high # of bins, and very small delay.

gmriggsalmost 12 years ago

Also known as Plinko Theorem

coherentponyalmost 12 years ago

This is not the Central Limit Theorem.

评论 #5795408 未加载

rypitalmost 12 years ago

nice!

评论 #5795164 未加载