> Statisticians are quick to reach for the Central Limit Theorem, but I think there’s a deeper, more intuitive, more powerful reason.<p>> The Normal Distribution is your best guess if you only know the mean and the variance of your data.<p>This is putting the cart before the horse, for sure. The reason why you only know the mean and the variance of your data is because you chose to summarize your data that way. And, the reason why you chose to summarize your data that way is <i>in order to get the normal distribution</i> as the maximum entropy distribution.<p>The normal distribution appears in a lot of places because it is the limiting case of many other distributions, this is the central limit theorem. It is very easy to work with the normal distribution because you can add or subtract a bunch of normal distributions and the result is just another normal distribution. You can add or subtract a bunch of <i>other</i> distributions and the resulting distribution will often be more normal. You can do a lot of work with the normal distribution using linear algebra techniques.<p>So, you choose to measure mean and variance in order to make the math easier. This does not always result in the best outcome. For example, if you need more robust statistics, you might go for median and average deviation, rather than mean and variance. Then when you choose the maximum entropy distribution from the result, you end up with the Laplace distribution. The Laplace distribution is very inconvenient to work with mathematically, unlike the normal distribution.
One thing I'd add to this is that this kind of thinking makes your coordinate system really matter.<p>Consider a measurement of some uncertainly sized cubes. You could describe them with their edge length or their volume. Learning one tells you the other. They're equivalent data. However a maximum entropy distribution on one isn't a maximum entropy distribution on the other.<p>Pragmatically, there's always something you can do (e.g. a Jeffreys prior), but philosophically, this has always made me uneasy with justifications about max entropy that don't also have justifications of the choice of coordinate system.
Thought experiment: suppose your friend drives 80 miles to visit you. They tell you the trip took between 2 and 4 hours. You have no further information. How confident are you the trip took less than 3 hours?<p>Now they tell you they maintained a constant speed throughout the trip, a speed somewhere between 20 and 40mph. How confident are you your friend was driving faster than 30mph?<p>The principle of maximum entropy, applied to each, gives you different answers. P(30mph) = 0.5 implies the trip takes 2hr40mins, not 3hrs. What gives? Which is the <i>real</i> way we should formulate travel times?<p>See: <a href="https://en.wikipedia.org/wiki/Bertrand_paradox_(probability)" rel="nofollow">https://en.wikipedia.org/wiki/Bertrand_paradox_(probability)</a>
Credit for this example: Michael Titelbaum
You can derive these distributions with a lot less algebra by characterizing them with invariances, rather than maximum entropy under constraints.<p><a href="https://stevefrank.org/reprints-pdf/16Entropy.pdf" rel="nofollow">https://stevefrank.org/reprints-pdf/16Entropy.pdf</a>
With this method, you can derive all of statistical mechanics from information theory with constraints originated from thermodynamics. The observation of thermodynamic quantities, which are high level observations on particles (i.e. related to means, etc., and not to individual particles), puts constraints of the same kind as the ones listed in this article. This approach was pioneered by Jaynes (1952) "Information theory and statistical mechanics, I": <a href="https://www.semanticscholar.org/paper/Information-Theory-and-Statistical-Mechanics-Jaynes/08b67692bc037eada8d3d7ce76cc70994e7c8116" rel="nofollow">https://www.semanticscholar.org/paper/Information-Theory-and...</a>
> The Normal Distribution is your best guess if you only know the mean and the variance of your data.<p>That's awful advice for some domains. If your process dynamics are badly behaved (statistically), such as power laws and likes, it turns out the "mean" and "variance" you're calculating from samples are probably rubbish.<p>Choosing a starting distribution is actually a statement on how you're exposing yourself to risk, there is no such thing as "best guess".
"And if we weigh this by the probability of that particular event happening, we get info ∝ p ⋅ log2(1/p)"<p>I fail to see the motivation of this step, and I think that's preventing me to see the argument as "intuitive". Could somebody explain?<p>The two steps back (info ∝ 1/p) it still makes sense to me: the more rare the event is, the bigger the resulting number is, so in the case the event happens, the more "surprised" we are, and more information is gained. However, what do we achieve by weighing the bitcount of the information with the probability?
I love the article!<p>My only advice is to end with a list of maximum entropy distributions to showcase the many applications of this theory. I often refer to such tables when I have varying constraints and want the best choice for representing the spread of the data.<p>See the table in <a href="https://en.wikipedia.org/wiki/Maximum_entropy_probability_distribution" rel="nofollow">https://en.wikipedia.org/wiki/Maximum_entropy_probability_di...</a>
This approach can mislead people because it by design make the hypothesis that the support is infinite and that the variance is finite, which is why it ends in a thin tail distribution in the first place.<p>Plus as said by klodolph the choice of arbitrarily restricting your knowledge to the mean and to the variance as summary statistics will lead to the Gaussian distribution. Moreover in practice restricting arbitrarily your knowledge is a violation of probability as a model of intuition as showed by Jaynes