In the notation of this page, the entropy H(P) is best thought of as:<p>"The mean number of bits to encode a member of P, assuming an optimal code."<p>And the KL divergence KL(P,Q) is probably best thought of as:<p>"The mean number of WASTED bits if you encode members of P assuming that they had come from Q."
I and clearly many other people have run into what one could only call “KL variance”, but it doesn’t seem to have an established name.<p><a href="https://mathoverflow.net/questions/210469/kullback-leibler-variance-does-that-divergence-have-a-name" rel="nofollow">https://mathoverflow.net/questions/210469/kullback-leibler-v...</a>
I often wondered about an alternative but related metric called "organization"<p>Entropy, in some sense would seem to measure "complexity", but it's more accurately related as "surprise" I think.<p>It's useful but limited (for example, you can measure the "entropy" present in a string -- of keystrokes, or text -- and determine how likely it is that it's "coherent" or "intelligent" but this is fuzzy, i.e., "too much" entropy, and you are at "randomness", too little and you are at "banality"). It seems like a more precise (but still 0 - 1 bounded) metric would be possible to measure "order" or "organization". Entropy fails at this: 0 entropy does not equal "total order". Just "total boringness" (heh :))<p>I considered something related to some archetypal canonical compression scheme (like LZ), but didn't flesh it out. Considering again now, what about the "self similarity" of the dictionary, combined with the diversity of the dictionary?<p>It's more of a "two-axis" metric but surely we can find a way to corral it into 0..1.<p>Very self-similar, and rather diverse? Highly organized.<p>Low self-similarity, and highly diverse? High entropy / highly disorganized.<p>Low self-similarity, and low diversity? Low entropy / high banality. I.e., simplicity heh :)<p>High self-similarity, low diversity - organized, but "less organized" than something with more diversity.<p>I don't think this is quite there yet, but there's intuitive sync with this.<p>Any takers???? :)
After the phrase:Manipulating the logarithms, we can also get ... the formula is incorrect, since p_j have disappeared.<p>\[D_{KL}(P,Q)=-\sum_{j=1}^{n}\log_2 \frac{q_j}{p_j}=\sum_{j=1}^{n}\log_2 \frac{p_j}{q_j}\]<p>The post is just basic definitions and simple examples for cross entropy and KL divergence.<p>There is a section about the relation of cross entropy and maximum likelihood estimation at the end that seems not so easy to understand but implies that the limit of a estimator applied to a sample from a distribution is the KL divergence when the sample length tends to infinity.
Cross entropy from the first principles: <a href="https://youtu.be/KHVR587oW8I" rel="nofollow">https://youtu.be/KHVR587oW8I</a>