Here's how I describe KL Divergence, building up from simple to complex concepts.<p>surprisal: how surprised I am when I learn the value of X<p><pre><code> Suprisal(x) = -log p(X=x)
</code></pre>
entropy: how surprised I expect to be<p><pre><code> H(p) = 𝔼_X -log p(X)
= ∑_x p(X=x) * -log p(X=x)
</code></pre>
cross-entropy: how surprised I expect Bob to be (if Bob's beliefs are q instead of p)<p><pre><code> H(p,q) = 𝔼_X -log q(X)
= ∑_x p(X=x) * -log q(X=x)
</code></pre>
KL divergence: how much *more* surprised I expect Bob to be than me<p><pre><code> Dkl(p || q) = H(p,q) - H(p,p)
= ∑_x p(X=x) * log p(X=x)/q(X=x)
</code></pre>
information gain: how much less surprised I expect Bob to be if he knew that Y=y<p><pre><code> IG(q|Y=y) = Dkl(q(X|Y=y) || q(X))
</code></pre>
mutual information: how much information I expect to gain about X from learning the value of Y<p><pre><code> I(X;Y) = 𝔼_Y IG(q|Y=y)
𝔼_Y Dkl(q(X|Y=y) || q(X))</code></pre>
I found <a href="https://www.lesswrong.com/posts/no5jDTut5Byjqb4j5/six-and-a-half-intuitions-for-kl-divergence" rel="nofollow noreferrer">https://www.lesswrong.com/posts/no5jDTut5Byjqb4j5/six-and-a-...</a> very helpful for getting intuition for what the K-L divergence is and why it's useful. The six intuitions:<p><pre><code> 1. Expected surprise
2. Hypothesis testing
3. MLEs
4. Suboptimal coding
5a. Gambling games -- beating the house
5b. Gambling games -- gaming the lottery
6. Bregman divergence</code></pre>
Here is the simplest way of explaining the KL divergence:<p>The KL divergence yields a concrete value that tells you how many actual bits of space on disk you will waste if you try to use an encoding table from one ZIP file of data to encode another ZIP file of data. It's not just theoretical, this is exactly the type of task that it's used for.<p>The closer the folders are to each other in content, the fewer wasted bits. So, we can use this to measure how similar two sets of information are, in a manner of speaking.<p>These 'wasted bits' are also known as relative entropy, since entropy basically is a measure of how disordered something can be. The more disordered, the more possibilities we have to choose from, thus the more information possible.<p>Entropy does not guarantee that the information is usable. It only guarantees how much of this quantity we can get, much like pipes serving water. Yes, they will likely serve water, but you can accidentally have sludge come through instead. Still, their capacity is the same.<p>One thing to note is that with our ZIP files, if you use the encoding tables from one to encode the other, then you will end up with different relative entropy (i.e. our 'wasted bits') numbers than if you did the vice versa. This is because the KL is not what's called symmetric. That is, it can have different meaning based upon which direction it goes.<p>Can you pull out a piece of paper, make yourself an example problem, and tease out an intuition as to why?
We use KL-divergence to calculate how surprising a time-series anomaly is and rank them for aviation safety, e.g., give me a ranked list of the most surprising increases in a safety metric. It's quite handy!
I have used KL-divergence in authorship verification: <a href="https://github.com/capjamesg/pysurprisal/blob/main/pysurprisal/core.py#L5">https://github.com/capjamesg/pysurprisal/blob/main/pysurpris...</a><p>My theory was: calculate entropy ("surprisal") of used words in a language (in my case, from an NYT corpus), then calculate KL-divergence between a given prose and a collection of surprisals for different authors. The author to whom the prose had the highest KL-divergence was assumed to be the author. I think it has been used in stylometry a bit.
K-L Divergence is something that Keeps coming up in my research but I still don't understand what it is.<p>Could someone give me a simple explanation as to what it's is.<p>And also, what practical use cases does it have?
KL divergence has also been used to generalize the second law of thermodynamics for systems far from equilibrium:<p><a href="https://arxiv.org/abs/1508.02421" rel="nofollow noreferrer">https://arxiv.org/abs/1508.02421</a><p>And to explain the relationship between the rate of evolution and evolutionary fitness:<p><a href="https://math.ucr.edu/home/baez/bio_asu/bio_asu_web.pdf" rel="nofollow noreferrer">https://math.ucr.edu/home/baez/bio_asu/bio_asu_web.pdf</a><p>The connection between all of these manifestations of KL divergence is that a system far from equilibrium contains more information (in the Shannon sense) than a system in equilibirum. That "excess information" is what drives fitness within some environment.
IIRC (and in my experience) KL divergence doesn't account for double counting. Wrote a paper where I ended up having to use a custom metric instead:
<a href="https://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=4438&context=smallsat" rel="nofollow noreferrer">https://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=4...</a>
I learnt about KL Divergence recently and it was pretty cool to know that cross-entropy loss originated from KL Divergence.
But could someone give me the cases where it is preferred to use Mean-squared Error loss vs Cross-entropy loss? Is there any merits or demerits of using either?
Btw, KL divergence isn’t symmetrical. So D(P,Q) != D(Q,P). If you need a symmetrical version of it, you can use Jensen–Shannon divergence which is the mean of D(P,Q) and D(Q,P). Or if you only care about relative distances you can just use the sum.