Kullback–Leibler divergence

177 点作者 dedalus超过 1 年前

12 条评论

jwarden超过 1 年前

Here's how I describe KL Divergence, building up from simple to complex concepts.surprisal: how surprised I am when I learn the value of X<pre><code> Suprisal(x) = -log p(X=x) </code></pre> entropy: how surprised I expect to be<pre><code> H(p) = 𝔼_X -log p(X) = ∑_x p(X=x) * -log p(X=x) </code></pre> cross-entropy: how surprised I expect Bob to be (if Bob's beliefs are q instead of p)<pre><code> H(p,q) = 𝔼_X -log q(X) = ∑_x p(X=x) * -log q(X=x) </code></pre> KL divergence: how much *more* surprised I expect Bob to be than me<pre><code> Dkl(p || q) = H(p,q) - H(p,p) = ∑_x p(X=x) * log p(X=x)/q(X=x) </code></pre> information gain: how much less surprised I expect Bob to be if he knew that Y=y<pre><code> IG(q|Y=y) = Dkl(q(X|Y=y) || q(X)) </code></pre> mutual information: how much information I expect to gain about X from learning the value of Y<pre><code> I(X;Y) = 𝔼_Y IG(q|Y=y) 𝔼_Y Dkl(q(X|Y=y) || q(X))</code></pre>

评论 #37229617 未加载

评论 #37228671 未加载

评论 #37227089 未加载

golwengaud超过 1 年前

I found <a href="https://www.lesswrong.com/posts/no5jDTut5Byjqb4j5/six-and-a-half-intuitions-for-kl-divergence" rel="nofollow noreferrer">https://www.lesswrong.com/posts/no5jDTut5Byjqb4j5/six-and-a-...</a> very helpful for getting intuition for what the K-L divergence is and why it's useful. The six intuitions:<pre><code> 1. Expected surprise 2. Hypothesis testing 3. MLEs 4. Suboptimal coding 5a. Gambling games -- beating the house 5b. Gambling games -- gaming the lottery 6. Bregman divergence</code></pre>

tysam_and超过 1 年前

Here is the simplest way of explaining the KL divergence:The KL divergence yields a concrete value that tells you how many actual bits of space on disk you will waste if you try to use an encoding table from one ZIP file of data to encode another ZIP file of data. It's not just theoretical, this is exactly the type of task that it's used for.The closer the folders are to each other in content, the fewer wasted bits. So, we can use this to measure how similar two sets of information are, in a manner of speaking.These 'wasted bits' are also known as relative entropy, since entropy basically is a measure of how disordered something can be. The more disordered, the more possibilities we have to choose from, thus the more information possible.Entropy does not guarantee that the information is usable. It only guarantees how much of this quantity we can get, much like pipes serving water. Yes, they will likely serve water, but you can accidentally have sludge come through instead. Still, their capacity is the same.One thing to note is that with our ZIP files, if you use the encoding tables from one to encode the other, then you will end up with different relative entropy (i.e. our 'wasted bits') numbers than if you did the vice versa. This is because the KL is not what's called symmetric. That is, it can have different meaning based upon which direction it goes.Can you pull out a piece of paper, make yourself an example problem, and tease out an intuition as to why?

评论 #37228451 未加载

techwizrd超过 1 年前

We use KL-divergence to calculate how surprising a time-series anomaly is and rank them for aviation safety, e.g., give me a ranked list of the most surprising increases in a safety metric. It's quite handy!

评论 #37225780 未加载

评论 #37228266 未加载

zerojames超过 1 年前

I have used KL-divergence in authorship verification: <a href="https://github.com/capjamesg/pysurprisal/blob/main/pysurprisal/core.py#L5">https://github.com/capjamesg/pysurprisal/blob/main/pysurpris...</a>My theory was: calculate entropy ("surprisal") of used words in a language (in my case, from an NYT corpus), then calculate KL-divergence between a given prose and a collection of surprisals for different authors. The author to whom the prose had the highest KL-divergence was assumed to be the author. I think it has been used in stylometry a bit.

评论 #37222817 未加载

max_超过 1 年前

K-L Divergence is something that Keeps coming up in my research but I still don't understand what it is.Could someone give me a simple explanation as to what it's is.And also, what practical use cases does it have?

评论 #37224095 未加载

评论 #37222166 未加载

评论 #37224119 未加载

评论 #37223805 未加载

评论 #37223053 未加载

评论 #37222144 未加载

评论 #37222556 未加载

评论 #37223844 未加载

评论 #37222009 未加载

评论 #37222567 未加载

评论 #37225596 未加载

评论 #37222902 未加载

评论 #37228523 未加载

评论 #37223791 未加载

评论 #37230274 未加载

评论 #37222341 未加载

riemannzeta超过 1 年前

KL divergence has also been used to generalize the second law of thermodynamics for systems far from equilibrium:<a href="https://arxiv.org/abs/1508.02421" rel="nofollow noreferrer">https://arxiv.org/abs/1508.02421</a>And to explain the relationship between the rate of evolution and evolutionary fitness:<a href="https://math.ucr.edu/home/baez/bio_asu/bio_asu_web.pdf" rel="nofollow noreferrer">https://math.ucr.edu/home/baez/bio_asu/bio_asu_web.pdf</a>The connection between all of these manifestations of KL divergence is that a system far from equilibrium contains more information (in the Shannon sense) than a system in equilibirum. That "excess information" is what drives fitness within some environment.

评论 #37228092 未加载

jszymborski超过 1 年前

VAEs have made me both love and hate KLD. Goddamn mode collapse.

评论 #37223180 未加载

nravic超过 1 年前

IIRC (and in my experience) KL divergence doesn't account for double counting. Wrote a paper where I ended up having to use a custom metric instead: <a href="https://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=4438&context=smallsat" rel="nofollow noreferrer">https://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=4...</a>

评论 #37228108 未加载

mrv_asura超过 1 年前

I learnt about KL Divergence recently and it was pretty cool to know that cross-entropy loss originated from KL Divergence. But could someone give me the cases where it is preferred to use Mean-squared Error loss vs Cross-entropy loss? Is there any merits or demerits of using either?

评论 #37222737 未加载

评论 #37223082 未加载

评论 #37223147 未加载

评论 #37222968 未加载

janalsncm超过 1 年前

Btw, KL divergence isn’t symmetrical. So D(P,Q) != D(Q,P). If you need a symmetrical version of it, you can use Jensen–Shannon divergence which is the mean of D(P,Q) and D(Q,P). Or if you only care about relative distances you can just use the sum.

评论 #37229972 未加载

ljlolel超过 1 年前

Equivalent to cross entropy for loss on NN