Here is the simplest way of explaining the KL divergence:<p>The KL divergence yields a concrete value that tells you how many actual bits of space on disk you will waste if you try to use an encoding table from one ZIP file of data to encode another ZIP file of data. It's not just theoretical, this is exactly the type of task that it's used for.<p>The closer the folders are to each other in content, the fewer wasted bits. So, we can use this to measure how similar two sets of information are, in a manner of speaking.<p>These 'wasted bits' are also known as relative entropy, since entropy basically is a measure of how disordered something can be. The more disordered, the more possibilities we have to choose from, thus the more information possible.<p>Entropy does not guarantee that the information is usable. It only guarantees how much of this quantity we can get, much like pipes serving water. Yes, they will likely serve water, but you can accidentally have sludge come through instead. Still, their capacity is the same.<p>One thing to note is that with our ZIP files, if you use the encoding tables from one to encode the other, then you will end up with different relative entropy (i.e. our 'wasted bits') numbers than if you did the vice versa. This is because the KL is not what's called symmetric. That is, it can have different meaning based upon which direction it goes.<p>Can you pull out a piece of paper, make yourself an example problem, and tease out an intuition as to why?