KL Divergence measures the distance between 2 distributions. This can be used to understand Cross-Entropy and deep learning model architectures such as VAE.
For Example, lets say there is a coin which has 50% chance of being HEADS and 50% chance TAILS. Besides that there are 2 coins, coinB(55% chance HEADS, 45% chance TAILS) and coinC(90% chance HEADS, 10% chance TAILS).
Which coin is closer to the first coin?
Intuitively, you can tell that coinB is closer because it’s quite obvious. But what happens when it is NOT obvious. It would be better to be able to quantify the similarity. KL Divergence Measures that.