To try out Shiny, I created an interactive visualization for Kullback-Leibler divergence (or KL Divergence). Right now, it only supports two univariate Gaussians, which should be sufficient to build some intuition.

If you like it, let me know! If it turns out to be popular, I might add more features, or create similar visualizations for other concepts!

## What is KL Divergence? What am I seeing?

Consider an unknown probability distribution \(p(x)\), which we’re trying to approximate with probability distribution \(q(x)\), then:

\[\text{KL}(p||q) = - \int p(x) \ln \frac{q(x)}{p(x)} dx\]

can informally be interpreted as the amount of information being lost by using \(q\) to *approximate* \(p\). As you might imagine, this has several applications in Machine Learning. A recurring pattern is to fit parameters to a model by minimizing an approximation of \(\text{KL}(p||q)\) (ie, making \(q\) “as similar” to \(p\) as possible). This blog post elaborates in a fun and informative way. If you have never heard about KL divergence before, Bishop provides a more formal (but still easy to understand) introduction in Section 1.6 of Pattern Recognition and Machine Learning.

## Suggested exercises with the interactive plot

Using the visualization tool, you can reason about the following questions:

Is \(\text{KL}(p||q) = \text{KL}(q||p)\)? Always? Never?

When is \(\text{KL}(p||q) = 0\)?

Let \(r(x) = \mathcal{N}(0, 1)\) and \(s(x) = \mathcal{N}(0, 2)\). Which is larger: \(\text{KL}(r||s)\) or \(\text{KL}(s||r)\)? Why?

Is \(\text{KL}(p||q)\) ever negative? When, or why not?