# Entropy¶

Entropy is a function that maps a discrete probability distribution to a real number.

A discrete probability distribution is a set of probabilities $$p_1, \dots, p_K$$ where $$p_i$$ is the probability of observing event $$i$$.

We define the entropy of the distribution $$H(p_1, \dots, p_K) = - \sum_{i = 1}^K p_i log_2 (p_i)$$.

In the case of $$K = 1$$, the plot of entropy against $$p_1$$ looks like an inverted U with a minimum at 0 and a maximum at 1 when $$p_1 = 0.5$$.

In general, entropy attains a maximum at $$\log_2(K)$$ when we set all $$p_i = \frac{1}{K}$$.

The closer entropy is to 0, the closer the probability distribution is to putting all the probabilility mass at one of the events. The closer entropy is to $$log_2(K)$$, the closer the distribution is to a uniform distribution over events, or a “flat prior”.

Suppose we have just two events: A and B, i.e. $$K = 2$$, and we sample a sequence of $$N$$ events where at each step in the sequence we have $$p$$ probability of seeing A and $$1 - p$$ probability of seeing B.

There are $$2^N$$ possible sequences we could observe. However, many of these sequences are rare. In particular, for very large $$N$$, the event A will occur about $$N \cdot p$$ times in a sequence. The number of sequences where the event A occurs $$N \cdot p$$ times is just the number of ways to choose $$N \cdot p$$ positions out of the $$N$$ total positions in a sequence, which using Stirling’s formula, is about $$2^{N \cdot H(p, 1 - p)}$$. In other words, entropy determines the size of the subset of “typical” sequences for a probability distribution. If entropy is 0, there’s just 1 typical sequence, because the same event happens over and over again. If entropy is 1, then the number of typical sequences is equal to the total number of possible sequences.