Maximum entropy probability distribution

In statistics and information theory, a maximum entropy probability distribution is a probability distribution whose entropy is larger than that of all other members of a specified class. If nothing is known about a distribution except that it belongs to a certain class, then the maximum entropy distribution for that class is often assumed "by default". The reason is twofold: maximizing entropy, in a sense, means minimizing the amount of prior information built into the distribution; furthermore, many physical systems tend to move towards maximal entropy configurations over time.

Definition of entropy

If X is a discrete random variable with distribution given by

\operatorname {Pr} (X=x_{k})=p_{k}\quad {\mbox{ for }}k=1,2,\ldots

then the entropy of X is defined as

H(X)=\sum _{k\geq 1}p_{k}\log \left({1 \over p_{k}}\right).

If X is a continuous random variable with probability density p(x), then the entropy of X is defined as

H(X)=\int _{-\infty }^{\infty }p(x)\log \left({1 \over p(x)}\right)dx

where p(x) log(1/p(x)) is understood to be zero whenever p(x) = 0.

The base of the logarithm used is not important as long as the same one is used consistently: change of base merely results in a rescaling of the entropy. Information theoreticians may prefer to use base 2 in order to express the entropy in bits; mathematicians and physicists will often prefer the natural logarithm, resulting in a unit of nits or nepers for the entropy.

Examples of maximum entropy distributions

The most important maximum entropy distribution is the normal distribution N(μ,σ²). It has maximum entropy among all distributions on the real line with specified mean μ and standard deviation σ. Therefore, if all you know about a distribution is its mean and standard deviation, it is often reasonable to assume that the distribution is normal.

The uniform distribution on the interval [a,b] is the maximum entropy distribution among all continuous distributions which are supported in the interval [a, b] (which means that the probability density is 0 outside of the interval). The uniform distribution on the finite set {x₁,...,x_n} (which assigns a probability of 1/n to each of these values) is the maximum entropy distribution among all discrete distributions supported on this set.

The exponential distribution with mean 1/λ is the maximum entropy distribution among all continuous distributions supported in [0,∞) that have a mean of 1/λ.

Among all the discrete distributions supported on the set {x₁,...,x_n} with mean μ, the maximum entropy distribution has the following shape:

\operatorname {Pr} (X=x_{k})=Cr^{x_{k}}\quad {\mbox{ for }}k=1,\ldots ,n

where the positive constants C and r can be determined by the requirements that the sum of all the probabilities must be 1 and the expected value must be μ.

As an example, consider the following scenario: a large number N of dice is thrown, and you are told that the sum of all the shown numbers is S. Based on this information alone, what would be a reasonable assumption for the number of dice showing 1, 2, ..., 6? This is an instance of the situation considered above, with {x₁,...,x₆} = {1,...,6} and μ = S/N.

Finally, among all the discrete distributions supported on the infinite set {x₁,x₂,...} with mean μ, the maximum entropy distribution has the shape:

\operatorname {Pr} (X=x_{k})=Cr^{x_{k}}\quad {\mbox{ for }}k=1,2,\ldots

where again the constants C and r can be determined by the requirements that the sum of all the probabilities must be 1 and the expected value must be μ.

A theorem by Boltzmann

All the above continuous examples are consequences of the following theorem by Boltzmann which is a standard application of the calculus of variations:

Suppose S is a (possibly infinite) interval of the real numbers R and we're given n smooth functions f₁,...,f_n and n numbers a₁,...,a_n. We consider the class of all continuous random variables which are supported on S and which satisfy

\operatorname {E} (f_{j}(X))=a_{j}\quad {\mbox{ for }}j=1,\ldots ,n

The maximum entropy distribution for this class (if it exists) has a probability density of the following shape:

p(x)=C\exp \left(\sum _{j=1}^{n}\lambda _{j}f_{j}(x)\right)\quad {\mbox{ for all }}x\in S

where the constants C and λ_j have to be determined so that the the integral of p(x) over S is 1 and the above conditions for the expected values are satisfied.

Conversely, if constants C and λ_j like this can be found, then p(x) is indeed the density of the maximum entropy distribution for our class.

Note that not all classes of distributions contain a maximum entropy distribution. It is possible that a class contain distributions of arbitrary large entropy (e.g. the class of all distributions on R with mean 0), or that the entropies are bounded above, but there is no distribution which attains the maximal entropy (e.g. the class of all distributions X on R with E(X) = E(X²) = E(X³) = 0).

There is also a version of Boltzmann's theorem for discrete distributions whose proof only involves ordinary calculus.