Jump to content

Maximum entropy probability distribution

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by AxelBoldt (talk | contribs) at 01:41, 29 April 2005. The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)

In statistics and information theory, a maximum entropy probability distribution is a probability distribution whose entropy is larger than that of all other members of a specified class. If nothing is known about a distribution except that it belongs to a certain class, then the maximum entropy distribution for that class is often assumed "by default". The reason is twofold: maximizing entropy, in a sense, means minimizing the amount of prior information built into the distribution; furthermore, many physical systems tend to move towards maximal entropy configurations over time.

Definition of entropy

If X is a discrete random variable with distribution given by

then the entropy of X is defined as

If X is a continuous random variable with probability density p(x), then the entropy of X is defined as

where p(x) log(1/p(x)) is understood to be zero whenever p(x) = 0.

The base of the logarithm used is not important as long as the same one is used consistently: change of base merely results in a rescaling of the entropy. Information theoreticians may prefer to use base 2 in order to express the entropy in bits; mathematicians and physicists will often prefer the natural logarithm, resulting in a unit of nits or nepers for the entropy.

Examples of maximum entropy distributions

The most important maximum entropy distribution is the normal distribution N(μ,σ2). It has maximum entropy among all distributions on the real line with specified mean μ and standard deviation σ. Therefore, if all you know about a distribution is its mean and standard deviation, it is often reasonable to assume that the distribution is normal.

The uniform distribution on the interval [a,b] is the maximum entropy distribution among all continuous distributions which are supported in the interval [a, b] (which means that the probability density is 0 outside of the interval). The uniform distribution on the finite set {x1,...,xn} (which assigns a probability of 1/n to each of these values) is the maximum entropy distribution among all discrete distributions supported on this set.

The exponential distribution with mean 1/λ is the maximum entropy distribution among all continuous distributions supported in [0,∞) that have a mean of 1/λ.

Among all the discrete distributions supported on the set {x1,...,xn} with mean μ, the maximum entropy distribution has the following shape:

where the positive constants C and r can be determined by the requirements that the sum of all the probabilities must be 1 and the expected value must be μ.

As an example, consider the following scenario: a large number N of dice is thrown, and you are told that the sum of all the shown numbers is S. Based on this information alone, what would be a reasonable assumption for the number of dice showing 1, 2, ..., 6? This is an instance of the situation considered above, with {x1,...,x6} = {1,...,6} and μ = S/N.

Finally, among all the discrete distributions supported on the infinite set {x1,x2,...} with mean μ, the maximum entropy distribution has the shape:

where again the constants C and r can be determined by the requirements that the sum of all the probabilities must be 1 and the expected value must be μ.

A theorem by Boltzmann

All the above continuous examples are consequences of the following theorem by Boltzmann which is a standard application of the calculus of variations:

Suppose S is a (possibly infinite) interval of the real numbers R and we're given n smooth functions f1,...,fn and n numbers a1,...,an. We consider the class of all continuous random variables which are supported on S and which satisfy

The maximum entropy distribution for this class (if it exists) has a probability density of the following shape:

where the constants C and λj have to be determined so that the the integral of p(x) over S is 1 and the above conditions for the expected values are satisfied.

Conversely, if constants C and λj like this can be found, then p(x) is indeed the density of the maximum entropy distribution for our class.

Note that not all classes of distributions contain a maximum entropy distribution. It is possible that a class contain distributions of arbitrary large entropy (e.g. the class of all distributions on R with mean 0), or that the entropies are bounded above, but there is no distribution which attains the maximal entropy (e.g. the class of all distributions X on R with E(X) = E(X2) = E(X3) = 0).

There is also a version of Boltzmann's theorem for discrete distributions whose proof only involves ordinary calculus.