Shannon's noiseless coding theorem

Shannon's noiseless coding theorem places an upper and a lower bound on the minimal possible expected length of codewords as a function of the entropy of the input words (which is viewed as a random variable) and of the size of the target alphabet.

Shannon's statement

Let X be a random variable taking values in some finite alphabet $\Sigma _{1}$ and let f be a decipherable code from $\Sigma _{1}$ to $\Sigma _{2}$ where $|\Sigma _{2}^{*}|=a$ . Let S denote the resulting wordlength of f(X).

If f is optimal in the sense that it has the minimal expected wordlength for X, then

{\frac {H(X)}{\log _{2}a}}\leq \mathbb {E} S<{\frac {H(X)}{\log _{2}a}}+1

Proof

Let $s_{i}$ denote the wordlength of each possible $x_{i}$ ( $i=1,\ldots ,n$ ). Define $q_{i}=a^{-s_{i}}/C$ , where C is chosen so that $\sum q_{i}=1$ .

Then

{\begin{matrix}H(X)&=&-\sum _{i=1}^{n}p_{i}\log _{2}p_{i}\\&&\\&\leq &-\sum _{i=1}^{n}p_{i}\log _{2}q_{i}\\&&\\&=&-\sum _{i=1}^{n}p_{i}\log _{2}a^{-s_{i}}+\sum _{i=1}^{n}\log _{2}C\\&&\\&\leq &-\sum _{i=1}^{n}-s_{i}p_{i}\log _{2}a\\&&\\&\leq &-\mathbb {E} S\log _{2}a\\\end{matrix}}

where the second line follows from Gibbs' inequality and the third line follows from Kraft's inequality: $C=\sum _{i=1}^{n}a^{-s_{i}}\leq 1$ so $\log C\leq 0$ .