Minimum mean square error

In statistics and signal processing, a minimum mean square error (MMSE) estimator describes the approach which minimizes the mean square error (MSE), which is a common measure of estimator quality.

The term MMSE specifically refers to estimation in a Bayesian setting with quadratic cost function. Unlike non-Bayesian approach where parameters of interest are assumed to be deterministic, but unknown constants, the Bayesian estimator seeks to estimate a parameters that is itself a random variable. The Bayesian approach, based directly on Bayes’ theorem, provides a framework for handling such problems by allowing prior knowledge to be incorporated into the estimator. Furthermore, Bayesian estimation provides yet another alternative to the minimum-variance unbiased estimator MVUE. This is useful when the MVUE cannot be found.

In the alternative frequentist setting there does not exist a single estimator having minimal MSE. A somewhat similar concept can be obtained within the frequentist point of view if one requires unbiasedness, since an estimator may exist that minimizes the variance (and hence the MSE) among unbiased estimators. Such an estimator is then called the MVUE.

Definition

Let $x$ be an unknown random vector variable, and let $y$ be a known random vector variable (the measurement or observation). An estimator ${\hat {x}}(y)$ of $x$ is any function of the measurement $y$ . The estimation error vector is given by $e={\hat {x}}-x$ and its mean squared error (MSE) is given by the trace of error covariance matrix

\mathrm {MSE} =tr\left\{E\{({\hat {x}}-x)({\hat {x}}-x)^{T}\}\right\}

,

where the expectation is taken over both $x$ and $y$ . When $x$ is a scalar variable, then MSE expression simplifies to $E\left\{({\hat {x}}-x)^{2}\right\}$ . The MMSE estimator is then defined as the estimator achieving minimal MSE.

Properties

Under some weak regularity assumptions,^[1] the MMSE estimator is uniquely defined, and is given by

{\hat {x}}_{\mathrm {MMSE} }(y)=E\left\{x|y\right\}.

In other words, the MMSE estimator is the conditional expectation of

x

given the known observed value of the measurements.

If $x$ and $y$ are jointly Gaussian, then the MMSE estimator is linear, i.e., it has the form $Wx+b$ for matrix $W$ and constant $b$ . As a consequence, to find the MMSE estimator, it is sufficient to find the linear MMSE estimator. Such a situation occurs in the example presented in the next section.

The orthogonality principle: An estimator ${\hat {x}}$ is MMSE if and only if

E\{({\hat {x}}-x)f(y)\}=0

for all functions

f(y)

of the measurements.

Linear MMSE estimator

In many cases, it is not possible to determine a closed form for the MMSE estimator. Also they are computationally expensive to implement since they often require multidimensional integration. In these cases, one possibility is to abandon the full optimality requirements and seek the technique minimizing the MSE within a particular class, such as the class of linear estimators. The linear MMSE estimator is the estimator achieving minimum MSE among all estimators of the form $Wy+b$ where the measurement $y$ is a random vector, $W$ is a matrix and $b$ is a vector. Such linear estimator only depends on the first two moments of the probability density function. These estimators are sometimes referred to as Wiener filters.

Let us have a linear MMSE estimator given as ${\hat {x}}=Wy+b$ . For the estimator to be unbiased, the mean error should be zero. This means,

E\{{\hat {x}}\}=E\{x\}.

Plugging the expression for ${\hat {x}}$ in above, we get

b={\bar {x}}-W{\bar {y}},

where ${\bar {x}}=E\{x\}$ and ${\bar {y}}=E\{y\}$ . Thus we can re-write the estimator as

{\hat {x}}=W(y-{\bar {y}})+{\bar {x}}

and the expression for estimation error becomes

{\hat {x}}-x=Wy+b-x=W(y-{\bar {y}})-(x-{\bar {x}}).

From the orthogonality principle, we can have $E\{({\hat {x}}-x)(y-{\bar {y}})^{T}\}=0$ . Here the left hand side term is

{\begin{aligned}E\{({\hat {x}}-x)(y-{\bar {y}})^{T}\}&=E\{(W(y-{\bar {y}})-(x-{\bar {x}}))(y-{\bar {y}})^{T}\}\\&=WE\{(y-{\bar {y}})(y-{\bar {y}})^{T}\}-E\{(x-{\bar {x}})(y-{\bar {y}})^{T}\}\\&=WC_{Y}-C_{XY}.\end{aligned}}

When equated to zero, we obtain the desired expression for $W$ as

W=C_{XY}C_{Y}^{-1}.

The $C_{XY}$ is cross-covariance matrix between X and Y, and $C_{Y}$ is covariance matrix of Y. Since $C_{XY}=C_{YX}^{T}$ , the expression can also be re-written in terms of $C_{YX}$ as

W^{T}=C_{Y}^{-1}C_{YX}.

Standard method like Gauss elimination can be used to solve the matrix equation. Since the matrix $C_{Y}$ is a symmetric positive definite matrix it can be solved twice as fast with the Cholesky decomposition. Levinson recursion is a fast method when $C_{Y}$ is also a Toeplitz matrix. This can happen when $y$ is a wide sense stationary process.

The covariance of MMSE estimation error will then be given by

{\begin{aligned}C_{e}&=E\{({\hat {x}}-x)({\hat {x}}-x)^{T}\}\\&=E\{({\hat {x}}-x)(W(y-{\bar {y}})-(x-{\bar {x}}))^{T}\}\\&=\underbrace {E\{({\hat {x}}-x)(y-{\bar {y}})^{T}\}} _{0}W^{T}-E\{({\hat {x}}-x)(x-{\bar {x}})^{T}\}\\&=-E\{(W(y-{\bar {y}})-(x-{\bar {x}}))(x-{\bar {x}})^{T}\}\\&=E\{(x-{\bar {x}})(x-{\bar {x}})^{T}\}-WE\{(y-{\bar {y}})(x-{\bar {x}})^{T}\}\\&=C_{X}-WC_{YX}.\\\end{aligned}}

The first term in the third line is zero due to the orthogonality principle. Since $W=C_{XY}C_{Y}^{-1}$ , we can re-write $C_{e}$ in terms of correlation matrices as

C_{e}=C_{X}-C_{XY}C_{Y}^{-1}C_{YX}.

Thus the minimum mean square error achievable by such a linear estimator is

MMSE=tr\{C_{e}\}

.

Linear process

Furthermore, let us have an underlying linear process $y=Ax+z$ , where $A$ is a known matrix and $z$ is random noise vector with the mean $E\{z\}=0$ and cross-covariance $C_{XZ}=0$ . The required covariance matrices will be

C_{Y}=AC_{X}A^{T}+C_{Z}

and

C_{XY}=C_{X}A^{T}.

Thus the expression for the linear MMSE estimator $W$ further modifies to

W=C_{X}A^{T}(AC_{X}A^{T}+C_{Z})^{-1}.

When $C_{Z}=0$ , the expression for $W$ is the same as that of weighed least square estimate with $C_{X}$ as the weight matrix.

Examples

Example 1

We shall take a linear prediction problem as an example. Let a linear combination of observed scalar random variables $x_{1},x_{2}$ and $x_{3}$ be used to estimate another future scalar random variable $x_{4}$ such that ${\hat {x}}_{4}=\sum _{i=1}^{3}w_{i}x_{i}$ . If the random variables $x=[x_{1},x_{2},x_{3},x_{4}]^{T}$ are real Gaussian random variables with zero mean and its covariance matrix given by

\operatorname {cov} (X)=E[xx^{T}]=\left[{\begin{array}{cccc}1&2&3&4\\2&5&8&9\\3&8&6&10\\4&9&10&15\end{array}}\right],

then our task is to find the coefficients $w_{i}$ such that it will yield an optimal linear estimate ${\hat {x}}_{4}$ .

In terms of the terminology developed in the previous section, for this problem we have the observation vector $y=[x_{1},x_{2},x_{3}]^{T}$ , the estimator matrix $W=[w_{1},w_{2},w_{3}]$ as a row vector, and the estimated variable $x=x_{4}$ as a scalar quantity. The autocorrelation matrix $C_{Y}$ is defined as

C_{Y}=\left[{\begin{array}{ccc}E[x_{1},x_{1}]&E[x_{2},x_{1}]&E[x_{3},x_{1}]\\E[x_{1},x_{2}]&E[x_{2},x_{2}]&E[x_{3},x_{2}]\\E[x_{1},x_{3}]&E[x_{2},x_{3}]&E[x_{3},x_{3}]\end{array}}\right]=\left[{\begin{array}{ccc}1&2&3\\2&5&8\\3&8&6\end{array}}\right].

The cross correlation matrix $C_{YX}$ is defined as

C_{YX}=\left[{\begin{array}{c}E[x_{4},x_{1}]\\E[x_{4},x_{2}]\\E[x_{4},x_{3}]\end{array}}\right]=\left[{\begin{array}{c}4\\9\\10\end{array}}\right].

We now solve the equation $C_{Y}W^{T}=C_{YX}$ by inverting $C_{Y}$ and pre-multiplying to get

C_{Y}^{-1}C_{YX}=\left[{\begin{array}{ccc}4.85&-1.71&-.142\\-1.71&.428&.2857\\-.142&.2857&-.1429\end{array}}\right]\left[{\begin{array}{c}4\\9\\10\end{array}}\right]=\left[{\begin{array}{c}2.57\\-.142\\.5714\end{array}}\right]=W^{T}.

So we have $w_{1}=2.57,$ $w_{2}=-.142,$ and $w_{3}=.5714$ as the optimal coefficients for ${\hat {x}}_{4}$ . Computing the minimum mean square error then gives $\left\Vert e\right\Vert _{\min }^{2}=E[x_{4}x_{4}]-WC_{YX}=15-WC_{YX}=.2857$ .^[2] Note that it is not necessary to obtain an explicit matrix inverse of $C_{Y}$ to compute the value of $W$ . The matrix equation can be solved by well known methods such as Gauss elimination method. A shorter, non-numerical example can be found in orthogonality principle.

Example 2

Consider a vector $y$ formed by taking $N$ observations of a random scalar parameter $x$ disturbed by white Gaussian noise. We can describe the process by a linear equation $y=1x+z$ , where $1=[1,1,\ldots ,1]^{T}$ . Depending on context it will be clear if $1$ represents a scalar or a vector. Let the aprior distribution of $x$ be uniform over an interval $(-x_{0},x_{0})$ , and thus $x$ will have variance of $\sigma _{X}^{2}=x_{0}^{2}/3.$ . Let the noise vector $z$ be normally distributed as $N(0,\sigma ^{2}I)$ where $I$ is an identity matrix. Also $x$ and $z$ are independent and $C_{XZ}=0$ . It is easy to see that

{\begin{aligned}&E\{y\}=0,\\&C_{Y}=E\{yy^{T}\}=\sigma _{X}^{2}11^{T}+\sigma ^{2}I,\\&C_{XY}=E\{xy^{T}\}=\sigma _{X}^{2}1^{T}.\end{aligned}}

Thus, the MMSE estimator is given by

{\begin{aligned}{\hat {x}}&=C_{XY}C_{Y}^{-1}y\\&=\sigma _{X}^{2}1^{T}(\sigma _{X}^{2}11^{T}+\sigma ^{2}I)^{-1}y\\&={\frac {\sigma _{X}^{2}}{\sigma ^{2}}}1^{T}{\Big [}I-{\frac {{\frac {\sigma _{X}^{2}}{\sigma ^{2}}}11^{T}}{1+{\frac {\sigma _{X}^{2}}{\sigma ^{2}}}1^{T}1}}{\Big ]}y.\end{aligned}}

The last step is due to a special case of the matrix binomial inverse theorem (also known as Woodbury matrix identity). The matrix thus obtained in the last step, $I-{\frac {{\frac {\sigma _{X}^{2}}{\sigma ^{2}}}11^{T}}{1+{\frac {\sigma _{X}^{2}}{\sigma ^{2}}}1^{T}1}}$ , will have $1-{\frac {1}{N}}\cdot {\frac {\sigma _{X}^{2}}{\sigma _{X}^{2}+\sigma ^{2}/N}}$ as diagonal terms and $-{\frac {1}{N}}\cdot {\frac {\sigma _{X}^{2}}{\sigma _{X}^{2}+\sigma ^{2}/N}}$ as off-diagonal terms. Taking the product with respect to $1^{T}$ , we get the required estimator

{\hat {x}}={\frac {\sigma _{X}^{2}}{\sigma _{X}^{2}+\sigma ^{2}/N}}{\bar {y}},

where for $y=[y_{1},y_{2},\ldots ,y_{N}]^{T}$ we have ${\bar {y}}={\frac {1^{T}y}{N}}={\frac {\sum _{i=1}^{N}y_{i}}{N}}.$

For very large $N$ , we see that the MMSE estimator of a scalar unknown random variable with uniform aprior distribution can be simply approximated by the arithmetic average of all the observed data

{\hat {x}}={\frac {1}{N}}\sum _{i=1}^{N}y_{i}.

However, the estimator is suboptimal since it is constrained to be linear.

Notes

^ Lehmann and Casella, Corollary 4.1.2.
^ Moon and Stirling.