Least-squares estimation of linear regression coefficients

In parametric statistics, the least-squares estimator is often used to estimate the coefficients of a linear regression. The least-squares estimator optimizes a certain criterion (namely it minimizes the sum of the square of the residuals). In this article, after setting the mathematical context of linear regression, we will motivate the use of the least-squares estimator ${\widehat {\theta }}_{LS}$ and derive its expression (as seen for example in the article regression analysis):

{\widehat {\theta }}_{LS}=(\mathbf {X} ^{t}\mathbf {X} )^{-1}\mathbf {X} ^{t}{\vec {Y}}

We conclude by giving some qualities of this estimator and a geometrical interpretation.

Assumptions

For $p\in \mathbb {N} ^{+}$ , let Y be a random variable taking values in $\mathbb {R}$ , we call observation.

We next define the function η, linear in $\theta$ :

\eta (X;\theta )=\sum _{j=1}^{p}\theta _{j}X_{j},

where

For $j\in \{1,...,p\}$ , $X_{j}$ is a random variable taking values in $\mathbb {R}$ and is called a factor and
$\theta _{j}$ is a scalar, for $j\in \{1,...,p\}$ , and $\theta ^{t}=(\theta _{1},\cdots ,\theta _{p})$ , where $\theta ^{t}$ denotes the transpose of vector $\theta$ .

Let $X^{t}=(X_{1},\cdots ,X_{p})$ . We can write $\eta (X;\theta )=X^{t}\theta$ . Define the error to be:

\varepsilon (\theta )=Y-X^{t}\theta

We suppose that there exists a true parameter ${\overline {\theta }}\in \mathbb {R} ^{p}$ such that $\mathbb {E} [\varepsilon ({\overline {\theta }})|X]=0$ . This means that, given the random variables $(X_{1},\cdots ,X_{p})$ , the best prediction we can make of Y is $Y=\eta (X;{\overline {\theta }})=X^{t}{\overline {\theta }}$ . Henceforth, $\varepsilon$ will denote $\varepsilon ({\overline {\theta }})$ and η will represent $\eta (X;{\overline {\theta }})$ .

Least-squares estimator

The idea behind the least-squares estimator is to see linear regression as an orthogonal projection. Let F be the L2-space of all random variables whose square has a finite Lebesgue integral. Let G be the linear subspace of $F$ generated by $X_{1},\cdots ,X_{p}$ (supposing that $Y\in F$ and $(X_{1},\cdots ,X_{p})\in F^{p}$ ). We show in this paragraph that the function $\eta$ is an orthogonal projection of Y on G and we will construct the least-squares estimator.

Seeing linear regression as an orthogonal projection

We have $\mathbb {E} (Y|X)=\eta$ , but $Y\mapsto \mathbb {E} (Y|X)$ is a projection, which means that $\eta$ is a projection of Y on G. What is more, this projection is an orthogonal one.

To see this, we can build a scalar product in F: for all couples of random variables $X,Y\in F$ , we define $\langle X,Y\rangle _{2}:=\mathbb {E} [XY]$ . It is indeed a scalar product because if $\|X\|_{2}^{2}=0$ , then $X=0$ almost everywhere (where $\|X\|_{2}^{2}:=\langle X,X\rangle _{2}$ is the norm corresponding to this scalar product).

For all $1\leq j\leq p$ ,

$\langle X_{j},\varepsilon \rangle _{2}$	$=\langle X_{j},Y-X^{t}{\overline {\theta }}\rangle _{2}$
	$=\langle X_{j},Y\rangle _{2}-\langle X_{j},\mathbb {E} [Y\|X]\rangle _{2}$
	$=\mathbb {E} [X_{j}Y]-\mathbb {E} [X_{j}\mathbb {E} [Y\|X]]$
	$=X_{j}(\mathbb {E} Y-\mathbb {E} [\mathbb {E} [Y\|X]])$
	$=X_{j}(\mathbb {E} Y-\mathbb {E} Y)$
$\langle X_{j},\varepsilon \rangle _{2}$	$=0$

Therefore, $\varepsilon$ is orthogonal to any $X_{j}$ and hence to the whole of the subspace G, which means that $\eta$ is a projection of Y on G, orthogonal with respect to the scalar product we have just defined. We have therefore shown:

\eta (X;{\overline {\theta }})=\min _{f\in G}\|Y-f\|_{2}^{2}.

Estimating the coefficients

If, for each $j\in \{1,\cdots ,p\}$ we have a sample of size $n>p,(X_{j}^{1},\cdots ,X_{j}^{n})$ of $X_{j}$ , along with a vector ${\vec {Y}}$ of n observations of Y, we can build an estimation of the coefficients of this orthogonal projection. To do this, we can use an estimation of the scalar product defined earlier.

For all couples of samples of size n ${\vec {U}},{\vec {V}}\in F^{n}$ of random variables U and V, we define $\langle {\vec {U}},{\vec {V}}\rangle :={\vec {U}}^{t}{\vec {V}}$ , where ${\vec {U}}^{t}$ is the transpose of vector ${\vec {U}}$ , and $\|\cdot \|:={\sqrt {\langle \cdot ,\cdot \rangle }}$ . Note that the scalar product $\langle \cdot ,\cdot \rangle$ is defined in $F^{n}$ and no longer in F.

Let us define the design matrix (or random design), a $n\times p$ random matrix: $\mathbf {X} =\left[{\begin{matrix}X_{1}^{1}&\cdots &X_{p}^{1}\\\vdots &&\vdots \\X_{1}^{n}&\cdots &X_{p}^{n}\end{matrix}}\right]$

We can now adapt the minimization of the sum of the residuals: the least-squares estimator ${\widehat {\theta }}_{LS}$ will be the value, if it exists, of $\theta$ which minimizes $\|\mathbf {X} \theta -{\vec {Y}}\|^{2}$ . Therefore, $\langle \mathbf {X} ,{\vec {\varepsilon }}({\widehat {\theta }}_{LS})\rangle =\mathbf {X} ^{t}(\mathbf {X} {\widehat {\theta }}_{LS}-{\vec {Y}})=0$ .

This yields $\mathbf {X} ^{t}\mathbf {X} {\widehat {\theta }}_{LS}=\mathbf {X} ^{t}{\vec {Y}}$ . If $\mathbf {X}$ is of full rank, then so is $\mathbf {X} ^{t}\mathbf {X}$ . In that case we can compute the least-squares estimator explicitly by inverting the $p\times p$ matrix $\mathbf {X} ^{t}\mathbf {X}$ :

{\widehat {\theta }}_{LS}=(\mathbf {X} ^{t}\mathbf {X} )^{-1}\mathbf {X} ^{t}{\vec {Y}}

Qualities and geometrical interpretation

Qualities of this estimator

Not only is the least-square estimator easy to compute, but under the Gauss-Markov assumptions, the Gauss-Markov theorem states that the least-square estimators is the best linear unbiased estimator (BLUE) of ${\overline {\theta }}$ .

The vector of errors ${\vec {\varepsilon }}={\vec {Y}}-\mathbf {X} {\overline {\theta }}$ is said to fulfil the Gauss-Markov assumptions if:

$\mathbb {E} {\vec {\varepsilon }}={\vec {0}}$
$\mathbb {V} {\vec {\varepsilon }}=\sigma ^{2}\mathbf {I} _{n}$ (uncorrelated but not necessarily independent; homoscedastic but not necessarily identically distributed)

where $\sigma ^{2}<+\infty$ and $\mathbf {I} _{n}$ is the $n\times n$ identity matrix.

This decisive advantage has lead to a sometimes abusive use of least-squares. Least-squares depends on the fulfilment of the Gauss-Markov hypothesis and applying this method in a situation where these conditions are not met can lead to inaccurate results. For example, in the study of time-series, it is often difficult to assume independence of the residuals.

Geometrical interpretation

The situation described by the linear regression problem can be geometrically seen as follows:

The least-squares is also an M-estimator of $\rho$ -type for $\rho (r):={\frac {r^{2}}{2}}$ .