Bayesian multivariate linear regression

Consider a collection of m linear regression problems for n observations, related through a set of common predictor variables $\{x_{c}\}$ , and a jointly normal errors $\{\epsilon _{c}\}$ :

y_{1}=\beta _{i}x_{1}+\epsilon _{i},\,

y_{c}=\beta _{c}x_{c}+\epsilon _{c},\,

y_{m}=\beta _{c}x_{m}+\epsilon _{m},\,

where the subscript c denotes a column vector of k observations for each measurement ( $n=k+m$ ).

The noise terms are jointly normal over each collection of k observations. That is, each row vector $\{r\}$ represents an m length bvector of of correlated observations on each of the dependent variables:

y_{r}=B^{T}x_{r}+\epsilon _{r},\,

where the noise $\epsilon _{r}$ is i.i.d. and normally distributed for all rows $\{r\}$ .

\epsilon _{r}\sim N(0,\Sigma _{\epsilon }^{2}).\,

where B is an $k\times m$ matrix

B=[\beta _{1},\cdots ,\beta _{c},\cdots ,\beta _{m}]\,

We can write the entire regression problem in matrix form as:

Y=B^{T}X+E,\,

where Y and E are $n\times m$ matrices.

The classical, frequentists least linear squares solution is to simply estimate the matirx of regression coeeficients ${\hat {B}}$ using the Moore-Penrose pseudoinverse:

{\hat {B}}=(X^{T}X)^{-1}X^{T}Y

.

To obtain the Bayesian solution, we need to specify the confitional likelihood and then find the appropriate conjugate prior. As with the univerate case of Linear Bayesian Regression, we will find that we can specify a natural conditional conjugate prior (which is scale dependent).

Let us write our conditional likelihood as

\rho (E|\Sigma _{\epsilon })\propto (\Sigma _{\epsilon }^{2})^{-n/2}exp(-{\frac {1}{2}}tr(E^{T}\Sigma _{\epsilon }^{-1}E)),\,

writing the error E in terms Y,X, and B yields

\rho (Y|X,B\Sigma _{\epsilon })\propto (\Sigma _{\epsilon }^{2})^{-n/2}exp(-{\frac {1}{2}}tr((Y-BX)^{T}\Sigma _{\epsilon }^{-1}(Y-BX))),\,

We seek a natural conjugate prior--a joint density $\rho (B,\sigma _{\epsilon })$ which is of the same functional form as the likelihood. Since the likelihood is quadratic in $B$ , we re-write the likelihood so it is normal in $(B-{\hat {B}})$ (the deviation from classical sample estimate)

Using the same technique as iwith Linear Bayesian Regression, we decompose the exponential term using a matrix-form of the sum-of-squares techique. Here, however, we will also need to use the Kronecker product and vectorization transformations.

First, let us apply sum-of-squares to obtain new expression for the likelihood:

\rho (Y|X,B,\Sigma _{\epsilon })\propto \Sigma _{\epsilon }^{-(n-k)/2}exp(-tr(-{\frac {1}{2}}S\Sigma _{\epsilon }^{-1}))(\Sigma _{\epsilon }^{2})^{-k/2}exp(-{\frac {1}{2}}tr((B-{\hat {B}})^{T}\Sigma _{\epsilon }^{-1}(B-{\hat {B}}))),\,

S=Y-{\hat {B}}X

We would like to develop a condition form for the priors:

\rho (B,\Sigma _{\epsilon })=\rho (\Sigma _{\epsilon })\rho (B|\Sigma _{\epsilon }),\,

where $\rho (\Sigma _{\epsilon })$ is an inverse-Wishsart distribution and $\rho (B|\Sigma _{\epsilon })$ is some form of Normal distribution in the matrix $B$ . This is accomplished using the vectorizaton transformation, which converts the likelihood from a function of the matrices $B,{\hat {B}}$ to a function of the vectors $\mathrm {B} =vec(B),{\hat {\mathrm {B} }}=vec({\hat {B}})$ :

\rho (\beta |\sigma ^{2})\propto (\sigma ^{2})^{-k}exp(-{\frac {1}{2{\sigma }^{2}}}(\beta -{\bar {\beta }})^{T}(A)(\beta -{\bar {\beta }})),\,

With the prior now specified, we can express the posterior distribution as

\rho (\beta ,\sigma ^{2}|y,X)\propto \rho (y|X,\beta ,\sigma ^{2})\rho (\beta |\sigma ^{2})\rho (\sigma ^{2})

\propto (\sigma ^{2})^{-n/2}exp(-{\frac {1}{2{\sigma }^{2}}}(y-\beta X)^{T}(y-\beta X))

\times (\sigma ^{2})^{-k}exp(-{\frac {1}{2{\sigma }^{2}}}(\beta -{\bar {\beta }})^{T}(A)(\beta -{\bar {\beta }})).

\times (\sigma ^{2})^{-v_{0}/2+1}exp(-{\frac {v_{0}s_{0}^{2}}{2{\sigma }^{2}}})

With some re-arrangement, we can re-write the posterior so that the posterior mean ${\tilde {\beta }}$ is weighted average of the least squares estimator and the prior mean:

{\tilde {\beta }}=(X^{T}X+A)^{-1}(X^{T}X{\hat {\beta }}+A{\bar {\beta }})

where $U$ comes from the LU decomposition of $A$ (which is a positive-definite matrix by design)

A=U^{T}U.

This is the key result of the Empirical Bayes approach; it allows us to estimate the slope $\beta$ for our original linear regression problem by combining estimates using the least squares estimate ${\hat {\beta }}$ for a single set of measurements with the empirical prior estimate ${\bar {\beta }}$ from a large collection of similar measurements. (Notice that the weighted average also depends on the empirical estimate of the prior covariance matrix $A$ .)

To justify this, collect the quadratic terms in the exponential and and now express this as a quadratic form in $\beta -{\tilde {\beta }}$ :

(y-\beta X)^{T}(y-\beta X))+(\beta -{\bar {\beta }})^{T}(A)(\beta -{\bar {\beta }})=(v-W\beta )^{T}(v-W\beta )

=ns^{2}+(\beta -{\bar {\beta }})^{T}W^{T}W(\beta -{\bar {\beta }})

where

ns^{2}=(v-W{\tilde {\beta }})^{T}(v-W{\tilde {\beta }}),v=[y,U{\bar {B}}],W=[X,U]

The posterior can now be expressed as a Normal distribution $N({\tilde {\beta }},\sigma ^{2}(X^{T}X+A)^{-1}$ times an inverse-gamma distribution:

\rho (\beta ,\sigma ^{2}|y,X)\propto (\sigma ^{2})^{-k/2}exp(-{\frac {1}{2{\sigma }^{2}}}(\beta -{\bar {\beta }})^{T}(X^{T}X+A)(\beta -{\bar {\beta }}))\times (\sigma ^{2})^{-(n+v_{0})/2+1}exp(-{\frac {(v_{0}s_{0}^{2}+ns^{2})}{2{\sigma }^{2}}})

A similar analysis can be performed for general case of multi-variate regression for a Bayesian Estimation of covariance matrices.

Example:

References

Bradley P. Carlin and Thomas A. Louis, Bayes and Empirical Bayes Methods for Data Analysis, Chapman & Hall/CRC, Second edition 2000,

Peter E. Rossi, Greg M. Allenby, and Robert McCulloch, Bayesian Statistics and Marketing, John Wiley & Sons, Ltd, 2006

External links