In statistics , Bayesian multivariate linear regression is a
Bayesian approach to multivariate linear regression , i.e. linear regression where the predicted outcome is a vector of correlated random variables rather than a single scalar random variable. A more general treatment of this approach can be found in the article MMSE estimator .
Details
Consider a regression problem where the dependent variable to be
predicted is not a single real-valued scalar but an m -length vector
of correlated real numbers. As in the standard regression setup, there
are n observations, where each observation i consists of k -1
explanatory variables , grouped into a vector
x
i
{\displaystyle \mathbf {x} _{i}}
of length k (where a dummy variable with a value of 1 has been
added to allow for an intercept coefficient). This can be viewed as a
set of m related regression problems for each observation i :
y
i
,
1
=
x
i
T
β
1
+
ϵ
i
,
1
{\displaystyle y_{i,1}=\mathbf {x} _{i}^{\rm {T}}{\boldsymbol {\beta }}_{1}+\epsilon _{i,1}}
⋯
{\displaystyle \cdots }
y
i
,
m
=
x
i
T
β
m
+
ϵ
i
,
m
{\displaystyle y_{i,m}=\mathbf {x} _{i}^{\rm {T}}{\boldsymbol {\beta }}_{m}+\epsilon _{i,m}}
where the set of errors
{
ϵ
i
,
1
,
…
,
ϵ
i
,
m
}
{\displaystyle \{\epsilon _{i,1},\ldots ,\epsilon _{i,m}\}}
are all correlated. Equivalently, it can be viewed as a single regression
problem where the outcome is a row vector
y
i
T
{\displaystyle \mathbf {y} _{i}^{\rm {T}}}
and the regression coefficient vectors are stacked next to each other, as follows:
y
i
T
=
x
i
T
B
+
ϵ
i
T
.
{\displaystyle \mathbf {y} _{i}^{\rm {T}}=\mathbf {x} _{i}^{\rm {T}}\mathbf {B} +{\boldsymbol {\epsilon }}_{i}^{\rm {T}}.}
The coefficient matrix B is a
k
×
m
{\displaystyle k\times m}
matrix where the coefficient vectors
β
1
,
…
,
β
m
{\displaystyle {\boldsymbol {\beta }}_{1},\ldots ,{\boldsymbol {\beta }}_{m}}
for each regression problem are stacked horizontally:
B
=
[
(
β
1
)
⋯
(
β
m
)
]
=
[
(
β
1
,
1
⋮
β
k
,
1
)
⋯
(
β
1
,
m
⋮
β
k
,
m
)
]
.
{\displaystyle \mathbf {B} ={\begin{bmatrix}{\begin{pmatrix}\\{\boldsymbol {\beta }}_{1}\\\\\end{pmatrix}}\cdots {\begin{pmatrix}\\{\boldsymbol {\beta }}_{m}\\\\\end{pmatrix}}\end{bmatrix}}={\begin{bmatrix}{\begin{pmatrix}\beta _{1,1}\\\vdots \\\beta _{k,1}\\\end{pmatrix}}\cdots {\begin{pmatrix}\beta _{1,m}\\\vdots \\\beta _{k,m}\\\end{pmatrix}}\end{bmatrix}}.}
The noise vector
ϵ
i
{\displaystyle {\boldsymbol {\epsilon }}_{i}}
for each observation i
is jointly normal, so that the outcomes for a given observation are
correlated:
ϵ
i
∼
N
(
0
,
Σ
ϵ
)
.
{\displaystyle {\boldsymbol {\epsilon }}_{i}\sim N(0,{\boldsymbol {\Sigma }}_{\epsilon }).}
We can write the entire regression problem in matrix form as:
Y
=
X
B
+
E
,
{\displaystyle \mathbf {Y} =\mathbf {X} \mathbf {B} +\mathbf {E} ,}
where Y and E are
n
×
m
{\displaystyle n\times m}
matrices. The design matrix X is an
n
×
k
{\displaystyle n\times k}
matrix with the observations stacked vertically, as in the standard linear regression setup:
X
=
[
x
1
T
x
2
T
⋮
x
n
T
]
=
[
x
1
,
1
⋯
x
1
,
k
x
2
,
1
⋯
x
2
,
k
⋮
⋱
⋮
x
n
,
1
⋯
x
n
,
k
]
.
{\displaystyle \mathbf {X} ={\begin{bmatrix}\mathbf {x} _{1}^{\rm {T}}\\\mathbf {x} _{2}^{\rm {T}}\\\vdots \\\mathbf {x} _{n}^{\rm {T}}\end{bmatrix}}={\begin{bmatrix}x_{1,1}&\cdots &x_{1,k}\\x_{2,1}&\cdots &x_{2,k}\\\vdots &\ddots &\vdots \\x_{n,1}&\cdots &x_{n,k}\end{bmatrix}}.}
The classical, frequentists linear least squares solution is to simply estimate the matrix of regression coefficients
B
^
{\displaystyle {\hat {\mathbf {B} }}}
using the Moore-Penrose pseudoinverse :
B
^
=
(
X
T
X
)
−
1
X
T
Y
{\displaystyle {\hat {\mathbf {B} }}=(\mathbf {X} ^{\rm {T}}\mathbf {X} )^{-1}\mathbf {X} ^{\rm {T}}\mathbf {Y} }
.
To obtain the Bayesian solution, we need to specify the conditional likelihood and then find the appropriate conjugate prior. As with the univariate case of linear Bayesian regression , we will find that we can specify a natural conditional conjugate prior (which is scale dependent).
Let us write our conditional likelihood as[ 1]
ρ
(
E
|
Σ
ϵ
)
∝
|
Σ
ϵ
|
−
n
/
2
exp
(
−
1
2
t
r
(
E
T
E
Σ
ϵ
−
1
)
)
,
{\displaystyle \rho (\mathbf {E} |{\boldsymbol {\Sigma }}_{\epsilon })\propto |{\boldsymbol {\Sigma }}_{\epsilon }|^{-n/2}\exp(-{\frac {1}{2}}{\rm {tr}}(\mathbf {E} ^{\rm {T}}\mathbf {E} {\boldsymbol {\Sigma }}_{\epsilon }^{-1})),}
writing the error
E
{\displaystyle \mathbf {E} }
in terms of
Y
,
X
,
{\displaystyle \mathbf {Y} ,\mathbf {X} ,}
and
B
{\displaystyle \mathbf {B} }
yields
ρ
(
Y
|
X
,
B
,
Σ
ϵ
)
∝
|
Σ
ϵ
|
−
n
/
2
exp
(
−
1
2
t
r
(
(
Y
−
X
B
)
T
(
Y
−
X
B
)
Σ
ϵ
−
1
)
)
,
{\displaystyle \rho (\mathbf {Y} |\mathbf {X} ,\mathbf {B} ,{\boldsymbol {\Sigma }}_{\epsilon })\propto |{\boldsymbol {\Sigma }}_{\epsilon }|^{-n/2}\exp(-{\frac {1}{2}}{\rm {tr}}((\mathbf {Y} -\mathbf {X} \mathbf {\mathbf {B} } )^{\rm {T}}(\mathbf {Y} -\mathbf {X} \mathbf {\mathbf {B} } ){\boldsymbol {\Sigma }}_{\epsilon }^{-1})),}
We seek a natural conjugate prior—a joint density
ρ
(
B
,
Σ
ϵ
)
{\displaystyle \rho (\mathbf {B} ,\Sigma _{\epsilon })}
which is of the same functional form as the likelihood. Since the likelihood is quadratic in
B
{\displaystyle \mathbf {B} }
, we re-write the likelihood so it is normal in
(
B
−
B
^
)
{\displaystyle (\mathbf {B} -{\hat {\mathbf {B} }})}
(the deviation from classical sample estimate).
Using the same technique as with Bayesian linear regression , we decompose the exponential term using a matrix-form of the sum-of-squares technique. Here, however, we will also need to use the Matrix Differential Calculus (Kronecker product and vectorization transformations).
First, let us apply sum-of-squares to obtain new expression for the likelihood:
ρ
(
Y
|
X
,
B
,
Σ
ϵ
)
∝
|
Σ
ϵ
|
−
(
n
−
k
)
/
2
exp
(
−
t
r
(
1
2
S
T
S
Σ
ϵ
−
1
)
)
|
Σ
ϵ
|
−
k
/
2
exp
(
−
1
2
t
r
(
(
B
−
B
^
)
T
X
T
X
(
B
−
B
^
)
Σ
ϵ
−
1
)
)
,
{\displaystyle \rho (\mathbf {Y} |\mathbf {X} ,\mathbf {B} ,{\boldsymbol {\Sigma }}_{\epsilon })\propto |{\boldsymbol {\Sigma }}_{\epsilon }|^{-(n-k)/2}\exp(-{\rm {tr}}({\frac {1}{2}}\mathbf {S} ^{\rm {T}}\mathbf {S} {\boldsymbol {\Sigma }}_{\epsilon }^{-1}))|{\boldsymbol {\Sigma }}_{\epsilon }|^{-k/2}\exp(-{\frac {1}{2}}{\rm {tr}}((\mathbf {B} -{\hat {\mathbf {B} }})^{\rm {T}}\mathbf {X} ^{\rm {T}}\mathbf {X} (\mathbf {B} -{\hat {\mathbf {B} }}){\boldsymbol {\Sigma }}_{\epsilon }^{-1})),}
S
=
Y
−
X
B
^
{\displaystyle \mathbf {S} =\mathbf {Y} -\mathbf {X} {\hat {\mathbf {B} }}}
We would like to develop a conditional form for the priors:
ρ
(
B
,
Σ
ϵ
)
=
ρ
(
Σ
ϵ
)
ρ
(
B
|
Σ
ϵ
)
,
{\displaystyle \rho (\mathbf {B} ,{\boldsymbol {\Sigma }}_{\epsilon })=\rho ({\boldsymbol {\Sigma }}_{\epsilon })\rho (\mathbf {B} |{\boldsymbol {\Sigma }}_{\epsilon }),}
where
ρ
(
Σ
ϵ
)
{\displaystyle \rho ({\boldsymbol {\Sigma }}_{\epsilon })}
is an inverse-Wishart distribution
and
ρ
(
B
|
Σ
ϵ
)
{\displaystyle \rho (\mathbf {B} |{\boldsymbol {\Sigma }}_{\epsilon })}
is some form of normal distribution in the matrix
B
{\displaystyle \mathbf {B} }
. This is accomplished using the vectorization transformation, which converts the likelihood from a function of the matrices
B
,
B
^
{\displaystyle \mathbf {B} ,{\hat {\mathbf {B} }}}
to a function of the vectors
β
=
v
e
c
(
B
)
,
β
^
=
v
e
c
(
B
^
)
{\displaystyle {\boldsymbol {\beta }}={\rm {vec}}(\mathbf {B} ),{\hat {\boldsymbol {\beta }}}={\rm {vec}}({\hat {\mathbf {B} }})}
.
Write
t
r
(
(
B
−
B
^
)
T
X
T
X
(
B
−
B
^
)
Σ
ϵ
−
1
)
=
v
e
c
(
B
−
B
^
)
T
v
e
c
(
X
T
X
(
B
−
B
^
)
Σ
ϵ
−
1
)
{\displaystyle {\rm {tr}}((\mathbf {B} -{\hat {\mathbf {B} }})^{\rm {T}}\mathbf {X} ^{\rm {T}}\mathbf {X} (\mathbf {B} -{\hat {\mathbf {B} }}){\boldsymbol {\Sigma }}_{\epsilon }^{-1})={\rm {vec}}(\mathbf {B} -{\hat {\mathbf {B} }})^{\rm {T}}{\rm {vec}}(\mathbf {X} ^{\rm {T}}\mathbf {X} (\mathbf {B} -{\hat {\mathbf {B} }}){\boldsymbol {\Sigma }}_{\epsilon }^{-1})}
Let
v
e
c
(
X
T
X
(
B
−
B
^
)
Σ
ϵ
−
1
)
=
(
Σ
ϵ
−
1
⊗
X
T
X
)
v
e
c
(
B
−
B
^
)
,
{\displaystyle {\rm {vec}}(\mathbf {X} ^{\rm {T}}\mathbf {X} (\mathbf {B} -{\hat {\mathbf {B} }}){\boldsymbol {\Sigma }}_{\epsilon }^{-1})=({\boldsymbol {\Sigma }}_{\epsilon }^{-1}\otimes \mathbf {X} ^{\rm {T}}\mathbf {X} ){\rm {vec}}(\mathbf {B} -{\hat {\mathbf {B} }}),}
where
A
⊗
B
{\displaystyle \mathbf {A} \otimes \mathbf {B} }
denotes the Kronecker product of matrices A and B , a generalization of the outer product which multiplies an
m
×
n
{\displaystyle m\times n}
matrix by a
p
×
q
{\displaystyle p\times q}
matrix to generate an
m
p
×
n
q
{\displaystyle mp\times nq}
matrix, consisting of every combination of products of elements from the two matrices.
Then
v
e
c
(
B
−
B
^
)
T
(
Σ
ϵ
−
1
⊗
X
T
X
)
v
e
c
(
B
−
B
^
)
{\displaystyle {\rm {vec}}(\mathbf {B} -{\hat {\mathbf {B} }})^{\rm {T}}({\boldsymbol {\Sigma }}_{\epsilon }^{-1}\otimes \mathbf {X} ^{\rm {T}}\mathbf {X} ){\rm {vec}}(\mathbf {B} -{\hat {\mathbf {B} }})}
=
(
β
−
β
^
)
T
(
Σ
ϵ
−
1
⊗
X
T
X
)
(
β
−
β
^
)
{\displaystyle =({\boldsymbol {\beta }}-{\hat {\boldsymbol {\beta }}})^{\rm {T}}({\boldsymbol {\Sigma }}_{\epsilon }^{-1}\otimes \mathbf {X} ^{\rm {T}}\mathbf {X} )({\boldsymbol {\beta }}-{\hat {\boldsymbol {\beta }}})}
which will lead to a likelihood which is normal in
(
β
−
β
^
)
{\displaystyle ({\boldsymbol {\beta }}-{\hat {\boldsymbol {\beta }}})}
.
With the likelihood in a more tractable form, we can now find a natural (conditional) conjugate prior.
Conjugate prior distribution
The natural conjugate prior using the vectorized variable
β
{\displaystyle {\boldsymbol {\beta }}}
is of the form[ 1] :
ρ
(
β
,
Σ
ϵ
)
=
ρ
(
Σ
ϵ
)
ρ
(
β
|
Σ
ϵ
)
{\displaystyle \rho ({\boldsymbol {\beta }},{\boldsymbol {\Sigma }}_{\epsilon })=\rho ({\boldsymbol {\Sigma }}_{\epsilon })\rho ({\boldsymbol {\beta }}|{\boldsymbol {\Sigma }}_{\epsilon })}
,
where
ρ
(
Σ
ϵ
)
∼
W
−
1
(
V
0
,
ν
0
)
{\displaystyle \rho ({\boldsymbol {\Sigma }}_{\epsilon })\sim {\mathcal {W}}^{-1}(\mathbf {V_{0}} ,{\boldsymbol {\nu }}_{0})}
and
ρ
(
β
|
Σ
ϵ
)
∼
N
(
β
0
,
Σ
ϵ
⊗
Λ
0
)
.
{\displaystyle \rho ({\boldsymbol {\beta }}|{\boldsymbol {\Sigma }}_{\epsilon })\sim N({\boldsymbol {\beta }}_{0},{\boldsymbol {\Sigma }}_{\epsilon }\otimes {\boldsymbol {\Lambda }}_{0}).}
Posterior distribution
Using the above prior and likelihood, the posterior distribution can be expressed as[ 1] :
ρ
(
β
,
Σ
ϵ
|
Y
,
X
)
∝
|
Σ
ϵ
|
−
(
ν
0
+
m
+
1
)
/
2
exp
(
−
1
2
t
r
(
V
0
Σ
ϵ
−
1
)
)
{\displaystyle \rho ({\boldsymbol {\beta }},{\boldsymbol {\Sigma }}_{\epsilon }|\mathbf {Y} ,\mathbf {X} )\propto |{\boldsymbol {\Sigma }}_{\epsilon }|^{-({\boldsymbol {\nu }}_{0}+m+1)/2}\exp {(-{\frac {1}{2}}{\rm {tr}}(\mathbf {V_{0}} {\boldsymbol {\Sigma }}_{\epsilon }^{-1}))}}
×
|
Σ
ϵ
|
−
k
/
2
exp
(
−
1
2
t
r
(
(
B
−
B
0
)
T
Λ
0
(
B
−
B
0
)
Σ
ϵ
−
1
)
)
{\displaystyle \times |{\boldsymbol {\Sigma }}_{\epsilon }|^{-k/2}\exp {(-{\frac {1}{2}}{\rm {tr}}((\mathbf {B} -\mathbf {B_{0}} )^{\rm {T}}{\boldsymbol {\Lambda }}_{0}(\mathbf {B} -\mathbf {B_{0}} ){\boldsymbol {\Sigma }}_{\epsilon }^{-1}))}}
×
|
Σ
ϵ
|
−
n
/
2
exp
(
−
1
2
t
r
(
(
Y
−
X
B
)
T
(
Y
−
X
B
)
Σ
ϵ
−
1
)
)
,
{\displaystyle \times |{\boldsymbol {\Sigma }}_{\epsilon }|^{-n/2}\exp {(-{\frac {1}{2}}{\rm {tr}}((\mathbf {Y} -\mathbf {XB} )^{\rm {T}}(\mathbf {Y} -\mathbf {XB} ){\boldsymbol {\Sigma }}_{\epsilon }^{-1}))},}
where
v
e
c
(
B
0
)
=
β
0
{\displaystyle {\rm {vec}}(\mathbf {B_{0}} )={\boldsymbol {\beta }}_{0}}
.
The terms involving
B
{\displaystyle \mathbf {B} }
can be grouped (with
Λ
0
=
U
T
U
{\displaystyle {\boldsymbol {\Lambda }}_{0}=\mathbf {U} ^{\rm {T}}\mathbf {U} }
) using:
(
B
−
B
0
)
T
Λ
0
(
B
−
B
0
)
+
(
Y
−
X
B
)
T
(
Y
−
X
B
)
{\displaystyle (\mathbf {B} -\mathbf {B_{0}} )^{\rm {T}}{\boldsymbol {\Lambda }}_{0}(\mathbf {B} -\mathbf {B_{0}} )+(\mathbf {Y} -\mathbf {XB} )^{\rm {T}}(\mathbf {Y} -\mathbf {XB} )}
=
(
[
Y
U
B
0
]
−
[
X
U
]
B
)
T
(
[
Y
U
B
0
]
−
[
X
U
]
B
)
{\displaystyle =\left({\begin{bmatrix}\mathbf {Y} \\\mathbf {UB_{0}} \end{bmatrix}}-{\begin{bmatrix}\mathbf {X} \\\mathbf {U} \end{bmatrix}}\mathbf {B} \right)^{\rm {T}}\left({\begin{bmatrix}\mathbf {Y} \\\mathbf {UB_{0}} \end{bmatrix}}-{\begin{bmatrix}\mathbf {X} \\\mathbf {U} \end{bmatrix}}\mathbf {B} \right)}
=
(
[
Y
U
B
0
]
−
[
X
U
]
B
n
)
T
(
[
Y
U
B
0
]
−
[
X
U
]
B
n
)
+
(
B
−
B
n
)
T
(
X
T
X
+
Λ
0
)
(
B
−
B
n
)
{\displaystyle =\left({\begin{bmatrix}\mathbf {Y} \\\mathbf {UB_{0}} \end{bmatrix}}-{\begin{bmatrix}\mathbf {X} \\\mathbf {U} \end{bmatrix}}\mathbf {B_{n}} \right)^{\rm {T}}\left({\begin{bmatrix}\mathbf {Y} \\\mathbf {UB_{0}} \end{bmatrix}}-{\begin{bmatrix}\mathbf {X} \\\mathbf {U} \end{bmatrix}}\mathbf {B_{n}} \right)+(\mathbf {B} -\mathbf {B_{n}} )^{\rm {T}}(\mathbf {X} ^{\rm {T}}\mathbf {X} +{\boldsymbol {\Lambda }}_{0})(\mathbf {B} -\mathbf {B_{n}} )}
=
(
Y
−
X
B
n
)
T
(
Y
−
X
B
n
)
+
(
B
0
−
B
n
)
T
Λ
0
(
B
0
−
B
n
)
+
(
B
−
B
n
)
T
(
X
T
X
+
Λ
0
)
(
B
−
B
n
)
{\displaystyle =(\mathbf {Y} -\mathbf {XB_{n}} )^{\rm {T}}(\mathbf {Y} -\mathbf {XB_{n}} )+(\mathbf {B_{0}} -\mathbf {B_{n}} )^{\rm {T}}{\boldsymbol {\Lambda }}_{0}(\mathbf {B_{0}} -\mathbf {B_{n}} )+(\mathbf {B} -\mathbf {B_{n}} )^{\rm {T}}(\mathbf {X} ^{\rm {T}}\mathbf {X} +{\boldsymbol {\Lambda }}_{0})(\mathbf {B} -\mathbf {B_{n}} )}
,
with
B
n
=
(
X
T
X
+
Λ
0
)
−
1
(
X
T
X
B
^
+
Λ
0
B
0
)
=
(
X
T
X
+
Λ
0
)
−
1
(
X
T
Y
+
Λ
0
B
0
)
{\displaystyle \mathbf {B_{n}} =(\mathbf {X} ^{\rm {T}}\mathbf {X} +{\boldsymbol {\Lambda }}_{0})^{-1}(\mathbf {X} ^{\rm {T}}\mathbf {X} {\hat {\mathbf {B} }}+{\boldsymbol {\Lambda }}_{0}\mathbf {B_{0}} )=(\mathbf {X} ^{\rm {T}}\mathbf {X} +{\boldsymbol {\Lambda }}_{0})^{-1}(\mathbf {X} ^{\rm {T}}\mathbf {Y} +{\boldsymbol {\Lambda }}_{0}\mathbf {B_{0}} )}
.
This now allows us to write the posterior in a more useful form:
ρ
(
β
,
Σ
ϵ
|
Y
,
X
)
∝
|
Σ
ϵ
|
−
(
ν
0
+
m
+
n
+
1
)
/
2
exp
(
−
1
2
t
r
(
(
V
0
+
(
Y
−
X
B
n
)
T
(
Y
−
X
B
n
)
+
(
B
n
−
B
0
)
T
Λ
0
(
B
n
−
B
0
)
)
Σ
ϵ
−
1
)
)
{\displaystyle \rho ({\boldsymbol {\beta }},{\boldsymbol {\Sigma }}_{\epsilon }|\mathbf {Y} ,\mathbf {X} )\propto |{\boldsymbol {\Sigma }}_{\epsilon }|^{-({\boldsymbol {\nu }}_{0}+m+n+1)/2}\exp {(-{\frac {1}{2}}{\rm {tr}}((\mathbf {V_{0}} +(\mathbf {Y} -\mathbf {XB_{n}} )^{\rm {T}}(\mathbf {Y} -\mathbf {XB_{n}} )+(\mathbf {B_{n}} -\mathbf {B_{0}} )^{\rm {T}}{\boldsymbol {\Lambda }}_{0}(\mathbf {B_{n}} -\mathbf {B_{0}} )){\boldsymbol {\Sigma }}_{\epsilon }^{-1}))}}
×
|
Σ
ϵ
|
−
k
/
2
exp
(
−
1
2
t
r
(
(
B
−
B
n
)
T
Λ
0
)
(
B
−
B
n
)
Σ
ϵ
−
1
)
)
{\displaystyle \times |{\boldsymbol {\Sigma }}_{\epsilon }|^{-k/2}\exp {(-{\frac {1}{2}}{\rm {tr}}((\mathbf {B} -\mathbf {B_{n}} )^{\rm {T}}{\boldsymbol {\Lambda }}_{0})(\mathbf {B} -\mathbf {B_{n}} ){\boldsymbol {\Sigma }}_{\epsilon }^{-1}))}}
.
This takes the form of an inverse-Wishart distribution times a Matrix normal distribution :
ρ
(
Σ
ϵ
|
Y
,
X
)
∼
W
−
1
(
V
n
,
ν
n
)
{\displaystyle \rho ({\boldsymbol {\Sigma }}_{\epsilon }|\mathbf {Y} ,\mathbf {X} )\sim {\mathcal {W}}^{-1}(\mathbf {V_{n}} ,{\boldsymbol {\nu }}_{n})}
and
ρ
(
β
|
Y
,
X
,
Σ
ϵ
)
∼
M
N
k
,
m
(
B
n
,
Λ
n
−
1
,
Σ
ϵ
)
{\displaystyle \rho ({\boldsymbol {\beta }}|\mathbf {Y} ,\mathbf {X} ,{\boldsymbol {\Sigma }}_{\epsilon })\sim {\mathcal {MN}}_{k,m}(\mathbf {B_{n}} ,{\boldsymbol {\Lambda }}_{n}^{-1},{\boldsymbol {\Sigma }}_{\epsilon })}
.
The parameters of this posterior are given by:
V
n
=
V
0
+
(
Y
−
X
B
n
)
T
(
Y
−
X
B
n
)
+
(
B
n
−
B
0
)
T
Λ
0
(
B
n
−
B
0
)
{\displaystyle \mathbf {V_{n}} =\mathbf {V_{0}} +(\mathbf {Y} -\mathbf {XB_{n}} )^{\rm {T}}(\mathbf {Y} -\mathbf {XB_{n}} )+(\mathbf {B_{n}} -\mathbf {B_{0}} )^{\rm {T}}{\boldsymbol {\Lambda }}_{0}(\mathbf {B_{n}} -\mathbf {B_{0}} )}
ν
n
=
ν
0
+
n
{\displaystyle {\boldsymbol {\nu }}_{n}={\boldsymbol {\nu }}_{0}+n}
B
n
=
(
X
T
X
+
Λ
0
)
−
1
(
X
T
Y
+
Λ
0
B
0
)
{\displaystyle \mathbf {B_{n}} =(\mathbf {X} ^{\rm {T}}\mathbf {X} +{\boldsymbol {\Lambda }}_{0})^{-1}(\mathbf {X} ^{\rm {T}}\mathbf {Y} +{\boldsymbol {\Lambda }}_{0}\mathbf {B_{0}} )}
Λ
n
=
X
T
X
+
Λ
0
{\displaystyle {\boldsymbol {\Lambda }}_{n}=\mathbf {X} ^{\rm {T}}\mathbf {X} +{\boldsymbol {\Lambda }}_{0}}
See also
References
^ a b c Peter E. Rossi, Greg M. Allenby, Rob McCulloch. Bayesian Statistics and Marketing . John Wiley & Sons, 2012, p. 32.