Gaussian process approximations

In statistics and machine learning Gaussian process approximation is a computational method that accelerates inference tasks in the context of a Gaussian process model, most commonly likelihood evaluation and prediction. Like approximations of other models, they can often be expressed as additional assumptions imposed on the model, which do not correspond to any actual feature, but which retain its key properties while simplifying calculations. Many of these approximation methods can be expressed in purely linear algebraic or functional analytic terms as matrix or function approximations. Others are purely algorithmic and cannot easily be rephrased as a modification of a statistical model.

Basic ideas

In statistical modeling, it is often convenient to assume that $y\in {\mathcal {Y}}$ , the phenomenon under investigation is a Gaussian process indexed by $X\in {\mathcal {X}}={\mathcal {X}}_{1}\times {\mathcal {X}}_{2}\dots {\mathcal {X}}_{d}$ which has mean function $\mu :{\mathcal {X}}\rightarrow {\mathcal {Y}}$ and covariance function $K:{\mathcal {X}}\times {\mathcal {X}}\rightarrow \mathbb {R}$ . One can also assume that data $\mathbf {y} =(y_{1},\dots ,y_{n})$ are values of a particular realization of this process for indices $\mathbf {X} =X_{1},\dots ,X_{n}$ .

Consequently, the joint distribution of the data can be expressed as

\mathbf {y} \sim {\mathcal {N}}(\mathbf {\mu } ,\mathbf {\Sigma } )

,

where $\mathbf {\Sigma } =\left[K(X_{i},X_{j})\right]_{i,j=1}^{n}$ and $\mathbf {\mu } =\left(\mu (X_{1}),\mu (X_{2}),\dots ,\mu (X_{d})\right)^{\top }$ , i.e. respectively a vector a matrix with the covariance function values and a vector with the mean function values at corresponding (pairs of) indices. The negative log-likelihood of the data then takes the form

-\log \ell (\mathbf {y} )={\frac {d}{2\pi }}+{\frac {1}{2}}\log \det(\mathbf {\Sigma } )+\left(\mathbf {y} -\mathbf {\mu } \right)^{\top }\mathbf {\Sigma } ^{-1}\left(\mathbf {y} -\mathbf {\mu } \right)

Similarly, the best predictor of $\mathbf {y} ^{*}$ , the values of $y$ for indices $\mathbf {X} ^{*}=\left(X_{1}^{*},X_{2}^{*},\dots ,X_{d}^{*}\right)$ , given data $\mathbf {y}$ has the form

\mathbf {\mu } _{\mathbf {y} }^{*}=\mathbb {E} \left[\mathbf {y} ^{*}|\mathbf {y} \right]=\mathbf {\mu } ^{*}-\mathbf {\Sigma } _{\mathbf {y} ^{*}\mathbf {y} }\mathbf {\Sigma } ^{-1}\left(\mathbf {y} -\mathbf {\mu } \right)

In the context of Gaussian models, especially in geostatistics, prediction using the best predictor, i.e. mean conditional on the data, is also known as kriging.

The most computationally expensive component of the best predictor formula is inverting the covariance matrix $\mathbf {\Sigma }$ , which has cubic complexity ${\mathcal {O}}(n^{3})$ . Similarly, evaluating likelihood involves both calculating $\mathbf {\Sigma } ^{-1}$ and the determinant $\det(\mathbf {\Sigma } )$ which has the same cubic complexity.

Gaussian process approximations can often be expressed in terms of assumptions on $y$ under which $\log \ell (\mathbf {y} )$ and $\mathbf {\mu } _{\mathbf {y} }^{*}$ can be calculated with much lower complexity. Since these assumptions are generally not believed to reflect reality, the likelihood and the best predictor obtained in this way are not exact, but they are meant to be close to their original values.

Model-based methods

This class of approximations is expressed through a set of assumptions which are imposed on the original process and which, typically, imply some special structure of the covariance matrix. Although most of these methods were developed independently, most of them can be expressed as special cases of the sparse general Vecchia approximation.

Low-rank methods

While this approach encompasses many methods, the common assumption underlying them all is the assumption, that $y$ , the Gaussian process of interest, is effectively low-rank. More precisely, it is assumed, that there exists a set of indices ${\bar {X}}=\{{\bar {x}}_{1},\dots ,{\bar {x}}_{p}\}$ such that every other set of indices $X=\{x_{1},\dots ,x_{n}\}$

$y(X)\sim {\mathcal {N}}\left(\mathbf {A} _{X}{\bar {\mathbf {\mu } }},\mathbf {A} _{X}^{\top }{\bar {\mathbf {\Sigma } }}\mathbf {A} _{X}+\mathbf {D} \right)$

where $\mathbf {A} _{X}$ is an $p\times k$ matrix, ${\bar {\mathbf {\mu } }}=\mu \left(y\left({\bar {X}}\right)\right)$ and ${\bar {\mathbf {\Sigma } }}=K\left({\bar {X}},{\bar {X}}\right)$ and $\mathbf {D}$ is a diagonal matrix. Depending on the method and application various ways of selecting ${\bar {X}}$ have been proposed.

Typically, $p$ is selected to be much smaller than $n$ which means that the computational cost of inverting ${\bar {\mathbf {\Sigma } }}$ is manageable ( ${\mathcal {O}}(p^{3})$ instead of ${\mathcal {O}}(n^{3})$ ).

Other sparse methods

Hierarchical methods

The general principle of hierarchical approximations consists of a repeated application of some other method, such that each consecutive application refines the quality of the approximation. Even though they can be expressed as a set of statistical assumptions, they are often described in terms of a hierarchical matrix approximation (HODLR) or basis function expansion (LatticeKrig, MRA, wavelets). The hierarchical matrix approach can often be represented as a repeated application of a low-rank approximation to successively smaller subsets of the index set $X$ . Basis function expansion relies on using functions with compact support. These features can then be exploited by an algorithm who steps through consecutive layers of the approximation. In the most favourable settings some of these methods can achieve quasi-linear ( ${\mathcal {O}}(n\log n)$ ) complexity.

Unified framework

Probabilistic graphical models provide a convenient framework for comparing model-based approximations. In this context, value of the process at index $x_{k}\in X$ can then be represented by a vertex in a directed graph and edges correspond to the terms in the factorization of the joint density of $y(X)$ . In general, when no independece relations are assumed, the joint probability distribution can be represented by an arbitrary ordering of vertices and edges that make up a spanning tree. Using a particular approximation can then be expressed as a certain way of ordering the vertices and adding or removing specific edges.

Vecchia approximation

An early attempt at using this framework for approximating Gaussian processes is called Vecchia approximation. For a given ordering of vertices, it is assumed that each vertex is independent of other vertices given $N$ preceding vertices. For example, for $N=2$ this implies, that the join density can be written as

f(y(X))=f(X_{1})\cdot f(X_{2}|X_{1})\cdot \prod _{i=2}^{n}f(X_{i}|X_{i-1},X_{i-2}),

while the graphical representation will look as follows.

Graphical representation of the Vecchia approximation. It shows that in the factorization of the joint density the orange vertex

{\mathcal {X}}_{4}

will be dependent on light orange vertices

{\mathcal {X}}_{3}

and

{\mathcal {X}}_{2}

Methods without a statistical model

References

. arXiv:1807.01065 [stat.ML]. {{cite arXiv}}: Missing or empty |title= (help) A bot will complete this citation soon. Click here to jump the queue
Heaton, Matthew J.; Datta, Abhirup; Finley, Andrew O.; Furrer, Reinhard; Guinness, Joseph; Guhaniyogi, Rajarshi; Gerber, Florian; Gramacy, Robert B.; Hammerling, Dorit; Katzfuss, Matthias; Lindgren, Finn; Nychka, Douglas W.; Sun, Furong; Zammit-Mangion, Andrew (2018). "A Case Study Competition Among Methods for Analyzing Large Spatial Data". Journal of Agricultural, Biological and Environmental Statistics. 24 (3): 398–425. doi:10.1007/s13253-018-00348-w. ISSN 1085-7117.