Bayesian interpretation of kernel regularization

Bayesian interpretation of kernel regularization examines how kernel methods in machine learning can be understood through the lens of Bayesian statistics, a framework that uses probability to model uncertainty. Kernel methods are founded on the concept of similarity between inputs within a structured space. While techniques like support vector machines (SVMs) and their regularization (a technique to make a model more generalizable and transferable) were not originally formulated using Bayesian principles, analyzing them from a Bayesian perspective provides valuable insights.

In the Bayesian framework, kernel methods serve as a fundamental component of Gaussian processes, where the kernel function operates as a covariance function that defines relationships between inputs. Traditionally, these methods have been applied to supervised learning problems where inputs are represented as vectors and outputs as scalars. Recent developments have extended kernel methods to handle multiple outputs, as seen in multi-task learning.^[1]

The mathematical framework for kernel methods typically involves reproducing kernel Hilbert spaces (RKHS). Not all kernels form inner product spaces, as they may not always be positive semidefinite (a property ensuring non-negative similarity measures), but they still operate within these more general RKHS. A mathematical equivalence between regularization approaches and Bayesian methods can be established, particularly in cases where the reproducing kernel Hilbert space is finite-dimensional. This equivalence demonstrates how both perspectives converge to essentially the same estimators, revealing the underlying connection between these seemingly different approaches.

The supervised learning problem

The classical supervised learning problem requires estimating the output for some new input point $\mathbf {x} '$ by learning a scalar-valued estimator ${\hat {f}}(\mathbf {x} ')$ on the basis of a training set $S$ consisting of $n$ input-output pairs, $S=(\mathbf {X} ,\mathbf {Y} )=(\mathbf {x} _{1},y_{1}),\ldots ,(\mathbf {x} _{n},y_{n})$ .^[2] Given a symmetric and positive bivariate function $k(\cdot ,\cdot )$ called a kernel, one of the most popular estimators in machine learning is given by

{\hat {f}}(\mathbf {x} ')=\mathbf {k} ^{\top }(\mathbf {K} +\lambda n\mathbf {I} )^{-1}\mathbf {Y} ,

1

where $\mathbf {K} \equiv k(\mathbf {X} ,\mathbf {X} )$ is the kernel matrix with entries $\mathbf {K} _{ij}=k(\mathbf {x} _{i},\mathbf {x} _{j})$ , $\mathbf {k} =[k(\mathbf {x} _{1},\mathbf {x} '),\ldots ,k(\mathbf {x} _{n},\mathbf {x} ')]^{\top }$ , and $\mathbf {Y} =[y_{1},\ldots ,y_{n}]^{\top }$ . We will see how this estimator can be derived both from a regularization and a Bayesian perspective.

A regularization perspective

The main assumption in the regularization perspective is that the set of functions ${\mathcal {F}}$ is assumed to belong to a reproducing kernel Hilbert space ${\mathcal {H}}_{k}$ .^[2]^[3]^[4]^[5]

Reproducing kernel Hilbert space

A reproducing kernel Hilbert space (RKHS) ${\mathcal {H}}_{k}$ is a Hilbert space of functions defined by a symmetric, positive-definite function $k:{\mathcal {X}}\times {\mathcal {X}}\rightarrow \mathbb {R}$ called the reproducing kernel such that the function $k(\mathbf {x} ,\cdot )$ belongs to ${\mathcal {H}}_{k}$ for all $\mathbf {x} \in {\mathcal {X}}$ .^[6]^[7]^[8] There are three main properties that make an RKHS appealing:

1. The reproducing property, after which the RKHS is named,

f(\mathbf {x} )=\langle f,k(\mathbf {x} ,\cdot )\rangle _{k},\quad \forall \ f\in {\mathcal {H}}_{k},

where $\langle \cdot ,\cdot \rangle _{k}$ is the inner product in ${\mathcal {H}}_{k}$ .

2. Functions in an RKHS are in the closure of the linear combination of the kernel at given points,

f(\mathbf {x} )=\sum _{i}k(\mathbf {x} _{i},\mathbf {x} )c_{i}

.

This allows the construction in a unified framework of both linear and generalized linear models.

3. The squared norm in an RKHS can be written as

\|f\|_{k}^{2}=\sum _{i,j}k(\mathbf {x} _{i},\mathbf {x} _{j})c_{i}c_{j}

and could be viewed as measuring the complexity of the function.

The regularized functional

The estimator is derived as the minimizer of the regularized functional

{\frac {1}{n}}\sum _{i=1}^{n}(f(\mathbf {x} _{i})-y_{i})^{2}+\lambda \|f\|_{k}^{2},

2

where $f\in {\mathcal {H}}_{k}$ and $\|\cdot \|_{k}$ is the norm in ${\mathcal {H}}_{k}$ . The first term in this functional, which measures the average of the squares of the errors between the $f(\mathbf {x} _{i})$ and the $y_{i}$ , is called the empirical risk and represents the cost we pay by predicting $f(\mathbf {x} _{i})$ for the true value $y_{i}$ . The second term in the functional is the squared norm in a RKHS multiplied by a weight $\lambda$ and serves the purpose of stabilizing the problem^[3]^[5] as well as of adding a trade-off between fitting and complexity of the estimator.^[2] The weight $\lambda$ , called the regularizer, determines the degree to which instability and complexity of the estimator should be penalized (higher penalty for increasing value of $\lambda$ ).

Derivation of the estimator

The explicit form of the estimator in equation (1) is derived in two steps. First, the representer theorem^[9]^[10]^[11] states that the minimizer of the functional (2) can always be written as a linear combination of the kernels centered at the training-set points,

{\hat {f}}(\mathbf {x} ')=\sum _{i=1}^{n}c_{i}k(\mathbf {x} _{i},\mathbf {x} ')=\mathbf {k} ^{\top }\mathbf {c} ,

3

for some $\mathbf {c} \in \mathbb {R} ^{n}$ . The explicit form of the coefficients $\mathbf {c} =[c_{1},\ldots ,c_{n}]^{\top }$ can be found by substituting for $f(\cdot )$ in the functional (2). For a function of the form in equation (3), we have that

{\begin{aligned}\|f\|_{k}^{2}&=\langle f,f\rangle _{k},\\&=\left\langle \sum _{i=1}^{N}c_{i}k(\mathbf {x} _{i},\cdot ),\sum _{j=1}^{N}c_{j}k(\mathbf {x} _{j},\cdot )\right\rangle _{k},\\&=\sum _{i=1}^{N}\sum _{j=1}^{N}c_{i}c_{j}\langle k(\mathbf {x} _{i},\cdot ),k(\mathbf {x} _{j},\cdot )\rangle _{k},\\&=\sum _{i=1}^{N}\sum _{j=1}^{N}c_{i}c_{j}k(\mathbf {x} _{i},\mathbf {x} _{j}),\\&=\mathbf {c} ^{\top }\mathbf {K} \mathbf {c} .\end{aligned}}

We can rewrite the functional (2) as

{\frac {1}{n}}\|\mathbf {y} -\mathbf {K} \mathbf {c} \|^{2}+\lambda \mathbf {c} ^{\top }\mathbf {K} \mathbf {c} .

This functional is convex in $\mathbf {c}$ and therefore we can find its minimum by setting the gradient with respect to $\mathbf {c}$ to zero,

{\begin{aligned}-{\frac {1}{n}}\mathbf {K} (\mathbf {Y} -\mathbf {K} \mathbf {c} )+\lambda \mathbf {K} \mathbf {c} &=0,\\(\mathbf {K} +\lambda n\mathbf {I} )\mathbf {c} &=\mathbf {Y} ,\\\mathbf {c} &=(\mathbf {K} +\lambda n\mathbf {I} )^{-1}\mathbf {Y} .\end{aligned}}

Substituting this expression for the coefficients in equation (3), we obtain the estimator stated previously in equation (1),

{\hat {f}}(\mathbf {x} ')=\mathbf {k} ^{\top }(\mathbf {K} +\lambda n\mathbf {I} )^{-1}\mathbf {Y} .

A Bayesian perspective

The notion of a kernel plays a crucial role in Bayesian probability as the covariance function of a stochastic process called the Gaussian process.

A review of Bayesian probability

As part of the Bayesian framework, the Gaussian process specifies the prior distribution that describes the prior beliefs about the properties of the function being modeled. These beliefs are updated after taking into account observational data by means of a likelihood function that relates the prior beliefs to the observations. Taken together, the prior and likelihood lead to an updated distribution called the posterior distribution that is customarily used for predicting test cases.

The Gaussian process

A Gaussian process (GP) is a stochastic process in which any finite number of random variables that are sampled follow a joint Normal distribution.^[12] The mean vector and covariance matrix of the Gaussian distribution completely specify the GP. GPs are usually used as a priori distribution for functions, and as such the mean vector and covariance matrix can be viewed as functions, where the covariance function is also called the kernel of the GP. Let a function $f$ follow a Gaussian process with mean function $m$ and kernel function $k$ ,

f\sim {\mathcal {GP}}(m,k).

In terms of the underlying Gaussian distribution, we have that for any finite set $\mathbf {X} =\{\mathbf {x} _{i}\}_{i=1}^{n}$ if we let $f(\mathbf {X} )=[f(\mathbf {x} _{1}),\ldots ,f(\mathbf {x} _{n})]^{\top }$ then

f(\mathbf {X} )\sim {\mathcal {N}}(\mathbf {m} ,\mathbf {K} ),

where $\mathbf {m} =m(\mathbf {X} )=[m(\mathbf {x} _{1}),\ldots ,m(\mathbf {x} _{N})]^{\top }$ is the mean vector and $\mathbf {K} =k(\mathbf {X} ,\mathbf {X} )$ is the covariance matrix of the multivariate Gaussian distribution.

Derivation of the estimator

In a regression context, the likelihood function is usually assumed to be a Gaussian distribution and the observations to be independent and identically distributed (iid),

p(y|f,\mathbf {x} ,\sigma ^{2})={\mathcal {N}}(f(\mathbf {x} ),\sigma ^{2}).

This assumption corresponds to the observations being corrupted with zero-mean Gaussian noise with variance $\sigma ^{2}$ . The iid assumption makes it possible to factorize the likelihood function over the data points given the set of inputs $\mathbf {X}$ and the variance of the noise $\sigma ^{2}$ , and thus the posterior distribution can be computed analytically. For a test input vector $\mathbf {x} '$ , given the training data $S=\{\mathbf {X} ,\mathbf {Y} \}$ , the posterior distribution is given by

p(f(\mathbf {x} ')|S,\mathbf {x} ',{\boldsymbol {\phi }})={\mathcal {N}}(m(\mathbf {x} '),\sigma ^{2}(\mathbf {x} ')),

where ${\boldsymbol {\phi }}$ denotes the set of parameters which include the variance of the noise $\sigma ^{2}$ and any parameters from the covariance function $k$ and where

{\begin{aligned}m(\mathbf {x} ')&=\mathbf {k} ^{\top }(\mathbf {K} +\sigma ^{2}\mathbf {I} )^{-1}\mathbf {Y} ,\\\sigma ^{2}(\mathbf {x} ')&=k(\mathbf {x} ',\mathbf {x} ')-\mathbf {k} ^{\top }(\mathbf {K} +\sigma ^{2}\mathbf {I} )^{-1}\mathbf {k} .\end{aligned}}

The connection between regularization and Bayes

A connection between regularization theory and Bayesian theory can only be achieved in the case of finite dimensional RKHS. Under this assumption, regularization theory and Bayesian theory are connected through Gaussian process prediction.^[3]^[12]^[13]

In the finite dimensional case, every RKHS can be described in terms of a feature map $\Phi :{\mathcal {X}}\rightarrow \mathbb {R} ^{p}$ such that^[2]

k(\mathbf {x} ,\mathbf {x} ')=\sum _{i=1}^{p}\Phi ^{i}(\mathbf {x} )\Phi ^{i}(\mathbf {x} ').

Functions in the RKHS with kernel $\mathbf {K}$ can be then be written as

f_{\mathbf {w} }(\mathbf {x} )=\sum _{i=1}^{p}\mathbf {w} ^{i}\Phi ^{i}(\mathbf {x} )=\langle \mathbf {w} ,\Phi (\mathbf {x} )\rangle ,

and we also have that

\|f_{\mathbf {w} }\|_{k}=\|\mathbf {w} \|.

We can now build a Gaussian process by assuming $\mathbf {w} =[w^{1},\ldots ,w^{p}]^{\top }$ to be distributed according to a multivariate Gaussian distribution with zero mean and identity covariance matrix,

\mathbf {w} \sim {\mathcal {N}}(0,\mathbf {I} )\propto \exp(-\|\mathbf {w} \|^{2}).

If we assume a Gaussian likelihood we have

P(\mathbf {Y} |\mathbf {X} ,f)={\mathcal {N}}(f(\mathbf {X} ),\sigma ^{2}\mathbf {I} )\propto \exp \left(-{\frac {1}{\sigma ^{2}}}\|f_{\mathbf {w} }(\mathbf {X} )-\mathbf {Y} \|^{2}\right),

where $f_{\mathbf {w} }(\mathbf {X} )=(\langle \mathbf {w} ,\Phi (\mathbf {x} _{1})\rangle ,\ldots ,\langle \mathbf {w} ,\Phi (\mathbf {x} _{n}\rangle )$ . The resulting posterior distribution is the given by

P(f|\mathbf {X} ,\mathbf {Y} )\propto \exp \left(-{\frac {1}{\sigma ^{2}}}\|f_{\mathbf {w} }(\mathbf {X} )-\mathbf {Y} \|_{n}^{2}+\|\mathbf {w} \|^{2}\right)

We can see that a maximum posterior (MAP) estimate is equivalent to the minimization problem defining Tikhonov regularization, where in the Bayesian case the regularization parameter is related to the noise variance.

From a philosophical perspective, the loss function in a regularization setting plays a different role than the likelihood function in the Bayesian setting. Whereas the loss function measures the error that is incurred when predicting $f(\mathbf {x} )$ in place of $y$ , the likelihood function measures how likely the observations are from the model that was assumed to be true in the generative process. From a mathematical perspective, however, the formulations of the regularization and Bayesian frameworks make the loss function and the likelihood function to have the same mathematical role of promoting the inference of functions $f$ that approximate the labels $y$ as much as possible.

References

^ Álvarez, Mauricio A.; Rosasco, Lorenzo; Lawrence, Neil D. (June 2011). "Kernels for Vector-Valued Functions: A Review". arXiv:1106.6251 [stat.ML].
^ ^a ^b ^c ^d Vapnik, Vladimir (1998). Statistical learning theory. Wiley. ISBN 9780471030034.
^ ^a ^b ^c Wahba, Grace (1990). Spline models for observational data. SIAM. Bibcode:1990smod.conf.....W.
^ Schölkopf, Bernhard; Smola, Alexander J. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press. ISBN 9780262194754.
^ ^a ^b Girosi, F.; Poggio, T. (1990). "Networks and the best approximation property" (PDF). Biological Cybernetics. 63 (3). Springer: 169–176. doi:10.1007/bf00195855. hdl:1721.1/6017. S2CID 18824241.
^ Aronszajn, N (May 1950). "Theory of Reproducing Kernels". Transactions of the American Mathematical Society. 68 (3): 337–404. doi:10.2307/1990404. JSTOR 1990404.
^ Schwartz, Laurent (1964). "Sous-espaces hilbertiens d'espaces vectoriels topologiques et noyaux associés (noyaux reproduisants)". Journal d'Analyse Mathématique. 13 (1). Springer: 115–256. doi:10.1007/bf02786620. S2CID 117202393.
^ Cucker, Felipe; Smale, Steve (October 5, 2001). "On the mathematical foundations of learning". Bulletin of the American Mathematical Society. 39 (1): 1–49. doi:10.1090/s0273-0979-01-00923-5.
^ Kimeldorf, George S.; Wahba, Grace (1970). "A correspondence between Bayesian estimation on stochastic processes and smoothing by splines". The Annals of Mathematical Statistics. 41 (2): 495–502. doi:10.1214/aoms/1177697089.
^ Schölkopf, Bernhard; Herbrich, Ralf; Smola, Alex J. (2001). "A Generalized Representer Theorem". Computational Learning Theory. Lecture Notes in Computer Science. Vol. 2111/2001. pp. 416–426. doi:10.1007/3-540-44581-1_27. ISBN 978-3-540-42343-0.
^ De Vito, Ernesto; Rosasco, Lorenzo; Caponnetto, Andrea; Piana, Michele; Verri, Alessandro (October 2004). "Some Properties of Regularized Kernel Methods". Journal of Machine Learning Research. 5: 1363–1390.
^ ^a ^b Rasmussen, Carl Edward; Williams, Christopher K. I. (2006). Gaussian Processes for Machine Learning. The MIT Press. ISBN 0-262-18253-X.
^ Huang, Yunfei.; et al. (2019). "Traction force microscopy with optimized regularization and automated Bayesian parameter selection for comparing cells". Scientific Reports. 9 (1): 537. arXiv:1810.05848. Bibcode:2019NatSR...9..539H. doi:10.1038/s41598-018-36896-x. PMC 6345967. PMID 30679578.

[AlvRosLaw112-1] Álvarez, Mauricio A.; Rosasco, Lorenzo; Lawrence, Neil D. (June 2011). "Kernels for Vector-Valued Functions: A Review". arXiv:1106.6251 [stat.ML].

[Vap98-2] Vapnik, Vladimir (1998). Statistical learning theory. Wiley. ISBN 9780471030034.

[Wah90-3] Wahba, Grace (1990). Spline models for observational data. SIAM. Bibcode:1990smod.conf.....W.

[SchSmo02-4] Schölkopf, Bernhard; Smola, Alexander J. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press. ISBN 9780262194754.

[GirPog90-5] Girosi, F.; Poggio, T. (1990). "Networks and the best approximation property" (PDF). Biological Cybernetics. 63 (3). Springer: 169–176. doi:10.1007/bf00195855. hdl:1721.1/6017. S2CID 18824241.

[Aro50-6] Aronszajn, N (May 1950). "Theory of Reproducing Kernels". Transactions of the American Mathematical Society. 68 (3): 337–404. doi:10.2307/1990404. JSTOR 1990404.

[Sch64-7] Schwartz, Laurent (1964). "Sous-espaces hilbertiens d'espaces vectoriels topologiques et noyaux associés (noyaux reproduisants)". Journal d'Analyse Mathématique. 13 (1). Springer: 115–256. doi:10.1007/bf02786620. S2CID 117202393.

[CucSma01-8] Cucker, Felipe; Smale, Steve (October 5, 2001). "On the mathematical foundations of learning". Bulletin of the American Mathematical Society. 39 (1): 1–49. doi:10.1090/s0273-0979-01-00923-5.

[KimWha70-9] Kimeldorf, George S.; Wahba, Grace (1970). "A correspondence between Bayesian estimation on stochastic processes and smoothing by splines". The Annals of Mathematical Statistics. 41 (2): 495–502. doi:10.1214/aoms/1177697089.

[SchHerSmo01-10] Schölkopf, Bernhard; Herbrich, Ralf; Smola, Alex J. (2001). "A Generalized Representer Theorem". Computational Learning Theory. Lecture Notes in Computer Science. Vol. 2111/2001. pp. 416–426. doi:10.1007/3-540-44581-1_27. ISBN 978-3-540-42343-0.

[DevEtal04-11] De Vito, Ernesto; Rosasco, Lorenzo; Caponnetto, Andrea; Piana, Michele; Verri, Alessandro (October 2004). "Some Properties of Regularized Kernel Methods". Journal of Machine Learning Research. 5: 1363–1390.

[RasWil06-12] Rasmussen, Carl Edward; Williams, Christopher K. I. (2006). Gaussian Processes for Machine Learning. The MIT Press. ISBN 0-262-18253-X.

[13] Huang, Yunfei.; et al. (2019). "Traction force microscopy with optimized regularization and automated Bayesian parameter selection for comparing cells". Scientific Reports. 9 (1): 537. arXiv:1810.05848. Bibcode:2019NatSR...9..539H. doi:10.1038/s41598-018-36896-x. PMC 6345967. PMID 30679578.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

The supervised learning problem

A regularization perspective

Reproducing kernel Hilbert space

The regularized functional

Derivation of the estimator

A Bayesian perspective

A review of Bayesian probability

The Gaussian process

Derivation of the estimator

The connection between regularization and Bayes

See also

References