Random feature

Random features (RF) are a technique used in machine learning to approximate kernel methods, introduced by Ali Rahimi and Ben Recht in their 2007 paper "Random Features for Large-Scale Kernel Machines".^[1]

RF uses a Monte Carlo approximation to kernel functions by randomly sampled feature maps. It is used for datasets that are too large for traditional kernel methods like support vector machine, kernel ridge regression, and gaussian process.

Mathematics

Given a feature map ${\textstyle \phi :\mathbb {R} ^{d}\to V}$ , where ${\textstyle V}$ is a Hilbert space, the kernel trick replaces inner products in high-dimensional feature space $\langle \phi (x),\phi (y)\rangle _{V}$ by a kernel function $k(x,y):\mathbb {R} ^{d}\times \mathbb {R} ^{d}\to \mathbb {R}$ that is positive-definite.

Kernel methods perform linear operations in high-dimensional space by manipulating the kernel matrix instead: $K_{X}:=[k(x_{i},x_{j})]_{i,j\in 1:N}$ where ${\textstyle N}$ is the number of data points.

The problem with kernel methods is that the kernel matrix ${\textstyle K_{X}}$ has size ${\textstyle N\times N}$ . This becomes computationally infeasible when ${\textstyle N}$ reaches the order of 1 million.

The random kernel method replaces the kernel function ${\textstyle k}$ by an inner product in low-dimensional feature space ${\textstyle \mathbb {R} ^{D}}$ : $k(x,y)\approx \langle z(x),z(y)\rangle$ where ${\textstyle z}$ is a randomly sampled feature map ${\textstyle z:\mathbb {R} ^{d}\to \mathbb {R} ^{D}}$ .

This converts kernel linear regression into linear regression in feature space, kernel SVM into feature SVM, etc. These methods no longer involve matrices of size ${\textstyle O(N^{2})}$ , but only random feature matrices of size ${\textstyle O(DN)}$ .

Examples

Radial basis function kernel

The Radial basis function kernel can be approximated by random Fourier transformation of the kernel: $\varphi (x)={\frac {1}{\sqrt {D}}}[\cos \langle w_{1},x\rangle ,\sin \langle w_{1},x\rangle ,\ldots ,\cos \langle w_{D},x\rangle ,\sin \langle w_{D},x\rangle ]^{T}$ where $w_{1},...,w_{D}$ are independent samples from the normal distribution $N(0,\sigma ^{-2}I)$ .

Theorem: (unbiased esimation) $\operatorname {E} [\langle \varphi (x),\varphi (y)\rangle ]=e^{\|x-y\|^{2}/(2\sigma ^{2})}.$

Proof: It suffices to prove the case of $D=1$ . Use the trigonometric identity $\cos(a-b)=\cos(a)\cos(b)+\sin(a)\sin(b)$ , the spherical symmetry of gaussian distribution, then evaluate the integral

\int _{-\infty }^{\infty }{\frac {\cos(kx)e^{-x^{2}/2}}{\sqrt {2\pi }}}dx=e^{-k^{2}/2}.

Theorem: (convergence) As the number of random features $R$ increases, the approximation converges to the true kernel with high probability.

Theorem: (variance bound) $\operatorname {Var} [\langle \varphi (x),\varphi (y)\rangle ]=O(D^{-1})$ . (Appendix A.2^[2]).

Orthogonal random features

Orthogonal random features^[3] uses a random orthogonal matrix instead of a random Fourier matrix.

References

^ Rahimi, Ali; Recht, Benjamin (2007). "Random Features for Large-Scale Kernel Machines". Advances in Neural Information Processing Systems. 20.
^ Peng, Hao; Pappas, Nikolaos; Yogatama, Dani; Schwartz, Roy; Smith, Noah A.; Kong, Lingpeng (2021-03-19). "Random Feature Attention". arXiv:2103.02143 [cs.CL].
^ Yu, Felix Xinnan X; Suresh, Ananda Theertha; Choromanski, Krzysztof M; Holtmann-Rice, Daniel N; Kumar, Sanjiv (2016). "Orthogonal Random Features". Advances in Neural Information Processing Systems. 29. Curran Associates, Inc.

[Rahimi2007-1] Rahimi, Ali; Recht, Benjamin (2007). "Random Features for Large-Scale Kernel Machines". Advances in Neural Information Processing Systems. 20.

[2] Peng, Hao; Pappas, Nikolaos; Yogatama, Dani; Schwartz, Roy; Smith, Noah A.; Kong, Lingpeng (2021-03-19). "Random Feature Attention". arXiv:2103.02143 [cs.CL].

[3] Yu, Felix Xinnan X; Suresh, Ananda Theertha; Choromanski, Krzysztof M; Holtmann-Rice, Daniel N; Kumar, Sanjiv (2016). "Orthogonal Random Features". Advances in Neural Information Processing Systems. 29. Curran Associates, Inc.

[1]

[2]

[3]

Mathematics

Examples

Radial basis function kernel

Orthogonal random features

See also

References