Random feature

Random Fourier features (RFF) are a technique used in machine learning to approximate kernel methods, introduced by Ali Rahimi and Ben Recht in their 2007 paper "Random Features for Large-Scale Kernel Machines".^[1]

RFF is a Monte Carlo approximation to the feature map associated with shift-invariant kernels. The method involves mapping the input data to a higher-dimensional space using randomly sampled sinusoidal functions. It is used for datasets that are too large for traditional kernel methods like support vector machine, kernel ridge regression, and gaussian process.

Mathematics

Because support vector machines and other models employing the kernel trick do not scale well to large numbers of training samples or large numbers of features in the input space, several approximations to the RBF kernel (and similar kernels) have been introduced.^[2] Typically, these take the form of a function z that maps a single vector to a vector of higher dimensionality, approximating the kernel:

\langle z(\mathbf {x} ),z(\mathbf {x'} )\rangle \approx \langle \varphi (\mathbf {x} ),\varphi (\mathbf {x'} )\rangle =K(\mathbf {x} ,\mathbf {x'} )

where $\textstyle \varphi$ is the implicit mapping embedded in the RBF kernel.

One way to construct such a z is to randomly sample from the Fourier transformation of the kernel^[3] $\varphi (x)={\frac {1}{\sqrt {D}}}[\cos \langle w_{1},x\rangle ,\sin \langle w_{1},x\rangle ,\ldots ,\cos \langle w_{D},x\rangle ,\sin \langle w_{D},x\rangle ]^{T}$ where $w_{1},...,w_{D}$ are independent samples from the normal distribution $N(0,\sigma ^{-2}I)$ .

Theorem: (unbiased esimation) $\operatorname {E} [\langle \varphi (x),\varphi (y)\rangle ]=e^{\|x-y\|^{2}/(2\sigma ^{2})}.$

Proof: It suffices to prove the case of $D=1$ . Use the trigonometric identity $\cos(a-b)=\cos(a)\cos(b)+\sin(a)\sin(b)$ , the spherical symmetry of gaussian distribution, then evaluate the integral

\int _{-\infty }^{\infty }{\frac {\cos(kx)e^{-x^{2}/2}}{\sqrt {2\pi }}}dx=e^{-k^{2}/2}.

Theorem: (convergence) As the number of random features $R$ increases, the approximation converges to the true kernel with high probability.

Theorem: (variance bound) $\operatorname {Var} [\langle \varphi (x),\varphi (y)\rangle ]=O(D^{-1})$ . (Appendix A.2^[4]).

Variations

Orthogonal random features^[5] uses a random orthogonal matrix instead of a random Fourier matrix.

References

^ Rahimi, Ali; Recht, Benjamin (2007). "Random Features for Large-Scale Kernel Machines". Advances in Neural Information Processing Systems. 20.
^ Andreas Müller (2012). Kernel Approximations for Efficient SVMs (and other feature extraction methods).
^ Rahimi, Ali; Recht, Benjamin (2007). "Random Features for Large-Scale Kernel Machines". Advances in Neural Information Processing Systems. 20. Curran Associates, Inc.
^ Peng, Hao; Pappas, Nikolaos; Yogatama, Dani; Schwartz, Roy; Smith, Noah A.; Kong, Lingpeng (2021-03-19). "Random Feature Attention". arXiv:2103.02143 [cs.CL].
^ Yu, Felix Xinnan X; Suresh, Ananda Theertha; Choromanski, Krzysztof M; Holtmann-Rice, Daniel N; Kumar, Sanjiv (2016). "Orthogonal Random Features". Advances in Neural Information Processing Systems. 29. Curran Associates, Inc.

[Rahimi2007-1] Rahimi, Ali; Recht, Benjamin (2007). "Random Features for Large-Scale Kernel Machines". Advances in Neural Information Processing Systems. 20.

[2] Andreas Müller (2012). Kernel Approximations for Efficient SVMs (and other feature extraction methods).

[3] Rahimi, Ali; Recht, Benjamin (2007). "Random Features for Large-Scale Kernel Machines". Advances in Neural Information Processing Systems. 20. Curran Associates, Inc.

[4] Peng, Hao; Pappas, Nikolaos; Yogatama, Dani; Schwartz, Roy; Smith, Noah A.; Kong, Lingpeng (2021-03-19). "Random Feature Attention". arXiv:2103.02143 [cs.CL].

[5] Yu, Felix Xinnan X; Suresh, Ananda Theertha; Choromanski, Krzysztof M; Holtmann-Rice, Daniel N; Kumar, Sanjiv (2016). "Orthogonal Random Features". Advances in Neural Information Processing Systems. 29. Curran Associates, Inc.

[1]

[2]

[3]

[4]

[5]

Mathematics

Variations

See also

References