Random feature

Random features (RF) are a technique used in machine learning to approximate kernel methods, introduced by Ali Rahimi and Ben Recht in their 2007 paper "Random Features for Large-Scale Kernel Machines".^[1]

RF uses a Monte Carlo approximation to kernel functions by randomly sampled feature maps. It is used for datasets that are too large for traditional kernel methods like support vector machine, kernel ridge regression, and gaussian process.

Mathematics

Kernel method

Given a feature map ${\textstyle \phi :\mathbb {R} ^{d}\to V}$ , where ${\textstyle V}$ is a Hilbert space (more specifically, a reproducing kernel Hilbert space), the kernel trick replaces inner products in feature space $\langle \phi (x_{i}),\phi (x_{j})\rangle _{V}$ by a kernel function $k(x_{i},x_{j}):\mathbb {R} ^{d}\times \mathbb {R} ^{d}\to \mathbb {R}$ Kernel methods perform linear operations in high-dimensional space by manipulating the kernel matrix instead: $K_{X}:=[k(x_{i},x_{j})]_{i,j\in 1:N}$ where ${\textstyle N}$ is the number of data points.

Random kernel method

The problem with kernel methods is that the kernel matrix ${\textstyle K_{X}}$ has size ${\textstyle N\times N}$ . This becomes computationally infeasible when ${\textstyle N}$ reaches the order of a million. The random kernel method replaces the kernel function ${\textstyle k}$ by an inner product in low-dimensional feature space ${\textstyle \mathbb {R} ^{D}}$ : $k(x,y)\approx \langle z(x),z(y)\rangle$ where ${\textstyle z}$ is a randomly sampled feature map ${\textstyle z:\mathbb {R} ^{d}\to \mathbb {R} ^{D}}$ .

This converts kernel linear regression into linear regression in feature space, kernel SVM into feature SVM, etc. These methods no longer involve matrices of size ${\textstyle O(N^{2})}$ , but only random feature matrices of size ${\textstyle O(DN)}$ .

Examples

Radial basis function kernel

The radial basis function (RBF) kernel on two samples $x_{i},x_{j}\in \mathbb {R} ^{d}$ is defined as^[2]

k(x_{i},x_{j})=\exp \left(-{\frac {\|x_{i}-x_{j}\|^{2}}{2\sigma ^{2}}}\right)

where $\|x_{i}-x_{j}\|^{2}$ is the squared Euclidean distance and $\sigma$ is a free parameter defining the shape of the kernel. It can be approximated by a random Fourier feature map $z:\mathbb {R} ^{d}\to \mathbb {R} ^{2D}$ : $z(x):={\frac {1}{\sqrt {D}}}[\cos \langle w_{1},x\rangle ,\sin \langle w_{1},x\rangle ,\ldots ,\cos \langle w_{D},x\rangle ,\sin \langle w_{D},x\rangle ]^{T}$ where $w_{1},...,w_{D}$ are IID samples from the multidimensional normal distribution $N(0,\sigma ^{-2}I)$ .

Theorem—-

(Unbiased estimation) $\operatorname {E} [\langle z(x),z(y)\rangle ]=e^{\|x-y\|^{2}/(2\sigma ^{2})}.$
(Variance bound) $\operatorname {Var} [\langle z(x),z(y)\rangle ]=O(D^{-1})$
(Convergence) As $D\to \infty$ , the approximation converges in probability to the true kernel.

Proof

(Unbiased estimation) By independence of $w_{1},...,w_{D}$ , it suffices to prove the case of $D=1$ . By the trigonometric identity $\cos(a-b)=\cos(a)\cos(b)+\sin(a)\sin(b)$ , $\langle z(x),z(y)\rangle ={\frac {1}{D}}\sum _{i=1}^{D}\cos \langle w_{i},x-y\rangle$ Apply the spherical symmetry of normal distribution, then evaluate the integral: $\int _{-\infty }^{\infty }{\frac {\cos(kx)e^{-x^{2}/2}}{\sqrt {2\pi }}}dx=e^{-k^{2}/2}.$

(Variance bound) Since $w_{1},...,w_{D}$ are IID, it suffices to prove that the variance of $\cos \langle w_{1},x-y\rangle$ is finite, which is true since it is bounded within $[-1,+1]$ .

(Convergence) By Chebyshev's inequality.

Random binning features

A random binning features map partitions the input space using randomly shifted grids at randomly chosen resolutions and assigns to an input point a binary bit string that corresponds to the bins in which it falls. The grids are constructed so that the probability that two points $x_{i},x_{j}\in \mathbb {R} ^{d}$ are assigned to the same bin is proportional to $K(x_{i},x_{j})$ . The inner product between a pair of transformed points is proportional to the number of times the two points are binned together, and is therefore an unbiased estimate of $K(x_{i},x_{j})$ . Since this mapping is not smooth and uses the proximity between input points, Random Binning Features works well for approximating kernels that depend only on the $L_{1}$ distance between datapoints.

Orthogonal random features

Orthogonal random features^[3] uses a random orthogonal matrix instead of a random Fourier matrix.

References

^ Rahimi, Ali; Recht, Benjamin (2007). "Random Features for Large-Scale Kernel Machines". Advances in Neural Information Processing Systems. 20.
^ Jean-Philippe Vert, Koji Tsuda, and Bernhard Schölkopf (2004). "A primer on kernel methods". Kernel Methods in Computational Biology.
^ Yu, Felix Xinnan X; Suresh, Ananda Theertha; Choromanski, Krzysztof M; Holtmann-Rice, Daniel N; Kumar, Sanjiv (2016). "Orthogonal Random Features". Advances in Neural Information Processing Systems. 29. Curran Associates, Inc.

[Rahimi2007-1] Rahimi, Ali; Recht, Benjamin (2007). "Random Features for Large-Scale Kernel Machines". Advances in Neural Information Processing Systems. 20.

[primer-2] Jean-Philippe Vert, Koji Tsuda, and Bernhard Schölkopf (2004). "A primer on kernel methods". Kernel Methods in Computational Biology.

[3] Yu, Felix Xinnan X; Suresh, Ananda Theertha; Choromanski, Krzysztof M; Holtmann-Rice, Daniel N; Kumar, Sanjiv (2016). "Orthogonal Random Features". Advances in Neural Information Processing Systems. 29. Curran Associates, Inc.

[1]

[2]

[3]