Kernel perceptron

In machine learning, the kernel perceptron is a variant of the popular perceptron learning algorithm that can learn kernel machines, i.e. non-linear classifiers that employ a kernel function to compute the similarity of unseen samples to training samples. The history of the algorithm can be traced to the 1960s.^[1]

Preliminaries

The perceptron algorithm

The perceptron algorithm is an old but popular online learning algorithm that operates by a principle called "error-driven learning": it iteratively improves a model by running it on training samples, then updating the model whenever it finds it has made an incorrect classification wrt. a supervised signal. The model learned by the standard perceptron algorithm is a linear binary classifier: a vector of weights $w$ (and optionally an intercept term $b$ ) that is used to classify a sample vector $x$ as class "one" or class "minus one" according to

{\hat {y}}=\operatorname {sgn}(\mathbf {w} ^{\top }\mathbf {x} )

where a zero is arbitrarily mapped to one or minus one.^[2]

In pseudocode, the perceptron algorithm is given by:

Initialize

w

to an all-zero vector of length

p

, the number of predictors (features).

For each training example

x ᵢ

with ground truth label

yᵢ \in {-1, 1

}:

Let

ŷ = sgn(w T x ᵢ)

.

If

ŷ \neq yᵢ

, update

w

:=

w + yᵢ x ᵢ

.

Repeat this procedure until convergence, or for a set amount of iterations ("epochs").

Kernel machines

By contrast with the linear models learned by the perceptron, a kernel machine^[3] is a classifier that stores a subset of its training examples $x ᵢ$ , associates with each a weight $αᵢ$ , and makes decisions for new samples $x'$ by evaluating

\operatorname {sgn} \sum _{i}\alpha _{i}y_{i}K(\mathbf {x} _{i},\mathbf {x'} )

Here, $K$ is some kernel function. Formally, a kernel function is a non-negative semidefinite kernel (see Mercer's condition); intuitively, it can be thought of as a similarity function between samples. Each function $x' \mapsto K (x ᵢ, x')$ serves as a basis function in the classification.

Algorithm

To derive a kernelized version of the perceptron algorithm, we must first formulate it in dual form, starting from the observation that the weight vector $w$ can be expressed as a linear combination of the $n$ training samples:

\mathbf {w} =\sum _{i}^{n}\alpha _{i}y_{i}\mathbf {x} _{i}

where $αᵢ$ is the number of times $x ᵢ$ was misclassified, forcing an update. The dual algorithm thus loops through the samples as before, making predictions. But instead of explicitly updating the per-feature weight vector $w$ , it updates the per-sample "mistake counter" vector $α$ . We can also rewrite the prediction formula as

{\begin{aligned}{\hat {y}}&=\operatorname {sgn}(\mathbf {w} ^{\mathsf {T}}\mathbf {x} )\\&=\operatorname {sgn} \left(\sum _{i}^{n}\alpha _{i}y_{i}\mathbf {x} _{i}\right)^{\mathsf {T}}\mathbf {x} \\&=\operatorname {sgn} \sum _{i}^{n}\alpha _{i}y_{i}(\mathbf {x} _{i}\cdot \mathbf {x} )\end{aligned}}

Finally, we can replace the dot product (linear kernel) in the dual version of the perceptron algorithm by a different kernel function. Doing this yields the kernel perceptron algorithm:^[4]

Initialize

α

to an all-zeros vector of length

n

, the number of training samples.

For some fixed number of iterations, or until the model makes no mistakes on the training set:

For each training example

x, y

:

Let

{\hat {y}}=\operatorname {sgn} \sum _{i}^{n}\alpha _{i}y_{i}K(\mathbf {x} _{i},\mathbf {x} )

If

ŷ \neq y

, perform an update:

αᵢ

:=

αᵢ + 1

Forgetron

A problem with the kernel perceptron, as presented above, is that it does not learn sparse kernel machines. Initially, all the $αᵢ$ are zero so that evaluating the decision function to get $ŷ$ requires no kernel evaluations at all, but each update increments a single $αᵢ$ , making the evaluation increasingly more costly. Moreover, when the kernel perceptron is used in an online setting, the number of non-zero $αᵢ$ and thus the evaluation cost grow linearly in the number of examples presented to the algorithm.

The forgetron variant of the kernel perceptron was suggested to deal with this problem. It maintains an active set of examples with non-zero $αᵢ$ , removing ("forgetting") examples from the active set when it exceeds a pre-determined budget and "shrinking" (lowering the weight of) old examples as new ones are promoted to non-zero $αᵢ$ .^[5]

Notes and references

^ Aizerman, M. A.; Emmanuel M. Braverman; L. I. Rozoner (1964). "Theoretical foundations of the potential function method in pattern recognition learning". Automation and Remote Control. 25: 821–837. Cited in Guyon, Isabelle; B. Boser; Vladimir Vapnik (1993). Automatic capacity tuning of very large VC-dimension classifiers. Advances in neural information processing systems.
^ Using the hat to denote an estimated value.
^ Schölkopf, Bernhard; and Smola, Alexander J.; Learning with Kernels, MIT Press, Cambridge, MA, 2002. ISBN 0-262-19475-9
^ John Shawe-Taylor; Nello Cristianini (2004). Kernel Methods for Pattern Analysis. Cambridge University Press. pp. 241–242.
^ Dekel, Ofer; Shai Shalev-Shwartz; Yoram Singer (2008). "The forgetron: A kernel-based perceptron on a budget" (PDF). SIAM Journal on Computing. 37 (5): 1342–1372.

[1] Aizerman, M. A.; Emmanuel M. Braverman; L. I. Rozoner (1964). "Theoretical foundations of the potential function method in pattern recognition learning". Automation and Remote Control. 25: 821–837. Cited in Guyon, Isabelle; B. Boser; Vladimir Vapnik (1993). Automatic capacity tuning of very large VC-dimension classifiers. Advances in neural information processing systems.

[2] Using the hat to denote an estimated value.

[3] Schölkopf, Bernhard; and Smola, Alexander J.; Learning with Kernels, MIT Press, Cambridge, MA, 2002. ISBN 0-262-19475-9

[shawe-4] John Shawe-Taylor; Nello Cristianini (2004). Kernel Methods for Pattern Analysis. Cambridge University Press. pp. 241–242.

[5] Dekel, Ofer; Shai Shalev-Shwartz; Yoram Singer (2008). "The forgetron: A kernel-based perceptron on a budget" (PDF). SIAM Journal on Computing. 37 (5): 1342–1372.

[1]

[2]

[3]

[4]

[5]