Kernel perceptron

In machine learning, the kernel perceptron is a variant of the popular perceptron learning algorithm that can learn kernel machines, i.e. non-linear classifiers that employ a kernel function to compute the similarity of unseen samples to training samples.

Preliminaries

The perceptron algorithm

The perceptron algorithm is an old but popular online learning algorithm that operates by a principle called "error-driven learning": it iteratively improves a model by running it on training samples, then updating the model whenever it finds it has made an incorrect classification wrt. a supervised signal. The model learned by the standard perceptron algorithm is a linear binary classifier: a vector of weights $w$ (and optionally an intercept term $b$ ) that is used to classify a sample vector $x$ as class "one" or class "minus one" according to

{\hat {y}}=\operatorname {sgn}(\mathbf {w} ^{\top }\mathbf {x} )

where a zero is arbitrarily mapped to one or minus one.^[1]

In pseudocode, the perceptron algorithm is given by:

Initialize

w

to an all-zero vector of length

p

, the number of predictors (features).

For each training example

x ᵢ

with ground truth label

yᵢ \in {-1, 1

}:

Let

ŷ = sgn(w T x ᵢ)

.

If

ŷ \neq yᵢ

, update

w

:=

w + yᵢ x ᵢ

.

Repeat this procedure until convergence, or for a set amount of iterations ("epochs").

Kernel machines

By contrast with the linear models learned by the perceptron, a kernel machine^[2] is a classifier that stores a subset of its training examples $x ᵢ$ , associates with each a weight $αᵢ$ , and makes decisions for new samples $x'$ by evaluating

\operatorname {sgn} \sum _{i}\alpha _{i}y_{i}K(\mathbf {x} _{i},\mathbf {x'} )

Here, $K$ is some kernel function. Formally, a kernel function is a non-negative semidefinite kernel (see Mercer's condition); intuitively, it can be thought of as a similarity function between samples. Each function $x' \mapsto K (x ᵢ, x')$ serves as a basis function in the classification.

Algorithm

The perceptron algorithm can be expressed in dual form, in which it can be seen to learn a linear combination of its $n$ training samples,

\mathbf {w} =\sum _{i}^{n}\alpha _{i}y_{i}\mathbf {x} _{i}

The required changes to the algorithm are that the per-feature weight vector $w$ is replaced with a per-example weight vector $α$ , initially all zero, the update rule is replaced by $αᵢ$ := $αᵢ + 1$ , and predictions $ŷ$ for a sample $x$ are made by evaluating

{\hat {y}}=\operatorname {sgn} \sum _{i}^{n}\alpha _{i}y_{i}\mathbf {x} _{i}\cdot \mathbf {x}

By replacing the dot product (linear kernel) in the dual version of the perceptron algorithm, we can make it learn kernel machines, giving the final algorithm:^[3]

Initialize

α

to an all-zeros vector of length

n

, the number of training samples.

For some fixed number of iterations, or until the model makes no mistakes on the training set:

For each training example

x, y

:

Let

{\hat {y}}=\operatorname {sgn} \sum _{i}^{n}\alpha _{i}y_{i}K(\mathbf {x} _{i},\mathbf {x} )

If

ŷ \neq y

, perform an update:

αᵢ

:=

αᵢ + 1

Notes and references

^ Using the hat to denote an estimated value.
^ Schölkopf, Bernhard; and Smola, Alexander J.; Learning with Kernels, MIT Press, Cambridge, MA, 2002. ISBN 0-262-19475-9
^ John Shawe-Taylor; Nello Cristianini (2004). Kernel Methods for Pattern Analysis. Cambridge University Press. pp. 241–242.

[1] Using the hat to denote an estimated value.

[2] Schölkopf, Bernhard; and Smola, Alexander J.; Learning with Kernels, MIT Press, Cambridge, MA, 2002. ISBN 0-262-19475-9

[shawe-3] John Shawe-Taylor; Nello Cristianini (2004). Kernel Methods for Pattern Analysis. Cambridge University Press. pp. 241–242.

[1]

[2]

[3]