Kernel perceptron

In machine learning, the kernel perceptron is a variant of the popular perceptron learning algorithm that can learn kernel machines, i.e. non-linear classifiers that employ a kernel function to compute the similarity of unseen samples to training samples.

Preliminaries

The perceptron algorithm

The perceptron algorithm is an old but popular online learning algorithm that operates by a principle called "error-driven learning": it iteratively improves a model by running it on training samples, then updating the model whenever it finds it has made an incorrect classification wrt. a supervised signal. The model learned by the standard perceptron algorithm is a linear binary classifier: a vector of weights $w$ (and optionally an intercept term $b$ ) that is used to classify a sample vector $x$ as class "one" or class "minus one" according to

\operatorname {sgn}(\mathbf {w} ^{\top }\mathbf {x} )

where a zero is arbitrarily mapped to one or minus one.

In pseudocode, the perceptron algorithm is given by:

Initialize

w

to an all-zero vector of length

p

, the number of predictors (features).

For each training example

x

with ground truth label

y \in {-1, 1

}:

Let

ŷ = sgn(w T x)

.^[1]

If

ŷ \neq y

, update

w

:=

w + y x

.

Repeat this procedure until convergence, or for a set amount of iterations ("epochs").

Kernel machines

By contrast with the linear models learned by the perceptron, a kernel machine^[2] is a classifier that stores a subset of its training examples $x ᵢ$ , associates with each a weight $αᵢ$ , and makes decisions for new samples $x'$ by evaluating

\operatorname {sgn} \sum _{i}\alpha _{i}y_{i}K(\mathbf {x_{i}} ,\mathbf {x'} )

Here, $K$ is some kernel function. Formally, a kernel function is a non-negative semidefinite kernel (see Mercer's condition); intuitively, it can be thought of as a similarity function between samples. Each function $x' \mapsto K (x ᵢ, x')$ serves as a basis function in the classification.

Algorithm

The kernel perceptron algorithm is a perceptron variant that learns kernel machines. In pseudocode, it can be formulated as follows:^[3]

Initialize

α

to an all-zeros vector of length

n

, the number of training samples.

For some fixed number of iterations, or until the model makes no mistakes on the training set:

For each training example

x, y

:

Let

{\hat {y}}=\operatorname {sgn} \sum _{i}^{n}\alpha _{i}y_{i}K(\mathbf {x_{i}} ,\mathbf {x} )

If

ŷ \neq y

, perform an update:

αᵢ

:=

αᵢ + 1

When the linear kernel $K (x, x') = x \cdot x'$ is used, this algorithm is a dual formulation of the ordinary perceptron algorithm.

Notes and references

^ Using the hat to denote an estimated value.
^ Schölkopf, Bernhard; and Smola, Alexander J.; Learning with Kernels, MIT Press, Cambridge, MA, 2002. ISBN 0-262-19475-9
^ John Shawe-Taylor; Nello Cristianini (2004). Kernel Methods for Pattern Analysis. Cambridge University Press. p. 242.

[1] Using the hat to denote an estimated value.

[2] Schölkopf, Bernhard; and Smola, Alexander J.; Learning with Kernels, MIT Press, Cambridge, MA, 2002. ISBN 0-262-19475-9

[3] John Shawe-Taylor; Nello Cristianini (2004). Kernel Methods for Pattern Analysis. Cambridge University Press. p. 242.

[1]

[2]

[3]