Polynomial kernel

In machine learning, the polynomial kernel is a kernel function commonly used with support vector machines (SVMs) and other kernelized models, that represents the similarity of vectors (training samples) in a feature space over polynomials of the original variables.

Definition

For degree-d polynomials, the polynomial kernel is defined as^[1]

K(x,y)=(x^{\top }y+c)^{d}

where $x$ and $y$ are vectors in the input space, i.e. vectors of features computed from training or test samples, $c\geq 0$ is a constant trading off the influence of higher-order versus lower-order terms in the polynomial. (A further generalized polykernel divides $x^{\top }y$ by a user-specified scalar parameter $a$ .^[2]) For the case $d=2$ of the quadratic kernel, we have:

K(x,y)=\left(\sum _{i=1}^{n}x_{i}y_{i}+c\right)^{2}=\sum _{i=1}^{n}x_{i}^{2}y_{i}^{2}+\sum _{i=2}^{n}\sum _{j=1}^{i-1}{\sqrt {2}}x_{i}y_{i}{\sqrt {2}}x_{j}y_{j}+\sum _{i=1}^{n}{\sqrt {2c}}x_{i}{\sqrt {2c}}y_{i}+c^{2}

From this we see that $K$ is the inner product in a feature space induced by the mapping

\varphi (x)=\langle x_{n}^{2},\ldots ,x_{1}^{2},{\sqrt {2}}x_{n}x_{n-1},\ldots ,{\sqrt {2}}x_{n}x_{1},{\sqrt {2}}x_{n-1}x_{n-2},\ldots ,{\sqrt {2}}x_{n-1}x_{1},\ldots ,{\sqrt {2}}x_{2}x_{1},{\sqrt {2c}}x_{n},\ldots ,{\sqrt {2c}}x_{1},c\rangle

When the input features are binary-valued (booleans) and $c=1$ , the mapped features correspond to conjunctions of input features.^[3] (The ${\sqrt {2}}$ terms in the expansion are immaterial.^[4])

Practical use

Although the RBF kernel is more popular in SVM classification than the polynomial kernel, the latter is quite popular in natural language processing (NLP).^[3]^[4] The most common degree is d=2, since larger degrees tend to overfit on NLP problems.

Various ways of computing the polynomial kernel (both exact and approximate) have been devised as alternatives to the usual non-linear SVM training algorithms, including:

full expansion of the kernel prior to training/testing with a linear SVM,^[4] i.e. full computation of the mapping $\varphi$
basket mining (using a variant of the apriori algorithm) for the most commonly occurring feature conjunctions in a training set to produce an approximate expansion^[5]
inverted indexing of support vectors^[5]^[3]

One problem with the polynomial kernel is that may it suffer from numerical instability: when $x^{\top }y+c<1$ , $K(x,y)=(x^{\top }y+c)^{d}$ tends to zero as $d$ is increased, whereas when $x^{\top }y+c>1$ , $K(x,y)$ tends to infinity.^[2]

References

^ http://www.cs.tufts.edu/~roni/Teaching/CLT/LN/lecture18.pdf
^ ^a ^b Chih-Jen Lin (2012). Machine learning software: design and practical use. Talk at Machine Learning Summer School, Kyoto.
^ ^a ^b ^c Yoav Goldberg and Michael Elhadad (2008). splitSVM: Fast, Space-Efficient, non-Heuristic, Polynomial Kernel Computation for NLP Applications. Proc. ACL-08: HLT.
^ ^a ^b ^c Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard and Chih-Jen Lin (2010). Training and testing low-degree polynomial data mappings via linear SVM. J. Machine Learning Research 11:1471–1490.
^ ^a ^b T. Kudo and Y. Matsumoto (2003). Fast methods for kernel-based text analysis. Proc. ACL 2003.

[1] ttp://www.cs.tufts.edu/~roni/Teaching/CLT/LN/lecture18.pdf

[lin2012-2] Chih-Jen Lin (2012). Machine learning software: design and practical use. Talk at Machine Learning Summer School, Kyoto.

[Goldberg2008-3] Yoav Goldberg and Michael Elhadad (2008). splitSVM: Fast, Space-Efficient, non-Heuristic, Polynomial Kernel Computation for NLP Applications. Proc. ACL-08: HLT.

[Chang2010-4] Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard and Chih-Jen Lin (2010). Training and testing low-degree polynomial data mappings via linear SVM. J. Machine Learning Research 11:1471–1490.

[Kudo2003-5] T. Kudo and Y. Matsumoto (2003). Fast methods for kernel-based text analysis. Proc. ACL 2003.

[1]

[2]

[3]

[4]

[5]