String kernel

String kernel is a mathematical tool used in large scale data analysis and mining, where sequence data are to be clustered or classified (concerning especially the popular research fields of text and gene analysis).

This computer science article is a stub. You can help Wikipedia by expanding it.

Motivation

Since several well-proven data clustering, classification and information retrieval methods (for example support vector machines) are designed to work on vectors (i.e. data are elements of a vector space), using a string kernel allows to extend these methods to handle sequence data.

The string kernel method is to be contrasted with earlier approaches for text classification where feature vectors only indicated the presence or absence of a word. It is an example for a whole class of kernels adapted to data structures.

Definition

A kernel on a domain $D$ is a function $K:D\times D\rightarrow \mathbb {R}$ satisfying some conditions (being symmetric in the arguments, continuous and positive semidefinite in a certain sense).

Mercer's theorem asserts that $K$ can then be expressed as $K(x,y)=\varphi (x)\cdot \varphi (y)$ with $\varphi$ mapping the arguments into an inner product space.

We can now reproduce the definition of a string subsequence kernel on strings over an alphabet $\Sigma$ . Coordinate-wise, the mapping is defined as follows:

 $\varphi _{u}:\left\{{\begin{array}{l}\Sigma ^{n}\rightarrow \mathbb {R} ^{\Sigma ^{n}}\\s\mapsto \sum _{\mathbf {i} :u=s_{\mathbf {i} }}\lambda ^{l(\mathbf {i} )}\end{array}}\right.$

The $\mathbf {i}$ are multiindices: subsequences can occur in a non-contiguous manner, but gaps are penalized. The parameter $\lambda$ may be set to any value between $0$ (gaps are not allowed) and $1$ (even widely-spread "occurrences" are weighted the same as appearances as a contiguous substring).

For several relevant algorithms, data enters into the algorithm only in expressions involving an inner product of feature vectors, whence the name Kernel methods. A desirable consequence of this is that one does not need to explicitly calculate the transformation $\phi (x)$ , only the inner product via the kernel, which may be a lot quicker, especially when approximated^[1] .

References

^ Lodhi, Huma; Saunders, Craig; Shawe-Taylor, John; Cristianini, Nello; Watkins, Chris (2002), "Text classification using string kernels", The Journal of Machine Learning Research, MIT Press: 444

[1] Lodhi, Huma; Saunders, Craig; Shawe-Taylor, John; Cristianini, Nello; Watkins, Chris (2002), "Text classification using string kernels", The Journal of Machine Learning Research, MIT Press: 444

[1]