Structured support vector machine

The structured support vector machine is a machine learning algorithm and generalizes the Support Vector Machine (SVM) classifier. Whereas the SVM classifier support binary classification, multiclass classification and regression, the structured SVM allows training of a classifier for general structured output labels.

As an example, a sample instance might be a natural language sentence, and the output label is a an annotated parse tree. Training a classifier consists of showing pairs of correct sample and output label pairs. After training, the structured SVM model allows one to predict for new sample instances the corresponding output label; that is, given a natural language sentence, the classifier can produce the most likely parse tree.

For a set of $\ell$ training instances $({\boldsymbol {x}}_{n},y_{n})\in {\mathcal {X}}\times {\mathcal {Y}}$ , $n=1,\dots ,\ell$ from a sample space ${\mathcal {X}}$ and label space ${\mathcal {Y}}$ , the structured SVM minimizes the following regularized risk function.

{\underset {\boldsymbol {w}}{\min }}\quad \|{\boldsymbol {w}}\|^{2}+C\sum _{n=1}^{\ell }{\underset {y\in {\mathcal {Y}}}{\max }}\left(\Delta (y_{n},y)+{\boldsymbol {w}}'\Psi ({\boldsymbol {x}}_{n},y)-{\boldsymbol {w}}'\Psi ({\boldsymbol {x}}_{n},y_{n})\right)

The function is convex in ${\boldsymbol {w}}$ because the maximum of an set of affine functions is convex. The function $\Delta (y_{n},y)$ measures a distance in label space and is an arbitrary function (not necessarily a metric) satisfying $\Delta (y,z)\geq 0,\Delta (y,z)=\Delta (z,y),\Delta (y,y)=0\forall y,z\in {\mathcal {Y}}$ .

Because the regularized risk function above is non-differentiable, therefore it is often reformulated in terms of a quadratic program by introducing one slack variables $\xi _{n}$ for each sample, each representing the value of the maximum. The standard structured SVM primal formulation is given as follows.

{\begin{array}{cl}{\underset {{\boldsymbol {w}},{\boldsymbol {\xi }}}{\min }}&\|{\boldsymbol {w}}\|^{2}+C\sum _{n=1}^{\ell }\xi _{n}\\{\textrm {sb.t.}}&{\boldsymbol {w}}'\Psi ({\boldsymbol {x}}_{n},y_{n})-{\boldsymbol {w}}'\Psi ({\boldsymbol {x}}_{n},y)+\xi _{n}\geq \Delta (y_{n},y),\qquad n=1,\dots ,\ell ,\quad \forall y\in {\mathcal {Y}},\\&\xi _{n}\geq 0,\qquad n=1,\dots ,\ell .\end{array}}