Structured support vector machine

The structured support vector machine is a machine learning algorithm and generalizes the Support Vector Machine (SVM) classifier. Whereas the SVM classifier support binary classification, multiclass classification and regression, the structured SVM allows training of a classifier for general structured output labels.

As an example, a sample instance might be a natural language sentence, and the output label is a an annotated parse tree. Training a classifier consists of showing pairs of correct sample and output label pairs. After training, the structured SVM model allows one to predict for new sample instances the corresponding output label; that is, given a natural language sentence, the classifier can produce the most likely parse tree.

For a set of $\ell$ training instances $({\boldsymbol {x}}_{n},y_{n})\in {\mathcal {X}}\times {\mathcal {Y}}$ , $n=1,\dots ,\ell$ from a sample space ${\mathcal {X}}$ and label space ${\mathcal {Y}}$ , the structured SVM minimizes the following regularized risk function.

{\underset {\boldsymbol {w}}{\min }}\quad \|{\boldsymbol {w}}\|^{2}+C\sum _{n=1}^{\ell }{\underset {y\in {\mathcal {Y}}}{\max }}\left(\Delta (y_{n},y)+{\boldsymbol {w}}'\Psi ({\boldsymbol {x}}_{n},y)-{\boldsymbol {w}}'\Psi ({\boldsymbol {x}}_{n},y_{n})\right)

The function is convex in ${\boldsymbol {w}}$ because the maximum of an set of affine functions is convex. The function $\Delta (y_{n},y)$ measures a distance in label space and is an arbitrary function (not necessarily a metric) satisfying $\Delta (y,z)\geq 0,\Delta (y,z)=\Delta (z,y),\Delta (y,y)=0\forall y,z\in {\mathcal {Y}}$ . The function $\Psi :{\mathcal {X}}\times {\mathcal {Y}}\to \mathbb {R} ^{d}$ is a feature function, extracting some feature vector from a given sample and label. The design of this function depends very much on the application.

Because the regularized risk function above is non-differentiable, it is often reformulated in terms of a quadratic program by introducing one slack variables $\xi _{n}$ for each sample, each representing the value of the maximum. The standard structured SVM primal formulation is given as follows.

{\begin{array}{cl}{\underset {{\boldsymbol {w}},{\boldsymbol {\xi }}}{\min }}&\|{\boldsymbol {w}}\|^{2}+C\sum _{n=1}^{\ell }\xi _{n}\\{\textrm {sb.t.}}&{\boldsymbol {w}}'\Psi ({\boldsymbol {x}}_{n},y_{n})-{\boldsymbol {w}}'\Psi ({\boldsymbol {x}}_{n},y)+\xi _{n}\geq \Delta (y_{n},y),\qquad n=1,\dots ,\ell ,\quad \forall y\in {\mathcal {Y}},\\&\xi _{n}\geq 0,\qquad n=1,\dots ,\ell .\end{array}}

Inference problem

The above quadratic program involves a very large, possibly infinite number of linear inequality constraints. In general, the number of inequalities is too large to be optimized over explicitly. Instead the problem is solved by using delayed constraint generation where only a finite and small subset of the constraints is used. Optimizing over a subset of the constraints enlarges the feasible set and will yield a solution which provides a lower bound on the objective. To test whether the solution ${\boldsymbol {w}}$ violates constraints of the complete set inequalities, a separation problem needs to be solved. As the inequalities decompose over the samples, for each sample $({\boldsymbol {x}}_{n},y_{n})$ the following problem needs to be solved.

y_{n}^{*}={\underset {y\in {\mathcal {Y}}}{\textrm {argmax}}}\left(\Delta (y_{n},y)+{\boldsymbol {w}}'\Psi ({\boldsymbol {x}}_{n},y)-{\boldsymbol {w}}'\Psi ({\boldsymbol {x}}_{n},y_{n})-\xi _{n}\right)

The right hand side objective to be maximized is composed of the constant $-{\boldsymbol {w}}'\Psi ({\boldsymbol {x}}_{n},y_{n})-\xi _{n}$ and a term dependent on the variables optimized over, namely $\Delta (y_{n},y)+{\boldsymbol {w}}'\Psi ({\boldsymbol {x}}_{n},y)$ . If the achieved right hand side objective is smaller or equal to zero, no violated constraint for this sample exist. If it is strictly larger than zero, the most violated constraint with respect to this sample has been identified. The problem is enlarged by this constraint and resolved. The process continues until no violated inequalities can be identified.