Regularization perspectives on support vector machines

Introduction

Regularization perspectives on SVM interpret SVMs as a special case Tikhonov regularization, specifically Tikhonov regularization with a hinge loss loss function. This provides a theoretical framework with which to analyze SVM algorithms and compare them to other algorithms with the same goals: to generalize without overfitting. SVMs were first proposed in 1995 by Corinna Cortes and Vladimir Vapnik, and framed geometrically as a method for finding hyperplanes that can separate multidimensional data into two categories.^[1] This traditional geometric interpretation of SVMs provides useful intuition about how SVMs work, but is difficult to relate to other machine learning techniques for avoiding overfitting like regularization, early stopping, sparsity and Bayesian inference. However, it has since been accepted that SVMs are also a special case of Tikhonov regularization, which provides a useful theoretical framework to compare SVMs with other algorithms.^[2]^[3]^[4] Different forms of Tikhonov regularization use different loss functions, and SVMs are Tikhonov regularization using the hinge loss. Regularization perspectives on SVM have linked SVM to this broader class of algorithms, and the theories describing it. This has enabled detailed comparisons between SVM and other forms of Tikhonov regularization, and theoretical grounding for why it's beneficial to use SVM's loss function, the hinge loss.^[5]

Theoretical background

In the statistical learning theory framework, an algorithm is a strategy for choosing a function $f:\mathbf {X} \to \mathbf {Y}$ given a training set $S=\{(x_{1},y_{1}),\ldots ,(x_{n},y_{n})\}$ of inputs and their labels (the labels are usually $\pm 1$ ). Regularization strategies avoid overfitting by choosing a function that fits the data, but is not too complex. Specifically:

$f={\text{arg}}\min _{f\in {\mathcal {H}}}\left\{{\frac {1}{n}}\sum _{i=1}^{n}V(y_{i},f(x_{i}))+\lambda ||f||_{\mathcal {H}}^{2}\right\}$ ,

where ${\mathcal {H}}$ is a hypothesis space^[6] of functions, $V:\mathbf {Y} \times \mathbf {Y} \to \mathbb {R}$ is the loss function, $||\cdot ||_{\mathcal {H}}$ is a norm on the hypothesis space of functions, and $\lambda \in \mathbb {R}$ is the regularization parameter^[7] .

When ${\mathcal {H}}$ is a reproducing kernel Hilbert space, there exists a kernel function $K:\mathbf {X} \times \mathbf {X} \to \mathbb {R}$ that can be written as an $n\times n$ symmetric positive definite matrix $\mathbf {K}$ . By the representer theorem^[8], $f(x_{i})=\sum _{f=1}^{n}c_{j}\mathbf {K} _{ij}$ , and $||f||_{\mathcal {H}}^{2}=\langle f,f\rangle _{\mathcal {H}}=\sum _{i=1}^{n}\sum _{j=1}^{n}c_{i}c_{j}K(x_{i},x_{j})=c^{T}\mathbf {K} c$

Special properties of the hinge loss

The simplest and most intuitive loss function for categorization is the misclassification loss, or 0-1 loss, which is 0 if $f(x_{i})=y_{i}$ and 1 if $f(x_{i})\neq y_{i}$ , i.e the heaviside step function on $-y_{i}f(x_{i})$ . However, this loss function is not convex, which makes the regularization problem very difficult to minimize computationally. Therefore, we look for convex substitutes for the 0-1 loss. The hinge loss, $V(y_{i},f(x_{i}))=(1-yf(x))_{+}$ where $(s)_{+}=max(s,0)$ , provides such a convex relaxation. In fact, the hinge loss is the tightest convex upper bound to the 0-1 misclassification loss function^[9], and with infinite data returns the Bayes optimal solution:^[10] ^[11]

$f_{b}(x)=\left\{{\begin{matrix}1&p(1|x)>p(-1|x)\\-1&p(1|x)<p(-1|x)\end{matrix}}\right.$

Derivation^[12]

With the hinge loss, $V(y_{i},f(x_{i}))=(1-yf(x))_{+}$ where $(s)_{+}=max(s,0)$ , the regularization problem becomes:

$f={\text{arg}}\min _{f\in {\mathcal {H}}}\left\{{\frac {1}{n}}\sum _{i=1}^{n}(1-yf(x))_{+}+\lambda ||f||_{\mathcal {H}}^{2}\right\}$ ,

In most of the SVM literature, this is written equivalently $\left({\text{take }}C={\frac {1}{2\lambda n}}\right)$ as:

$f={\text{arg}}\min _{f\in {\mathcal {H}}}\left\{C\sum _{i=1}^{n}(1-yf(x))_{+}+{\frac {1}{2}}||f||_{\mathcal {H}}^{2}\right\}$ .

This problem is non-differentiable because of the "kink" in the loss function. However, we can rewrite it using slack variables $\xi _{i}$ :

$f={\text{arg}}\min _{f\in {\mathcal {H}}}\left\{C\sum _{i=1}^{n}\xi _{i}+{\frac {1}{2}}||f||_{\mathcal {H}}^{2}\right\}$ subject to: ${\begin{aligned}\xi _{i}\geq 1-y_{i}f(x_{i}):\ \ \ &i=1,\ldots ,n\\\xi _{i}\geq 0:\ \ \ &i=1,\ldots ,n\end{aligned}}$

Next we apply the representer theorem to get:

$f={\text{arg}}\min _{f\in {\mathcal {H}}}\left\{C\sum _{i=1}^{n}\xi _{i}+{\frac {1}{2}}c^{T}\mathbf {K} c\right\}$ subject to: ${\begin{aligned}\xi _{i}\geq 1-y_{i}\sum _{j=1}^{n}c_{j}K(x_{i},x_{j}):\ \ \ &i=1,\ldots ,n\\\xi _{i}\geq 0:\ \ \ &i=1,\ldots ,n\end{aligned}}$

This is a constrained optimization problem, which we will solve using the Lagrangian to derive the dual problem. The Lagrangian is:

$L(c,\xi ,\alpha ,\zeta )=C\sum _{i=1}^{n}\xi _{i}+{\frac {1}{2}}c^{T}\mathbf {K} c-\sum _{i=1}^{n}\alpha _{i}\left(y_{i}\left\{\sum _{j=1}^{n}c_{j}K(x_{i},x_{j})\right\}-1-\xi _{i}\right)-\sum _{i=1}^{n}\zeta _{i}\xi _{i}$

The dual problem is:

${\text{arg}}\min _{\alpha ,\zeta >0}\inf _{c,\xi }L(c,\xi ,\alpha ,\zeta )$

Minimizing $L$ with respect to $c_{i}$ : ${\frac {\partial L}{\partial c_{i}}}=0\Rightarrow c_{i}=\alpha _{i}y_{i}$ Minimizing $L$ with respect to $\xi _{i}$ : ${\frac {\partial L}{\partial \xi _{i}}}=0\Rightarrow C-\alpha _{i}-\zeta _{i}=0\Rightarrow 0\leq \alpha _{i}\leq C$

Then, plugging $\zeta _{i}=C-\alpha _{i}$ into the Lagrangian, we can write the dual problem as: ${\text{arg}}\max _{\alpha \geq 0}\inf L(c,\alpha )-{\frac {1}{2}}c^{T}\mathbf {K} c+\sum _{i=1}^{n}\alpha _{i}\left(1-y_{i}\sum _{j=1}^{n}K(x_{i},x_{j})c_{j}\right)$

Then, plugging in $c_{i}=\alpha _{i}y_{i}$ , we get: ${\text{arg}}\max _{\alpha \in \mathbb {R} ^{n}}L(\alpha )={\text{arg}}\max _{\alpha \in \mathbb {R} ^{n}}\sum _{i=1}^{n}\alpha _{i}-{\frac {1}{2}}\sum _{i,j=1}^{n}\alpha _{i}y_{i}K(x_{i},x_{j})\alpha _{j}y_{j}={\text{arg}}\max _{\alpha \in \mathbb {R} ^{n}}\sum _{i=1}^{n}\alpha _{i}-{\frac {1}{2}}\alpha ^{T}({\text{diag}}\mathbf {Y} )\mathbf {K} ({\text{diag}}\mathbf {Y} )\alpha$

Subject to $0\leq \alpha _{i}\leq C\ \ \ i=1,\ldots ,n$

Note that this dual problem is easier to solve than the original problem because it is box constrained (the $\alpha _{i}$ are bounded). Also notice that the slack variables have disappeared in the dual problem.

Consequences and interpretations^[13]

The Karush-Kuhn-Tucker conditions dictate that all optimal solutions must satisfy the following conditions for $i=1,\ldots ,n$ :

$\sum _{j=1}^{n}c_{j}K(x_{i},x_{j})-\sum _{j=1}^{n}y_{i}\alpha _{j}K(x_{i},x_{j})=0$

$C-\alpha _{i}-\zeta _{i}=0$

$y_{i}\left(\sum _{j=1}^{n}y_{j}\alpha _{j}K(x_{i},x_{j})\right)-1+\xi _{i}\geq 0$

$\alpha _{i}\left[y_{i}\left(\sum _{j=1}^{n}y_{j}\alpha _{j}K(x_{i},x_{j})\right)-1+\xi _{i}\right]=0$

$\zeta _{i}\xi _{i}=0$

$\xi _{i},\alpha _{i},\zeta _{i}\geq 0$

From these above constraints, and recalling that $f(x)=\sum _{i=1}^{n}y_{i}\alpha _{i}K(x,x_{i})$ , we can derive conditions relating the $\alpha _{i}$ to $y_{i}f(x_{i})$ :

${\begin{aligned}y_{i}f(x_{i})>1&\Rightarrow (1-y_{i}f(x_{i}))<0\\&\Rightarrow \xi _{i}\neq (1-y_{i}f(x_{i}))\\&\Rightarrow \alpha _{i}=0\end{aligned}}$

${\begin{aligned}y_{i}f(x_{i})<1&\Rightarrow (1-y_{i}f(x_{i}))>0\\&\Rightarrow \xi _{i}>0\\&\Rightarrow \zeta _{i}=0\\&\Rightarrow \alpha _{i}=C\end{aligned}}$

${\begin{aligned}\alpha _{i}=C&\Rightarrow \xi _{i}=1-y_{i}f(x_{i})\\&\Rightarrow y_{i}f(x_{i})\leq 1\end{aligned}}$

${\begin{aligned}\alpha _{i}=0&\Rightarrow C=\zeta _{i}\\&\Rightarrow \xi _{i}=0\\&\Rightarrow \\&\Rightarrow y_{i}f(x_{i})\geq 1\end{aligned}}$

${\begin{aligned}0<\alpha _{i}<C&\Rightarrow \zeta _{i}\neq 0\\&\Rightarrow \xi _{i}=0\\&\Rightarrow y_{i}f(x_{i})=1\end{aligned}}$

Note that the solution is relatively sparse, because whenever $y_{i}f(x_{i})>1,\ \alpha _{i}=0$ . In SVM, the input points with non-zero coefficients are called support vectors. Given the above constraints, the support vectors are precisely the input points where $y_{i}f(x_{i})\leq 1$ . ${\begin{aligned}\end{aligned}}$

Notes and References

^ Cortes, Corinna (1995). "Suppor-Vector Networks". Machine Learning. 20: 273–297. doi:10.1007/BF00994018. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ Rosasco, Lorenzo. "Regularized Least-Squares and Support Vector Machines" (PDF).,
^ Rifkin, Ryan (2002). Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning (PDF). MIT (PhD thesis).
^ Lee, Yoonkyung (2012). "Multicategory Support Vector Machines". Journal of the American Statistical Association. 99 (465): 67–81. doi:10.1198/016214504000000098. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ Rosasco, Lorenzo (2004). "Are Loss Functions All the Same". Neural Computation. 5. 16: 1063–1076. doi:10.1162/089976604773135104. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help); Unknown parameter |month= ignored (help)
^ This hypothesis space of functions is a Hilbert space of all the functions we're allowing the algorithm to pick
^ For insight on choosing the parameter, see, e.g., Wahba, Grace (1990). "When is the optimal regularization parameter insensitive to the choice of the loss function". Communications in Statistics - Theory and Methods. 19 (5): 1685–1700. doi:10.1080/03610929008830285. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ See Scholkopf, Bernhard (2001). "A Generalized Representer Theorem". Computational Learning Theory: Lecture Notes in Computer Science. 2111: 416–426. doi:10.1007/3-540-44581-1_27. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ Lee, Yoonkyung (2012). "Multicategory Support Vector Machines". Journal of the American Statistical Association. 99 (465): 67–81. doi:10.1198/016214504000000098. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ Lin, Yi (2002). "Support Vector Machines and the Bayes Rule in Classification" (PDF). Data Mining and Knowledge Discovery. 6 (3): 259–275. doi:10.1023/A:1015469627679. {{cite journal}}: Unknown parameter |month= ignored (help)
^ Rosasco, Lorenzo (2004). "Are Loss Functions All the Same". Neural Computation. 5. 16: 1063–1076. doi:10.1162/089976604773135104. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help); Unknown parameter |month= ignored (help)
^ For a detailed derivation, see Rifkin, Ryan (2002). Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning (PDF). MIT (PhD thesis).
^ For more detail, see Rosasco, Lorenzo. "Regularized Least Squares and Support Vector Machines" (PDF).

Evgeniou, Theodoros (2000). "Regularization Networks and Support Vector Machines" (PDF). Advances in Computational Mathematics. 13 (1): 1–50. doi:10.1023/A:1018946025316. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

Joachims, Thorsten. "SVMlight".

Vapnik, Vladimir (1999). The Nature of Statistical Learning Theory. New York: Springer-Verlag. ISBN 0-387-98780-0.

[1] Cortes, Corinna (1995). "Suppor-Vector Networks". Machine Learning. 20: 273–297. doi:10.1007/BF00994018. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[2] Rosasco, Lorenzo. "Regularized Least-Squares and Support Vector Machines" (PDF).,

[3] Rifkin, Ryan (2002). Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning (PDF). MIT (PhD thesis).

[4] Lee, Yoonkyung (2012). "Multicategory Support Vector Machines". Journal of the American Statistical Association. 99 (465): 67–81. doi:10.1198/016214504000000098. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[5] Rosasco, Lorenzo (2004). "Are Loss Functions All the Same". Neural Computation. 5. 16: 1063–1076. doi:10.1162/089976604773135104. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help); Unknown parameter |month= ignored (help)

[6] This hypothesis space of functions is a Hilbert space of all the functions we're allowing the algorithm to pick

[7] For insight on choosing the parameter, see, e.g., Wahba, Grace (1990). "When is the optimal regularization parameter insensitive to the choice of the loss function". Communications in Statistics - Theory and Methods. 19 (5): 1685–1700. doi:10.1080/03610929008830285. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[8] See Scholkopf, Bernhard (2001). "A Generalized Representer Theorem". Computational Learning Theory: Lecture Notes in Computer Science. 2111: 416–426. doi:10.1007/3-540-44581-1_27. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[9] Lee, Yoonkyung (2012). "Multicategory Support Vector Machines". Journal of the American Statistical Association. 99 (465): 67–81. doi:10.1198/016214504000000098. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help)

[10] Lin, Yi (2002). "Support Vector Machines and the Bayes Rule in Classification" (PDF). Data Mining and Knowledge Discovery. 6 (3): 259–275. doi:10.1023/A:1015469627679. {{cite journal}}: Unknown parameter |month= ignored (help)

[11] Rosasco, Lorenzo (2004). "Are Loss Functions All the Same". Neural Computation. 5. 16: 1063–1076. doi:10.1162/089976604773135104. {{cite journal}}: Unknown parameter |coauthors= ignored (|author= suggested) (help); Unknown parameter |month= ignored (help)

[12] For a detailed derivation, see Rifkin, Ryan (2002). Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning (PDF). MIT (PhD thesis).

[13] For more detail, see Rosasco, Lorenzo. "Regularized Least Squares and Support Vector Machines" (PDF).

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]