USER:ELMACKEV SANDBOX

This is the user sandbox of Elmackev. A user sandbox is a subpage of the user's user page. It serves as a testing spot and page development space for the user and is not an encyclopedia article. Create or edit your own sandbox here.

Other sandboxes: Main sandbox | Template sandbox

Finished writing a draft article? Are you ready to request review of it by an experienced editor for possible inclusion in Wikipedia? Submit your draft for review!

Regularization Perspectives on SVM

Support vector machines (SVM), like regularized least squares, are a special case of Tikhonov regularization. In the case of SVM, the loss function is the hinge loss.^[1]^[2]^[3]^[4]

Background

In the supervised learning framework, an algorithm is a strategy for choosing a function $f:\mathbf {X} \to \mathbf {Y}$ given a training set $S=\{(x_{1},y_{1}),\ldots ,(x_{n},y_{n})\}$ of inputs and their labels (the labels are usually $\pm 1$ ). Regularization strategies avoid overfitting by choosing a function that fits the data, but is not too complex. Specifically:

$f={\text{arg}}\min _{f\in {\mathcal {H}}}\left\{{\frac {1}{n}}\sum _{i=1}^{n}V(y_{i},f(x_{i}))+\lambda ||f||_{\mathcal {H}}^{2}\right\}$ ,

where ${\mathcal {H}}$ is a hypothesis space^[5] of functions, $V:\mathbf {Y} \times \mathbf {Y} \to \mathbb {R}$ is the loss function, $||\cdot ||_{\mathcal {H}}$ is a norm on the hypothesis space of functions, and $\lambda \in \mathbb {R}$ is the regularization parameter^[6] .

When ${\mathcal {H}}$ is a reproducing kernel Hilbert space, there exists a kernel function $K:\mathbf {X} \times \mathbf {X} \to \mathbb {R}$ that can be written as an $n\times n$ symmetric positive definite matrix $\mathbf {K}$ . By the representer theorem^[7], $f(x_{i})=\sum _{f=1}^{n}c_{j}\mathbf {K} _{ij}$ , and $||f||_{\mathcal {H}}^{2}=\langle f,f\rangle _{\mathcal {H}}=\sum _{i=1}^{n}\sum _{j=1}^{n}c_{i}c_{j}K(x_{i},x_{j})=c^{T}\mathbf {K} c$

Hinge loss

The simplest and most intuitive loss function for categorization is the misclassification loss, or 0-1 loss, which is 0 if $f(x_{i})=y_{i}$ and 1 if $f(x_{i})\neq y_{i}$ , i.e the heaviside step function on $-y_{i}f(x_{i})$ . However, this loss function is not convex, which makes the regularization problem very difficult to minimize computationally. Therefore, we look for convex substitutes for the 0-1 loss. The hinge loss, $V(y_{i},f(x_{i}))=(1-yf(x))_{+}$ where $(s)_{+}=max(s,0)$ , provides such a convex relaxation. In fact, the hinge loss is the tightest convex upper bound to the 0-1 misclassification loss function^[8], and with infinite data returns the Bayes optimal solution:^[9]

$f_{b}(x)=\left\{{\begin{matrix}1&p(1|x)>p(-1|x)\\-1&p(1|x)<p(-1|x)\end{matrix}}\right.$

Derivation^[10]

With the hinge loss, $V(y_{i},f(x_{i}))=(1-yf(x))_{+}$ where $(s)_{+}=max(s,0)$ , the regularization problem becomes:

$f={\text{arg}}\min _{f\in {\mathcal {H}}}\left\{{\frac {1}{n}}\sum _{i=1}^{n}(1-yf(x))_{+}+\lambda ||f||_{\mathcal {H}}^{2}\right\}$ ,

In most of the SVM literature, this is written equivalently $\left({\text{take }}C={\frac {1}{2\lambda n}}\right)$ as:

$f={\text{arg}}\min _{f\in {\mathcal {H}}}\left\{C\sum _{i=1}^{n}(1-yf(x))_{+}+{\frac {1}{2}}||f||_{\mathcal {H}}^{2}\right\}$ .

This problem is non-differentiable because of the "kink" in the loss function. However, we can rewrite it using slack variables $\xi _{i}$ :

$f={\text{arg}}\min _{f\in {\mathcal {H}}}\left\{C\sum _{i=1}^{n}\xi _{i}+{\frac {1}{2}}||f||_{\mathcal {H}}^{2}\right\}$ subject to: ${\begin{aligned}\xi _{i}\geq 1-y_{i}f(x_{i}):\ \ \ &i=1,\ldots ,n\\\xi _{i}\geq 0:\ \ \ &i=1,\ldots ,n\end{aligned}}$

Next we apply the representer theorem to get:

$f={\text{arg}}\min _{f\in {\mathcal {H}}}\left\{C\sum _{i=1}^{n}\xi _{i}+{\frac {1}{2}}c^{T}\mathbf {K} c\right\}$ subject to: ${\begin{aligned}\xi _{i}\geq 1-y_{i}\sum _{j=1}^{n}c_{j}K(x_{i},x_{j}):\ \ \ &i=1,\ldots ,n\\\xi _{i}\geq 0:\ \ \ &i=1,\ldots ,n\end{aligned}}$

This is a constrained optimization problem, which we will solve using the Lagrangian to derive the dual problem. The Lagrangian is:

$L(c,\xi ,\alpha ,\zeta )=C\sum _{i=1}^{n}\xi _{i}+{\frac {1}{2}}c^{T}\mathbf {K} c-\sum _{i=1}^{n}\alpha _{i}\left(y_{i}\left\{\sum _{j=1}^{n}c_{j}K(x_{i},x_{j})\right\}-1-\xi _{i}\right)-\sum _{i=1}^{n}\zeta _{i}\xi _{i}$

The dual problem is:

${\text{arg}}\min _{\alpha ,\zeta >0}\inf _{c,\xi }L(c,\xi ,\alpha ,\zeta )$

Minimizing $L$ with respect to $c_{i}$ : ${\frac {\partial L}{\partial c_{i}}}=0\Rightarrow c_{i}=\alpha _{i}y_{i}$ Minimizing $L$ with respect to $\xi _{i}$ : ${\frac {\partial L}{\partial \xi _{i}}}=0\Rightarrow C-\alpha _{i}-\zeta _{i}=0\Rightarrow 0\leq \alpha _{i}\leq C$

Then, plugging $\zeta _{i}=C-\alpha _{i}$ into the Lagrangian, we can write the dual problem as: ${\text{arg}}\max _{\alpha \geq 0}\inf L(c,\alpha )-{\frac {1}{2}}c^{T}\mathbf {K} c+\sum _{i=1}^{n}\alpha _{i}\left(1-y_{i}\sum _{j=1}^{n}K(x_{i},x_{j})c_{j}\right)$

Then, plugging in $c_{i}=\alpha _{i}y_{i}$ , we get: ${\text{arg}}\max _{\alpha \in \mathbb {R} ^{n}}L(\alpha )={\text{arg}}\max _{\alpha \in \mathbb {R} ^{n}}\sum _{i=1}^{n}\alpha _{i}-{\frac {1}{2}}\sum _{i,j=1}^{n}\alpha _{i}y_{i}K(x_{i},x_{j})\alpha _{j}y_{j}={\text{arg}}\max _{\alpha \in \mathbb {R} ^{n}}\sum _{i=1}^{n}\alpha _{i}-{\frac {1}{2}}\alpha ^{T}({\text{diag}}\mathbf {Y} )\mathbf {K} ({\text{diag}}\mathbf {Y} )\alpha$

Subject to $0\leq \alpha _{i}\leq C\ \ \ i=1,\ldots ,n$

Note that this dual problem is easier to solve than the original problem because it is box constrained (the $\alpha _{i}$ are bounded). Also notice that the slack variables have disappeared in the dual problem.

Consequences and interpretations

The Karush-Kuhn-Tucker conditions dictate that all optimal solutions must satisfy the following conditions for $i=1,\ldots ,n$ :

$\sum _{j=1}^{n}c_{j}K(x_{i},x_{j})-\sum _{j=1}^{n}y_{i}\alpha _{j}K(x_{i},x_{j})=0$

$C-\alpha _{i}-\zeta _{i}=0$

$y_{i}\left(\sum _{j=1}^{n}y_{j}\alpha _{j}K(x_{i},x_{j})\right)-1+\xi _{i}\geq 0$

$\alpha _{i}\left[y_{i}\left(\sum _{j=1}^{n}y_{j}\alpha _{j}K(x_{i},x_{j})\right)-1+\xi _{i}\right]=0$

$\zeta _{i}\xi _{i}=0$

$\xi _{i},\alpha _{i},\zeta _{i}\geq 0$

From these above constraints, and recalling that $f(x)=\sum _{i=1}^{n}y_{i}\alpha _{i}K(x,x_{i})$ , we can derive conditions relating the $\alpha _{i}$ to $y_{i}f(x_{i})$ ^[11] :

${\begin{aligned}y_{i}f(x_{i})>1&\Rightarrow (1-y_{i}f(x_{i}))<0\\&\Rightarrow \xi _{i}\neq (1-y_{i}f(x_{i}))\\&\Rightarrow \alpha _{i}=0\end{aligned}}$

${\begin{aligned}y_{i}f(x_{i})<1&\Rightarrow (1-y_{i}f(x_{i}))>0\\&\Rightarrow \xi _{i}>0\\&\Rightarrow \zeta _{i}=0\\&\Rightarrow \alpha _{i}=C\end{aligned}}$

${\begin{aligned}\alpha _{i}=C&\Rightarrow \xi _{i}=1-y_{i}f(x_{i})\\&\Rightarrow y_{i}f(x_{i})\leq 1\end{aligned}}$

${\begin{aligned}\alpha _{i}=0&\Rightarrow C=\zeta _{i}\\&\Rightarrow \xi _{i}=0\\&\Rightarrow \\&\Rightarrow y_{i}f(x_{i})\geq 1\end{aligned}}$

${\begin{aligned}0<\alpha _{i}<C&\Rightarrow \zeta _{i}\neq 0\\&\Rightarrow \xi _{i}=0\\&\Rightarrow y_{i}f(x_{i})=1\end{aligned}}$

Note that the solution is relatively sparse, because whenever $y_{i}f(x_{i})>1,\ \alpha _{i}=0$ . In SVM, the input points with non-zero coefficients are called support vectors. Given the above constraints, the support vectors are precisely the input points where $y_{i}f(x_{i})\leq 1$ . ${\begin{aligned}\end{aligned}}$

Notes

^ Rosasco, Lorenzo. "Regularized Least-Squares and Support Vector Machines" (PDF).,
^ Rifkin, Ryan (2002). Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning (PDF). MIT (PhD thesis).
^ Lee, Yoonkyung; Lin, Yi; Wahba, Grace (2004). "Multicategory Support Vector Machines". Journal of the American Statistical Association. 99 (465): 67–81. doi: 10.1198/016214504000000098. {{ cite journal}}: Check date values in: |year= / |date= mismatch ( help)
^ Rosasco, Lorenzo; Vito, Ernesto De; Caponnetto, Andrea; Piana, Michele; Verri, Alessandro (2004). "Are Loss Functions All the Same". Neural Computation. 5. 16 (5): 1063–1076. doi: 10.1162/089976604773135104. PMID 15070510. {{ cite journal}}: Unknown parameter |month= ignored ( help)CS1 maint: date and year ( link)
^ This hypothesis space of functions is a Hilbert space of all the functions we're allowing the algorithm to pick
^ For insight on choosing the parameter, see, e.g., Wahba, Grace; Wang, Yonghua (1990). "When is the optimal regularization parameter insensitive to the choice of the loss function". Communications in Statistics - Theory and Methods. 19 (5): 1685–1700. doi: 10.1080/03610929008830285.{{ cite journal}}: CS1 maint: date and year ( link)
^ See Schölkopf, Bernhard; Herbrich, Ralf; Smola, Alex J. (2001). "A Generalized Representer Theorem". Computational Learning Theory: Lecture Notes in Computer Science. Lecture Notes in Computer Science. 2111: 416–426. doi: 10.1007/3-540-44581-1_27. ISBN 978-3-540-42343-0.{{ cite journal}}: CS1 maint: date and year ( link)
^ Lee, Yoonkyung; Lin, Yi; Wahba, Grace (2004). "Multicategory Support Vector Machines". Journal of the American Statistical Association. 99 (465): 67–81. doi: 10.1198/016214504000000098. {{ cite journal}}: Check date values in: |year= / |date= mismatch ( help)
^ Rosasco, Lorenzo; Vito, Ernesto De; Caponnetto, Andrea; Piana, Michele; Verri, Alessandro (2004). "Are Loss Functions All the Same". Neural Computation. 5. 16 (5): 1063–1076. doi: 10.1162/089976604773135104. PMID 15070510. {{ cite journal}}: Unknown parameter |month= ignored ( help)CS1 maint: date and year ( link)
^ For a detailed derivation, see Rifkin, Ryan (2002). Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning (PDF). MIT (PhD thesis).
^ For more detail, see Rosasco, Lorenzo. "Regularized Least Squares and Support Vector Machines" (PDF).

References

Evgeniou, Theodoros; Pontil, Massimiliano; Poggio, Tomaso (2000). "Regularization Networks and Support Vector Machines" (PDF). Advances in Computational Mathematics. 13 (1): 1–50. doi: 10.1023/A:1018946025316.{{ cite journal}}: CS1 maint: date and year ( link)

Joachims, Thorsten. "SVMlight".

Lee, Yoonkyung; Lin, Yi; Wahba, Grace (2004). "Multicategory Support Vector Machines". Journal of the American Statistical Association. 99 (465): 67–81. doi: 10.1198/016214504000000098. {{ cite journal}}: Check date values in: |year= / |date= mismatch ( help)

Rifkin, Ryan (2002). Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning (PDF). MIT (PhD thesis).

Rosasco, Lorenzo; Vito, Ernesto De; Caponnetto, Andrea; Piana, Michele; Verri, Alessandro (2004). "Are Loss Functions All the Same". Neural Computation. 5. 16 (5): 1063–1076. doi: 10.1162/089976604773135104. PMID 15070510. {{ cite journal}}: Unknown parameter |month= ignored ( help)CS1 maint: date and year ( link)

Rosasco, Lorenzo. "Regularized Least-Squares and Support Vector Machines" (PDF).

Schölkopf, Bernhard; Herbrich, Ralf; Smola, Alex J. (2001). "A Generalized Representer Theorem". Computational Learning Theory: Lecture Notes in Computer Science. Lecture Notes in Computer Science. 2111: 416–426. doi: 10.1007/3-540-44581-1_27. ISBN 978-3-540-42343-0.{{ cite journal}}: CS1 maint: date and year ( link)

Vapnik, Vladimir (1999). The Nature of Statistical Learning Theory. New York: Springer-Verlag. ISBN 0-387-98780-0.

Wahba, Grace; Wang, Yonghua (1990). "When is the optimal regularization parameter insensitive to the choice of the loss function". Communications in Statistics - Theory and Methods. 19 (5): 1685–1700. doi: 10.1080/03610929008830285.{{ cite journal}}: CS1 maint: date and year ( link)

[1] Rosasco, Lorenzo. "Regularized Least-Squares and Support Vector Machines" (PDF).,

[2] Rifkin, Ryan (2002). Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning (PDF). MIT (PhD thesis).

[3] Lee, Yoonkyung; Lin, Yi; Wahba, Grace (2004). "Multicategory Support Vector Machines". Journal of the American Statistical Association. 99 (465): 67–81. doi: 10.1198/016214504000000098. {{ cite journal}}: Check date values in: |year= / |date= mismatch ( help)

[4] Rosasco, Lorenzo; Vito, Ernesto De; Caponnetto, Andrea; Piana, Michele; Verri, Alessandro (2004). "Are Loss Functions All the Same". Neural Computation. 5. 16 (5): 1063–1076. doi: 10.1162/089976604773135104. PMID 15070510. {{ cite journal}}: Unknown parameter |month= ignored ( help)CS1 maint: date and year ( link)

[5] This hypothesis space of functions is a Hilbert space of all the functions we're allowing the algorithm to pick

[6] For insight on choosing the parameter, see, e.g., Wahba, Grace; Wang, Yonghua (1990). "When is the optimal regularization parameter insensitive to the choice of the loss function". Communications in Statistics - Theory and Methods. 19 (5): 1685–1700. doi: 10.1080/03610929008830285.{{ cite journal}}: CS1 maint: date and year ( link)

[7] See Schölkopf, Bernhard; Herbrich, Ralf; Smola, Alex J. (2001). "A Generalized Representer Theorem". Computational Learning Theory: Lecture Notes in Computer Science. Lecture Notes in Computer Science. 2111: 416–426. doi: 10.1007/3-540-44581-1_27. ISBN 978-3-540-42343-0.{{ cite journal}}: CS1 maint: date and year ( link)

[8] Lee, Yoonkyung; Lin, Yi; Wahba, Grace (2004). "Multicategory Support Vector Machines". Journal of the American Statistical Association. 99 (465): 67–81. doi: 10.1198/016214504000000098. {{ cite journal}}: Check date values in: |year= / |date= mismatch ( help)

[9] Rosasco, Lorenzo; Vito, Ernesto De; Caponnetto, Andrea; Piana, Michele; Verri, Alessandro (2004). "Are Loss Functions All the Same". Neural Computation. 5. 16 (5): 1063–1076. doi: 10.1162/089976604773135104. PMID 15070510. {{ cite journal}}: Unknown parameter |month= ignored ( help)CS1 maint: date and year ( link)

[10] For a detailed derivation, see Rifkin, Ryan (2002). Everything Old is New Again: A Fresh Look at Historical Approaches in Machine Learning (PDF). MIT (PhD thesis).

[11] For more detail, see Rosasco, Lorenzo. "Regularized Least Squares and Support Vector Machines" (PDF).

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

Regularization Perspectives on SVM

Background

Hinge loss

Derivation^[10]

Consequences and interpretations

Notes

References

Regularization Perspectives on SVM

Background

Hinge loss

Derivation^[10]

Consequences and interpretations

Notes

References

Videos

Websites

Encyclopedia

Facebook

Regularization Perspectives on SVM

Background

Hinge loss

Derivation [10]

Consequences and interpretations

Notes

References

Regularization Perspectives on SVM

Background

Hinge loss

Derivation [10]

Consequences and interpretations

Notes

References

Videos

Websites

Encyclopedia

Facebook

Derivation^[10]

Derivation^[10]