Support vector machines (SVM), like regularized least squares, are a special case of Tikhonov regularization. In the case of SVM, the loss function is the hinge loss. [1] [2] [3] [4]
In the supervised learning framework, an algorithm is a strategy for choosing a function given a training set of inputs and their labels (the labels are usually ). Regularization strategies avoid overfitting by choosing a function that fits the data, but is not too complex. Specifically:
,
where is a hypothesis space [5] of functions, is the loss function, is a norm on the hypothesis space of functions, and is the regularization parameter [6] .
When is a reproducing kernel Hilbert space, there exists a kernel function that can be written as an symmetric positive definite matrix . By the representer theorem [7], , and
The simplest and most intuitive loss function for categorization is the misclassification loss, or 0-1 loss, which is 0 if and 1 if , i.e the
heaviside step function on . However, this loss function is not
convex, which makes the regularization problem very difficult to minimize computationally. Therefore, we look for convex substitutes for the 0-1 loss. The hinge loss, where , provides such a
convex relaxation. In fact, the hinge loss is the tightest convex
upper bound to the 0-1 misclassification loss function
[8], and with infinite data returns the
Bayes optimal solution:
[9]
With the hinge loss, where , the regularization problem becomes:
,
In most of the SVM literature, this is written equivalently as:
.
This problem is non-differentiable because of the "kink" in the loss function. However, we can rewrite it using slack variables :
subject to:
Next we apply the representer theorem to get:
subject to:
This is a constrained optimization problem, which we will solve using the Lagrangian to derive the dual problem. The Lagrangian is:
The dual problem is:
Minimizing with respect to : Minimizing with respect to :
Then, plugging into the Lagrangian, we can write the dual problem as:
Then, plugging in , we get:
Subject to
Note that this dual problem is easier to solve than the original problem because it is box constrained (the are bounded). Also notice that the slack variables have disappeared in the dual problem.
The Karush-Kuhn-Tucker conditions dictate that all optimal solutions must satisfy the following conditions for :
From these above constraints, and recalling that , we can derive conditions relating the to [11] :
Note that the solution is relatively sparse, because whenever . In SVM, the input points with non-zero coefficients are called support vectors. Given the above constraints, the support vectors are precisely the input points where .
{{
cite journal}}
: Check date values in: |year=
/ |date=
mismatch (
help)
{{
cite journal}}
: Unknown parameter |month=
ignored (
help)CS1 maint: date and year (
link)
{{
cite journal}}
: CS1 maint: date and year (
link)
{{
cite journal}}
: CS1 maint: date and year (
link)
{{
cite journal}}
: Check date values in: |year=
/ |date=
mismatch (
help)
{{
cite journal}}
: Unknown parameter |month=
ignored (
help)CS1 maint: date and year (
link)
{{
cite journal}}
: CS1 maint: date and year (
link){{
cite journal}}
: Check date values in: |year=
/ |date=
mismatch (
help){{
cite journal}}
: Unknown parameter |month=
ignored (
help)CS1 maint: date and year (
link){{
cite journal}}
: CS1 maint: date and year (
link){{
cite journal}}
: CS1 maint: date and year (
link)
Support vector machines (SVM), like regularized least squares, are a special case of Tikhonov regularization. In the case of SVM, the loss function is the hinge loss. [1] [2] [3] [4]
In the supervised learning framework, an algorithm is a strategy for choosing a function given a training set of inputs and their labels (the labels are usually ). Regularization strategies avoid overfitting by choosing a function that fits the data, but is not too complex. Specifically:
,
where is a hypothesis space [5] of functions, is the loss function, is a norm on the hypothesis space of functions, and is the regularization parameter [6] .
When is a reproducing kernel Hilbert space, there exists a kernel function that can be written as an symmetric positive definite matrix . By the representer theorem [7], , and
The simplest and most intuitive loss function for categorization is the misclassification loss, or 0-1 loss, which is 0 if and 1 if , i.e the
heaviside step function on . However, this loss function is not
convex, which makes the regularization problem very difficult to minimize computationally. Therefore, we look for convex substitutes for the 0-1 loss. The hinge loss, where , provides such a
convex relaxation. In fact, the hinge loss is the tightest convex
upper bound to the 0-1 misclassification loss function
[8], and with infinite data returns the
Bayes optimal solution:
[9]
With the hinge loss, where , the regularization problem becomes:
,
In most of the SVM literature, this is written equivalently as:
.
This problem is non-differentiable because of the "kink" in the loss function. However, we can rewrite it using slack variables :
subject to:
Next we apply the representer theorem to get:
subject to:
This is a constrained optimization problem, which we will solve using the Lagrangian to derive the dual problem. The Lagrangian is:
The dual problem is:
Minimizing with respect to : Minimizing with respect to :
Then, plugging into the Lagrangian, we can write the dual problem as:
Then, plugging in , we get:
Subject to
Note that this dual problem is easier to solve than the original problem because it is box constrained (the are bounded). Also notice that the slack variables have disappeared in the dual problem.
The Karush-Kuhn-Tucker conditions dictate that all optimal solutions must satisfy the following conditions for :
From these above constraints, and recalling that , we can derive conditions relating the to [11] :
Note that the solution is relatively sparse, because whenever . In SVM, the input points with non-zero coefficients are called support vectors. Given the above constraints, the support vectors are precisely the input points where .
{{
cite journal}}
: Check date values in: |year=
/ |date=
mismatch (
help)
{{
cite journal}}
: Unknown parameter |month=
ignored (
help)CS1 maint: date and year (
link)
{{
cite journal}}
: CS1 maint: date and year (
link)
{{
cite journal}}
: CS1 maint: date and year (
link)
{{
cite journal}}
: Check date values in: |year=
/ |date=
mismatch (
help)
{{
cite journal}}
: Unknown parameter |month=
ignored (
help)CS1 maint: date and year (
link)
{{
cite journal}}
: CS1 maint: date and year (
link){{
cite journal}}
: Check date values in: |year=
/ |date=
mismatch (
help){{
cite journal}}
: Unknown parameter |month=
ignored (
help)CS1 maint: date and year (
link){{
cite journal}}
: CS1 maint: date and year (
link){{
cite journal}}
: CS1 maint: date and year (
link)