Part of course:

Support Vector Machine (SVM)

- Problem Formulation
- Soft margin:
- Training an SVM
- Multiclass SVM
- Kernel SVM
- Applications

NaN.

Support Vector Machine (SVM)[ Edit ]

In supervised learning, a Support Vector Machine is a linear classifier that aims to find a **separating hyperplane** between positive and negative instances. In case the data is linearly separable, multiple hyperplanes could perform well in the binary classification task. However, the SVM algorithm seeks the hyperplane with the **largest margin:** the largest distance to nearest sample points. Hence its denotation as **maximum-margin classifier. **Closest data points are called **support vectors**.

Check following figure as illustration. Here the dataset lies in a two-dimensional space and therefore the hyperplane will be a one-dimensional line. Although many lines (in light blue) do separate all instances correctly, there is only one optimal hyperplane (red line) that maximizes the distance to the closest points (in yellow).

New *unseen *samples are categorized based on the side of the hyperplane in which they fall, as explained in next section.

In SVM, the decision function for predicting data points *x* corresponds to the equation of an hyperplane:

f_w(x) = w^T x + w_0

The sign of its output *yˆ = sign(f _{w}(x)) *∈ {-1, +1} will decide the classification into the corresponding label.

Although linearly separable datasets are nice, in practice those are difficult to find. The approach of learning an hyperplane where some instances are classified on the wrong side is known as the **soft-margin problem.** Let us recall the general form of an empirical risk minimization problem for the optimization criterion:

\min_w L(w) = \min_w \sum_{i=1}^{n_{train}} l(\hat{y}_i(x_i), y_i) + \lambda \Omega(w)

Beside a **loss function** *l* over training instances {(*x _{i, }y_{i})*}, an extra term is needed for distinguishing and finding more likely models. This is the role of the

Soft-margin works with the __hinge loss__:

l(\hat{y}_i(x_i), y_i) = max(0, 1-y_i [w^T x_i+w_0])

and the __L2-norm__ (or squared euclidean norm):

\Omega(w)= ||w||_2^2 = w_1^2 + w_2^2 + ... + w_d^2

Let us imagine an optimal scenario were the empirical loss (the sum of the training losses) is zero. Then, for each training instance *(x _{i}, y_{i})*, its individual loss must be zero as well:

\begin{aligned}
&max(0, 1-y_i [w^T x_i+w_0]) = 0 \\
&\Leftrightarrow y_i [w^T x_i+w_0] \geqslant 1 \\
&\Leftrightarrow y_i \dfrac{w^T x_i + w_0}{||w||_2} \geqslant \dfrac{1}{||w||_2} \\
&\Leftrightarrow proj_w(x_i) \begin{cases}
\geqslant \dfrac{1}{||w||_2} \ if\ y_i=+1 \\
\leqslant \dfrac{-1}{||w||_2} \ if\ y_i=-1
\end{cases}
\end{aligned}

Recall that the function *proj _{w}(x_{i})* calculates the distance of point

Furthermore, minimizing the regularizer ||w||_{2}^{2} is equivalent to __maximizing the margin__:

min_w ||w||_2^2 \;
\Leftrightarrow \; \max_w \dfrac{1}{||w||_2}

And hence the desired **max-margin** property.

Finally, by applying the function of the optimal hyperplane and extracting signs, new instances *x* are labeled:

x \mapsto sign(w^T x + w_0)

__Hard margin__:

However, for those cases where data are indeed linearly separable, the optimization criterion only deals with maximizing the margin restricted no sample falling into it. That is:

\begin{aligned}
&\max_{w} \dfrac{1}{||w||_2},\ subject\ to\ y_i (w^T x_i + w_0) \geqslant 1 \\
&\Leftrightarrow\min_{w} ||w||_2,\ subject\ to\ 1 - y_i (w^T x_i + w_0) \leqslant 0
\end{aligned}

This can be solved using linear programming. In fact, given a sufficiently small value for *λ*, the soft-margin yields the hard-margin classifier when data is linearly classifiable.

Above, we saw the decision boundary an SVM aims to find (max-margin classifier). We also saw how the cost function allows us to achieve this objective. To train an SVM, we could use the gradient descent optimization method with the above loss function.

In general, that is not what is usually used. The exact method used for training an SVM in general is quite mathematically involved, involving an understanding of a field called linear programming (and dual formulation of a linear programming problem). We will not be discussing this method in this tutorial. Also note that Kernel SVMs (described below) too are trained using the linear programming formulation.

So far, only the binary classification model was described. SVM can be extended to be used on a multi-class dataset. There are two ways for that:

**One-versus-all:**by training one classifier per label to distinguish one single class from the rest. Following the*winner-takes-all strategy*, new instances will be categorized with respect to the highest scoring output.**One-versus-one:**by training one classifiers per pair of classes. New samples are predicted following the*max-wins voting strategy*. First, each classifier assigns the instance to one of the two classes, whose vote is increased by one. And finally labels the instance with the class with the most votes.

In addition, SVMs can efficiently perform non-linear classifications by using kernel methods, where **input instances are mapped into a higher-dimensional feature space**. There, separation should be easier, allowing a linear classification to be applied. The *kernel trick* says that, instead of explicitly computing the coordinates of each instance in the feature space, inner products between pair of samples allow to implicitly work in such a high-dimensional space.

A suitable kernel function

k(x_i,x_j) = \phi(x_i) \cdot \phi(x_j)

has to be selected. Typical ones are:

- Polynomial:

(x_i \cdot x_j)^d

- Gaussian:

e^{-\gamma ||x_i - x_j ||^2}, \gamma > 0

- Hyperbolic tangent:

tanh(\kappa x_i \cdot x_j + c), \kappa>0, c<0

The most extended model version is the **SVM with Gaussian kernel.** In fact, it has been consistently shown to represent one of the best machine learning models for non-media related datasets.

Links: Support vector machine

Read more…(824 words)

Mark as completed

Part of lists:

Previous

Hands-on Project: Digit classification with K-Nearest Neighbors and Data Augmentation

Next

Quiz (and Assignment): Support Vector Machines

About the contributors:

Marta EnescoData Scientist, Graduate Research Assistant at University of Potsdam

77%

Keshav DhandhaniaMSc in Deep Learning @ MIT (2014)

23%

Loading…

Have a question? Ask here…

Post

Part of course:

Support Vector Machine (SVM)

- Problem Formulation
- Soft margin:
- Training an SVM
- Multiclass SVM
- Kernel SVM
- Applications

Contributors

Marta EnescoData Scientist, Graduate Research Assistant at University of Potsdam

77%

Keshav DhandhaniaMSc in Deep Learning @ MIT (2014)

23%

Ready to join our community?

Sign up below to automatically get notified of new courses, get **reminders** to finish ones you subscribe to, and **bookmark** lessons to read later.

Continue with Facebook

— OR —

Your Full Name

Email address

I have an account. Log in instead

By signing up, you agree to our Terms and our Privacy Policy.

Popular Courses

New Courses

Get in touch

Copyright 2016-18, Compose Labs Inc. All rights reserved.