Support Vector Machine (SVM) Theory

June 20, 2018

In this tutorial, we’ll describe the mathematics behind Support Vector Machines. This tutorial is going to be much more math heavy as compared to other tutorials. If you’re mostly interested in applying SVMs to solve problems, then our first tutorial on SVMs is sufficient. However, if you would like to understand the mathematical basis of Support Vector Machines, then you’ll find this tutorial interesting.

In this tutorial, we will focus on the hard-margin SVM and soft-margin SVM. However, we will not be considering kernels or the hyper-parameter $\gamma$ (gamma).

Problem Formulation

In SVM, the decision function for predicting data points x corresponds to the equation of an hyperplane:

$$ f_w(x) = w^T x + w_0 $$

The sign of its output $\hat{y} = sign(f_w(x)) \in \{-1, +1\}$ will decide the classification into the corresponding label.

Each of the blue lines (separating hyperplanes) classify all data correctly, but the red line is the max-margin classifier. Image source: ResearchGate

Loss function

The general form of the loss function used for an SVM is similar to that for other machine learning problems. It is given as follows:

$$ L(w) = \sum_{i=1} max(0, 1-y_i [w^T x_i+w_0]) + \lambda ||w||_2^2 $$

Here, the first term measures the error due to misclassification (or data points being closer to the classification boundary than the margin), and is know as hinge-loss. “Hinge” describes the fact that the error is 0 if the data point is classified correctly (and is not too close to the decision boundary), and after that it keeps increasing. The second term is the regularization term, and $\lambda$ (lambda) is the regularization coefficient. Note that sometimes you may see C being used for regularization coefficient. In that case, $\lambda = \frac{1}{C}$.

Minimizing Loss is equivalent to Maximizing-Margin

Now, we’ll show how minimizing the above loss function is equivalent to finding the max-margin classifier.

First, we’ll derive the formula for the margin from the hinge-loss. If a data point is on the margin of the classifier, the hinge-loss is exactly zero. (That is, if it’s further away, hinge loss remains zero, but if its closer, the hinge loss is positive).

Hence, on the margin, we have,

$$ \begin{aligned} & \space max(0, 1-y_i [w^T x_i+w_0]) = 0 \\ &\Rightarrow y_i [w^T x_i+w_0] = 1 \end{aligned} $$

Note that $y_i$ is either +1 or -1, and the distance of $x_i$ from the line $w^Tx_i + w_0 = 0$ is given by

$$ d_w(x_i) = \Bigg| \frac{w^T x_i + w_0}{||w||_2}\Bigg| $$

Hence, we get,

$$ d_w(x_i) = \Bigg|\frac{w^T x_i + w_0}{||w||_2}\Bigg| = \Bigg|\frac{1}{y_i||w||_2}\Bigg| = \frac{1}{||w||_2} $$

That is, the margin is given by $1/||w||_2$.

Now, since the data is linearly separable (and none of the data points are within the margin), we know that the hinge-loss is zero. That is,

$$ \sum_{i=1} max(0, 1-y_i [w^T x_i+w_0]) = 0 $$

Hence, the loss function reduces to

$$ L(w) = \lambda ||w||_2^2 $$

Minimizing the loss function $L(w) = \lambda ||w||_2^2$ is the same as maximizing $1/||w||_2$, which is the margin.

Hence, we can see that minimizing the above loss function is equivalent to finding the max-margin classifier in the case where no points are allowed to be misclassified (or be within the margins).

Training an SVM

Given the above loss function, it is possible to train an SVM using the gradient descent optimization (assuming we’re not interested in supporting SVM kernels).

However, that is not the standard method used for training an SVM. Instead, the method used for training an SVM involves a field called linear programming (and dual formulation of a linear programming problem). If you’d like to read more about the dual formulation of the SVM, I recommend the Wikipedia article on it.