Support Vector Machine (SVM)

September 30, 2017

A Support Vector Machine (SVM) is a classification algorithm, typically used for binary classification. An SVM with gaussian kernel has been consistently shown to be one of the best machine learning models, achieving the highest accuracy in a large variety of datasets, especially datasets which do not involve images or audio.

Understanding the various hyper-parameters of an SVM are central to achieving high accuracy. In this tutorial, we’ll learn what an SVM is, and what is the purpose of its various hyper-parameters.

Max-margin classifier

Given a binary classification dataset, an SVM aims to find a separating hyperplane between positive and negative instances. For example, in the figure below, each of the blue lines is a plane which separates the positive and negative data points. All blue points lie on one side, and all red points lie on the other rise. New unseen samples are categorized based on the side of the hyperplane in which they fall.

Each of the blue lines (separating hyperplanes) classify all data correctly, but the red line is the max-margin classifier. Image source: ResearchGate

Although there are many possible hyperplanes that separate all data points correctly, the SVM algorithm seeks the hyperplane with the largest margin. That is, it maximizes the distance between the hyperplane and the nearest data point. Hence, it is known as a maximum-margin classifier. The closest data points are called support vectors.

In the illustration above, the red line is the maximum-margin classifier, and the data points marked in the yellow are the corresponding support vectors.

Intuition behind maximum-margin

The main reason for preferring the max-margin classifier over other hyperplanes is that intuitively, it is expected to perform better on test data / unseen data.

Consider the following example, if new data points were added to this dataset, we would expect classifier A to perform better than classifier B, since classifier B is very close to one of the green data points.

What if data is overlapping? Soft-margin

Although linearly separable datasets are nice, in practice those are difficult to find. For overlapping datasets, the SVM algorithm includes a regularization parameter (usually denoted as C), which controls how important it is to avoid misclassifying data points.

Large values of C imply that the SVM classifier will classify more points correctly, even if that means a small margin. Whereas small values of C imply that more data points might be misclassified, but the resulting classifier will have a larger margin with the rest of the data points.

Classifier A misclassifies 2 points and has a small margin. Classifier B misclassifies 3 points, but has a larger margin

In general, if the dataset has more outliers, we want to allow for more data points to be misclassified. In practice, we try many different values for the hyper-parameter C, and choose the one that performs best on the validation dataset.

What if optimal classifier is not linear? Kernel trick

One limitation of an SVM (even with soft-margin) is that the classification boundary is linear. However, in real-life datasets, the best classification boundary might not be a straight line. For example, consider the dataset below. Clearly, the best decision boundary in this case is a circle.

2D data where best decision boundary is a circle (non-linear). Image source: eric-kim.net

This problem can be solved by introducing new dimensions, or mapping the data to a higher dimension. For example, let’s define a third dimension z, where $z = x^2 + y^2$. Now, when we plot the data, we get the following plot:

2D data mapped to 3D

Now, it is possible to separate the data with a linear hyperplane.

Data projected to 3D and the linear hyperplane that separates it.

Finally, we can also look at what the decision boundary looks like if the project it back to our original 2D dataset.

Hyperplane projected back on to the 2D dataset

As we can see, the learnt decision boundary is a circle. In summary, projecting the data to a higher dimension allows us to learn classifiers which are non-linear with respect to the original data.

Kernel trick in practice

There are two more important things to know about SVM kernels:

SVM kernels provide a way to perform the transformation on the data implicitly. That is, we need not go through every data point and calculate the values for the new dimensions. For a restricted set of transformations, the SVM optimizer is able to implicitly work in such a high-dimensional space by looking at the dot products between pair of samples. This makes things a lot more computationally efficient when the number of higher dimensions is much much larger.
We have figured out a set of kernels which are extremely powerful and work very well on a large number of datasets. In particular, the polynomial kernel and gaussian kernel (also known as radial basis function) are the most popular ones.

Adjusting the number of support vectors: Gamma

Finally, SVM provides us with a third powerful hyper-parameter (apart from regularization parameter C and choice of kernel). This hyper-parameter, typically denoted by $\gamma$ (gamma), controls how far the influence of each data point reaches.

That is, high values of $\gamma$ imply a small influence and hence a small number of support vectors. Whereas low values of $\gamma$ imply a large influence and hence a large number of support vectors.

Classifier obtained with a large gamma value implies a small number of support vectors

Classifier obtained with a small gamma value implies a large number of support vectors

Multiclass SVM

So far, only the binary classification model was described. SVM can be extended to be used on a multi-class dataset. There are two popular options:

One-versus-rest: train one SVM classifier per class to distinguish one single class from the rest. For predictions, we follow the winner-takes-all strategy. That is, new instances will be categorized based on the highest scoring output.
One-versus-one: train one SVM classifier for each pair of classes. New samples are predicted following the max-wins voting strategy. That is, each classifier assigns the instance to one of the two classes, and the final prediction is given by the class with the most votes.

Putting things together with code

You can see the documentation for SVMs in the scikit-learn machine learning library here: sklearn.svm.SVC — scikit-learn documentation. The following is the list of arguments the function can take:

class sklearn.svm.SVC(C=1.0, kernel=’rbf’, degree=3, gamma=’auto’, 
    coef0=0.0, shrinking=True, probability=False, tol=0.001, 
    cache_size=200, class_weight=None, verbose=False, max_iter=-1, 
    decision_function_shape=’ovr’, random_state=None)

Out of the above arguments, the following are the most important ones (most have already been discussed in the tutorial):

Model hyper-parameters

C = regularization parameter = controls how important it is to avoid misclassifying data points
kernel = ‘rbf’ (radial basis function, i.e. Gaussian kernel) and ‘poly’ (polynomial kernel) are most popular
degree = only applicable for polynomial kernel.
gamma = controls the range of influence of each data point

Stopping criterion for training

tol = tolerance for stopping
max_iter = hard limit for number of iterations

Multiclass SVM:

decision_function_shape = ‘ovo’ (one-versus-one) or ‘ovr’ (one-versus-rest)

Sample interaction

Here’s a sample interaction for training the SVM

>>> # source: scikit-learn
>>> import numpy as np
>>> X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]])
>>> y = np.array([1, 1, 2, 2])
>>> from sklearn.svm import SVC
>>> clf = SVC()
>>> clf.fit(X, y) 
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)
>>> print(clf.predict([[-0.8, -1]]))
[1]

Applications

SVM with Gaussian kernel has been consistently shown to be one of the best machine learning models for a large variety of datasets (specially non-media datasets, where the input isn’t an image or an audio).

Theory

If you’d like to read about the theoretical background behind SVMs, check out our tutorial on Support Vector Machine (SVM) Theory, which is part of Machine Learning Algorithms Course.