Arjun's Blog

Support Vector Machines

February 26, 2019

Support Vector Machines enjoy a really high popularity amongst machine learning practitioners, to the point of being ubiquitous, they are everywhere!

A Support Vector Machine or SVM is a supervised learning algorithm, which can really can be thought of as an extension of the Perceptron model.

The big idea here is to maximize the decision margins, which is defined as the distance between the separating hyperplane (the decision boundary) and the training samples closest to the hyperplane on either side, which are the so called support vectors.

When there are multiple hyperplanes that act as a decision boundary (see figure).

Group-4

SVM algorithm says that we should chose the one with the maximum margin. They are sometimes referred as Maximum-Margin classifiers because of this characteristic.

The rationale behind having decision boundaries with large margins is that they tend to have a lower generalization error (better at predict outcome values for previously unseen data), whereas models with small margins are prone to overfitting.

The Mathematics of SVMs

if the decision-boundary is a hyperplane, which can be expressed as w0+wTx=0w_{0} + w^{T}x = 0, where ww is the weight vector, w0w_{0} the bias and xx is the feature vector.
Then the planes parallel to the hyperplanes, which we call the positive and negative hyperplanes can be expressed as:

  • w0+wTxpos=1w_{0}+w^{T}x_{pos} = 1 or simply  w0+wTxpos1=0\ \ w_{0} + w^{T}x_{pos} - 1 = 0          (1)\ \ \ \ \ \ \ \ \ (1)

    And

  • w0+wTxneg=1w_{0}+w^{T}x_{neg} = -1 or simply  w0+wTxneg+1=0\ \ w_{0} + w^{T}x_{neg} + 1 = 0       (2)\ \ \ \ \ \ (2)

If we subtract equations (1)(1) and (2)(2) from each other, we get:

    wT(xposxneg)=2\implies w^{T}(x_{pos} - x_{neg}) = 2

we can normalize this equation by the length of vector wm which is defined as:

w=j=1iwj2\parallel w \parallel = \sqrt{\sum^{i}_{j=1} w^2_j}

we arrive at the following equation:

wT(xposxneg)w=2w\dfrac {w^T(x_{pos} - x_{neg})} {\parallel w \parallel} = \dfrac {2}{\parallel w \parallel}

The left side of the preceding equation can be interpreted as the distance between the positive and negative hyperplane, which is the so-called margin that we want to maximize.

Now, the objective function of the SVM becomes the maximization of this margin by maximizing 2w\dfrac {2}{\parallel w \parallel} under the constraint that the samples are classified correctly, which can be expressed as:

w0+wTx(i)1  if  y(i)=1w_{0} + w^Tx^{(i)} \geq 1 \ \ if\ \ y^{(i)} = 1

w0+wTx(i)1  if  y(i)=1w_{0} + w^Tx^{(i)} \leq -1 \ \ if\ \ y^{(i)} = -1

for i = 1 … N where, N is the number of samples in our dataset.

Those two equations are just a formal way of saying that all positive samples should fall ahead of the positive hyperplane and all the negative samples should fall behind the negative hyperplane.
which can also more compactly be written as y(i)(w0+wTxi)1  iy^{(i)}(w_0+w^Tx^i) \geq 1 \ \ \forall _i

In practice though, it is easier to Minimize the reciprocal term w22\dfrac {{\parallel w \parallel}^2} {2}, which can be solved by quadratic programming, whose details are out of the scope of this post.

But what about non-linearity?

We already know from the post on Perceptron Algorithm, that it never converges on data which is not linearly separable. Does SVMs, with their linear decision boundaries also suffer from indefinite non-convergence ?

They would, if not for something called a Slack variable. The motivation for introducing the slack variable ξ\xi was to relax the linear constraints for nonlinearly separable data to allow convergence in presence of misclassifications, under appropriate cost penalties.

What ^ that means in English is you allow some misclassifications, given you agree to its penalty cost. If the penalty is low and misclassifying it does give you a better margin, then you misclassify it. If the penalty is high, you do not misclassify it at the cost of a reduced margin. (see figure)

The positive value slack variable is simply added to the linear constraints:

w0+wTx(i)1ξ(i)  if  y(i)=1w_0 + w^Tx^{(i)} \geq 1 - \xi^{(i)} \ \ if\ \ y^{(i)} = 1
and
w0+wTx(i)1+ξ(i)  if  y(i)=1w_0 + w^Tx^{(i)} \geq -1 + \xi^{(i)} \ \ if\ \ y^{(i)} = -1

for ii = 1 … N, where N is the number of samples in our dataset.

So, the new objective to be minimized (subject to constraints) becomes:

12w2+C(iξ(i))\dfrac 1 2 {\parallel w \parallel}^2 + C\Big(\sum_i \xi^{(i)}\Big)

Via, the variable CC, we can control the penalty for misclassification.

  • Large values of CC correspond to larger error penalty
  • Smaller values of CC indicate that we are less strict about misclassification errors.

Note:- The variable CC became sort of a convention, you would see it at many places, just hanging around, many times without any context, for eg: in the Decision Trees Learning Algorithm, it is thus in general a good idea to be familiar with it.

We can use the variable CC to control the width of the margin, thus tuning the bias-variance trade-off as seen in the figure below.

The code

This is how you import a SVC classifier from sci-kit learn python module and feed it the training data for it to learn (fit) from.

from sklearn.svm import SVC
svm = SVC(kernel='linear', C=1.0, random_state=1)
svm.fit(X_train, y_train)

The parameter C, is the same C we saw earlier

Summary

whew, That was a handful. Lets review what you learned in the post:

  • You learned what Support Vector Machines are and what they do.
  • You learned about the questions SVMs answer.
  • You learned about the fundamental Mathematics behind SVMs.
  • You learned about how SVMs handle non-linearity.
  • You learned about how to control the cost penalty for misclassifications in case of nonlinearly separable data.

Conclusion

We learned that support vector machines are a really popular choice with ML practitioners, and for good reason. It is a really handy algorithm and to understand it well will serve you good, simply because it can be found everywhere, from building classifiers to the example section of your favorite machine learning library. Apart from the ease of use, a major reason for SVM’s popularity is the Kernel Trick or the Kernel Method, which will be a topic for discussion in a future post, stay tuned and keep learning.


Arjun Kathuria

Written by Arjun Kathuria

You can find more about him here