Support Vector Machines
February 26, 2019
Support Vector Machines enjoy a really high popularity amongst machine learning practitioners, to the point of being ubiquitous, they are everywhere!
A Support Vector Machine or SVM is a supervised learning algorithm, which can really can be thought of as an extension of the Perceptron model.
The big idea here is to maximize the decision margins, which is defined as the distance between the separating hyperplane (the decision boundary) and the training samples closest to the hyperplane on either side, which are the so called support vectors.
When there are multiple hyperplanes that act as a decision boundary (see figure).
SVM algorithm says that we should chose the one with the maximum margin. They are sometimes referred as MaximumMargin classifiers because of this characteristic.
The rationale behind having decision boundaries with large margins is that they tend to have a lower generalization error (better at predict outcome values for previously unseen data), whereas models with small margins are prone to overfitting.
The Mathematics of SVMs
if the decisionboundary is a hyperplane, which can be expressed as $w_{0} + w^{T}x = 0$, where $w$ is the weight vector, $w_{0}$ the bias and $x$ is the feature vector.
Then the planes parallel to the hyperplanes, which we call the positive and negative hyperplanes can be expressed as:

$w_{0}+w^{T}x_{pos} = 1$ or simply$\ \ w_{0} + w^{T}x_{pos}  1 = 0$ $\ \ \ \ \ \ \ \ \ (1)$
And

$w_{0}+w^{T}x_{neg} = 1$ or simply$\ \ w_{0} + w^{T}x_{neg} + 1 = 0$ $\ \ \ \ \ \ (2)$
If we subtract equations $(1)$ and $(2)$ from each other, we get:
$\implies w^{T}(x_{pos}  x_{neg}) = 2$
we can normalize this equation by the length of vector wm which is defined as:
$\parallel w \parallel = \sqrt{\sum^{i}_{j=1} w^2_j}$
we arrive at the following equation:
$\dfrac {w^T(x_{pos}  x_{neg})} {\parallel w \parallel} = \dfrac {2}{\parallel w \parallel}$
The left side of the preceding equation can be interpreted as the distance between the positive and negative hyperplane, which is the socalled margin that we want to maximize.
Now, the objective function of the SVM becomes the maximization of this margin by maximizing $\dfrac {2}{\parallel w \parallel}$ under the constraint that the samples are classified correctly, which can be expressed as:
$w_{0} + w^Tx^{(i)} \geq 1 \ \ if\ \ y^{(i)} = 1$
$w_{0} + w^Tx^{(i)} \leq 1 \ \ if\ \ y^{(i)} = 1$
for i = 1 … N where, N is the number of samples in our dataset.
Those two equations are just a formal way of saying that all positive samples should fall ahead of the positive hyperplane and all the negative samples should fall behind the negative hyperplane.
which can also more compactly be written as $y^{(i)}(w_0+w^Tx^i) \geq 1 \ \ \forall _i$
In practice though, it is easier to Minimize the reciprocal term $\dfrac {{\parallel w \parallel}^2} {2}$, which can be solved by quadratic programming, whose details are out of the scope of this post.
But what about nonlinearity?
We already know from the post on Perceptron Algorithm, that it never converges on data which is not linearly separable. Does SVMs, with their linear decision boundaries also suffer from indefinite nonconvergence ?
They would, if not for something called a Slack variable. The motivation for introducing the slack variable $\xi$ was to relax the linear constraints for nonlinearly separable data to allow convergence in presence of misclassifications, under appropriate cost penalties.
What ^ that means in English is you allow some misclassifications, given you agree to its penalty cost. If the penalty is low and misclassifying it does give you a better margin, then you misclassify it. If the penalty is high, you do not misclassify it at the cost of a reduced margin. (see figure)
The positive value slack variable is simply added to the linear constraints:
$w_0 + w^Tx^{(i)} \geq 1  \xi^{(i)} \ \ if\ \ y^{(i)} = 1$
and
$w_0 + w^Tx^{(i)} \geq 1 + \xi^{(i)} \ \ if\ \ y^{(i)} = 1$
for $i$ = 1 … N, where N is the number of samples in our dataset.
So, the new objective to be minimized (subject to constraints) becomes:
$\dfrac 1 2 {\parallel w \parallel}^2 + C\Big(\sum_i \xi^{(i)}\Big)$
Via, the variable $C$, we can control the penalty for misclassification.
 Large values of $C$ correspond to larger error penalty
 Smaller values of $C$ indicate that we are less strict about misclassification errors.
Note: The variable $C$ became sort of a convention, you would see it at many places, just hanging around, many times without any context, for eg: in the Decision Trees Learning Algorithm, it is thus in general a good idea to be familiar with it.
We can use the variable $C$ to control the width of the margin, thus tuning the biasvariance tradeoff as seen in the figure below.
The code
This is how you import a SVC classifier from scikit learn python module and feed it the training data for it to learn (fit) from.
from sklearn.svm import SVC
svm = SVC(kernel='linear', C=1.0, random_state=1)
svm.fit(X_train, y_train)
The parameter C, is the same C we saw earlier
Summary
whew, That was a handful. Lets review what you learned in the post:
 You learned what Support Vector Machines are and what they do.
 You learned about the questions SVMs answer.
 You learned about the fundamental Mathematics behind SVMs.
 You learned about how SVMs handle nonlinearity.
 You learned about how to control the cost penalty for misclassifications in case of nonlinearly separable data.
Conclusion
We learned that support vector machines are a really popular choice with ML practitioners, and for good reason. It is a really handy algorithm and to understand it well will serve you good, simply because it can be found everywhere, from building classifiers to the example section of your favorite machine learning library. Apart from the ease of use, a major reason for SVM’s popularity is the Kernel Trick or the Kernel Method, which will be a topic for discussion in a future post, stay tuned and keep learning.