Revisit Support Vector Machine

This post reviews the key knowledgement about support vector machine (SVM). The post is pretty much based on a series of the lectures [1].

Suppose we have data \mathcal{D}, where each point is a d-dimensional data point x_i \in \mathcal{D}: x \in \mathbb{R}^d, \forall i=1,\cdots,N associated with label y_i: y_i = 1 \text{ or} -1, \;\;\forall i=1, \cdots, N.

The very initial SVM is to come up with a plane w^T x + b = 0 such that data points can be perfectly linearly divided according to their labels. For a linearly separable dataset, there can exist infinite such plane. However, the “best” one, intuitively, should have the largest margins from all the closest points to it. Here, “margin” means the distance from a point to a plane. As an example below, the best separating plane is supposed to be the rightmost one.

1

For a data point x_n, the way to calculate its margin (i.e., its distance to the plane w^T x + b=0, is to pick a point x on the plane and then project x_n - x onto w:

3

Since any point x on the plane satisfies that w^T x + b=0, we can simplify the distance formula:

distance=\frac{1}{\|w\|}|w^T x_n + b|

(From an intuitive perspective, the vector w is perpendicular to the plane w^Tx+b=0. Therefore the distance from x_n to the hyperplane is to find the projection of x_n - x onto the direction w.)

This is the margin for any data point x_n. However, remember that the ultimate goal is to maximize the margin of the data point, the data point which is the closest to the separating plane. One can write the objective function like this:

max \; \min_{x_n}\frac{1}{\|w\|}|w^T x_n + b|

A problem arises here. Even for the same plane, there can be infinite sets of (w,b) corresponding to it, all with different scales. For example, the plane w_1^T x + b_1=0 with w_1=(2,1), b=3 is the same plane as the plane w_2^T x + b_2 = 0 with w_2=(4,2), b=6. Therefore, we need to pre-define a scale.

We propose a normalization standard: normalize w and b such that the closest data point to the plane, x_n, has: |w^T x_n + b|=1. Therefore, the distance from the closest data point to the plane becomes \frac{1}{\|w\|}. After this, the objective function becomes easier:

4

Note that for any data point x_n, |w^T x_n + b| = y_n (w^T x_n + b). The constraints \min\limits_{n=1,2,\cdots,N} |w^T x_n+b|=1 can be re-written as y_n(w^T x_n + b) \geq 1, \; \forall n=1,2, \cdots, N. Moreover, instead of maximizing \frac{1}{\|w\|}, we can minimize \frac{1}{2}w^T w equivalently. Therefore, the objective function now becomes:

1

We are sure that for all the constraints, there must be at least one constraint where the equality holds (i.e., for some data point x_n, y_n(w^T x_n + b)=1). That is the data point which is the closest to the plane. If all the constraints only have the inequality hold (i.e., y_n(w^T x_n + b)>1, \; \forall n=1,2,\cdots,n), \frac{1}{2}w^T w would not be minimized: we can always scale down w a little bit such that \frac{1}{2} w^T w decreases until some constraint hits the equality.

The objective function is a constrained optimization problem with affine inequality constraints. An overview of using Lagrange/KKT to do optimization can be found in my previous post [2]. Basically, the way to optimize this objective function is to use Lagrange multiplier and KKT condition.

From [2], we know that KKT condition is usually a necessary condition in a constrained optimization problem. However, when the optimization problem is convex (in our case \frac{1}{2}w^T w) and Slater’s condition holds (in our case, we only have linear constraints y_n(w^Tx_n+b) \geq 1), KKT condition becomes a sufficient and necessary condition for finding the optimal solution w^*. w^* will be the saddle point of the Lagrange multiplier augmented loss function . Regarding why we can apply Lagrange multiplier + KKT condition to optimize the objective function, please refer to other materials also, since this is the part I am less confident about: (Section 3.2 in [4], answer in [5], slides in [6], Section 5.2.3 in [7]).

The optimization process is as follows. We first write the Lagranger formulation and take the derivatives of the Lagranger formulation w.r.t. w and b. Setting the derivatives to 0 we can find the relationship between \alpha (the Lagranger multipler), y_n, and x_n (data).

2

4

3

We can eventually formulate the problem as a quadratic programming problem (w.r.t. \alpha) and then hand it to a quadratic programming package to solve it. We are expected to receive the \alpha vector as results, which then we will use to obtain w.

5

Then, the parameter b can be obtained by picking a support vector x_n and solve by y_n (w^T x_n + b)-1 = 0.

Until this point, a new problem arises: we have been only dealing with perfectly linearly separable problems. How can we apply SVM in non-linear problems?

The first solution is kernel trick. Our objective function sending to quadratic programming is \mathcal{L}(\alpha)=\sum\limits_{n=1}^N \alpha_n - \frac{1}{2}\sum\limits_{n=1}^N \sum\limits_{m=1}^N y_n y_m \alpha_n \alpha_m x_n^T x_m, which only requires we know the inner product between any x_n and x_m. Suppose in the original form of raw data, x_1, x_2, \cdots, x_n are low-dimensional data points that are not linearly separable. However, if they are mapped to a higher dimension as z_1, z_2, \cdots, z_n, they can be separable. On that high dimension space, we can apply the linear separable SVM version on the data points z_1, z_2, \cdots, z_n to find a hyperplance and support vectors. That only requires we know the inner product of z_n and z_m.

6

We don’t even need to know the exact formula of the mapping function from x_n \rightarrow z_n: even a black-box mapping function is ok, as long as we can calculate the inner product of any z_n and z_m corresponding to x_n and x_m.  This “inner product” value function is called kernel:

7

Kernel functions can have many forms. For example, polynomial kernel: K(x, x')=(ax^Tx'+b)^Q and Radial Basis Function kernel: K(x, x')=exp(-\gamma\|x-x'\|^2). There is a math property called “Mercer’s condition” that an be used to verify whether a function is a valid kernel function. But this is out of the scope of this post.

After we use kernel, the optimization of SVM stays the same except that the coefficient matrix of quadratic programming now becomes:

9

And to finally obtain w and b, we need to utilize kernel function too:

10

The second solution to deal with non-linearly separable problems is to use soft-margin. The intuition is to allow some violation of the margin constraint for some data points.

12

To compare the two solutions, we find that the kernel solution is often used when data are highly non-linearly separable. The soft-margin SVM is often used when data is just slightly non-linearly separable (most data are still linearly separable).

11

About practical issues of applying SVM, please refer to [3].

Reference

[1] lectures from Caltech online ML course

https://www.youtube.com/watch?v=eHsErlPJWUU

and

https://www.youtube.com/watch?v=XUj5JbQihlU

[2] https://czxttkl.com/?p=1449

[3] https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf

[4] http://www.cmap.polytechnique.fr/~mallat/papiers/svmtutorial.pdf

[5] https://stats.stackexchange.com/questions/239912/lagrange-multipler-kkt-condition

[6] https://www.cs.cmu.edu/~epxing/Class/10701-08s/recitation/svm.pdf

[7] convex optimization textbook: https://www.dropbox.com/s/fk992yb6qja3xhv/convex%20optimization.pdf?dl=0

Some chinese references:

[a] http://www.cnblogs.com/jerrylead/archive/2011/03/13/1982639.html

[b] http://blog.csdn.net/v_july_v/article/details/7624837

Leave a comment

Your email address will not be published. Required fields are marked *