Revisit Support Vector Machine

This post reviews the key knowledgement about support vector machine (SVM). The post is pretty much based on a series of the lectures [1].

Suppose we have data $\mathcal{D}$ , where each point is a $d$ -dimensional data point $x_i \in \mathcal{D}: x \in \mathbb{R}^d, \forall i=1,\cdots,N$ associated with label $y_i: y_i = 1 \text{ or} -1, \;\;\forall i=1, \cdots, N$ .

The very initial SVM is to come up with a plane $w^T x + b = 0$ such that data points can be perfectly linearly divided according to their labels. For a linearly separable dataset, there can exist infinite such plane. However, the “best” one, intuitively, should have the largest margins from all the closest points to it. Here, “margin” means the distance from a point to a plane. As an example below, the best separating plane is supposed to be the rightmost one.

For a data point $x_n$ , the way to calculate its margin (i.e., its distance to the plane $w^T x + b=0$ , is to pick a point $x$ on the plane and then project $x_n - x$ onto $w$ :

Since any point $x$ on the plane satisfies that $w^T x + b=0$ , we can simplify the distance formula:

$distance=\frac{1}{\|w\|}|w^T x_n + b|$

(From an intuitive perspective, the vector $w$ is perpendicular to the plane $w^Tx+b=0$ . Therefore the distance from $x_n$ to the hyperplane is to find the projection of $x_n - x$ onto the direction $w$ .)

This is the margin for any data point $x_n$ . However, remember that the ultimate goal is to maximize the margin of the data point, the data point which is the closest to the separating plane. One can write the objective function like this:

$max \; \min_{x_n}\frac{1}{\|w\|}|w^T x_n + b|$

A problem arises here. Even for the same plane, there can be infinite sets of $(w,b)$ corresponding to it, all with different scales. For example, the plane $w_1^T x + b_1=0$ with $w_1=(2,1), b=3$ is the same plane as the plane $w_2^T x + b_2 = 0$ with $w_2=(4,2), b=6$ . Therefore, we need to pre-define a scale.

We propose a normalization standard: normalize $w$ and $b$ such that the closest data point to the plane, $x_n$ , has: $|w^T x_n + b|=1$ . Therefore, the distance from the closest data point to the plane becomes $\frac{1}{\|w\|}$ . After this, the objective function becomes easier:

Note that for any data point $x_n$ , $|w^T x_n + b| = y_n (w^T x_n + b)$ . The constraints $\min\limits_{n=1,2,\cdots,N} |w^T x_n+b|=1$ can be re-written as $y_n(w^T x_n + b) \geq 1, \; \forall n=1,2, \cdots, N$ . Moreover, instead of maximizing $\frac{1}{\|w\|}$ , we can minimize $\frac{1}{2}w^T w$ equivalently. Therefore, the objective function now becomes:

We are sure that for all the constraints, there must be at least one constraint where the equality holds (i.e., for some data point $x_n$ , $y_n(w^T x_n + b)=1$ ). That is the data point which is the closest to the plane. If all the constraints only have the inequality hold (i.e., $y_n(w^T x_n + b)>1, \; \forall n=1,2,\cdots,n$ ), $\frac{1}{2}w^T w$ would not be minimized: we can always scale down $w$ a little bit such that $\frac{1}{2} w^T w$ decreases until some constraint hits the equality.

The objective function is a constrained optimization problem with affine inequality constraints. An overview of using Lagrange/KKT to do optimization can be found in my previous post [2]. Basically, the way to optimize this objective function is to use Lagrange multiplier and KKT condition.

From [2], we know that KKT condition is usually a necessary condition in a constrained optimization problem. However, when the optimization problem is convex (in our case $\frac{1}{2}w^T w$ ) and Slater’s condition holds (in our case, we only have linear constraints $y_n(w^Tx_n+b) \geq 1$ ), KKT condition becomes a sufficient and necessary condition for finding the optimal solution $w^*$ . $w^*$ will be the saddle point of the Lagrange multiplier augmented loss function . Regarding why we can apply Lagrange multiplier + KKT condition to optimize the objective function, please refer to other materials also, since this is the part I am less confident about: (Section 3.2 in [4], answer in [5], slides in [6], Section 5.2.3 in [7]).

The optimization process is as follows. We first write the Lagranger formulation and take the derivatives of the Lagranger formulation w.r.t. $w$ and $b$ . Setting the derivatives to 0 we can find the relationship between $\alpha$ (the Lagranger multipler), $y_n$ , and $x_n$ (data).

We can eventually formulate the problem as a quadratic programming problem (w.r.t. $\alpha$ ) and then hand it to a quadratic programming package to solve it. We are expected to receive the $\alpha$ vector as results, which then we will use to obtain $w$ .

Then, the parameter $b$ can be obtained by picking a support vector $x_n$ and solve by $y_n (w^T x_n + b)-1 = 0$ .

Until this point, a new problem arises: we have been only dealing with perfectly linearly separable problems. How can we apply SVM in non-linear problems?

The first solution is kernel trick. Our objective function sending to quadratic programming is $\mathcal{L}(\alpha)=\sum\limits_{n=1}^N \alpha_n - \frac{1}{2}\sum\limits_{n=1}^N \sum\limits_{m=1}^N y_n y_m \alpha_n \alpha_m x_n^T x_m$ , which only requires we know the inner product between any $x_n$ and $x_m$ . Suppose in the original form of raw data, $x_1, x_2, \cdots, x_n$ are low-dimensional data points that are not linearly separable. However, if they are mapped to a higher dimension as $z_1, z_2, \cdots, z_n$ , they can be separable. On that high dimension space, we can apply the linear separable SVM version on the data points $z_1, z_2, \cdots, z_n$ to find a hyperplance and support vectors. That only requires we know the inner product of $z_n$ and $z_m$ .

We don’t even need to know the exact formula of the mapping function from $x_n \rightarrow z_n$ : even a black-box mapping function is ok, as long as we can calculate the inner product of any $z_n$ and $z_m$ corresponding to $x_n$ and $x_m$ . This “inner product” value function is called kernel:

Kernel functions can have many forms. For example, polynomial kernel: $K(x, x')=(ax^Tx'+b)^Q$ and Radial Basis Function kernel: $K(x, x')=exp(-\gamma\|x-x'\|^2)$ . There is a math property called “Mercer’s condition” that an be used to verify whether a function is a valid kernel function. But this is out of the scope of this post.

After we use kernel, the optimization of SVM stays the same except that the coefficient matrix of quadratic programming now becomes:

And to finally obtain $w$ and $b$ , we need to utilize kernel function too:

The second solution to deal with non-linearly separable problems is to use soft-margin. The intuition is to allow some violation of the margin constraint for some data points.

To compare the two solutions, we find that the kernel solution is often used when data are highly non-linearly separable. The soft-margin SVM is often used when data is just slightly non-linearly separable (most data are still linearly separable).