SVM

Maximum-margin
Soft-margin
- Primal problem:
- Dual problem:
Some important parameters

Maximum-margin

Primal problem

The hyperplane function is that:

$w*x - b = 1$

and

$w*x - b = -1$

So the distance between two hyperplanes is:

$\frac{2b}{\|w\|}$

To find a hyperplane that has the maximum distance, we need to:

$\min \|w\|$

subject to the following constraints:

$y_i (w * x_i - b ) \ge 1 \forall i$

Dual problem

Due to lagrangian dual function:

$L(w,b,\alpha) = \frac{1}{2}\|w\|^2 + \sum_{i=1}^{N}\alpha_i*(1-y_i(w*x_i-b))$

The primal problem is equivalent to the following:

$\max_{w,b}\min_{\alpha\ge0}L(w,b,\alpha)$

To solve $\min$ problem, we set the gradient w.r.t. $w$ and $b$ equal to zero:

$w = \sum_{i=1}^{N} a_i y_i x_i$

and

$\sum_{i}^{N} a_iy_i = 0$

We know that $w$ is equal to a linear combination of $x_i$ , due to the sparsity of $\alpha_i$ , we only need to find some support vectors.

When we replace $w$ in the Lagrange dual function, the problem is convert to :

$\max_{w,b} \sum_i^N\alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i\alpha_jy_iy_jx^T_ix_j$

subject to :

$\alpha_i\ge 0$

and

$\sum_{i}^{N} a_iy_i = 0$

Complexity of the optimization:

For dual problem, we need to optimize N parameters. So the speed is proportional to the number of training examples.

Soft-margin

However, sometimes, there is no hyperplane that can split the data. We need to tolerate some examples, so the constraints equals to:

$y_i(w*x_i - b) \ge 1 - \zeta_i \space; \forall i \space\zeta_i \ge 0$

if $\zeta = 0$ , it means correct classified examples. If $\zeta \gt 1$ , it means incorrectly classified examples. So $\sum_i \zeta_i$ means how many examples are misclassified.

Primal problem:

$\min \frac{1}{2}\|w\|^2 + C\sum_i\zeta_i$

subject to:

$y_i(w*x_i - b) \ge 1 - \zeta_i \space; \forall i \space\zeta_i \ge 0$

Dual problem:

$\max_{w,b,\zeta}\min_{\alpha\ge0,\beta\ge0} \frac{1}{2}\|w\|^2 +C\sum_{i}^{N}\zeta_i+ \sum_{i=1}^{N}\alpha_i*(1-\zeta_i-y_i(w*x_i-b))+\sum_{i}^{N}\beta_i(0-\zeta_i)$

To solve $min$ problem, gradient w.r.t. w,b, $\zeta$ equal to zero:

$w = \sum_{i=1}^{N} a_i y_i x_i$ $\sum_{i}^{N} a_iy_i = 0$ $\forall i \space \alpha_i + \beta_i = C$

When we replace $w$ in the Lagrange dual function, the problem is converted to :

$\max_{w,b} \sum_i^N\alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i\alpha_jy_iy_jx^T_ix_j$

subject to :

$0\le\alpha_i \le C$

and

$\sum_{i}^{N} a_iy_i = 0$

The only difference is the constraints on $\alpha$ , this is so nice that the difference is such tiny. Because of this, the author received a big Award.

Some important parameters

C

The extent that we can tolerate the misclassified examples.

Does kernel hurts performance ?

Choosing incorrect kernel will definitely hurt the performance.

One-vs-One or One-vs-Rest

There’s no big difference.

overfitting

Max margin can avoid overfitting. It means even though there could be a very good accuracy on training data, the testing data is still have at least a slightly better results.

Feature or Model

Feature is the best thing. The more information, the more possible to classify correct.

Feature Selection before SVM ?

No, SVM’s max-margin regularization will handle the irrelevant features. [Still don’t know why: see here and here

How can SVM handle large number of features ?

The SVM is an approximate implementation of a bound on the generalization error, that depends on the margin (essentially the distance from the decision boundary to the nearest pattern from each class), but is independent of the dimensionality of the feature space (which is why using the kernel trick to map the data into a very high dimensional space isn’t such a bad idea as it might seem). So in principle SVMs should be highly resistant to over-fitting, but in practice this depends on the careful choice of C and the kernel parameters. Sadly, over-fitting can also occur quite easily when tuning the hyper-parameters as well.

The probability that random features could distinguish the two classes ?

Given 10 data points, 5 positive and 5 negative. Each data points have d random features. What’s the probability that there exists a hyperplane w that could distinguish these data points ? What’s the relations between the probability and the feature dimension d?

Suppose the probability that there exists a single dimension w that can split the data on a single dimension is p0. Then, if there are d dimensions, then the probability that there exists a single dimension w that can split the data is $1 - (1-p_0)^d$ , is very high. That’s why, more features would let SVM overfit. But there exists some really informative features. Which would probably have a large margin, thus, SVM would still tend to choose the real informative features.

So, some conclusion:

Additional uninformative features will let SVM to overfit, the more features there are, the higher probability to overfit.
Real informative features tends to have much large margin then those uninformative features, so SVM still tends to omit those uninformative features. But this all depends on C. If C is large, which requires overfitting, then SVM have to use those uninformative features.
Take home message: additional features extend the space to overfit, but SVM still tend not to go in that space. The tendency is controlled by C.

Some remaining questions:

What’s the theoretical bounds that SVM has ?
What’s the probability that there exists a hyperplane that can split a dataset W ? Stackoverflow link

##Reference