Part 1 discussed the mathematical formulation of the separating hyperplane and the optimization problem. This part would focus on non-linear decision boundaries with SVM. It’d focus more on concepts rather than applications to understand non-linear decision boundaries.

Let us assume that there is a p-dimensional dataset with two classes and a non-linear decision boundary, therefore data is linearly inseparable. The simplest approach to solve this linearly inseparable data with a linear model is to transform the predictors in a higher order thereby enlarging the feature space.

To understand the above statement, let us assume the equation of a circle:

Equation (1)

Now, the above equation takes the following form as depicted in Image (1). Visualize the below case as follows:
1. The coordinates highlighted in red are support vectors.
2. The coordinates highlighted in green and blue belong to two separate classes.

Image (1)

It is clearly non-linearly separable. However, let us transform the feature space in a higher order (quadratic). It takes the following form.

Image (2)

Isn’t it exciting to see how transforming the feature space in higher order converts a non-linear problem to linear problem?

Support Vector Machine

In the previous section, we discussed how enlarging the feature space converts the nonlinear problem to a linear one. The other approach is using a kernel function. The Support Vector machine is an extension of the support vector that results from enlarging the feature space.

The linear support vector classifier can be represented as

Equation (2)

Equation (2) can be thought of as optimized equation from Equation (4) discussed in Part 1.

<x, xi> represents the inner dot product between all pairs of training observations.

As discussed earlier, SVMs are only impacted by support vectors. In the above equation, it can be shown that α is non-zero only for the support vectors. Therefore, Equation (2) is generalized as follows:

Equation (3)

where S is the set of support points.

Now, <x, xi> (inner product) can be generalized as some function that quantifies the similarity of two observations.

Examples of Kernel Functions

Linear Kernel: It uses Pearson correlation to identify the similarity of a pair of combination.

Equation (4): Linear Kernel

Polynomial Kernel: It involves fitting a support vector classifier in a higher dimensional space as discussed in the example with circle, remember!

Equation (5): Polynomial Kernel

Radial Kernel: View Equation (6) below and you’d observe that it takes a very small value when the observations are far from one another. It implies that training samples that are far from one another will have very little to no impact on their predictions. Therefore, this kernel has a very local behaviour.

Equation (6): Radial Basis Kernel

The enlarged feature space can sometimes be very computationally intensive and hence kernel functions help us reduce this complexity.

Cheers!

Happy Learning!

--

--

Anant Kumar

Machine Learning & Deep Learning Practitioner | Learning is Continuous | Github : https://github.com/anant-kumar-0308