An Intuitive Guide to understand Support Vector Machine — Part 1
TL; DR
The topic is covered in 3-part series:
- An Intuitive Guide to understand Support Vector Machine — Part 1 — Concepts behind SVM.
- An Intuitive Guide to understand Support Vector Machine — Part 2 — Support Vector Machines.
- An Intuitive Guide to understand Support Vector Machine — Part 3 — Multiclass Classification with SVM using Python Code.
What is a Hyperplane?
Hyperplane is a flat affine subspace. FYI - Affine subspace need not pass through origin. Consider a p-dimensional space, the hyperplane would be of dimension p-1. Hence, hyperplane is a line in a 2-D space and a plane in a 3-D space. In a p-dimensional setting:
β0 + β1X1 + β2X2 + β3X3 + … + βpXp = 0 … (1)
β0 + β1X1 + β2X2 + β3X3 + … + βpXp < 0 … (2)
β0 + β1X1 + β2X2 + β3X3 + … + βpXp >0 … (3)
Equation (1) represents the hyperplane. Since, the hyperplane divides the p-dimensional space into two halves. Equation (2) and (3) help identify the direction in which a point lies with respect to the hyperplane.
Classification using a Separating Hyperplane
Let us review the equations (2) and (3). Observe that this hyperplane can be used to classify 2 classes — upon passing a test observation (Xk) through equations (2) or (3), would help identify which direction the test observation lies and also the magnitude (how far off the point is from the hyperplane). The magnitude helps us identify the probability of a point lying on either side of the hyperplane.
Have you wondered till now, how do we identify the co-efficients of this hyperplane? How do we identify which separating hyperplane is best suited for the given data as there can be infinite number of such hyperplanes?
Let us answer these questions!
Maximal Margin Classifier
The separating hyperplane that is farthest from the training observations. The optimal separating hyperplane is identified by the calculating the perpendicular distance from each training observation to a hyperplane. The smallest such distance to a hyperplane (Let us call this minimal distance as min_dist) is called the margin. The maximal margin classifier identifies the hyperplane that has the maximum margin. This might seem confusing but is particularly simple to understand as we get down further to the concepts of support vectors and margins.
Support Vectors
Support vectors are the data points that lie closest to the separating hyperplane. The decision function for the separating hyperplane is specified by a small subset of training samples which formulate the support vectors. Now, tweaking or modifying any other record in the training sample won’t impact the model until the observations lie within the margin or there is a change in classification (target). Simply put, SVM models are entirely dependent on the support vectors and changing these vectors will definitely change the formulation of the separating hyperplane.
Support Vector is computationally very intensive - O(n²) and depending upon the kernel (to be discussed later) can even reach a complexity of O(n³). The complexity increases substantially as the dimensionality and the number of training examples of the data increases.
Let us view the following example and understand SVM intuitively.
As you’d observe, there are two separate classes of data in the example above. In linear regression and tree-based or even Neural Net, all points would contribute to the final formulation of the model. However, SVM are dependent on only the points that are difficult to classify, i.e., the points that lie close to the separating hyperplane. These points are called support vectors. The solution to find these support vectors can be obtained by using Lagrangian multiplier method.
Now, with the above method, you’d end up with support vectors that would create the separating hyperplane. However, there is another approach to solve SVM using Kuhn-Tucker theorem in an optimized method. I’ve found the reference svmrecita.ppt (mit.edu) easy to understand and learn mathematical concepts behind SVM.
Optimization problem
The hyperplane is the solution to the maximization optimization problem.
The following constraints are further applied to Equation (4)
where,
C is a non-negative tuning parameter
M is the width of the margin
ϵ is the slack variable (will be discussed in detail later)
The slack variable helps us identify where the observation is located relative to the hyperplane and margin. Following are the characteristics of this slack variable:
1. ϵ = 0 then observation is on the correct side of the margin.
2. ϵ > 0 and ϵ < 1 then observation is on the wrong side of the margin but on the correct side of the hyperplane.
3. ϵ > 1 then observation is on the wrong side of the hyperplane.
C is the tuning parameter and can be thought of as the maximum threshold for violations to the margin observed by n observations. For example, if C is 1 then there is only tolerance for 1 observation to violate the separating hyperplane. Thinking intuitively, if C is small, it implies that the model is less tolerant to violations in classification which will result in the model having low bias but high variance. Think about what would happen if C were large!
Cheers!
Happy Learning!