Support Vector Machines (SVM) Explained - Python Sklearn

Hi everyone! In this article we will discover the power of support vector machines. We will discover the classification and regression both along with the mathematics behind the algorithm. we will go in depth and focus on the parts which are most likely to be coming in practical use. At the end of the article we will code a project using Python and scikit-learn.

Flow of Article:

What is Support Vector Machine?
Mathematics behind it !
Classification and Regression with SVM
Strength and Weakness
Python project with plots!
Interview Questions

You may also want to explore KNN, Random Forest, Logistic Regression, Best 10 Regression Model Coded, Linear Regression, Transfer Learning using Regression, or Automated EDA.

What is a Support Vector Machine (SVM)?

A Support Vector Machine (SVM) is a powerful supervised machine learning algorithm that is used for classification and regression tasks. The primary objective of SVM is to find the optimal hyperplane that best separates data points of different classes in a high-dimensional space. The term “support vector” refers to the data points that lie closest to the decision boundary (hyperplane) and play a crucial role in determining its position.

Mathematics Behind Support Vector Machine:

Now, let’s delve into the mathematical underpinnings of KNN.

1. Hyperplane:

In a binary classification scenario, the SVM seeks to find a hyperplane that maximally separates two classes. The hyperplane is a decision boundary that divides the feature space into two regions corresponding to the two classes. Mathematically, a hyperplane in an n-dimensional space is represented by the equation $, where is the weight vector, is the input feature vector, and is the bias. As seen in the image above, the hyperplane is visible.$

2. Margin:

SVM not only finds a hyperplane but also aims to maximize the margin, which is the distance between the hyperplane and the nearest data points (support vectors) of each class. The larger the margin, the better the generalization to new, unseen data.

3. Soft Margin Objective Function

In real world scenarios we ont want to perfectly model the data and instead regularize the model to learn and generalize well on new data. This is what for soft margin SVM are used. This is explained further in this article.

The soft margin objective function is given by the following equation. The $is the regularization parameter controlling the trade-off between maximizing the margin and allowing for misclassifications, are slack variables representing the degree of misclassification, and is the number of data points.$

Note: While coding we will see the difference in soft margin and hard margin SVM.

4. Kernel Trick:

SVM can handle non-linearly separable data by using the kernel trick. The kernel function transforms the input features into a higher-dimensional space, making it possible to find a hyperplane that can separate the classes. Common kernel functions include linear, polynomial, radial basis function (RBF), and sigmoid

5. Optimization Objective:

The optimization objective in SVM involves minimizing the norm of the weight vector ( $) to find the optimal hyperplane while satisfying the constraint that each data point is correctly classified. This is a constrained optimization problem that is typically solved using techniques like Lagrange multipliers.$

How Classification and Regression Work in Support Vector Machines:

Classification:

Decision Function: For classification, the SVM uses a decision function that assigns a new data point to one of the two classes based on which side of the hyperplane it falls. The decision function is $, and the sign of determines the class.$

Soft Margin SVM: In real-world scenarios, data may not be perfectly separable. SVM accommodates this through a concept called soft margin. The soft margin allows for some misclassification to handle noisy data or overlapping classes. The trade-off between a larger margin and allowing misclassifications is controlled by a regularization parameter (C).

Regression:

Support Vector Regression (SVR): SVM can also be used for regression tasks. In SVR, the goal is to fit as many data points as possible within a specified margin while minimizing the error. The margin in regression is an epsilon-tube around the predicted values.

Loss Function: The loss function in SVR penalizes deviations from the target variable, and the optimization objective involves finding a hyperplane that fits the data within the specified margin.

Strengths:

Effective in High-Dimensional Spaces: SVM performs well in high-dimensional spaces, making it suitable for problems with a large number of features, such as image classification or text categorization.
Robust to Overfitting: SVM is less prone to overfitting, especially in high-dimensional spaces, due to the margin maximization objective. The margin helps generalize the model to unseen data.
Versatility through Kernels: The use of kernel functions allows SVM to handle non-linear decision boundaries effectively. Common kernels include linear, polynomial, radial basis function (RBF), and sigmoid, providing flexibility in modeling complex relationships.
Global Optimality: SVM aims to find the global minimum of the optimization problem, ensuring that the solution is optimal and not sensitive to initialization.
Effective for Small and Medium-Sized Datasets: SVM can perform well with small to medium-sized datasets, where it efficiently finds a clear margin between classes.
Handles Imbalanced Data: SVM can handle imbalanced datasets by adjusting class weights, ensuring that it doesn’t overly favor the majority class.

Weaknesses:

Computational Intensity: Training an SVM can be computationally intensive, especially as the size of the dataset grows. The time complexity is often cubic in the number of data points, making it less efficient for very large datasets.
Memory Requirements: SVMs can have high memory requirements, particularly when dealing with large datasets or using complex kernels. This can limit their applicability in memory-constrained environments.
Sensitivity to Noise: SVMs can be sensitive to noise in the dataset, especially when using a small-margin classifier. Noisy data or outliers can significantly impact the position and orientation of the decision boundary.
Choice of Kernel: The choice of the kernel and its parameters can significantly affect the performance of SVM. It requires careful tuning, and the best choice may depend on the specific characteristics of the data.
Interpretability: SVMs, especially when using non-linear kernels, might be less interpretable compared to simpler models like decision trees or logistic regression. Understanding the impact of individual features on the decision boundary can be challenging.
Limited to Binary Classification: Traditional SVMs are designed for binary classification. While there are extensions for multi-class problems, they may not be as straightforward as other algorithms like decision trees or random forests.

Python Code Implementation

Certainly! Below is a simple example of training a Support Vector Machine (SVM) using Python and scikit-learn on a synthetic dataset. The code demonstrates both soft margin and hard margin scenarios. For simplicity, a two-dimensional dataset is used.

# Import necessary libraries

import numpy as np

import matplotlib.pyplot as plt

from sklearn import datasets

from sklearn.model_selection import train_test_split

from sklearn.svm import SVC

from sklearn.metrics import accuracy_score

# Create a synthetic dataset

X, y = datasets.make_classification(n_samples=300, n_features=2, n_classes=2, n_informative=2, n_redundant=0, random_state=42)

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Function to visualize the decision boundary

def plot_decision_boundary(X, y, model, title):

h = .02 # step size in the mesh

x_min, x_max = X[:, 0].min() – 1, X[:, 0].max() + 1

y_min, y_max = X[:, 1].min() – 1, X[:, 1].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

Z = model.predict(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)

plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm)

plt.title(title)

plt.xlabel(‘Feature 1’)

plt.ylabel(‘Feature 2’)

plt.show()

# Train SVM with hard margin

svm_hard_margin = SVC( C=float(100000000))

svm_hard_margin.fit(X_train, y_train)

# Predictions

y_pred_hard_margin = svm_hard_margin.predict(X_test)

# Visualize decision boundary for hard margin

plot_decision_boundary(X_train, y_train, svm_hard_margin, ‘SVM with Hard Margin’)

# Evaluate accuracy for hard margin

accuracy_hard_margin = accuracy_score(y_test, y_pred_hard_margin)

print(f’Accuracy with Hard Margin: {accuracy_hard_margin:.2f}’)

# Train SVM with soft margin

svm_soft_margin = SVC(C=0.1)

svm_soft_margin.fit(X_train, y_train)

# Predictions

y_pred_soft_margin = svm_soft_margin.predict(X_test)

# Visualize decision boundary for soft margin

plot_decision_boundary(X_train, y_train, svm_soft_margin, ‘SVM with Soft Margin’)

# Evaluate accuracy for soft margin

accuracy_soft_margin = accuracy_score(y_test, y_pred_soft_margin)

print(f’Accuracy with Soft Margin: {accuracy_soft_margin:.2f}’)

This code creates a synthetic dataset, splits it into training and testing sets, and then trains two SVM models: one with a hard margin (large C) and one with a soft margin (C=0.1). It visualizes the decision boundaries and evaluates the accuracy of both models.

As we can see in real-world scenarios, we need a C which is small and lets the SVM model the data more realistically avoiding overfitting.

Make sure to install scikit-learn (pip install scikit-learn) if you haven’t already. Additionally, note that the choice of the dataset and parameters is for demonstration purposes, and in a real-world scenario, you would adapt the code to your specific data and requirements.