Understanding K Nearest Neighbour (KNN) - Python Sklearn

Welcome, data enthusiasts! Today, we’re diving into the fascinating realm of the K Nearest Neighbors (KNN) algorithm – a powerful tool in the data scientist’s toolkit. In this article, we’ll unravel the mysteries behind KNN, exploring its essence, the mathematics governing its operations, and understanding the strengths and weaknesses that make it a unique player in the field of machine learning. At the end of the article we will code a project using Python and scikit-learn.

Flow of Article:

What is K Nearest Neighbours algorithm?
Mathematics behind it !
Type of distances used in KNN
Classification and Regression with KNN
Strength and Weakness
Python project
Interview Questions

You may also want to explore Random Forest, Logistic Regression, Best 10 Regression Model Coded, Linear Regression, Transfer Learning using Regression, or Automated EDA.

What is a K Nearest Neighbours?

KNN, short for K Nearest Neighbours, is a simple yet potent algorithm used for both classification and regression tasks. At its core, KNN makes predictions based on the majority class or average of the K-nearest data points. Picture this: your data points are scattered across a feature space, and KNN draws insights by examining the characteristics of its closest neighbours.

Mathematics behind it !

Now, let’s delve into the mathematical underpinnings of KNN. We know by the working of the KNN algorithm that the nearest data points are chosen for finding the class or value of a specific data point. The mathematics lies in choosing these nearest points. Now, a distance metric is required for telling which data points are nearby. The choice of distance metric is crucial in KNN.

Distance Metrics with Use Cases:

1. Euclidean Distance

Scenario: Imagine you have data points representing the locations of houses in a city with two features: square footage and number of bedrooms.
Explanation: Euclidean distance would be suitable here as it measures the straight-line distance between two houses in the 2D space of square footage and bedrooms.

Euclidean distance is the straight-line distance between two points in an n-dimensional space.

and represent the coordinates of the two points along each dimension, and is the number of dimensions.

2. Manhattan Distance (L1 Norm):

Scenario: Consider a delivery truck navigating a city grid to deliver packages. The truck can only move along streets, not diagonally.
Explanation: Manhattan distance is appropriate in this scenario because it measures the distance traveled along the streets of the city, aligning with the grid-like movement of the truck.

Manhattan distance, also known as L1 norm, is the sum of the absolute differences between the corresponding coordinates of two points. It represents the distance traveled along the grid-like streets of a city (hence, “Manhattan”).

3. Minkowski Distance:

Scenario: You are analyzing a dataset with multiple features, including temperature, humidity, and wind speed, to predict the likelihood of rain.
Explanation: Minkowski distance with $(Euclidean distance) might be suitable here, as it considers the overall spatial relationship between the data points in the multi-dimensional feature space.$

Minkowski distance is a generalization of both Euclidean and Manhattan distances. The parameter

determines the degree of the distance metric. When, it becomes the Euclidean distance; when, it becomes the Manhattan distance.

4. Chebyshev Distance (Maximum Norm):

Scenario: In a board game, you want to measure the maximum number of moves a piece can make to reach any position on the board.
Explanation: Chebyshev distance is appropriate in this case as it calculates the maximum difference along any dimension, similar to measuring the maximum moves in different directions on the game board.

Chebyshev distance, also known as Maximum norm, calculates the maximum absolute difference between the corresponding coordinates of two points. It represents the longest distance along any dimension.

5. Hamming Distance (for binary data):

Scenario: You are comparing DNA sequences represented in binary, where each bit represents a nucleotide.
Explanation: Hamming distance is suitable for comparing binary strings such as DNA sequences, where it counts the positions where the nucleotides differ.

Hamming distance is used for binary data and counts the positions where the bits (symbols) differ between two binary strings. is the length of the binary strings.

5. Cosine Similarity (for Vectors)

Scenario: You have text documents represented as term frequency vectors, and you want to measure their similarity.
Explanation: Cosine similarity is appropriate for text data represented as vectors because it considers the angle between the vectors, making it robust to the overall magnitude of the vectors and focusing on the directionality of the information.

Cosine similarity measures the cosine of the angle between two vectors.

and are the components of the vectors, and is the dimensionality. The numerator is the dot product of the vectors, and the denominators normalize the vectors.

Choosing the Value of K:

The selection of K influences the model’s performance. A small K may lead to a noisy model, while a large K might result in a model oversimplified. Cross-validation is often employed to find the optimal K.

KNN for Classification:

K Nearest Neighbors (KNN) in classification works like a friendly neighbor survey. Imagine each data point as a house, and KNN looks at the types of houses (classes) in its neighborhood. When a new house arrives, it checks its K nearest neighbors and adopts the most popular class in the neighborhood. For instance, if the majority of the three closest houses are ‘Iris-setosa,’ the new house is classified as ‘Iris-setosa.’ KNN simplifies decision-making by following the majority vote of its closest friends.

KNN for Regression:

In regression, KNN takes a similar approach but with a different goal – predicting numerical values. Picture the task of estimating the price of a house. KNN identifies the K most similar houses and calculates the average price of these neighbors. This average becomes the predicted price for the new house. KNN’s versatility shines as it adapts to various prediction tasks, providing straightforward solutions based on the collective wisdom of nearby data points.

Strengths:

Simplicity: KNN is easy to understand and implement.
No Training Period: KNN is instance-based, requiring no explicit training phase.
Non-parametric: KNN doesn’t make assumptions about the underlying data distribution.

Weaknesses:

Computational Intensity: KNN can be computationally expensive, especially with large datasets.
Sensitivity to Noise: KNN is sensitive to outliers and noisy data.
Optimal K Selection: Choosing the right K value is crucial, and there’s no one-size-fits-all solution.

Python Code Implementation

Python code using scikit-learn to train and evaluate a K Nearest Neighbors (KNN) algorithm. This example uses the famous Iris dataset for a classification task. Make sure you have scikit-learn installed (pip install scikit-learn) before running the code.

# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
X = iris.data # Features
y = iris.target # Target variable

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the KNN classifier
knn_classifier = KNeighborsClassifier(n_neighbors=3) # You can adjust the value of ‘n_neighbors’

# Train the KNN classifier on the training data
knn_classifier.fit(X_train, y_train)

# Make predictions on the test data
y_pred = knn_classifier.predict(X_test)

# Evaluate the performance of the KNN classifier
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)

# Print the results
print(f”Accuracy: {accuracy:.2f}”)
print(“Classification Report:\n”, classification_rep)

10 technical questions commonly asked about KNN in interviews:

What is K Nearest Neighbors (KNN)?
- Answer: KNN is a supervised machine learning algorithm used for classification and regression tasks. It classifies a new data point by considering the majority class or averaging the values of its K-nearest neighbors.
How does KNN determine the “nearest” neighbors?
- Answer: KNN uses a distance metric, such as Euclidean or Manhattan distance, to measure the proximity between data points. The K-nearest neighbors are those with the smallest distances to the target point.
What factors should be considered when choosing the value of K in KNN?
- Answer: The choice of K impacts the model’s performance. A small K may lead to a noisy model, while a large K might oversimplify. Cross-validation is often used to find the optimal K for a given dataset.
Explain the curse of dimensionality and its relevance to KNN.
- Answer: The curse of dimensionality refers to the challenges that arise when working with high-dimensional data. In KNN, as the number of dimensions increases, the distance between data points becomes less meaningful, affecting the algorithm’s performance.
What are the different distance metrics used in KNN, and when would you choose one over the other?
- Answer: Common distance metrics include Euclidean, Manhattan, Minkowski, Chebyshev, Hamming, and Cosine Similarity. The choice depends on the nature of the data; for example, Euclidean is suitable for continuous numerical data, while Hamming is used for binary data.
Explain the concept of weighted averaging in KNN.
- Answer: Weighted averaging in KNN assigns different importance to each neighbor based on their distance. Closer neighbors may have higher weights, influencing the prediction more than farther ones. This helps capture the local structure of the data.
What are the strengths of the KNN algorithm?
- Answer: KNN is simple, intuitive, and effective for non-linear relationships. It requires no training period, making it suitable for dynamic datasets. It’s also non-parametric, meaning it makes no assumptions about the underlying data distribution.
Discuss the weaknesses of the KNN algorithm.
- Answer: KNN can be computationally intensive, especially with large datasets. It is sensitive to outliers and noisy data, and the optimal choice of K may vary depending on the dataset and problem, requiring careful tuning.
How does KNN handle imbalanced datasets?
- Answer: KNN is sensitive to imbalanced datasets, as the majority class may dominate predictions. Techniques like resampling, adjusting class weights, or using specialized distance metrics can help mitigate this issue.
Can KNN be used for regression tasks, and how does it differ from its classification counterpart?
- Answer: Yes, KNN can be used for regression. In regression, KNN predicts a continuous value by averaging or weighted averaging the target variable values of its K-nearest neighbors. The main difference lies in the nature of the predicted output.