Random Forest - Classification and Regression - Explained using Python Sklearn
Today, let’s embark on a journey to explore one of the most powerful and versatile algorithms – the Random Forest. This algorithm, known for its prowess in both classification and regression tasks, is a true gem in the vast landscape of data science.
Note: Kindly read the article about Decision Trees before reading this article, for background understanding.
In this blog, we will delve into the fundamentals of Random Forest, explore the foundations, discuss its strengths and weakness, and finally, showcase its implementation using Python with the scikit-learn library. Also, we will discuss some most important interview questions about this algorithm.
Flow of Article:
- What is Random Forest Algorithm?
- How ensemble prediction work?
- What is Bagging and Boosting ?
- Classification and Regression
- Strength and Weakness
- Python project
- Real-life Uses
- Interview Questions
You may also want to explore Logistic Regression, Best 10 Regression Model Coded, Linear Regression, Transfer Learning using Regression, or Automated EDA.
What is a Random Forest?
Random Forest is an ensemble learning method in machine learning that leverages the collective strength of multiple decision trees to enhance predictive accuracy and generalization performance.
- At the core of random forest algorithm, a decision tree is a hierarchical model that makes sequential decisions based on input features to arrive at a final prediction.
- As you see in the figure below, various decision trees are there giving their predictions.
- Random Forest takes this concept further by constructing an ensemble of diverse decision trees, each trained on a random subset of the training data and considering a random subset of features at each split.
- You see in the figure below the dataset is split in N subdata sets. Each decision tree is trained on these random datasets.
- This inherent randomness injects variability into the individual trees, mitigating the risk of overfitting and improving the model’s robustness.
- The final prediction in a Random Forest is typically determined through a voting mechanism for classification tasks or an averaging process for regression tasks.
This ensemble approach not only imparts resilience to noisy data but also provides a natural means of assessing feature importance within the context of the entire model. The elegance of Random Forest lies in its ability to harness the wisdom of crowds, transforming a collection of relatively simple models into a potent, versatile predictive tool.
How The predictions are made?
Suppose we have N decision trees, each trained on a different subset of the data or using a different subset of features. The predictions of these trees are then aggregated to form the final prediction. In the case of classification, this aggregation is typically done through a majority voting mechanism, while for regression, it’s often an average.
What is Bagging and Boosting ?
Bagging (Bootstrap Aggregating): Bagging is an ensemble learning technique that aims to improve the stability and accuracy of machine learning models by training multiple instances of the same base model on different subsets of the training data. The subsets, known as bootstrap samples, are generated by randomly sampling with replacement from the original dataset. Each base model is trained independently on a distinct bootstrap sample, and their predictions are combined through averaging (in regression) or voting (in classification) to form the final ensemble prediction. Random forest is a bagging model.
Boosting: Boosting is another ensemble learning technique, but unlike bagging, it focuses on improving model performance by sequentially training weak learners, where each subsequent learner corrects the errors of its predecessor. In boosting, each instance in the training set is assigned a weight, and misclassified instances receive higher weights. The algorithm aims to emphasize the importance of misclassified instances, allowing subsequent models to focus on these challenging cases.
Difference in Regressor and Classifier
In classification tasks, where the goal is to assign input data points to predefined classes, the Random Forest algorithm excels in aggregating the outputs of its constituent decision trees. Each decision tree in the forest independently predicts the class of a given input, and the final classification is determined through a majority vote. The class that receives the most votes across all the trees becomes the predicted class for the input.
In regression tasks, the objective is to predict a continuous numerical value rather than a discrete class. In this context, each decision tree in the Random Forest independently predicts a numerical value for a given input. The final prediction for a specific input is often the average of these individual predictions. This averaging process smoothens out the predictions, resulting in a more stable and reliable estimate of the target variable.
Strengths of Random Forest:
High Predictive Accuracy: Random Forests are known for delivering high predictive accuracy in both classification and regression tasks, thanks to the aggregation of diverse decision trees.
Robust to Overfitting: The ensemble nature of Random Forest mitigates overfitting, as the individual decision trees are trained on random subsets of the data, promoting model generalization.
Versatility: Random Forests can handle a variety of data types, including categorical and numerical features, making them versatile for different types of datasets.
Implicit Feature Importance: The algorithm provides a natural way to assess the importance of features in the dataset, aiding in feature selection and interpretation.
Resilience to Noisy Data: The randomness introduced during training allows Random Forests to handle noisy data and outliers more effectively compared to individual decision trees.
Weaknesses of Random Forest:
Complexity: The ensemble of decision trees can make Random Forest models complex and sometimes difficult to interpret. Training multiple decision trees and aggregating their predictions can be computationally expensive, especially for large datasets and complex models.
Memory Usage: Random Forests can be memory-intensive, particularly as the number of decision trees in the forest increases, which may pose challenges for deployment in resource-constrained environments.
Bias Towards Dominant Classes: In imbalanced datasets, where one class significantly outnumbers the others, Random Forests may exhibit a bias towards the dominant class, impacting the model’s sensitivity to minority classes.
Not Ideal for Linear Relationships: If the underlying relationships in the data are predominantly linear, simpler models like linear regression may outperform Random Forests, which are more adept at capturing non-linear patterns.
Python Code Implementation
Let’s implement it using the scikit-learn library with a simple example:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import plot_tree
from sklearn.datasets import load_iris
# Load the iris dataset (replace with your own dataset)
data = load_iris()
X = data.data
y = data.target
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train a Random Forest classifier
rf_model = RandomForestClassifier(n_estimators=10, random_state=42)
rf_model.fit(X_train, y_train)
# Evaluate the model on the testing set
accuracy = rf_model.score(X_test, y_test)
print(f”Accuracy on the testing set: {accuracy:.2f}”)
# Plot one of the decision trees in the forest
plt.figure(figsize=(15, 10))
plot_tree(rf_model.estimators_[0], feature_names=data.feature_names, class_names=data.target_names, filled=True)
plt.show()
Above is the example of a single decision tree which is also known as estimator in Random forest, that makes you visualise how the plot_tree function will make you visualise your single decision tree.
Applications of Random Forest
Random Forest, with its versatility and robustness, finds applications across various domains in the real world. Here are key areas where Random Forest has proven to be particularly impactful:
Financial Sector : Random Forest is widely used in the financial sector for credit scoring. By analyzing an individual’s financial history, Random Forest models can assess creditworthiness more accurately than individual models. In fraud detection, the algorithm excels in identifying unusual patterns and anomalies by learning from historical data.
Healthcare: Disease Prediction and Drug Discovery. Random Forest plays a crucial role in predicting diseases based on patient data. In drug discovery, Random Forest models contribute to predicting the efficacy and potential side effects of new drugs, accelerating the drug development process.
Marketing: Customer Segmentation and Targeted Marketing based on various attributes such as purchasing behavior, demographics, and preferences. This segmentation aids in targeted marketing campaigns.
Ecology and Environmental Science: Species Classification and Remote Sensing. It helps researchers and conservationists identify and monitor different species in diverse ecosystems. Remote sensing data, such as satellite imagery, can be analysed using Random Forest for land cover classification, deforestation monitoring, and environmental change detection.
10 technical questions commonly asked about Random Forests in interviews:
How does a Random Forest differ from a single decision tree?
- Answer: A Random Forest is an ensemble of decision trees. Unlike a single decision tree, it trains on random subsets of the data and features, mitigating overfitting and improving generalization.
What is the purpose of using random subsets of features in each tree of a Random Forest?
- Answer: Random subsets of features ensure diversity among individual trees, preventing them from relying too heavily on specific features and enhancing the overall robustness of the model.
How does a Random Forest handle overfitting, and why is it effective in this regard?
- Answer: Random Forest mitigates overfitting by aggregating predictions from multiple trees, each trained on different subsets of the data. The ensemble approach helps generalize well to unseen data.
Explain the concept of bagging in the context of Random Forests.
- Answer: Bagging (Bootstrap Aggregating) involves training each decision tree on a bootstrap sample (randomly sampled with replacement) from the training data. This diversifies the training process.
What criteria are commonly used for splitting nodes in decision trees within a Random Forest?
- Answer: Common splitting criteria include Gini impurity for classification tasks and mean squared error for regression tasks. These metrics quantify the homogeneity of subsets created by a split.
How does Random Forest handle missing values in the dataset?
- Answer: Random Forest can handle missing values by imputing them during training. The algorithm uses the mean or median of available values for imputation.
Explain the concept of out-of-bag (OOB) error in Random Forests.
- Answer: The out-of-bag error is an estimate of a model’s performance on unseen data. It is calculated using the data points that were not included in the bootstrap sample used to train each tree.
What is the role of hyperparameters in tuning a Random Forest model?
- Answer: Hyperparameters, such as the number of trees, the depth of trees, and the size of feature subsets, influence the behavior and performance of a Random Forest. Tuning them is essential for optimal model performance.
How does Random Forest determine feature importance, and why is it valuable?
- Answer: Feature importance is assessed by measuring the decrease in impurity (e.g., Gini impurity) caused by each feature across all trees. It helps identify the most influential features in making predictions.
Can Random Forest be used for regression tasks, and if so, how is the prediction calculated?
- Answer: Yes, Random Forest can be used for regression. The prediction is typically the average of the predictions from individual trees, providing a continuous output.