Logistic Regression: Explained with Python - Sklearn
Whether you’re predicting spam emails or diagnosing diseases, logistic regression provides a reliable framework for modeling binary and multiclass outcomes. In this blog, we will delve into the fundamentals, explore the mathematical foundations, discuss its assumptions, evaluate performance metrics, and finally, showcase its implementation using Python with the scikit-learn library.
Flow of Article:
- What is Logistic Regression?
- Mathematical Explanation
- Effect of Coefficient (Graphical)
- Assumptions of this technique
- Advantages and Disadvantages
- Python project
- Real-life Scenario
- Interview Questions
You may also want to explore Linear Regression, Transfer Learning using Regression, or Validation Techniques.
What is Logistic Regression?
Despite its name, logistic regression is a classification algorithm, not a regression one. It’s commonly used when the dependent variable is binary, meaning it has only two possible outcomes—0 or 1, true or false, spam or not spam, etc. The model estimates the probability of the binary outcome by transforming the linear combination of input features using the logistic function (sigmoid function). Research level information can be seen here.
Maths Behind It
The equation can be expressed as follows:
The logistic function ensures that the output lies between 0 and 1, mapping the linear combination to a probability. Further maths can be seen at this page.
The logistic function transforms the linear combination b0+b1X1+b2X2+…+bkXk into a value between 0 and 1. This transformed value can be interpreted as the probability of the event Y=1 given the values of the independent variables.
The coefficients (b0,b1,…,bk) are estimated during the training process to best fit the observed data.
The effect of Coefficients :
- The values of the coefficients in logistic regression directly influence the shape and steepness of the S-shaped curve, known as the logistic curve or sigmoid curve.
- This curve represents the probability of the event occurring as a function of the input variables. Specifically, the logistic function transforms the linear combination of the input variables and their corresponding coefficients into a probability between 0 and 1.
- For a positive coefficient, the logistic curve shifts upward, indicating an increase in the probability of the event as the corresponding independent variable increases. On the other hand, a negative coefficient causes the curve to shift downward, suggesting a decrease in the probability of the event as the variable increases.
- The steepness of the curve is influenced by the magnitude of the coefficient; larger coefficients result in a steeper curve, indicating a more rapid change in probability with respect to changes in the independent variable.
The following image represents the above point clearly that when the magnitude of coefficient in the equation is large, it grows steeper and represents sharper decision boundary.
Assumptions of Logistic Regression
- Linearity of Log-Odds: The relationship between the independent variables and the log-odds of the dependent variable is linear.
- Independence of Errors: The observations are independent of each other.
- Multicollinearity Absence: The independent variables are not highly correlated.
- Outliers Absence: The model is sensitive to outliers; hence, their absence is preferred.
Metrics Used
1.Accuracy
Accuracy is a common metric used for classification models and is calculated as the ratio of correctly predicted instances to the total instances.
2. Area Under the ROC Curve (AUC-ROC)
The AUC-ROC curve evaluates the model’s ability to distinguish between classes. It plots the true positive rate against the false positive rate, providing a comprehensive view of the model’s performance.
Advantages
- Simple and Efficient: Logistic regression is computationally efficient and straightforward to implement.
- Interpretability: Coefficients indicate the strength and direction of relationships.
- Probabilistic Predictions: Outputs provide probabilities, aiding decision-making.
Disadvantages
- Assumption Violation Sensitivity: Performance degrades if assumptions are violated.
- Limited to Linear Relationships: It assumes a linear relationship between features and log-odds.
- Not Suitable for Complex Relationships: In scenarios with intricate relationships, other models may perform better.
Python Code Implementation
Let’s implement it using the scikit-learn library with a simple example:
# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import pandas as pd
# Sample dataset (replace with your dataset)
data = {‘Feature1’: [1, 2, 3, 4, 5],
‘Feature2’: [0, 1, 1, 0, 1],
‘Target’: [0, 0, 1, 1, 1]}
df = pd.DataFrame(data)
# Split data into features and target variable
X = df[[‘Feature1’, ‘Feature2’]]
y = df[‘Target’]
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# Create a logistic regression model
model = LogisticRegression()
# Train the model on the training set
model.fit(X_train, y_train)
# Make predictions on the testing set
y_pred = model.predict(X_test)
# Calculate accuracy
accuracy = metrics.accuracy_score(y_test, y_pred)
print(f’Accuracy: {accuracy}’)
# Calculate AUC-ROC
auc_roc = metrics.roc_auc_score(y_test, model.predict_proba(X_test)[:,1])
print(f’AUC-ROC: {auc_roc}’)
Real Life Applications :
Logistic regression finds application in various real-world scenarios due to its simplicity and effectiveness in handling binary and multiclass classification problems. Here are some notable examples:
Credit Scoring:
- Scenario: Determining whether a customer is likely to default on a loan or not.
- Use: Logistic regression can analyze historical data, including credit scores, income, and debt, to predict the probability of loan default.
Medical Diagnosis:
- Scenario: Identifying whether a patient has a particular disease based on medical test results.
- Use: It can analyze patient data and test results to estimate the probability of disease presence, aiding in early diagnosis.
Spam Email Detection:
- Scenario: Classifying emails as spam or non-spam.
- Use: It can analyze email content, sender information, and other features to predict the likelihood of an email being spam.
Customer Churn Prediction:
- Scenario: Predicting whether a customer is likely to stop using a service or product.
- Use: It can analyze customer behavior, satisfaction scores, and usage patterns to predict the likelihood of churn.
Fraud Detection:
- Scenario: Identifying fraudulent transactions in financial systems.
- Use: Logistic regression can analyze transaction data, user behavior, and other features to estimate the probability of a transaction being fraudulent.
Disease Diagnosis in Healthcare:
- Scenario: Predicting the likelihood of a patient having a specific medical condition.
- Use: Logistic regression can utilize patient data, medical history, and diagnostic test results to estimate the probability of disease presence.
Most Asked Interview Questions !
What is logistic regression, and when is it used?
- Answer: Logistic regression is a statistical method used for modeling the probability of a binary outcome. It’s suitable when the dependent variable is categorical with two levels.
Question: How does logistic regression differ from linear regression?
- Answer: Linear regression models a continuous outcome, while logistic regression models the probability of a binary outcome using the logistic function to constrain values between 0 and 1.
Question: What is the logistic function, and why is it important in logistic regression?
- Answer: The logistic function is P(Y=1)=1+e−(b0+b1X1+…+bkXk). It transforms the linear combination of predictors and coefficients into probabilities between 0 and 1.
Question: How do you interpret the coefficients in logistic regression?
- Answer: A positive coefficient indicates an increase in the log-odds of the event, while a negative coefficient suggests a decrease. The magnitude signifies the strength of the influence.
Question: What is the odds ratio, and how is it calculated in logistic regression?
- Answer: The odds ratio represents the change in odds for a one-unit increase in the predictor variable. It is calculated as e^bi, where bi is the coefficient for the variable.
Question: What is a decision boundary in logistic regression?
- Answer: The decision boundary is the threshold at which the predicted probability equals 0.5. Above the boundary, the event is predicted to occur; below, it is predicted not to occur.
Question: How is model performance assessed in logistic regression?
- Answer: Common metrics include accuracy, precision, recall, F1 score, and the area under the receiver operating characteristic (ROC) curve.
Question: What is multicollinearity, and why is it a concern in logistic regression?
- Answer: Multicollinearity occurs when predictor variables are highly correlated. It can lead to unstable coefficient estimates and challenges in interpreting individual variable contributions.
Question: Can logistic regression handle categorical variables, and if so, how?
- Answer: Yes, logistic regression can handle categorical variables by encoding them appropriately, often using techniques like one-hot encoding or dummy coding.