I N F O A R Y A N

Logistic Regression: Explained with Python - Sklearn

Whether you’re predicting spam emails or diagnosing diseases, logistic regression provides a reliable framework for modeling binary and multiclass outcomes. In this blog, we will delve into the fundamentals, explore the mathematical foundations, discuss its assumptions, evaluate performance metrics, and finally, showcase its implementation using Python with the scikit-learn library.

Flow of Article:

  1. What is Logistic Regression? 
  2. Mathematical Explanation
  3. Effect of Coefficient (Graphical)
  4. Assumptions of this technique
  5. Advantages and Disadvantages
  6. Python project
  7. Real-life Scenario
  8. Interview Questions

 

You may also want to explore Linear Regression, Transfer Learning using Regression, or Validation Techniques.

 

What is Logistic Regression?

Despite its name, logistic regression is a classification algorithm, not a regression one. It’s commonly used when the dependent variable is binary, meaning it has only two possible outcomes—0 or 1, true or false, spam or not spam, etc. The model estimates the probability of the binary outcome by transforming the linear combination of input features using the logistic function (sigmoid function). Research level information can be seen here.

 

Maths Behind It

The equation can be expressed as follows:

The logistic function ensures that the output lies between 0 and 1, mapping the linear combination to a probability. Further maths can be seen at this page.

Logistic Regression - Equation

The logistic function transforms the linear combination b0+b1X1+b2X2+…+bkXk into a value between 0 and 1. This transformed value can be interpreted as the probability of the event Y=1 given the values of the independent variables.

The coefficients (b0,b1,…,bk) are estimated during the training process to best fit the observed data.

 

The effect of Coefficients :

  • The values of the coefficients in logistic regression directly influence the shape and steepness of the S-shaped curve, known as the logistic curve or sigmoid curve.
  • This curve represents the probability of the event occurring as a function of the input variables. Specifically, the logistic function transforms the linear combination of the input variables and their corresponding coefficients into a probability between 0 and 1.
  • For a positive coefficient, the logistic curve shifts upward, indicating an increase in the probability of the event as the corresponding independent variable increases. On the other hand, a negative coefficient causes the curve to shift downward, suggesting a decrease in the probability of the event as the variable increases.
  • The steepness of the curve is influenced by the magnitude of the coefficient; larger coefficients result in a steeper curve, indicating a more rapid change in probability with respect to changes in the independent variable.

The following image represents the above point clearly that when the magnitude of coefficient in the equation is large, it grows steeper and represents sharper decision boundary.

 

Assumptions of Logistic Regression

  1. Linearity of Log-Odds: The relationship between the independent variables and the log-odds of the dependent variable is linear.
  2. Independence of Errors: The observations are independent of each other.
  3. Multicollinearity Absence: The independent variables are not highly correlated.
  4. Outliers Absence: The model is sensitive to outliers; hence, their absence is preferred.

 

Metrics Used

1.Accuracy

Accuracy is a common metric used for classification models and is calculated as the ratio of correctly predicted instances to the total instances.

2. Area Under the ROC Curve (AUC-ROC)

The AUC-ROC curve evaluates the model’s ability to distinguish between classes. It plots the true positive rate against the false positive rate, providing a comprehensive view of the model’s performance.

 

Advantages 

  1. Simple and Efficient: Logistic regression is computationally efficient and straightforward to implement.
  2. Interpretability: Coefficients indicate the strength and direction of relationships.
  3. Probabilistic Predictions: Outputs provide probabilities, aiding decision-making.

Disadvantages

  1. Assumption Violation Sensitivity: Performance degrades if assumptions are violated.
  2. Limited to Linear Relationships: It assumes a linear relationship between features and log-odds.
  3. Not Suitable for Complex Relationships: In scenarios with intricate relationships, other models may perform better.

 

Python Code Implementation

Let’s implement it using the scikit-learn library with a simple example:

# Import necessary libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import pandas as pd

# Sample dataset (replace with your dataset)
data = {‘Feature1’: [1, 2, 3, 4, 5],
‘Feature2’: [0, 1, 1, 0, 1],
‘Target’: [0, 0, 1, 1, 1]}

df = pd.DataFrame(data)

# Split data into features and target variable
X = df[[‘Feature1’, ‘Feature2’]]
y = df[‘Target’]

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Create a logistic regression model
model = LogisticRegression()

# Train the model on the training set
model.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = metrics.accuracy_score(y_test, y_pred)
print(f’Accuracy: {accuracy}’)

# Calculate AUC-ROC
auc_roc = metrics.roc_auc_score(y_test, model.predict_proba(X_test)[:,1])
print(f’AUC-ROC: {auc_roc}’)

 

Real Life Applications :

Logistic regression finds application in various real-world scenarios due to its simplicity and effectiveness in handling binary and multiclass classification problems. Here are some notable examples:

  • Credit Scoring:

    • Scenario: Determining whether a customer is likely to default on a loan or not.
    • Use: Logistic regression can analyze historical data, including credit scores, income, and debt, to predict the probability of loan default.
  • Medical Diagnosis:

    • Scenario: Identifying whether a patient has a particular disease based on medical test results.
    • Use: It can analyze patient data and test results to estimate the probability of disease presence, aiding in early diagnosis.
  • Spam Email Detection:

    • Scenario: Classifying emails as spam or non-spam.
    • Use: It can analyze email content, sender information, and other features to predict the likelihood of an email being spam.
  •  
  • Customer Churn Prediction:

    • Scenario: Predicting whether a customer is likely to stop using a service or product.
    • Use: It can analyze customer behavior, satisfaction scores, and usage patterns to predict the likelihood of churn.
  • Fraud Detection:

    • Scenario: Identifying fraudulent transactions in financial systems.
    • Use: Logistic regression can analyze transaction data, user behavior, and other features to estimate the probability of a transaction being fraudulent.
  • Disease Diagnosis in Healthcare:

    • Scenario: Predicting the likelihood of a patient having a specific medical condition.
    • Use: Logistic regression can utilize patient data, medical history, and diagnostic test results to estimate the probability of disease presence.

 

Most Asked Interview Questions !

  •