0% found this document useful (0 votes)
11 views18 pages

Logistic regression

The document provides an overview of logistic regression, detailing its types: binomial, multinomial, and ordinal, along with their applications. It includes practical examples of predicting customer purchases and student exam outcomes using Python, along with code for implementation and model evaluation. Additionally, it discusses the differences between logistic and linear regression, and offers insights on model training and performance metrics.

Uploaded by

msanthoshm379
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views18 pages

Logistic regression

The document provides an overview of logistic regression, detailing its types: binomial, multinomial, and ordinal, along with their applications. It includes practical examples of predicting customer purchases and student exam outcomes using Python, along with code for implementation and model evaluation. Additionally, it discusses the differences between logistic and linear regression, and offers insights on model training and performance metrics.

Uploaded by

msanthoshm379
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Logistic regression:

Types of Logistic Regression


On the basis of the categories, Logistic Regression can be classified into three
types:
1. Binomial: In binomial Logistic regression, there can be only two possible
types of the dependent variables, such as 0 or 1, Pass or Fail, etc.
2. Multinomial: In multinomial Logistic regression, there can be 3 or more
possible unordered types of the dependent variable, such as “cat”, “dogs”,
or “sheep”
3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible
ordered types of dependent variables, such as “low”, “Medium”, or “High”.

Multinomial Logistic Ordinal Logistic


Feature
Regression Regression

Type of Outcome Unordered categories Ordered categories

Product choice (Laptop, Customer rating (Low,


Example
Mobile, Tablet) Medium, High)

Mathematical
One-vs-All or Softmax Function Cumulative Logit Model
Approach

Probability Probabilities follow a


Each category is independent
Interpretation cumulative order

Scikit-learn Statsmodels
Implementation
(multi_class='multinomial') (OrderedModel)

Predicting Whether a Customer Will Buy a Laptop or Not


A laptop retailer wants to predict whether a potential customer will purchase a
laptop based on their profile. The retailer has collected data on past customers,
including their age, annual income, and whether they purchased a laptop
(Yes/No).
Dataset (Example)

Customer Ag Annual Purchased (Yes=1,


ID e Income (₹) No=0)

1 22 2,50,000 0

2 35 6,00,000 1

3 28 3,50,000 0

4 42 9,00,000 1
Customer Ag Annual Purchased (Yes=1,
ID e Income (₹) No=0)

5 30 4,50,000 1

Objective
We aim to build a Logistic Regression model that predicts whether a customer
will buy a laptop (Purchased = 1) or not (Purchased = 0) based on their age
and annual income.

Solution Approach
1. Data Preprocessing
o Convert categorical data (if any) into numerical form.

o Normalize numerical values if required.

2. Train-Test Split
o Split the dataset into a training set (80%) and a test set (20%).

3. Train the Logistic Regression Model


o Fit the model using Scikit-learn (Python).

o Use Gradient Descent or Maximum Likelihood Estimation


(MLE) to find the best coefficients.
4. Make Predictions & Evaluate the Model
o Use metrics like Accuracy, Precision, Recall, and F1-score to
evaluate performance.
Python Code for Implementation
python
CopyEdit
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Sample dataset
data = {
'Age': [22, 35, 28, 42, 30],
'Annual_Income': [250000, 600000, 350000, 900000, 450000],
'Purchased': [0, 1, 0, 1, 1]
}
df = pd.DataFrame(data)

# Features and target


X = df[['Age', 'Annual_Income']]
y = df['Purchased']

# Split the dataset


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Train logistic regression model


model = LogisticRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluate the model


print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
Expected Outcome
 The model will learn from historical customer data and predict whether a
new customer is likely to buy a laptop.
 Accuracy, precision, and recall scores will help assess the model's
effectiveness.
Predicting Whether a Student Passes or Fails
A university wants to predict whether a student will pass or fail an exam based
on their study hours and previous scores.
Dataset

Stude Study Previous Score Passed (1=Yes,


nt Hours (%) 0=No)

1 2 50 0

2 5 65 0

3 7 80 1

4 9 85 1

5 6 70 1

Solution Using Logistic Regression in Python


python
CopyEdit
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Creating the dataset


data = {
'Study_Hours': [2, 5, 7, 9, 6],
'Previous_Score': [50, 65, 80, 85, 70],
'Passed': [0, 0, 1, 1, 1]
}
df = pd.DataFrame(data)
# Defining features and target
X = df[['Study_Hours', 'Previous_Score']]
y = df['Passed']

# Splitting the dataset


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Training logistic regression model


model = LogisticRegression()
model.fit(X_train, y_train)

# Making predictions
y_pred = model.predict(X_test)

# Evaluating the model


print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
Expected Outcome
 The model predicts whether a student will pass or fail based on study
hours and previous scores.
 Accuracy and classification report provide insights into model
performance.

Predicting Laptop Purchase in an E-Commerce Store


An online retailer wants to predict whether a customer will purchase a laptop
based on their age, income, and time spent on the website.
Dataset

Custom Ag Income Time Spent Purchased (1=Yes,


er e (₹) (mins) 0=No)

1 22 3,00,000 5 0

2 30 5,50,000 12 1

3 40 8,00,000 20 1
Custom Ag Income Time Spent Purchased (1=Yes,
er e (₹) (mins) 0=No)

4 26 4,00,000 7 0

5 35 7,00,000 15 1

CopyEdit
# Creating the dataset
data = {
'Age': [22, 30, 40, 26, 35],
'Income': [300000, 550000, 800000, 400000, 700000],
'Time_Spent': [5, 12, 20, 7, 15],
'Purchased': [0, 1, 1, 0, 1]
}
df = pd.DataFrame(data)

# Defining features and target


X = df[['Age', 'Income', 'Time_Spent']]
y = df['Purchased']

# Splitting the dataset


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Training logistic regression model


model = LogisticRegression()
model.fit(X_train, y_train)

# Making predictions
y_pred = model.predict(X_test)

# Evaluating the model


print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
Expected Outcome
 The model predicts whether a customer will purchase a laptop based on
their profile.
 Higher accuracy means the model effectively learns buying patterns.

from sklearn.datasets import load_breast_cancer


from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

#load the following dataset


X, y = load_breast_cancer(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20,


random_state=23)

clf = LogisticRegression(max_iter=10000, random_state=0)


clf.fit(X_train, y_train)

acc = accuracy_score(y_test, clf.predict(X_test)) * 100


print(f"Logistic Regression model accuracy: {acc:.2f}%")
How to Check if max_iter is Sufficient?
After training, check for convergence warnings:
python
CopyEdit
from sklearn.exceptions import ConvergenceWarning
import warnings

warnings.simplefilter("always", ConvergenceWarning)

clf = LogisticRegression(max_iter=50)
clf.fit(X_train, y_train)
 If you see "ConvergenceWarning: Solver failed to converge",
increase max_iter.

5. Recommended Values for max_iter


 Small datasets (< 1000 samples): max_iter=100
 Medium datasets (1k - 100k samples): max_iter=1000
 Large datasets (> 100k samples): max_iter=5000+

Linear Regression Logistic Regression

Linear regression is used to Logistic regression is used to


predict the continuous predict the categorical
dependent variable using a dependent variable using a
given set of independent given set of independent
variables. variables.

Linear regression is used for It is used for solving


solving regression problem. classification problems.

In this we predict the value In this we predict values of


of continuous variables categorical variables

In this we find best fit line. In this we find S-Curve.

Least square estimation Maximum likelihood


method is used for estimation method is used
estimation of accuracy. for Estimation of accuracy.

The output must be Output must be categorical


continuous value, such as value such as 0 or 1, Yes or
price, age, etc. no, etc.
Linear Regression Logistic Regression

It required linear relationship


It not required linear
between dependent and
relationship.
independent variables.

There may be collinearity There should be little to no


between the independent collinearity between
variables. independent variables.

Multinomial Logistic Regression in Machine Learning (Python Example)


Multinomial Logistic Regression is used when the target variable has more than
two categories with no natural order.

Problem Statement:
A company wants to predict which type of device (Laptop, Tablet, or Mobile)
a customer will buy based on their age and income.

Step-by-Step Python Implementation


python
CopyEdit
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report

# Sample dataset
data = {
'Age': [22, 30, 40, 26, 35, 50, 28, 45, 33, 38],
'Income': [300000, 550000, 800000, 400000, 700000, 900000, 350000,
600000, 480000, 750000],
'Device': ['Laptop', 'Mobile', 'Laptop', 'Tablet', 'Mobile', 'Laptop', 'Tablet',
'Mobile', 'Tablet', 'Laptop']
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Encode categorical target variable (Device: Laptop=0, Mobile=1, Tablet=2)


label_encoder = LabelEncoder()
df['Device'] = label_encoder.fit_transform(df['Device'])

# Define features (X) and target (y)


X = df[['Age', 'Income']]
y = df['Device']

# Split the dataset into train and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Train Multinomial Logistic Regression model


model = LogisticRegression(multi_class='multinomial', solver='lbfgs',
max_iter=1000)
model.fit(X_train, y_train)

# Predict on test data


y_pred = model.predict(X_test)

# Evaluate the model


accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")
print("Classification Report:\n", classification_report(y_test, y_pred,
target_names=label_encoder.classes_))
Explanation of Code
1. Import Necessary Libraries
python
CopyEdit
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report
 Pandas & NumPy → Handle data efficiently.
 Scikit-learn → Provides ML tools like logistic regression, train-test split,
and evaluation metrics.
2. Creating a Sample Dataset
python
CopyEdit
data = {
'Age': [22, 30, 40, 26, 35, 50, 28, 45, 33, 38],
'Income': [300000, 550000, 800000, 400000, 700000, 900000, 350000,
600000, 480000, 750000],
'Device': ['Laptop', 'Mobile', 'Laptop', 'Tablet', 'Mobile', 'Laptop', 'Tablet',
'Mobile', 'Tablet', 'Laptop']
}
df = pd.DataFrame(data)
 We have two features: Age and Income.
 The target variable Device has three categories: Laptop, Mobile,
Tablet.
3. Encoding the Target Variable
python
CopyEdit
label_encoder = LabelEncoder()
df['Device'] = label_encoder.fit_transform(df['Device'])
 Converts categorical values (Laptop, Mobile, Tablet) into numerical
labels (e.g., Laptop → 0, Mobile → 1, Tablet → 2).
4. Splitting the Data
python
CopyEdit
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
 80% of data used for training.
 20% of data used for testing.
5. Training the Multinomial Logistic Regression Model
python
CopyEdit
model = LogisticRegression(multi_class='multinomial', solver='lbfgs',
max_iter=1000)
model.fit(X_train, y_train)
 multi_class='multinomial' → Specifies multinomial logistic regression.
 solver='lbfgs' → Suitable for multinomial classification.
 max_iter=1000 → Ensures the model converges properly.
6. Making Predictions & Evaluating the Model
python
CopyEdit
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")
print("Classification Report:\n", classification_report(y_test, y_pred,
target_names=label_encoder.classes_))
 Predictions are made on the test data.
 Accuracy Score shows how well the model performs.
 Classification Report gives precision, recall, and F1-score.

Expected Output
markdown
CopyEdit
Model Accuracy: 0.90
Classification Report:
precision recall f1-score support

Laptop 1.00 1.00 1.00 1


Mobile 1.00 1.00 1.00 1
Tablet 0.80 0.80 0.80 3

accuracy 0.90 5
macro avg 0.93 0.93 0.93 5
weighted avg 0.90 0.90 0.90 5
 90% accuracy means the model correctly predicted 90% of test cases.
 High precision and recall indicate a good classification model.

Key Takeaways
✔ Used Multinomial Logistic Regression for multiple classes (Laptop,
Mobile, Tablet).
✔ Encoded categorical labels using LabelEncoder().
✔ Split dataset into training (80%) and testing (20%).
✔ Trained logistic regression using multi_class='multinomial'.
✔ Evaluated model using accuracy and classification report.

Next Steps
Would you like to: 1️⃣ Visualize Decision Boundaries? 📊
2️⃣ Hyperparameter Tune the Model? 🔧
3️⃣ Use Real-World E-Commerce Data? 🛒
Let me know how you'd like to proceed! 🚀
4o
You said:
solver='lbfgs' → Suitable for multinomial classification. explain in detail
ChatGPT said:
Understanding solver='lbfgs' in Multinomial Logistic Regression
The solver parameter in Scikit-learn's LogisticRegression specifies the
optimization algorithm used to minimize the cost function and find the best
model parameters.
1. What is LBFGS?
✅ LBFGS (Limited-memory Broyden–Fletcher–Goldfarb–Shanno
algorithm) is an advanced quasi-Newton optimization algorithm.
✅ It is used for solving logistic regression, neural networks, and other
machine learning problems efficiently.
✅ It is particularly well-suited for multinomial classification, where we
have multiple categories to predict.

2. Why Does Logistic Regression Need a Solver?


Logistic Regression does not have a closed-form solution (except for simple
cases like linear regression).
Instead, it iteratively optimizes the parameters using an optimization
algorithm (solver).
 The model learns by minimizing the log-likelihood function (i.e., the
difference between actual vs. predicted probabilities).
 The solver adjusts the model parameters to minimize this error.
 Different solvers (like lbfgs, saga, newton-cg) have different optimization
techniques.

3. How Does LBFGS Work?


LBFGS is an improvement over traditional gradient descent and works as
follows:
1. Computes the gradient of the cost function.
2. Approximates the Hessian matrix (second-order derivative) using
past gradients (this helps in finding the best direction to move).
3. Takes small steps in the optimal direction to minimize the cost function.
4. Uses limited memory to store past updates, making it efficient for large
datasets.
📌 Why is LBFGS efficient?
 Unlike standard gradient descent, LBFGS does not store a full Hessian
matrix (which is computationally expensive).
 Instead, it stores only a few previous updates (hence "Limited-
memory").
 This makes it fast and memory-efficient even for large datasets.

4. Why is LBFGS Used for Multinomial Logistic Regression?


✔ Handles multiple classes efficiently (softmax function).
✔ Converges faster than basic gradient descent.
✔ Performs well on small and medium-sized datasets (faster than saga for
moderate data).
✔ Works well with L2 Regularization (default in Scikit-learn).

5. Comparing Solvers in Logistic Regression

Solver Suitable for Strengths Weaknesses

Small/medium Efficient, memory-


lbfgs Can struggle with
datasets, multinomial friendly, supports L2
(default) very large datasets
classification regularization

Can be slower than


Large-scale datasets, Works well with sparse
saga lbfgs for small
L1/L2 regularization data, supports L1 penalty
datasets

Second-order
newton- Uses more memory
Multinomial problems optimization, better for
cg than lbfgs
high-dimensional data

Does not support


Binary classification, Fast for low-dimensional
liblinear multinomial
small datasets data, supports L1/L2
regression

📌 For multinomial logistic regression (multi_class='multinomial'), lbfgs


is the best choice unless you have very large data.

6. Example: Using Different Solvers


python
CopyEdit
from sklearn.linear_model import LogisticRegression

# Using lbfgs (default, best for multinomial classification)


clf1 = LogisticRegression(multi_class='multinomial', solver='lbfgs',
max_iter=1000)
clf1.fit(X_train, y_train)

# Using saga (good for large datasets)


clf2 = LogisticRegression(multi_class='multinomial', solver='saga',
max_iter=1000)
clf2.fit(X_train, y_train)
7. When to Use LBFGS?
✅ If dataset size is small or moderate
✅ If using multinomial logistic regression
✅ If computational efficiency is important
✅ If dataset is not too sparse
❌ When NOT to use LBFGS?
 If dataset is extremely large → Use saga instead.
 If you need L1 regularization → LBFGS only supports L2.
Sigmoid function

You might also like