Logistic Regression in Machine Learning
Logistic Regression in Machine Learning
Logistic Regression is a supervised machine learning algorithm mainly used for classification
tasks where the goal is to predict the probability that an instance belongs to a given class. It
is a kind of statistical algorithm, which analyzes the relationship between a set of
independent variables and the dependent binary variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
Logistic regression is commonly used in various fields to predict the likelihood of an event
occurring. Here are some of the applications of logistic regression:
1. Binary Outcome: Logistic regression is used when the dependent variable is binary
(0/1, Yes/No, True/False) in nature.
2. Logistic Function: It uses a logistic function to transform this binary outcome into a
continuous variable that can range between 0 and 1.
3. Dependent Variable: This continuous variable is used as the dependent variable in the
regression analysis.
4. Independent Variables: The independent variables in the analysis are the factors that
you think may influence the outcome.
5. Estimating Effects: The regression analysis estimates the effect of each independent
variable on the probability of the outcome.
6. Predicting Probability: These estimates can then be used to predict the probability of
the outcome given a set of values for the independent variables.
7. Sigmoid Function: The sigmoid function maps any real-valued number into another
value between 0 and 1. In machine learning, we use sigmoid to map predictions to
probabilities.
8. Decision Boundary: Logistic Regression algorithm also uses a concept of threshold
value. By default, the threshold value is 0.5 which means if the probability of an
observation is greater than 0.5 then the observation is classified as Class-1(Y=1), else it
is classified as Class-0(Y=0).
Remember, the goal of logistic regression is to find the best fitting model to describe the
relationship between the dichotomous characteristic of interest and a set of independent
variables.
In Logistic Regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1). The curve from the logistic function
indicates the likelihood of something such as whether the cells are cancerous or not, a
mouse is obese or not based on its weight, etc.
The Sigmoid Function is used to map the predicted values to probabilities. It ensures that the
value of the logistic regression must be between 0 and 1, which cannot go beyond this limit.
This makes it a powerful tool for decision-making in classification tasks.
In [1]: import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
In [5]: Le = LabelEncoder()
Out[6]: 0 1
1 1
2 0
3 0
4 1
5 1
6 0
7 0
Name: Gender, dtype: int64
0 1 19 19000
1 1 35 20000
2 0 26 43000
3 0 27 57000
4 1 19 76000
395 0 46 41000
396 1 51 23000
397 0 50 20000
398 1 36 33000
399 0 49 36000
In [8]: y = suv_df['Purchased']
y
Out[8]: 0 0
1 0
2 0
3 0
4 0
..
395 1
396 1
397 1
398 0
399 1
Name: Purchased, Length: 400, dtype: int64
Out[11]: 280
Out[12]: 120
1. Explanatory Variables: These are the independent variables or predictors in the model.
2. Response Variable: This is the dependent variable that we are trying to predict.
3. Logistic Function: This is the formula used to represent how the independent and
dependent variables relate to one another.
4. penalty: This specifies the norm used in the penalization.
5. dual: This is a boolean parameter that decides whether to solve the primal or dual
optimization problem.
6. tol: This is the tolerance for stopping criteria.
7. C: This is the inverse of regularization strength.
8. fit_intercept: This specifies if a constant (a.k.a. bias or intercept) should be added to
the decision function.
9. intercept_scaling: This is useful only when the solver ‘liblinear’ is used and
self.fit_intercept is set to True.
10. class_weight: This is an optional parameter that represents weights associated with
classes.
11. random_state: This parameter is used for shuffling the data.
12. solver: This parameter specifies the algorithm to use in the optimization problem.
13. max_iter: This is the maximum number of iterations for the solver to converge.
14. multi_class: This parameter specifies if the algorithm should use a one-vs-rest or
multinomial approach.
15. verbose: This is used for the verbosity of the output.
16. warm_start: This is a boolean parameter that decides whether to reuse the solution of
the previous call to fit as initialization.
17. n_jobs: This is the number of CPU cores used during the cross-validation loop.
18. l1_ratio: This parameter is used only when penalty is 'elasticnet'.
Out[15]: LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or
trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page
with nbviewer.org.
In [16]: y_test.head(8)
Out[16]: 208 1
227 1
268 1
9 0
327 0
119 0
134 0
349 0
Name: Purchased, dtype: int64
In [17]: y_pred = log.predict(x_test)
print(y_pred)
[0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 1 0 0]
There are several metrics that can be used to evaluate the performance of a Logistic
Regression model:
1. Accuracy: This is the ratio of the total number of correct predictions to the total number
of input samples.
2. Precision: This is the ratio of the total number of correct positive predictions to the total
number of positive predictions.
3. Recall (Sensitivity): This is the ratio of the total number of correct positive predictions
to the total number of actual positives.
4. F1-Score: This is the Harmonic Mean of Precision and Recall.
5. AUC-ROC (Area Under the Receiver Operating Characteristics Curve): This is a
performance measurement for classification problem at various thresholds settings.
6. Confusion Matrix: A table used to describe the performance of a classification model.
7. Log-Loss: It's a performance metric for evaluating the predictions of probabilities of
membership to a particular class.
8. AIC (Akaike Information Criteria): It's a method used to compare different models and
find out the best fitting model from the given different models.
9. Null Deviance and Residual Deviance: These are measures of goodness of fit for a
logistic regression model.
10. Mean Absolute Error (MAE): This is the mean of the absolute value of the errors.
11. Root Mean Square Error (RMSE): This is the square root of the average of squared
differences between prediction and actual observation.
12. Coefficient of Determination or R2: This metric provides an indication of the goodness
of fit of a set of predictions to the actual values.
13. Adjusted R2: This is a modified version of R-squared that has been adjusted for the
number of predictors in the model.
Remember, the choice of metric depends on your specific problem and the business context.
It's always a good idea to understand the assumptions and implications of each metric
before choosing one.
Out[18]: 0.65
In [19]: from sklearn import metrics
# Accuracy
accuracy = metrics.accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
# Precision
precision = metrics.precision_score(y_test, y_pred, zero_division=0) # The z
print(f'Precision: {precision}') # 0, means that out of all the positive pre
# Recall
recall = metrics.recall_score(y_test, y_pred)
print(f'Recall: {recall}') # 0, means that out of all the positive predictio
# F1 Score
f1_score = metrics.f1_score(y_test, y_pred)
print(f'F1 Score: {f1_score}') # F1 score is the harmonic mean of precision
# AUC-ROC
auc_roc = metrics.roc_auc_score(y_test, y_pred)
print(f'AUC-ROC: {auc_roc}')
# Confusion Matrix
confusion_matrix = metrics.confusion_matrix(y_test, y_pred)
print(f'Confusion Matrix: \n{confusion_matrix}')
print()
# Mean Absolute Error (MAE)
mae = metrics.mean_absolute_error(y_test, y_pred)
print(f'MAE: {mae}') # 0.4167, means that on average, your model’s predictio
# Root Mean Square Error (RMSE)
rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
print(f'RMSE: {rmse}') # 0.6455, means that on average, the square root of t
# R2 Score
r2 = metrics.r2_score(y_test, y_pred) # R2 score of 1 indicates that the mod
print(f'R2 Score: {r2}') # A negative R2 score indicates that your model is
# Adjusted R2 Score
n = len(y_test) # Number of observations
k = 1 # Number of predictors
adjusted_r2 = 1 - (1 - r2) * (n - 1) / (n - k - 1)
print(f'Adjusted R2 Score: {adjusted_r2}') # A negative R2 score indicates t
Accuracy: 0.65
Precision: 0.3333333333333333
Recall: 0.07692307692307693
F1 Score: 0.125
AUC-ROC: 0.5014245014245013
Confusion Matrix:
[[75 6]
[36 3]]
MAE: 0.35
RMSE: 0.5916079783099616
R2 Score: -0.5954415954415957
Adjusted R2 Score: -0.6089622869283888
In [20]: # Printing AUC-ROC Curve
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred) # '_' This will prevent the
# In Python, the underscore _ is often used as a “throwaway” variable to ind
auc = metrics.roc_auc_score(y_test, y_pred)
plt.plot(fpr, # Increasing false positive rates such that element i is the f
tpr, # Increasing true positive rates such that element i is the tr
label="AUC = "+str(auc))
plt.legend(loc=4)
plt.show()
Predict Purchased
Out[21]: LabelEncoder()
In a Jupyter environment, please rerun this cell to show the HTML representation or
trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page
with nbviewer.org.
Out[22]: LabelEncoder()
In a Jupyter environment, please rerun this cell to show the HTML representation or
trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page
with nbviewer.org.
In [23]: age = int(input("Enter the Age: "))
gender = input("Enter the Gender: ")
gender_var = log_gender.transform([gender])[0] # The [0] at the end is an in
salary = int(input("Enter the Salary: "))
label_map = {0: 'No', 1: 'Yes'}
predict_value = log.predict([[age, gender_var, salary]])
# Map the numerical prediction back to a string label
label_value = label_map[predict_value[0]]
print(label_value) # 0: 'Yes', 1: 'No'
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:439: UserWarning:
X does not have valid feature names, but LogisticRegression was fitted wit
h feature names
warnings.warn(