ML Assignment
ML Assignment
Machine Learning
Submitted by
SURAJ ROY
SUVAM CHATTERJEE
ARGHYADIP DHARA
Year: 2024
Logistic regression is used for binary classification where we use sigmoid function,
that takes input as independent variables and produces a probability value between 0
and 1.
For example, we have two classes Class 0 and Class 1 if the value of the logistic
function for an input is greater than 0.5 (threshold value) then it belongs to Class 1
otherwise it belongs to Class 0. It’s referred to as regression because it is the
extension of linear regression but is mainly used for classification problems.
Key Points:
It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the
exact value as 0 and 1, it gives the probabilistic values which lie between 0 and
1.
It maps any real value into another value within a range of 0 and 1. The value of
the logistic regression must be between 0 and 1, which cannot go beyond this
limit, so it forms a curve like the “S” form.
The S-form curve is called the Sigmoid function or the logistic function.
In logistic regression, we use the concept of the threshold value, which defines
the probability of either 0 or 1. Such as values above the threshold value tends
to 1, and a value below the threshold values tends to 0.
On the basis of the categories, Logistic Regression can be classified into three types:
1. Binomial: In binomial Logistic regression, there can be only two possible types
of the dependent variables, such as 0 or 1, Pass or Fail, etc.
2. Multinomial: In multinomial Logistic regression, there can be 3 or more
possible unordered types of the dependent variable, such as “cat”, “dogs”, or
“sheep”
The logistic regression model transforms the linear regression function continuous
value output into categorical value output using a sigmoid function, which maps any
real-valued set of independent variables input into a value between 0 and 1. This
function is known as the logistic function.
Let the independent input features be:
z=w⋅X+bz=w⋅X+b
whatever we discussed above is the linear-regression.
Sigmoid Function
Now we use the sigmoid function where the input will be z and we find the probability
between 0 and 1. i.e. predicted y.
Sigmoid function
As shown above, the figure sigmoid function converts the continuous variable data
into the probability i.e. between 0 and 1.
The odd is the ratio of something occurring to something not occurring. it is different
from probability as the probability is the ratio of something occurring to everything
that could possibly occur. so odd will be:
2. Data Preprocessing
Handling Missing Values: Check for missing values and handle them
through imputation or removal.
Normalization/Standardization: Scale features to ensure that they
are on a similar scale. This helps in optimizing the logistic regression
algorithm.
Encoding Categorical Variables: If any categorical features exist,
encode them into numerical values using techniques like one-hot
encoding.
3. Exploratory Data Analysis (EDA)
Visualizations: Use plots (histograms, scatter plots, box plots) to
understand the distribution of features and identify relationships between
them.
Correlation Analysis: Assess how features correlate with the target
variable (benign vs. malignant).
4. Splitting the Data
a. Train-Test Split: Divide the dataset into training and
testing sets (commonly 70% train and 30% test).
b. This helps evaluate the model’s performance on unseen
data.
5. Building the Logistic Regression Model
a. Model Definition: Logistic regression predicts the probability of a
binary outcome. The logistic function (sigmoid function) is used to
transform the linear combination of features into a probability.
b. Model Equation: The logistic regression model can be defined as:
c. Training the Model: Use the training dataset to fit the logistic
regression model, optimizing the coefficients using methods like
Maximum Likelihood Estimation.
6. Model Evaluation
a. Prediction: Use the test set to make predictions on the probabilities of
benign or malignant outcomes.
b. Thresholding: Convert probabilities to binary outcomes based on a
threshold (commonly 0.5).
c. Performance Metrics:
Accuracy: Overall correct predictions.
Precision: True positives / (True positives + False positives).
Recall (Sensitivity): True positives / (True positives + False
negatives).
F1 Score: Harmonic mean of precision and recall.
Confusion Matrix: A matrix to visualize true positives, true
negatives, false positives, and false negatives.
7. Model Validation
a. Cross-Validation: Use techniques like k-fold cross-validation to
ensure that the model generalizes well to unseen data.
b. ROC Curve and AUC: Plot the Receiver Operating Characteristic
curve and calculate the Area Under the Curve to assess the trade-off
between true positive rate and false positive rate.
Conclusion:-
In more complex cases where the relationship between features and the target
is non-linear, more advanced algorithms like support vector machines (SVM)
or neural networks might outperform logistic regression. However, for many
breast cancer datasets, logistic regression can be an effective and interpretable
method for early detection.
10/24/24, 9:00 PM logistic_regression.ipynb - Colab
import pandas as pd
import numpy as np
0 1000025 5 1 1 1 2 1 3 1 1 2
1 1002945 5 4 4 5 7 10 3 2 1 2
2 1015425 3 1 1 1 2 2 3 1 1 2
3 1016277 6 8 8 1 3 4 3 7 1 2
4 1017023 4 1 1 3 2 1 3 1 1 2
5 1017122 8 10 10 8 7 10 9 7 1 4
6 1018099 1 1 1 1 2 10 3 1 1 2
7 1018561 2 1 2 1 2 1 3 1 1 2
8 1033078 2 1 1 1 2 1 1 1 5 2
9 1033078 4 2 1 1 2 1 2 1 1 2
10 1035283 1 1 1 1 1 1 3 1 1 2
11 1036172 2 1 1 1 2 1 2 1 1 2
12 1041801 5 3 3 3 2 3 4 4 1 4
13 1043999 1 1 1 1 2 3 3 1 1 2
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Clump Thickness 699 non-null int64
1 Uniformity of Cell Size 699 non-null int64
2 Uniformity of Cell Shape 699 non-null int64
3 Marginal Adhesion 699 non-null int64
4 Single Epithelial Cell Size 699 non-null int64
5 Bare Nuclei 699 non-null object
6 Bland Chromatin 699 non-null int64
7 Normal Nucleoli 699 non-null int64
8 Mitoses 699 non-null int64
9 Class 699 non-null int64
dtypes: int64(9), object(1)
memory usage: 54.7+ KB
data.describe()
https://colab.research.google.com/drive/1ptYtObP7alBpVj94HC9kHOgZTB02NmVt#scrollTo=e6ZU0cK19lum&printMode=true 1/7
10/24/24, 9:00 PM logistic_regression.ipynb - Colab
Single
Clump Uniformity Uniformity of Marginal Bland Normal
Epithelial Mitoses Class
Thickness of Cell Size Cell Shape Adhesion Chromatin Nucleoli
Cell Size
count 699.000000 699.000000 699.000000 699.000000 699.000000 699.000000 699.000000 699.000000 699.000000
mean 4.417740 3.134478 3.207439 2.806867 3.216023 3.437768 2.866953 1.589413 2.689557
std 2.815741 3.051459 2.971913 2.855379 2.214300 2.438364 3.053634 1.715078 0.951273
min 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 2.000000
25% 2.000000 1.000000 1.000000 1.000000 2.000000 2.000000 1.000000 1.000000 2.000000
50% 4.000000 1.000000 1.000000 1.000000 2.000000 3.000000 1.000000 1.000000 2.000000
75% 6.000000 5.000000 5.000000 4.000000 4.000000 5.000000 4.000000 1.000000 4.000000
print(data['Bare Nuclei'].unique())
['1' '10' '2' '4' '3' '9' '7' '?' '5' '8' '6']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Clump Thickness 699 non-null int64
1 Uniformity of Cell Size 699 non-null int64
2 Uniformity of Cell Shape 699 non-null int64
3 Marginal Adhesion 699 non-null int64
4 Single Epithelial Cell Size 699 non-null int64
5 Bare Nuclei 683 non-null float64
6 Bland Chromatin 699 non-null int64
7 Normal Nucleoli 699 non-null int64
8 Mitoses 699 non-null int64
9 Class 699 non-null int64
dtypes: float64(1), int64(9)
memory usage: 54.7 KB
y = data['Class']
X = data.drop('Class', axis=1)
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
▾ LogisticRegression i ?
LogisticRegression()
y_pred_lr = lr_model.predict(X_test)
score = accuracy_score(y_test,y_pred_lr)
print('Accuracy score: {}'.format(score))
svc_model.fit(X_train, y_train)
▾ SVC i ?
y_pred_svm = svc_model.predict(X_test)
score = accuracy_score(y_test,y_pred_svm)
print('Accuracy score: {}'.format(score))
https://colab.research.google.com/drive/1ptYtObP7alBpVj94HC9kHOgZTB02NmVt#scrollTo=e6ZU0cK19lum&printMode=true 2/7
10/24/24, 9:00 PM logistic_regression.ipynb - Colab
plt.figure()
plt.plot(fpr_lr, tpr_lr, label='Logistic Regression (AUC = %0.2f)' % roc_auc_lr)
plt.plot([0, 1], [0, 1], 'k--') # Random guessing line
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Logistic Regression')
plt.legend(loc="lower right")
plt.show()
plt.figure()
plt.plot(fpr_svm, tpr_svm, label='SVM (AUC = %0.2f)' % roc_auc_svm)
plt.plot([0, 1], [0, 1], 'k--') # random guessing line
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - SVM')
plt.legend(loc="lower right")
plt.show()
plt.figure(figsize=(10, 6))
sns.barplot(x='Coefficient', y='Feature', data=feature_importance)
plt.title('Feature Importance - Logistic Regression')
plt.show()
https://colab.research.google.com/drive/1ptYtObP7alBpVj94HC9kHOgZTB02NmVt#scrollTo=e6ZU0cK19lum&printMode=true 3/7
10/24/24, 9:00 PM logistic_regression.ipynb - Colab
https://colab.research.google.com/drive/1ptYtObP7alBpVj94HC9kHOgZTB02NmVt#scrollTo=e6ZU0cK19lum&printMode=true 4/7
10/24/24, 9:00 PM logistic_regression.ipynb - Colab
https://colab.research.google.com/drive/1ptYtObP7alBpVj94HC9kHOgZTB02NmVt#scrollTo=e6ZU0cK19lum&printMode=true 5/7
10/24/24, 9:00 PM logistic_regression.ipynb - Colab
https://colab.research.google.com/drive/1ptYtObP7alBpVj94HC9kHOgZTB02NmVt#scrollTo=e6ZU0cK19lum&printMode=true 6/7
10/24/24, 9:00 PM logistic_regression.ipynb - Colab
https://colab.research.google.com/drive/1ptYtObP7alBpVj94HC9kHOgZTB02NmVt#scrollTo=e6ZU0cK19lum&printMode=true 7/7
import matplotlib.pyplot as plt
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
plt.figure(figsize=(10, 6))
y_curve = sigmoid(x_curve)
plt.xlabel('X')
plt.ylabel('Sigmoid(X)')
plt.legend()
plt.grid(True)
plt.show()