0% found this document useful (0 votes)
11 views

ML Assignment

Uploaded by

Suvam chatterjee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

ML Assignment

Uploaded by

Suvam chatterjee
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Govt.

College of Engineering and Ceramic Technology


THEORY ASSIGNMENT

Machine Learning

Submitted by

SURAJ ROY
SUVAM CHATTERJEE
ARGHYADIP DHARA

ROLL NO.: GCECTB-R21-2040,2041,R20-2007

BRANCH: INFORMATION TECHNOLOGY

Year: 2024

GOVT. COLLEGE OF ENGINEERING & CERAMIC TECHNOLOGY

73, Abinash Ch. Banerjee Lane, Kolkata - 10


Introduction to Logistic Regression

Logistic regression is a supervised machine learning algorithm used


for classification tasks where the goal is to predict the probability that an instance
belongs to a given class or not. Logistic regression is a statistical algorithm which
analyse the relationship between two data factors. The article explores the
fundamentals of logistic regression, it’s types and implementations.

What is Logistic Regression?

Logistic regression is used for binary classification where we use sigmoid function,
that takes input as independent variables and produces a probability value between 0
and 1.

For example, we have two classes Class 0 and Class 1 if the value of the logistic
function for an input is greater than 0.5 (threshold value) then it belongs to Class 1
otherwise it belongs to Class 0. It’s referred to as regression because it is the
extension of linear regression but is mainly used for classification problems.

Key Points:

 Logistic regression predicts the output of a categorical dependent variable.


Therefore, the outcome must be a categorical or discrete value.

 It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the
exact value as 0 and 1, it gives the probabilistic values which lie between 0 and
1.

 In Logistic regression, instead of fitting a regression line, we fit an “S” shaped


logistic function, which predicts two maximum values (0 or 1).

Logistic Function – Sigmoid Function

 The sigmoid function is a mathematical function used to map the predicted


values to probabilities.

 It maps any real value into another value within a range of 0 and 1. The value of
the logistic regression must be between 0 and 1, which cannot go beyond this
limit, so it forms a curve like the “S” form.

 The S-form curve is called the Sigmoid function or the logistic function.

 In logistic regression, we use the concept of the threshold value, which defines
the probability of either 0 or 1. Such as values above the threshold value tends
to 1, and a value below the threshold values tends to 0.

Types of Logistic Regression

On the basis of the categories, Logistic Regression can be classified into three types:

1. Binomial: In binomial Logistic regression, there can be only two possible types
of the dependent variables, such as 0 or 1, Pass or Fail, etc.
2. Multinomial: In multinomial Logistic regression, there can be 3 or more
possible unordered types of the dependent variable, such as “cat”, “dogs”, or
“sheep”

3. Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered


types of dependent variables, such as “low”, “Medium”, or “High”.

Assumptions of Logistic Regression

We will explore the assumptions of logistic regression as understanding these


assumptions is important to ensure that we are using appropriate application of the
model. The assumption include:

1. Independent observations: Each observation is independent of the other.


meaning there is no correlation between any input variables.

2. Binary dependent variables: It takes the assumption that the dependent


variable must be binary or dichotomous, meaning it can take only two values.
For more than two categories SoftMax functions are used.

3. Linearity relationship between independent variables and log odds: The


relationship between the independent variables and the log odds of the
dependent variable should be linear.

4. No outliers: There should be no outliers in the dataset.

5. Large sample size: The sample size is sufficiently large.

The logistic regression model transforms the linear regression function continuous
value output into categorical value output using a sigmoid function, which maps any
real-valued set of independent variables input into a value between 0 and 1. This
function is known as the logistic function.
Let the independent input features be:

and the dependent variable is Y having only binary value i.e. 0 or 1.

then, apply the multi-linear function to the input variables X.

Here xixi is the ith observation of X, wi=[w1,w2,w3,⋯,wm]wi=[w1,w2,w3,⋯,wm] is the


weights or Coefficient, and b is the bias term also known as intercept. simply this can
be represented as the dot product of weight and bias.

z=w⋅X+bz=w⋅X+b
whatever we discussed above is the linear-regression.

Sigmoid Function

Now we use the sigmoid function where the input will be z and we find the probability
between 0 and 1. i.e. predicted y.

Sigmoid function

As shown above, the figure sigmoid function converts the continuous variable data
into the probability i.e. between 0 and 1.

 σ(z) σ(z) tends towards 1 as z→∞z→∞

 σ(z) σ(z) tends towards 0 as z→−∞z→−∞

 σ(z) σ(z) is always bounded between 0 and 1

where the probability of being a class can be measured as:

Logistic Regression Equation

The odd is the ratio of something occurring to something not occurring. it is different
from probability as the probability is the ratio of something occurring to everything
that could possibly occur. so odd will be:

Applying natural log on odd. then log odd will be:


then the final logistic regression equation will be:

Likelihood Function for Logistic Regression

The predicted probabilities will be:

 for y=1 The predicted probabilities will be: p(X;b,w) = p(x)

 for y = 0 The predicted probabilities will be: 1-p(X;b,w) = 1-p(x)

Taking natural logs on both sides

Gradient of the log-likelihood function

To find the maximum likelihood estimates, we differentiate w.r.t w,


Problem Statement: Breast Cancer Detection
Breast cancer is one of the most common and potentially life-threatening
cancers among women. Early detection plays a critical role in treatment success
and survival rates. Therefore, developing accurate methods for detecting breast
cancer based on clinical data is of great importance.
In this problem, we aim to build a classification model that predicts whether a
tumor is malignant (cancerous) or benign (non-cancerous) based on features
extracted from breast cancer biopsy samples. The dataset typically contains
features such as the size, texture, and shape of cell nuclei and other tumor
properties.
The goal is to predict a binary outcome:
 Malignant (1): Tumor is cancerous.

 Benign (0): Tumor is not cancerous.

Why Use Logistic Regression for Breast Cancer


Detection?
Logistic Regression is a popular classification algorithm that is ideal for binary
classification problems, like determining whether a tumor is malignant or
benign. It works by estimating the probability that a given input belongs to one
of the two classes.
Key Reasons for Using Logistic Regression:
1. Binary Classification: Logistic regression is designed for binary outcomes,
such as classifying a tumor as either benign (0) or malignant (1).
2. Probabilistic Interpretation: Logistic regression outputs probabilities,
providing a score between 0 and 1. This score can be interpreted as the
probability that the tumor is malignant, which is valuable in medical
diagnostics.
3. Linear Boundaries: Logistic regression assumes a linear relationship
between the input features (e.g., tumor size, texture, and shape) and the
log-odds of the outcome. For datasets where this relationship is
reasonable, logistic regression performs well.
4. Feature Weights: Logistic regression assigns weights to each feature,
which can be interpreted as the impact of each feature on the likelihood
of malignancy. This interpretability is useful in medical applications to
understand how different characteristics of a tumor contribute to the
diagnosis.

How Logistic Regression Works for Breast Cancer


Detection
1. Logistic Function (Sigmoid Function): Logistic regression uses the
sigmoid function to convert the linear combination of input features into
a probability between 0 and 1.

2. Decision Boundary: Once the logistic regression model predicts a


probability, a decision threshold is applied to classify the outcome. Typically,
the threshold is set at 0.5, meaning:
 If the predicted probability P^(y=1∣X)≥0.5\hat{P}(y=1|X) \geq
0.5P^(y=1∣X)≥0.5, the tumor is classified as malignant.
 If P^(y=1∣X)<0.5\hat{P}(y=1|X) < 0.5P^(y=1∣X)<0.5, the tumor is classified
as benign.
3. Cost Function: Logistic regression uses a cost function based on log-
likelihood. This is optimized using gradient descent to find the best-fitting
parameters (weights).
4. Training the Model: The model is trained using a dataset containing labeled
examples (features of tumors and their known classifications as malignant or
benign). During training, the model learns the parameters (weights) that best
separate malignant from benign cases.
5. Prediction: After training, the logistic regression model can predict whether
a new, unseen tumor is malignant or benign by calculating the probability using
the sigmoid function and applying the decision boundary.

Example of Features in Breast Cancer Dataset


A typical dataset for breast cancer detection, such as the Wisconsin Breast
Cancer Dataset, includes the following features:
 Radius (mean of distances from center to points on the perimeter)
 Texture (standard deviation of gray-scale values)
 Perimeter
 Area
 Smoothness (local variation in radius lengths)
 Compactness (perimeter² / area - 1.0)
 Concavity (severity of concave portions of the contour)
 Concave points (number of concave portions of the contour)
 Symmetry
 Fractal dimension (coastline approximation - "roughness" of cell
borders)
Each of these features provides important information about the
characteristics of the tumor cells and helps the logistic regression model
determine whether a tumor is cancerous or not.

Step-by-Step Approach Using Logistic Regression for


Breast Cancer Detection
1. Data Collection
a. Data Sources: Common datasets include the Breast Cancer Wisconsin
(Diagnostic) dataset, which contains features extracted from digitized
images of fine needle aspirate (FNA) of breast masses.
b. Features: Typical features include:
 Radius
 Texture
 Perimeter
 Area
 Smoothness
 Compactness
 Concavity
 Symmetry
 Fractal dimension

2. Data Preprocessing
 Handling Missing Values: Check for missing values and handle them
through imputation or removal.
 Normalization/Standardization: Scale features to ensure that they
are on a similar scale. This helps in optimizing the logistic regression
algorithm.
 Encoding Categorical Variables: If any categorical features exist,
encode them into numerical values using techniques like one-hot
encoding.
3. Exploratory Data Analysis (EDA)
 Visualizations: Use plots (histograms, scatter plots, box plots) to
understand the distribution of features and identify relationships between
them.
 Correlation Analysis: Assess how features correlate with the target
variable (benign vs. malignant).
4. Splitting the Data
a. Train-Test Split: Divide the dataset into training and
testing sets (commonly 70% train and 30% test).
b. This helps evaluate the model’s performance on unseen
data.
5. Building the Logistic Regression Model
a. Model Definition: Logistic regression predicts the probability of a
binary outcome. The logistic function (sigmoid function) is used to
transform the linear combination of features into a probability.
b. Model Equation: The logistic regression model can be defined as:
c. Training the Model: Use the training dataset to fit the logistic
regression model, optimizing the coefficients using methods like
Maximum Likelihood Estimation.

6. Model Evaluation
a. Prediction: Use the test set to make predictions on the probabilities of
benign or malignant outcomes.
b. Thresholding: Convert probabilities to binary outcomes based on a
threshold (commonly 0.5).
c. Performance Metrics:
 Accuracy: Overall correct predictions.
 Precision: True positives / (True positives + False positives).
 Recall (Sensitivity): True positives / (True positives + False
negatives).
 F1 Score: Harmonic mean of precision and recall.
 Confusion Matrix: A matrix to visualize true positives, true
negatives, false positives, and false negatives.

7. Model Validation
a. Cross-Validation: Use techniques like k-fold cross-validation to
ensure that the model generalizes well to unseen data.
b. ROC Curve and AUC: Plot the Receiver Operating Characteristic
curve and calculate the Area Under the Curve to assess the trade-off
between true positive rate and false positive rate.

Advantages of Logistic Regression for Breast Cancer


Detection
 Simplicity: Logistic regression is simple to implement and interpret.
 Efficiency: It is computationally efficient and works well with small to
medium-sized datasets.
 Probabilistic Output: The probabilistic nature of the output helps in
quantifying the uncertainty in predictions, which is useful in medical
applications.
 Interpretability: The weights assigned to features can provide insights
into which tumor characteristics are important for classifying tumors as
malignant or benign.

Conclusion:-
In more complex cases where the relationship between features and the target
is non-linear, more advanced algorithms like support vector machines (SVM)
or neural networks might outperform logistic regression. However, for many
breast cancer datasets, logistic regression can be an effective and interpretable
method for early detection.
10/24/24, 9:00 PM logistic_regression.ipynb - Colab

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split


from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn.linear_model import LogisticRegression

data = pd.read_csv('/content/breast_cancer_bd (1).csv')


data.head(15)

Sample Uniformity Uniformity Single


Clump Marginal Bare Bland Normal
code of Cell of Cell Epithelial Mitoses Class
Thickness Adhesion Nuclei Chromatin Nucleoli
number Size Shape Cell Size

0 1000025 5 1 1 1 2 1 3 1 1 2

1 1002945 5 4 4 5 7 10 3 2 1 2

2 1015425 3 1 1 1 2 2 3 1 1 2

3 1016277 6 8 8 1 3 4 3 7 1 2

4 1017023 4 1 1 3 2 1 3 1 1 2

5 1017122 8 10 10 8 7 10 9 7 1 4

6 1018099 1 1 1 1 2 10 3 1 1 2

7 1018561 2 1 2 1 2 1 3 1 1 2

8 1033078 2 1 1 1 2 1 1 1 5 2

9 1033078 4 2 1 1 2 1 2 1 1 2

10 1035283 1 1 1 1 1 1 3 1 1 2

11 1036172 2 1 1 1 2 1 2 1 1 2

12 1041801 5 3 3 3 2 3 4 4 1 4

13 1043999 1 1 1 1 2 3 3 1 1 2

Next steps: Generate code with data


toggle_off View recommended plots New interactive sheet

data = data.drop('Sample code number', axis=1)

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Clump Thickness 699 non-null int64
1 Uniformity of Cell Size 699 non-null int64
2 Uniformity of Cell Shape 699 non-null int64
3 Marginal Adhesion 699 non-null int64
4 Single Epithelial Cell Size 699 non-null int64
5 Bare Nuclei 699 non-null object
6 Bland Chromatin 699 non-null int64
7 Normal Nucleoli 699 non-null int64
8 Mitoses 699 non-null int64
9 Class 699 non-null int64
dtypes: int64(9), object(1)
memory usage: 54.7+ KB

data.describe()

https://colab.research.google.com/drive/1ptYtObP7alBpVj94HC9kHOgZTB02NmVt#scrollTo=e6ZU0cK19lum&printMode=true 1/7
10/24/24, 9:00 PM logistic_regression.ipynb - Colab

Single
Clump Uniformity Uniformity of Marginal Bland Normal
Epithelial Mitoses Class
Thickness of Cell Size Cell Shape Adhesion Chromatin Nucleoli
Cell Size

count 699.000000 699.000000 699.000000 699.000000 699.000000 699.000000 699.000000 699.000000 699.000000

mean 4.417740 3.134478 3.207439 2.806867 3.216023 3.437768 2.866953 1.589413 2.689557

std 2.815741 3.051459 2.971913 2.855379 2.214300 2.438364 3.053634 1.715078 0.951273

min 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 2.000000

25% 2.000000 1.000000 1.000000 1.000000 2.000000 2.000000 1.000000 1.000000 2.000000

50% 4.000000 1.000000 1.000000 1.000000 2.000000 3.000000 1.000000 1.000000 2.000000

75% 6.000000 5.000000 5.000000 4.000000 4.000000 5.000000 4.000000 1.000000 4.000000

print(data['Bare Nuclei'].unique())

['1' '10' '2' '4' '3' '9' '7' '?' '5' '8' '6']

data = data.replace('?', np.nan)

data['Bare Nuclei'] = pd.to_numeric(data['Bare Nuclei'])


data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 699 entries, 0 to 698
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Clump Thickness 699 non-null int64
1 Uniformity of Cell Size 699 non-null int64
2 Uniformity of Cell Shape 699 non-null int64
3 Marginal Adhesion 699 non-null int64
4 Single Epithelial Cell Size 699 non-null int64
5 Bare Nuclei 683 non-null float64
6 Bland Chromatin 699 non-null int64
7 Normal Nucleoli 699 non-null int64
8 Mitoses 699 non-null int64
9 Class 699 non-null int64
dtypes: float64(1), int64(9)
memory usage: 54.7 KB

data['Bare Nuclei'] = data['Bare Nuclei'].fillna(data['Bare Nuclei'].median())

y = data['Class']
X = data.drop('Class', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.3)

lr_model = LogisticRegression()

lr_model.fit(X_train, y_train)

▾ LogisticRegression i ?
LogisticRegression()

y_pred_lr = lr_model.predict(X_test)
score = accuracy_score(y_test,y_pred_lr)
print('Accuracy score: {}'.format(score))

Accuracy score: 0.9809523809523809

svc_model = SVC(C= .1, kernel='linear', gamma= 1)

svc_model.fit(X_train, y_train)

▾ SVC i ?

SVC(C=0.1, gamma=1, kernel='linear')

y_pred_svm = svc_model.predict(X_test)
score = accuracy_score(y_test,y_pred_svm)
print('Accuracy score: {}'.format(score))

https://colab.research.google.com/drive/1ptYtObP7alBpVj94HC9kHOgZTB02NmVt#scrollTo=e6ZU0cK19lum&printMode=true 2/7
10/24/24, 9:00 PM logistic_regression.ipynb - Colab

Accuracy score: 0.9809523809523809

Suggested code may be subject to a license | grnitish/ADS_PROJECT-6300-4300- | ANIKNCI/travel-package-purchase-prediction | Alsharif-hasan/Credit-Card-Fraud-Detectio-Model | mvoss02/thesis_co

import matplotlib.pyplot as plt


import seaborn as sns
from sklearn.metrics import roc_curve, auc

# Confusion Matrix for Logistic Regression


cm_lr = confusion_matrix(y_test, y_pred_lr)
disp_lr = ConfusionMatrixDisplay(confusion_matrix=cm_lr, display_labels=lr_model.classes_)
disp_lr.plot()
plt.title('Confusion Matrix - Logistic Regression')
plt.show()

# Confusion Matrix for SVM


cm_svm = confusion_matrix(y_test, y_pred_svm)
disp_svm = ConfusionMatrixDisplay(confusion_matrix=cm_svm, display_labels=svc_model.classes_)
disp_svm.plot()
plt.title('Confusion Matrix - SVM')
plt.show()

# Classification Report for Logistic Regression


print("Classification Report - Logistic Regression\n", classification_report(y_test, y_pred_lr))

# Classification Report for SVM


print("Classification Report - SVM\n", classification_report(y_test, y_pred_svm))

# ROC Curve and AUC for Logistic Regression


y_pred_proba_lr = lr_model.predict_proba(X_test)[:, 1]
# Specify pos_label to indicate that '4' is the positive class
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_pred_proba_lr, pos_label=4)
roc_auc_lr = auc(fpr_lr, tpr_lr)

# ROC Curve and AUC for SVM


y_pred_decision_svm = svc_model.decision_function(X_test)
# Specify pos_label to indicate that '4' is the positive class
fpr_svm, tpr_svm, _ = roc_curve(y_test, y_pred_decision_svm, pos_label=4)
roc_auc_svm = auc(fpr_svm, tpr_svm)

plt.figure()
plt.plot(fpr_lr, tpr_lr, label='Logistic Regression (AUC = %0.2f)' % roc_auc_lr)
plt.plot([0, 1], [0, 1], 'k--') # Random guessing line
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Logistic Regression')
plt.legend(loc="lower right")
plt.show()

# ROC Curve and AUC for SVM


y_pred_decision_svm = svc_model.decision_function(X_test)
# Specify pos_label to indicate that '4' is the positive class
fpr_svm, tpr_svm, _ = roc_curve(y_test, y_pred_decision_svm, pos_label=4) # Add pos_label=4
roc_auc_svm = auc(fpr_svm, tpr_svm)

plt.figure()
plt.plot(fpr_svm, tpr_svm, label='SVM (AUC = %0.2f)' % roc_auc_svm)
plt.plot([0, 1], [0, 1], 'k--') # random guessing line
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - SVM')
plt.legend(loc="lower right")
plt.show()

# Feature Importance (for Logistic Regression) - Coefficients


feature_importance = pd.DataFrame({'Feature': X.columns, 'Coefficient': lr_model.coef_[0]})
feature_importance = feature_importance.sort_values(by='Coefficient', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x='Coefficient', y='Feature', data=feature_importance)
plt.title('Feature Importance - Logistic Regression')
plt.show()

# Pairplot (for a subset of features for better visualization)


sns.pairplot(data[['Clump Thickness', 'Uniformity of Cell Size', 'Uniformity of Cell Shape', 'Class']], hue='Class')
plt.show()

https://colab.research.google.com/drive/1ptYtObP7alBpVj94HC9kHOgZTB02NmVt#scrollTo=e6ZU0cK19lum&printMode=true 3/7
10/24/24, 9:00 PM logistic_regression.ipynb - Colab

Classification Report - Logistic Regression


precision recall f1-score support

2 1.00 0.97 0.99 142


4 0.94 1.00 0.97 68

accuracy 0.98 210


macro avg 0.97 0.99 0.98 210
weighted avg 0.98 0.98 0.98 210

Classification Report - SVM


precision recall f1-score support

2 1.00 0.97 0.99 142


4 0.94 1.00 0.97 68

accuracy 0.98 210


macro avg 0.97 0.99 0.98 210
weighted avg 0.98 0.98 0.98 210

https://colab.research.google.com/drive/1ptYtObP7alBpVj94HC9kHOgZTB02NmVt#scrollTo=e6ZU0cK19lum&printMode=true 4/7
10/24/24, 9:00 PM logistic_regression.ipynb - Colab

https://colab.research.google.com/drive/1ptYtObP7alBpVj94HC9kHOgZTB02NmVt#scrollTo=e6ZU0cK19lum&printMode=true 5/7
10/24/24, 9:00 PM logistic_regression.ipynb - Colab

https://colab.research.google.com/drive/1ptYtObP7alBpVj94HC9kHOgZTB02NmVt#scrollTo=e6ZU0cK19lum&printMode=true 6/7
10/24/24, 9:00 PM logistic_regression.ipynb - Colab

import matplotlib.pyplot as plt

https://colab.research.google.com/drive/1ptYtObP7alBpVj94HC9kHOgZTB02NmVt#scrollTo=e6ZU0cK19lum&printMode=true 7/7
import matplotlib.pyplot as plt

import numpy as np

def sigmoid(x):

return 1 / (1 + np.exp(-x))

# Generate some sample data (replace with your actual data)

X_train = np.linspace(-6, 6, 100)

y_train = sigmoid(X_train) + np.random.normal(0, 0.1, 100) # Add some


noise

X_test = np.linspace(-6, 6, 50)

y_test = sigmoid(X_test) + np.random.normal(0, 0.1, 50)

# Plot the sigmoid curve and data points

plt.figure(figsize=(10, 6))

# Sigmoid curve (same color for both)

x_curve = np.linspace(-6, 6, 200)

y_curve = sigmoid(x_curve)

plt.plot(x_curve, y_curve, color='blue', label='Sigmoid Curve')

# Training data points

plt.scatter(X_train, y_train, color='red', label='Training Data', marker='o')

# Testing data points

plt.scatter(X_test, y_test, color='green', label='Testing Data', marker='x')

plt.xlabel('X')

plt.ylabel('Sigmoid(X)')

plt.title('Sigmoid Function with Training and Testing Data')

plt.legend()

plt.grid(True)

plt.show()

You might also like