0% found this document useful (0 votes)

14 views34 pages

ADS - Phase 3

The document provides a comprehensive overview of data preprocessing and loading techniques in machine learning, specifically focusing on logistic regression and decision tree classifiers. It includes code examples for generating sample data, splitting datasets, training models, making predictions, and evaluating model performance using metrics like accuracy and F1 score. The conclusion emphasizes the importance of data preprocessing in enhancing model accuracy and interpretability.

Uploaded by

21CS35 R. Kaviya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views34 pages

ADS - Phase 3

Uploaded by

21CS35 R. Kaviya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

DATA PREPROCESSING AND LOADING

1. LOGISTIC REGRESSION
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report,
confusion_matrix

# Generate some sample data

np.random.seed(0)
data = {
'Exam1': np.random.rand(100) * 100,
'Exam2': np.random.rand(100) * 100,
'Admitted': np.random.randint(2, size=100)
}
df = pd.DataFrame(data)
print(df)
#
# # Split the data into features (X) and target (y)
X = df[['Exam1', 'Exam2']]
y = df['Admitted']
print(X)
print(y)
#
# # Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
#
# # Create a logistic regression model
model = LogisticRegression()
#
# # Fit the model to the training data
model.fit(X_train, y_train)
#
# # Make predictions on the test data
y_pred = model.predict(X_test)
print("------------------")
print(y_pred)
#
# # Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
#
# # Display classification report and confusion matrix
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
#
# Plot the decision boundary
# x_min, x_max = X['Exam1'].min() - 10, X['Exam1'].max() + 10
# y_min, y_max = X['Exam2'].min() - 10, X['Exam2'].max() + 10
# xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min,
y_max, 100))
# Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
# Z = Z.reshape(xx.shape)
#
# plt.contourf(xx, yy, Z, cmap=plt.cm.RdBu, alpha=0.8)
# plt.scatter(X['Exam1'], X['Exam2'], c=y, cmap=plt.cm.RdBu)
# plt.xlabel('Exam 1 Score')
# plt.ylabel('Exam 2 Score')
# plt.title('Logistic Regression Decision Boundary')
# plt.show()
Output:
2. CONFUSION MATRIX

#scikit-learn

from sklearn.datasets import make_classification

value1, y = make_classification(

n_features=6,

n_classes=2,

n_samples=800,

n_informative=2,

random_state=66,

n_clusters_per_class=1,

##. This code imports the make_classification function from the sklearn.datasets module.

##• The make_classification function generates a random dataset for classification tasks.

##• The function takes several arguments: n_features: the number of features (or independent
variables) in the dataset.

##• In this case, there are 6 features.

##• n_classes: the number of classes (or target variables) in the dataset.

##• In this case, there are 3 classes.

##• n_samples: the number of samples (or observations) in the dataset.

##• In this case, there are 800 samples.

##• n_informative: the number of informative features in the dataset.

##• These are the features that actually influence the target variable.

##• In this case, there are 2 informative features.

##• random_state: a seed value for the random number generator.

##• This ensures that the dataset is reproducible.

##• n_clusters_per_class: the number of clusters per class.

##• This determines the degree of separation between the classes.

##• In this case, there is only 1 cluster per class.

##• The function returns two arrays: X: an array of shape (n_samples, n_features) containing the
features of the dataset.

##• y: an array of shape (n_samples,) containing the target variable of the dataset.

import matplotlib.pyplot as plt

plt.scatter(value1[:, 0], value1[:, 1], c=y, marker="*")

plt.show()

##This code imports the matplotlib.pyplot module and creates a scatter plot using the scatter() function.

##• The X and y variables are assumed to be previously defined arrays or data frames.

##• The scatter() function takes three arguments: X[:, 0] and X[:, 1] are the first and second columns of
the X array, respectively, and c=y assigns a color to each point based on the corresponding value in the y
array.

##• The marker argument specifies the shape of the marker used for each point, in this case, an asterisk.

##• The resulting plot will have the values in the first column of X on the x-axis, the values in the second
column of X on the y-axis, and each point will be colored based on the corresponding value in y.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(

value1, y, test_size=0.33, random_state=125

##This code imports the train_test_split function from the sklearn.model_selection module.

##• This function is used to split the dataset into training and testing sets.
##• The train_test_split function takes four arguments: X, y, test_size, and random_state.

##• X and y are the input features and target variable, respectively.

##• test_size is the proportion of the dataset that should be allocated to the testing set.

##• In this case, it is set to 0.33, which means that 33% of the data will be used for testing.

##• random_state is used to set the seed for the random number generator, which ensures that the
same random split is generated each time the code is run.

##• The function returns four variables: X_train, X_test, y_train, and y_test.

##• X_train and y_train are the training set, while X_test and y_test are the testing set.

##• These variables can be used to train and evaluate a machine learning model.

from sklearn.naive_bayes import GaussianNB

# Build a Gaussian Classifier

model = GaussianNB()

# Model training

model.fit(X_train, y_train)

# Predict Output

predicted = model.predict([X_test[6]])

print("Actual Value:", y_test[6])

print("Predicted Value:", predicted[0])

##This code uses the scikit-learn library to build a Gaussian Naive Bayes classifier.

##• First, the code imports the GaussianNB class from the sklearn.naive_bayes module.

##• Next, a new instance of the GaussianNB class is created and assigned to the variable 'model'.
##• The model is then trained using the fit() method, which takes in the training data X_train and the
corresponding target values y_train.

##• After the model is trained, it is used to predict the output for a single test data point, which is the
7th element in the X_test array.

##• The predicted value is stored in the 'predicted' variable.

##• Finally, the actual value for the test data point is printed using y_test[6], and the predicted value is
printed using predicted[0].

#---------------

from sklearn.metrics import (

accuracy_score,

confusion_matrix,

ConfusionMatrixDisplay,

f1_score,

y_pred = model.predict(X_test)

accuray = accuracy_score(y_pred, y_test)

f1 = f1_score(y_pred, y_test, average="weighted")

print("Accuracy:", accuray)

print("F1 Score:", f1)

##This code imports several functions from the sklearn.metrics module, including accuracy_score,
confusion_matrix, ConfusionMatrixDisplay, and f1_score.

##• These functions are used to evaluate the performance of a machine learning model.

##• The code then uses the model.predict method to generate predictions for the test data (X_test).

##• These predictions are compared to the actual labels (y_test) using the accuracy_score and f1_score
functions.
##• The accuracy_score function calculates the accuracy of the model's predictions,

## while the f1_score function calculates the F1 score, which is a weighted average of precision and
recall.

##• Finally, the code prints out the accuracy and F1 score of the model's predictions.

#-----------------------------

#####Expected output

####

####Accuracy: 0.8484848484848485

####F1 Score: 0.8491119695890328

####This code snippet is not actually a code, but rather the output of some code that was run.

####• It shows the accuracy and F1 score of a model that was trained on some data.

####• The accuracy is 0.8484848484848485,

## which means that the model correctly predicted the outcome of 84.8% of the cases.

####• The F1 score is 0.8491119695890328,

#### which is a measure of the model's accuracy that takes into account both precision and recall.

####• A higher F1 score indicates better performance of the model.

#------------------------------------------------

labels = [0,1,2]

cm = confusion_matrix(y_test, y_pred, labels=labels)

print(cm)

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)

disp.plot()
########This code is using the scikit-learn library to create a confusion matrix and display it using
ConfusionMatrixDisplay.

########• First, a list of labels is created with the values 0, 1, and 2.

########• Then, the confusion_matrix function is called with the test labels (y_test) and predicted
labels (y_pred) as inputs, along with the labels list.

########• This creates a confusion matrix with the specified labels.

########• Next, a ConfusionMatrixDisplay object is created with the confusion matrix as input, along
with the labels list.

########• Finally, the plot method is called on the display object to show the confusion matrix
graphically.
# Run this program on your local python

# interpreter, provided you have installed

# the required libraries.

# Importing the required packages

import numpy as np

import pandas as pd

from sklearn.metrics import confusion_matrix

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score

from sklearn.metrics import classification_report

# Function importing Dataset

def importdata():

balance_data = pd.read_csv(

'https://archive.ics.uci.edu/ml/machine-learning-' +

'databases/balance-scale/balance-scale.data',

sep=',', header=None)

# Printing the dataswet shape

print("Dataset Length: ", len(balance_data))

print("Dataset Shape: ", balance_data.shape)

# Printing the dataset obseravtions

print("Dataset: ", balance_data.head())

return balance_data

# Function to split the dataset

def splitdataset(balance_data):
# Separating the target variable

X = balance_data.values[:, 1:5]

Y = balance_data.values[:, 0]

# Splitting the dataset into train and test

X_train, X_test, y_train, y_test = train_test_split(

X, Y, test_size=0.3, random_state=100)

return X, Y, X_train, X_test, y_train, y_test

# Function to perform training with giniIndex.

def train_using_gini(X_train, X_test, y_train):

# Creating the classifier object

clf_gini = DecisionTreeClassifier(criterion="gini",

random_state=100, max_depth=3, min_samples_leaf=5)

# Performing training

clf_gini.fit(X_train, y_train)

return clf_gini

# Function to perform training with entropy.

def tarin_using_entropy(X_train, X_test, y_train):

# Decision tree with entropy

clf_entropy = DecisionTreeClassifier(

criterion="entropy", random_state=100,

max_depth=3, min_samples_leaf=5)
# Performing training

clf_entropy.fit(X_train, y_train)

return clf_entropy

# Function to make predictions

def prediction(X_test, clf_object):

# Predicton on test with giniIndex

y_pred = clf_object.predict(X_test)

print("Predicted values:")

print(y_pred)

return y_pred

# Function to calculate accuracy

def cal_accuracy(y_test, y_pred):

print("Confusion Matrix: ",

confusion_matrix(y_test, y_pred))

print("Accuracy : ",

accuracy_score(y_test, y_pred) * 100)

print("Report : ",

classification_report(y_test, y_pred))

# Driver code

def main():

# Building Phase
data = importdata()

X, Y, X_train, X_test, y_train, y_test = splitdataset(data)

clf_gini = train_using_gini(X_train, X_test, y_train)

clf_entropy = tarin_using_entropy(X_train, X_test, y_train)

# Operational Phase

print("Results Using Gini Index:")

# Prediction using gini

y_pred_gini = prediction(X_test, clf_gini)

cal_accuracy(y_test, y_pred_gini)

print("Results Using Entropy:")

# Prediction using entropy

y_pred_entropy = prediction(X_test, clf_entropy)

cal_accuracy(y_test, y_pred_entropy)

# Calling main function

if __name__ == "__main__":

main()
3. CREDITCARD FRAUD CSV IMPORT
Conclusion
In conclusion, preprocessing data before applying it to a machine learning

algorithm is a crucial step in the ML workflow. It helps to improve the

accuracy, reduce the time and resources required to train the model, prevent

overfitting, and improve the interpretability of the model.

The above code example demonstrates how to preprocess data using the
popular Python library, Pandas, but there are many other libraries available for

preprocessing data, such as NumPy and Scikit-learn, that can be used

depending on the specific needs of your project.

C2 W3 Assignment
No ratings yet
C2 W3 Assignment
437 pages
Logistic Regression
100% (1)
Logistic Regression
10 pages
21CSC305P ML - Lab Programs 1 - 9
No ratings yet
21CSC305P ML - Lab Programs 1 - 9
36 pages
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
No ratings yet
Progress of GRADIENT BOOSTING ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
10 pages
Classification
No ratings yet
Classification
3 pages
School of Engineering: Lab Manual On Machine Learning Lab
No ratings yet
School of Engineering: Lab Manual On Machine Learning Lab
23 pages
IRis
No ratings yet
IRis
19 pages
ML
No ratings yet
ML
8 pages
Ritesh Mangla ML PracticalFile
No ratings yet
Ritesh Mangla ML PracticalFile
55 pages
Machine Learning Lab New
No ratings yet
Machine Learning Lab New
14 pages
ML File External File
No ratings yet
ML File External File
25 pages
Rain in Australia Logistic Regression Classifier
No ratings yet
Rain in Australia Logistic Regression Classifier
10 pages
Model Evaluation - II
No ratings yet
Model Evaluation - II
12 pages
Ann Experiential Learning
No ratings yet
Ann Experiential Learning
43 pages
ML Manual With Outputs
No ratings yet
ML Manual With Outputs
30 pages
ML Lab Programs For Exam
No ratings yet
ML Lab Programs For Exam
10 pages
FND Imp Points
No ratings yet
FND Imp Points
6 pages
LAB-4 Report
No ratings yet
LAB-4 Report
21 pages
ML Prac1-10
No ratings yet
ML Prac1-10
32 pages
ML File
No ratings yet
ML File
10 pages
ML Lab Manual
No ratings yet
ML Lab Manual
12 pages
St. John College of Engineering and Management, Palghar - Maharashtra
No ratings yet
St. John College of Engineering and Management, Palghar - Maharashtra
11 pages
Allcodesml 2
No ratings yet
Allcodesml 2
10 pages
C2W3 Lab 01 Model Evaluation and Selection
No ratings yet
C2W3 Lab 01 Model Evaluation and Selection
21 pages
22se02cs039 DS P-11
No ratings yet
22se02cs039 DS P-11
10 pages
23BCE7199 ML Lab Assignment
No ratings yet
23BCE7199 ML Lab Assignment
15 pages
Import Pandas As PD DF PD - Read - CSV ("Titanic - Train - CSV") DF - Head
No ratings yet
Import Pandas As PD DF PD - Read - CSV ("Titanic - Train - CSV") DF - Head
20 pages
ML Lab 01999676272
No ratings yet
ML Lab 01999676272
12 pages
Logistic Regression
No ratings yet
Logistic Regression
3 pages
Data Analytics
No ratings yet
Data Analytics
10 pages
C2W3 Lab 01 Model Evaluation and Selection
No ratings yet
C2W3 Lab 01 Model Evaluation and Selection
21 pages
Aiml Ex 4-7
No ratings yet
Aiml Ex 4-7
8 pages
Data Preprocessing
No ratings yet
Data Preprocessing
9 pages
Setup: This Notebook Contains All The Sample Code and Solutions To The Exercises in Chapter 3
No ratings yet
Setup: This Notebook Contains All The Sample Code and Solutions To The Exercises in Chapter 3
30 pages
Naive Bayes Classification
No ratings yet
Naive Bayes Classification
8 pages
Model Learning Steps
No ratings yet
Model Learning Steps
12 pages
Linearregression SVM
No ratings yet
Linearregression SVM
3 pages
Python For Data Science IA 1 Programs
No ratings yet
Python For Data Science IA 1 Programs
14 pages
Python For Data Science IA 1 Programs
No ratings yet
Python For Data Science IA 1 Programs
14 pages
Detect Fake Profiles in Online Social Networks Using Support Vector Machine
No ratings yet
Detect Fake Profiles in Online Social Networks Using Support Vector Machine
8 pages
Unit2 ML Programs
No ratings yet
Unit2 ML Programs
7 pages
Progress of CATBOOST ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
No ratings yet
Progress of CATBOOST ALGORITHM FOR ELECTRICITY THEFT DETECTION IN POWER UTILITIES
9 pages
CP4252 Lab Manual
No ratings yet
CP4252 Lab Manual
13 pages
Practicalpgm ML
No ratings yet
Practicalpgm ML
33 pages
Da 012307
No ratings yet
Da 012307
8 pages
Random Forest
No ratings yet
Random Forest
8 pages
ML Lab Record - 250625 - 105014
No ratings yet
ML Lab Record - 250625 - 105014
29 pages
Dsbda 10
No ratings yet
Dsbda 10
5 pages
Machine Learning Lab Assignment CSE-716: S. M. Shafkat Raihan ID: 16701041 SESSION: 2015-16
No ratings yet
Machine Learning Lab Assignment CSE-716: S. M. Shafkat Raihan ID: 16701041 SESSION: 2015-16
9 pages
Python Code For KNN Classifier 1. Initial Message
No ratings yet
Python Code For KNN Classifier 1. Initial Message
7 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
7 pages
Iii Aid - ML
No ratings yet
Iii Aid - ML
30 pages
Linear REgression Lab Report 4
No ratings yet
Linear REgression Lab Report 4
3 pages
ML Functions
No ratings yet
ML Functions
12 pages
Import As Import As From Import From Import From Import From Import
No ratings yet
Import As Import As From Import From Import From Import From Import
4 pages
3 (Energy & Power Signal)
100% (1)
3 (Energy & Power Signal)
10 pages
VND - Openxmlformats Officedocument - Wordprocessingml.document&rendition 1
No ratings yet
VND - Openxmlformats Officedocument - Wordprocessingml.document&rendition 1
24 pages
ADS Expt5 BE9 29
No ratings yet
ADS Expt5 BE9 29
3 pages
Ch2 Wiener Filters
No ratings yet
Ch2 Wiener Filters
80 pages
Introduction GenAI EoAI
No ratings yet
Introduction GenAI EoAI
69 pages
Dilation
No ratings yet
Dilation
13 pages
ML Lecture 11 Evaluation
No ratings yet
ML Lecture 11 Evaluation
17 pages
Block Cipher Design Principles
No ratings yet
Block Cipher Design Principles
13 pages
Oose Lab Software Personnel Management System Final
No ratings yet
Oose Lab Software Personnel Management System Final
60 pages
Setting Up The Linear Programming Problem
No ratings yet
Setting Up The Linear Programming Problem
12 pages
Speed Distance Time WS
No ratings yet
Speed Distance Time WS
8 pages
Course Outlines For AI
No ratings yet
Course Outlines For AI
4 pages
Business Update
No ratings yet
Business Update
21 pages
Aneeket Arya Tic-Tac-Toe AI First Draft
No ratings yet
Aneeket Arya Tic-Tac-Toe AI First Draft
12 pages
Lecture 7
No ratings yet
Lecture 7
18 pages
Lab 04 - Seismic Deconvolution
No ratings yet
Lab 04 - Seismic Deconvolution
10 pages
Human Activity Recognition
No ratings yet
Human Activity Recognition
40 pages
Revision V5no
No ratings yet
Revision V5no
14 pages
SSC Mts 2025: Adda247 Tamil Youtube Adda247 Tamil Telegram Adda247 Tamil Insta Tamil Facebook
No ratings yet
SSC Mts 2025: Adda247 Tamil Youtube Adda247 Tamil Telegram Adda247 Tamil Insta Tamil Facebook
3 pages
K Means
No ratings yet
K Means
19 pages
CSE 1202-MathematicsandStatisticsforComputerScience
No ratings yet
CSE 1202-MathematicsandStatisticsforComputerScience
2 pages
Microelectronic Devices Circuits and Systems Second International Conference ICMDCS 2021 Vellore India February 11 13 2021 Revised Selected Papers 1st Edition V. Arunachalam (Editor) - The latest ebook is available, download it today
100% (3)
Microelectronic Devices Circuits and Systems Second International Conference ICMDCS 2021 Vellore India February 11 13 2021 Revised Selected Papers 1st Edition V. Arunachalam (Editor) - The latest ebook is available, download it today
76 pages
1999 - A Statistical Method For Practical Assessment of Sawability With Diamond Wire Cutting Machine of Ankara-Cubuk Andesites
No ratings yet
1999 - A Statistical Method For Practical Assessment of Sawability With Diamond Wire Cutting Machine of Ankara-Cubuk Andesites
4 pages
EXP-1-To Implement Linear Regression
No ratings yet
EXP-1-To Implement Linear Regression
5 pages
IT in Agri
No ratings yet
IT in Agri
7 pages
Organic Presentation
No ratings yet
Organic Presentation
7 pages
Ode MCQ Unit 2
No ratings yet
Ode MCQ Unit 2
2 pages
Java Zoho Answer - 2025
No ratings yet
Java Zoho Answer - 2025
1 page
3 - CentumVP Engineering Course Day 3
No ratings yet
3 - CentumVP Engineering Course Day 3
76 pages
Algorithm-Lab Updated
No ratings yet
Algorithm-Lab Updated
125 pages
ECO113 Practice Problems
No ratings yet
ECO113 Practice Problems
6 pages
SO PoleAssignment
No ratings yet
SO PoleAssignment
6 pages
IEEE A4 Conference Template
No ratings yet
IEEE A4 Conference Template
2 pages
Breach&Bound
No ratings yet
Breach&Bound
11 pages
Markov Chain Algorithm in Java
No ratings yet
Markov Chain Algorithm in Java
7 pages
SSL - C4.5 Rules
No ratings yet
SSL - C4.5 Rules
13 pages
Artificial Neural Network - Genetic Algorithm - Tutorialspoint
No ratings yet
Artificial Neural Network - Genetic Algorithm - Tutorialspoint
2 pages
Assignment 5 - Investment Criteria
No ratings yet
Assignment 5 - Investment Criteria
3 pages
Lower-Upper Symmetric-Gauss-Seidel Method For The Euler and Navier-Stokes Equations
No ratings yet
Lower-Upper Symmetric-Gauss-Seidel Method For The Euler and Navier-Stokes Equations
2 pages
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet