0% found this document useful (0 votes)

43 views

Assignment 1 - LP1

The document describes an experiment to analyze a heart disease dataset. It performs data cleaning, exploratory data analysis through visualizations, and builds a classification model to predict heart disease. It extracts features from the data, splits it into training and test sets, scales the features, and trains a k-nearest neighbors model to make predictions.

Uploaded by

bbad070105

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views

Assignment 1 - LP1

Uploaded by

bbad070105

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 14

Experiment No.

1
AIM:
Data Preparation: Download heart dataset from following link.
https://www.kaggle.com/zhaoyingzhu/heartcsv

Perform following operation on given dataset.

a) Find Shape of Data

b) Find Missing Values
c) Find data type of each column
d) Finding out Zero's
e) Find Mean age of patients
f) Now extract only Age, Sex, ChestPain, RestBP, Chol. Randomly divide dataset in training (75%)
and testing (25%).

Through the diagnosis test I predicted 100 report as COVID positive, but only 45 of those were actually
positive. Total 50 people in my sample were actually COVID positive. I have total 500 samples.
Create confusion matrix based on above data and
find I Accuracy
II Precision
III Recall
IV F-1 score

Theory:
Data Preparation: It is the process of transforming raw data into a particular form so that data
scientists and analysts can run it through machine learning algorithms to uncover insights or make
predictions. All projects have the same general steps; they are:

Step 1: Define
Problem. Step 2:
Prepare Data.
Step 3: Evaluate
Models.
Step 4: Finalize Model.
We are concerned with the data preparation step (step 2), and there are common or standard tasks that you
may use or explore during the data preparation step in a machine learning project.

Data Preparation Tasks

1. Data Cleaning: There are many reasons data may have incorrect values, such as being mistyped,
corrupted, duplicated, and so on. Domain expertise may allow obviously erroneous observations to be
identified as they are different from what is expected.

2. Feature Selection: Feature selection refers to techniques for selecting a subset of input features that
are most relevant to the target variable that is being predicted. Feature selection techniques are generally
grouped into those that use the target variable (supervised) and those that do not (unsupervised).
Additionally, the supervised techniques can be further divided into models that automatically select
features as part of fitting the model (intrinsic), those that explicitly choose features that result in the best
performing model (wrapper) and those that score each input feature and allow a subset to be selected
(filter).
3. Data Transforms: Data transforms are used to change the type or distribution of data variables.
 Numeric Data Type: Number values.
 Integer: Integers with no fractional part.
 Real: Floating point values.
 Categorical Data Type: Label values.
 Ordinal: Labels with a rank ordering.
 Nominal: Labels with no rank ordering.
 Boolean: Values True and False.

4. Feature Engineering: Feature engineering refers to the process of creating new input variables from
the available data. Engineering new features is highly specific to your data and data types. As such, it
often requires the collaboration of a subject matter expert to help identify new features that could be
constructed from the data.

5. Dimensionality Reduction: The number of input features for a dataset may be considered the
dimensionality of the data. This motivates feature selection, although an alternative to feature selection is
to create a projection of the data into a lower-dimensional space that still preserves the most important
properties of the original data. The most common approach to dimensionality reduction is to use a matrix
factorization technique:

 Principal Component Analysis (PCA)

 Linear Discriminant Analysis (LDA)

Confusion Matrix
A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model,
where N is the number of target classes. The matrix compares the actual target values with those predicted
by the machine learning model.
The explanation of the terms associated with confusion matrix is as follows −
 True Positives (TP) − It is the case when both actual class & predicted class of data point is 1.
 True Negatives (TN) − It is the case when both actual class & predicted class of data point is 0.
 False Positives (FP) − It is the case when actual class of data point is 0 & predicted class of
data point is 1.
 False Negatives (FN) − It is the case when actual class of data point is 1 & predicted class of
data point is 0.

Code:

#Imporiting Required Libraries

#converting an entire data table into a NumPy matrix array.

#data manipulation and analysis

import pandas as pd

importnumpy as np#array manipulation

importmatplotlib.pyplot as plt#graph plotting libraries

importseaborn as sns#data visualization and exploratory data analysis.

#making statistical graphics.

#%matplotlib inline

#Loading the data

#Data frame is a two-dimensional data structure,

#i.e., data is aligned in a tabular fashion in rows and columns.

df = pd.read_csv('heart.csv') # read csv file store into dataframedf

print(df.head(3)) # print first 3 row, if df print complete data

print() #print for spacing

#Features of the data set

print('Below are the features of

dataset:') df.info()

#Details of Rows & Columns (Count, Datatypes, Null Values & Memory

Usage) #Dimensions of the dataset

print()

print('Below are the diamensions of dataset:')

#Shape method denotes count of rows &colums

print('Number of rows in the dataset:

',df.shape[0])

print('Number of columns in the dataset: ',df.shape[1])

#Checking for null values in the dataset

print()

print('Checking for null values in the dataset:')

print(df.isnull().sum()) #Field has no value

present #There are no null values in the dataset

print(df.describe())

#The features described in the above data set are:

#1. Count tells us the number of NoN-empty rows in a

feature. #2. Mean tells us the mean value of that feature.

#3. Std tells us the Standard Deviation Value of that feature.

#4. Min tells us the minimum value of that feature.

#5. 25%, 50%, and 75% are the percentile/quartile of each

features. #6. Max tells us the maximum value of that feature.

#Checking features of various

attributes #1. Sex -->

male =len(df[df['sex'] == 1]) #df=complete df #df = column in df

female = len(df[df['sex']== 0])

plt.figure(figsize=(8,6)) #8 by 6 inch

# Data to plot specifications

labels =

'Male','Female' sizes =

[male,female]

colors = ['skyblue', 'yellowgreen']

explode = (0, 0) # explode 1st slice don't separate

# Plot actual figure

#autopct: according len calculate

percentage #pie: show piechart according

parameter

plt.pie(sizes, explode=explode, labels=labels,

colors=colors, autopct='%1.1f%%', shadow=True,

startangle=90) plt.axis('equal') #x & y equal axis

plt.show()

#2. Chest Pain Type -->

plt.figure(figsize=(8,6))

# Data to plot
labels = 'Chest Pain Type:0','Chest Pain Type:1','Chest Pain Type:2','Chest Pain Type:3'

sizes = [len(df[df['cp'] == 0]),len(df[df['cp'] == 1]),

len(df[df['cp'] == 2]),

len(df[df['cp'] == 3])]

colors = ['skyblue',

'yellowgreen','orange','gold'] explode = (0,

0,0,0) # explode 1st slice

# Plot specifications

plt.pie(sizes, explode=explode, labels=labels,

colors=colors, autopct='%1.1f%%', shadow=True,

startangle=180) plt.axis('equal')

plt.show()

#3. fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 =

false) plt.figure(figsize=(8,6))

# Data to plot

labels = 'fasting blood sugar < 120 mg/dl','fasting blood sugar > 120 mg/dl'

sizes = [len(df[df['fbs'] == 0]),len(df[df['cp'] == 1])] #bp value

colors = ['skyblue',

'yellowgreen','orange','gold'] explode = (0.1, 0)

# explode 1st slice

# Plot

plt.pie(sizes, explode=explode, labels=labels,

colors=colors, autopct='%1.1f%%', shadow=True,

startangle=180) plt.axis('equal')

plt.show()
#4.exang: exercise induced angina (1 = yes; 0 =

no) plt.figure(figsize=(8,6))

# Data to plot

labels =

'No','Yes'

sizes = [len(df[df['exang'] == 0]),len(df[df['exang'] ==

1])] colors = ['skyblue', 'yellowgreen']

explode = (0.1, 0) # explode 1st slice

# Plot

plt.pie(sizes, explode=explode, labels=labels,

colors=colors, autopct='%1.1f%%', shadow=True,

startangle=90) plt.axis('equal')

plt.show()

#Exploratory Data Analysis

sns.set_style('whitegrid') #set background

white #1. Heatmap

plt.figure(figsize=(14,8)) #14/8

#heatmap:Graphical representation of data that uses a system

of #color-coding to represent different values

#corr(): pairwise correlation of all columns in the dataframe

#annot: Value in each field bool or rectangular dataset

#cmap: Colourmap

sns.heatmap(df.corr(), annot = True, cmap='coolwarm',linewidths=.1)

plt.show()
#Plotting the distribution of various

attribures #1. thalach: maximum heart rate

achieved

sns.distplot(df['thalach'],kde=False,bins=30,color='violet')

#2.chol: serum cholestoral in mg/dl

sns.distplot(df['chol'],kde=False,bins=30,color='red')

plt.show()

#3. trestbps: resting blood pressure (in mm Hg on admission to the hospital)

sns.distplot(df['trestbps'],kde=False,bins=30,color='blue')

plt.show()

#4. Number of people who have heart disease according to

age plt.figure(figsize=(15,6))

sns.countplot(x='age',data = df, hue = 'target',palette='GnBu')

plt.show()

#5.Scatterplot for thalach vs. chol

plt.figure(figsize=(8,6))

sns.scatterplot(x='chol',y='thalach',data=df,hue='target')

plt.show()

#6.Scatterplot for thalach vs. trestbps

plt.figure(figsize=(8,6))

sns.scatterplot(x='trestbps',y='thalach',data=df,hue='target')

plt.show()

#Making Predictions

#Splitting the dataset into training and test set

X= df.drop('target',axis=1)

y=df['target']

fromsklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test =

train_test_split(X,y,test_size=.3,random_state=42) #Preprocessing - Scaling the

features

fromsklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_train = pd.DataFrame(X_train_scaled)

X_test_scaled = scaler.transform(X_test)

X_test = pd.DataFrame(X_test_scaled)

#1. k-NearestNeighor Algorithm

#ImplementingGridSearchCv to select best parameters and applying k-NN Algorithm

fromsklearn.neighbors import KNeighborsClassifier

fromsklearn.model_selection import GridSearchCV

knn =KNeighborsClassifier()

params =

{'n_neighbors':list(range(1,20)), 'p':[1,

2, 3, 4,5,6,7,8,9,10],

'leaf_size':list(range(1,20)),

'weights':['uniform', 'distance'] }

model = GridSearchCV(knn,params,cv=3, n_jobs=-1)

model.fit(X_train,y_train)

print(model.best_params_) #print's parameters best values

#Making predictions

predict = model.predict(X_test)

#Checking accuracy

fromsklearn.metrics import

accuracy_score,confusion_matrix print()

print('Accuracy Score:

',accuracy_score(y_test,predict)) print('Using k-NN we

get an accuracy score of: ',

round(accuracy_score(y_test,predict),5)*100,'%')

print()

#Confusion Matrix

class_names = [0,1]

fig,ax = plt.subplots()

tick_marks = np.arange(len(class_names))

plt.xticks(tick_marks,class_names)

plt.yticks(tick_marks,class_names)

cnf_matrix =

confusion_matrix(y_test,predict) print('Below

is the confusion matrix') print(cnf_matrix)

#create a heat map

sns.heatmap(pd.DataFrame(cnf_matrix), annot = True, cmap = 'YlGnBu',fmt = 'g')

ax.xaxis.set_label_position('top')

plt.tight_layout()

plt.title('Confusion matrix for k-Nearest Neighbors Model', y = 1.1)

plt.ylabel('Actual label')

plt.xlabel('Predicted label')

plt.show()

#Classification report

fromsklearn.metrics import classification_report

print(classification_report(y_test,predict))

#Receiver Operating Characterstic(ROC) Curve

fromsklearn.metrics import roc_auc_score,roc_curve

#Get predicted probabilites from the model

y_probabilities = model.predict_proba(X_test)[:,1]

#Create true and false positive rates

false_positive_rate_knn,true_positive_rate_knn,threshold_knn = roc_curve(y_test,y_probabilities)

#Plot ROC Curve

plt.figure(figsize=(10,6))

plt.title('Revceiver Operating Characterstic')

plt.plot(false_positive_rate_knn,true_positive_rate_knn)

plt.plot([0,1],ls='--')

plt.plot([0,0],[1,0],c='.5')

plt.plot([1,1],c='.5')

plt.xlabel('False Positive

Rate') plt.ylabel('True Positive

Rate') plt.show()

#Calculate area under the curve

print(roc_auc_score(y_test,y_probabilities))

Output:
Conclusion: Thus we have studied different data preparation techniques.

The Dark Light Series Box Set Dark Light 13 S L Jennings pdf download
100% (1)
The Dark Light Series Box Set Dark Light 13 S L Jennings pdf download
39 pages
Kia Rio (Ub) 2012 - 2013 G 1.6 Gdi Technical Data
33% (3)
Kia Rio (Ub) 2012 - 2013 G 1.6 Gdi Technical Data
140 pages
2017 - Metalworking Fluids-CRC Press PDF
100% (3)
2017 - Metalworking Fluids-CRC Press PDF
531 pages
Secret Garden Yorkshire Words
No ratings yet
Secret Garden Yorkshire Words
10 pages
2021-A Complete Guide To Stepwise Regression in R
No ratings yet
2021-A Complete Guide To Stepwise Regression in R
4 pages
Slides 16-27
No ratings yet
Slides 16-27
12 pages
Glee As A Pop-Culture Reflection
No ratings yet
Glee As A Pop-Culture Reflection
47 pages
04 DS 2023
No ratings yet
04 DS 2023
63 pages
Data Science Practicals - Ipynb
No ratings yet
Data Science Practicals - Ipynb
54 pages
Copy of ML_preprocessing_introduction.pptx
No ratings yet
Copy of ML_preprocessing_introduction.pptx
14 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Abhiml ML File
No ratings yet
Abhiml ML File
74 pages
ML(sudhanshu)
No ratings yet
ML(sudhanshu)
24 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
ModuleAr Merged
No ratings yet
ModuleAr Merged
42 pages
PythonForMachineLearning
No ratings yet
PythonForMachineLearning
66 pages
Roll NO 2020
No ratings yet
Roll NO 2020
8 pages
Machine Learning Lab Manual (1)
No ratings yet
Machine Learning Lab Manual (1)
42 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
index
No ratings yet
index
4 pages
Sample Worksheet 1
No ratings yet
Sample Worksheet 1
8 pages
1728086737277
No ratings yet
1728086737277
26 pages
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
1152CS239-Intro. To Data Science-Syllabus
No ratings yet
1152CS239-Intro. To Data Science-Syllabus
6 pages
ML Proj Diabetes.pptx
No ratings yet
ML Proj Diabetes.pptx
51 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
PROJECTS
No ratings yet
PROJECTS
6 pages
Data Science
No ratings yet
Data Science
8 pages
2_DataPreProcessing_code
No ratings yet
2_DataPreProcessing_code
46 pages
Machine Learning Project Roadmap
No ratings yet
Machine Learning Project Roadmap
4 pages
Data Science
No ratings yet
Data Science
18 pages
DA lab
No ratings yet
DA lab
27 pages
Data science and analtics Laboratory
No ratings yet
Data science and analtics Laboratory
21 pages
DWDM Lab Manual
No ratings yet
DWDM Lab Manual
32 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Unit2 Modified
No ratings yet
Unit2 Modified
42 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Hgs Phase II
No ratings yet
Hgs Phase II
27 pages
EDA_CODE_SNIPPETS
No ratings yet
EDA_CODE_SNIPPETS
17 pages
Week 4 Laboratory Activity
No ratings yet
Week 4 Laboratory Activity
6 pages
Data Science Algorithmen Master - 02 Data Handling
No ratings yet
Data Science Algorithmen Master - 02 Data Handling
76 pages
Bussiness Report PM
No ratings yet
Bussiness Report PM
44 pages
data science practicals
No ratings yet
data science practicals
47 pages
Eda
No ratings yet
Eda
48 pages
Lesson 2 - Data Preprocessing
100% (1)
Lesson 2 - Data Preprocessing
72 pages
CSE445 NSU Week_3
No ratings yet
CSE445 NSU Week_3
48 pages
FYMCA IDSLab A6 Submission
No ratings yet
FYMCA IDSLab A6 Submission
9 pages
Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
No ratings yet
Asset-V1 VIT+MBA109+2020+type@asset+block@Introductio To ML Using Python
7 pages
Ai in HC - 2
No ratings yet
Ai in HC - 2
9 pages
Data Science Practical Book - Ipynb
No ratings yet
Data Science Practical Book - Ipynb
21 pages
Python Class 6 Assignment Solution
No ratings yet
Python Class 6 Assignment Solution
9 pages
ML Lab Records
No ratings yet
ML Lab Records
101 pages
GE Practical Sem 2 (2)
No ratings yet
GE Practical Sem 2 (2)
28 pages
ASSi2 DSBDA
No ratings yet
ASSi2 DSBDA
4 pages
Heart Disease Prediction - Jupyter Notebook
100% (1)
Heart Disease Prediction - Jupyter Notebook
9 pages
DAV Guidelines
No ratings yet
DAV Guidelines
4 pages
manishadav
No ratings yet
manishadav
27 pages
AIDS C04-Session-20
No ratings yet
AIDS C04-Session-20
17 pages
05 Pandas (1)
No ratings yet
05 Pandas (1)
12 pages
Data Visualization EDA-print
No ratings yet
Data Visualization EDA-print
18 pages
CSE1703 - Fundamental of Data Science
No ratings yet
CSE1703 - Fundamental of Data Science
6 pages
Code shabab error 7
No ratings yet
Code shabab error 7
5 pages
MACHINE LEARNING manual
No ratings yet
MACHINE LEARNING manual
36 pages
Cardiovascular_Disease_Prediction
No ratings yet
Cardiovascular_Disease_Prediction
2 pages
K-Nearest Neighbors For Diabetes Prediction: Malik Yousaf (F2020019038) Ahsan Rauf (F2020019057)
No ratings yet
K-Nearest Neighbors For Diabetes Prediction: Malik Yousaf (F2020019038) Ahsan Rauf (F2020019057)
15 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Galloway Hydro Scheme History
No ratings yet
Galloway Hydro Scheme History
14 pages
DGR REGULATION - Section 18 - Dangerous Goods by Air - Edition 2.0
No ratings yet
DGR REGULATION - Section 18 - Dangerous Goods by Air - Edition 2.0
49 pages
buildings-12-00644 (1)
No ratings yet
buildings-12-00644 (1)
18 pages
Toyota Vios
No ratings yet
Toyota Vios
1 page
Obesity
No ratings yet
Obesity
11 pages
Machine
No ratings yet
Machine
1 page
Retail Price List USD Supertech 2018 09
No ratings yet
Retail Price List USD Supertech 2018 09
22 pages
BEC-3302-Econometrics-I
No ratings yet
BEC-3302-Econometrics-I
4 pages
The Complete Guide For EWQLSO
No ratings yet
The Complete Guide For EWQLSO
109 pages
Radioactivity 2023
No ratings yet
Radioactivity 2023
11 pages
Kamrup District Study Report
No ratings yet
Kamrup District Study Report
49 pages
Tenses New
No ratings yet
Tenses New
5 pages
NCERT Solutions for Class 9 Science Chapter 7 - Motion
No ratings yet
NCERT Solutions for Class 9 Science Chapter 7 - Motion
26 pages
BISCUIT PROJECT - Docx Zewdu
No ratings yet
BISCUIT PROJECT - Docx Zewdu
38 pages
Constant Speed Drive Bulletin737 - 300 - 1
No ratings yet
Constant Speed Drive Bulletin737 - 300 - 1
100 pages
MICS 2016-17 GB Full Report
No ratings yet
MICS 2016-17 GB Full Report
398 pages
IEEE STD C37.09a-2005
No ratings yet
IEEE STD C37.09a-2005
36 pages
A Complete Guide To LSTM Architecture and Its Use in Text Classification
No ratings yet
A Complete Guide To LSTM Architecture and Its Use in Text Classification
10 pages
Electrical Safety
No ratings yet
Electrical Safety
191 pages
Electrical Circuits and Internal Resistance (MCQ Only)
100% (1)
Electrical Circuits and Internal Resistance (MCQ Only)
12 pages
Radio Operator
No ratings yet
Radio Operator
2 pages
Population Explosion in India
0% (1)
Population Explosion in India
17 pages
Cky
No ratings yet
Cky
1 page