0% found this document useful (0 votes)
42 views

Assignment 1 - LP1

The document describes an experiment to analyze a heart disease dataset. It performs data cleaning, exploratory data analysis through visualizations, and builds a classification model to predict heart disease. It extracts features from the data, splits it into training and test sets, scales the features, and trains a k-nearest neighbors model to make predictions.

Uploaded by

bbad070105
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

Assignment 1 - LP1

The document describes an experiment to analyze a heart disease dataset. It performs data cleaning, exploratory data analysis through visualizations, and builds a classification model to predict heart disease. It extracts features from the data, splits it into training and test sets, scales the features, and trains a k-nearest neighbors model to make predictions.

Uploaded by

bbad070105
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Experiment No.

1
AIM:
Data Preparation: Download heart dataset from following link.
https://www.kaggle.com/zhaoyingzhu/heartcsv

Perform following operation on given dataset.

a) Find Shape of Data


b) Find Missing Values
c) Find data type of each column
d) Finding out Zero's
e) Find Mean age of patients
f) Now extract only Age, Sex, ChestPain, RestBP, Chol. Randomly divide dataset in training (75%)
and testing (25%).

Through the diagnosis test I predicted 100 report as COVID positive, but only 45 of those were actually
positive. Total 50 people in my sample were actually COVID positive. I have total 500 samples.
Create confusion matrix based on above data and
find I Accuracy
II Precision
III Recall
IV F-1 score

Theory:
Data Preparation: It is the process of transforming raw data into a particular form so that data
scientists and analysts can run it through machine learning algorithms to uncover insights or make
predictions. All projects have the same general steps; they are:

Step 1: Define
Problem. Step 2:
Prepare Data.
Step 3: Evaluate
Models.
Step 4: Finalize Model.
We are concerned with the data preparation step (step 2), and there are common or standard tasks that you
may use or explore during the data preparation step in a machine learning project.

Data Preparation Tasks


1. Data Cleaning: There are many reasons data may have incorrect values, such as being mistyped,
corrupted, duplicated, and so on. Domain expertise may allow obviously erroneous observations to be
identified as they are different from what is expected.

2. Feature Selection: Feature selection refers to techniques for selecting a subset of input features that
are most relevant to the target variable that is being predicted. Feature selection techniques are generally
grouped into those that use the target variable (supervised) and those that do not (unsupervised).
Additionally, the supervised techniques can be further divided into models that automatically select
features as part of fitting the model (intrinsic), those that explicitly choose features that result in the best
performing model (wrapper) and those that score each input feature and allow a subset to be selected
(filter).
3. Data Transforms: Data transforms are used to change the type or distribution of data variables.
 Numeric Data Type: Number values.
 Integer: Integers with no fractional part.
 Real: Floating point values.
 Categorical Data Type: Label values.
 Ordinal: Labels with a rank ordering.
 Nominal: Labels with no rank ordering.
 Boolean: Values True and False.

4. Feature Engineering: Feature engineering refers to the process of creating new input variables from
the available data. Engineering new features is highly specific to your data and data types. As such, it
often requires the collaboration of a subject matter expert to help identify new features that could be
constructed from the data.

5. Dimensionality Reduction: The number of input features for a dataset may be considered the
dimensionality of the data. This motivates feature selection, although an alternative to feature selection is
to create a projection of the data into a lower-dimensional space that still preserves the most important
properties of the original data. The most common approach to dimensionality reduction is to use a matrix
factorization technique:

 Principal Component Analysis (PCA)


 Linear Discriminant Analysis (LDA)

Confusion Matrix
A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model,
where N is the number of target classes. The matrix compares the actual target values with those predicted
by the machine learning model.
The explanation of the terms associated with confusion matrix is as follows −
 True Positives (TP) − It is the case when both actual class & predicted class of data point is 1.
 True Negatives (TN) − It is the case when both actual class & predicted class of data point is 0.
 False Positives (FP) − It is the case when actual class of data point is 0 & predicted class of
data point is 1.
 False Negatives (FN) − It is the case when actual class of data point is 1 & predicted class of
data point is 0.

Code:

#Imporiting Required Libraries

#converting an entire data table into a NumPy matrix array.

#data manipulation and analysis

import pandas as pd

importnumpy as np#array manipulation

importmatplotlib.pyplot as plt#graph plotting libraries

importseaborn as sns#data visualization and exploratory data analysis.

#making statistical graphics.

#%matplotlib inline

#Loading the data

#Data frame is a two-dimensional data structure,

#i.e., data is aligned in a tabular fashion in rows and columns.


df = pd.read_csv('heart.csv') # read csv file store into dataframedf

print(df.head(3)) # print first 3 row, if df print complete data

print() #print for spacing

#Features of the data set

print('Below are the features of

dataset:') df.info()

#Details of Rows & Columns (Count, Datatypes, Null Values & Memory

Usage) #Dimensions of the dataset

print()

print('Below are the diamensions of dataset:')

#Shape method denotes count of rows &colums

print('Number of rows in the dataset:

',df.shape[0])

print('Number of columns in the dataset: ',df.shape[1])

#Checking for null values in the dataset

print()

print('Checking for null values in the dataset:')

print(df.isnull().sum()) #Field has no value

present #There are no null values in the dataset

print(df.describe())

#The features described in the above data set are:

#1. Count tells us the number of NoN-empty rows in a

feature. #2. Mean tells us the mean value of that feature.

#3. Std tells us the Standard Deviation Value of that feature.


#4. Min tells us the minimum value of that feature.

#5. 25%, 50%, and 75% are the percentile/quartile of each

features. #6. Max tells us the maximum value of that feature.

#Checking features of various

attributes #1. Sex -->

male =len(df[df['sex'] == 1]) #df=complete df #df = column in df

female = len(df[df['sex']== 0])

plt.figure(figsize=(8,6)) #8 by 6 inch

# Data to plot specifications

labels =

'Male','Female' sizes =

[male,female]

colors = ['skyblue', 'yellowgreen']

explode = (0, 0) # explode 1st slice don't separate

# Plot actual figure

#autopct: according len calculate

percentage #pie: show piechart according

parameter

plt.pie(sizes, explode=explode, labels=labels,

colors=colors, autopct='%1.1f%%', shadow=True,

startangle=90) plt.axis('equal') #x & y equal axis

plt.show()

#2. Chest Pain Type -->

plt.figure(figsize=(8,6))

# Data to plot
labels = 'Chest Pain Type:0','Chest Pain Type:1','Chest Pain Type:2','Chest Pain Type:3'

sizes = [len(df[df['cp'] == 0]),len(df[df['cp'] == 1]),

len(df[df['cp'] == 2]),

len(df[df['cp'] == 3])]

colors = ['skyblue',

'yellowgreen','orange','gold'] explode = (0,

0,0,0) # explode 1st slice

# Plot specifications

plt.pie(sizes, explode=explode, labels=labels,

colors=colors, autopct='%1.1f%%', shadow=True,

startangle=180) plt.axis('equal')

plt.show()

#3. fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 =

false) plt.figure(figsize=(8,6))

# Data to plot

labels = 'fasting blood sugar < 120 mg/dl','fasting blood sugar > 120 mg/dl'

sizes = [len(df[df['fbs'] == 0]),len(df[df['cp'] == 1])] #bp value

colors = ['skyblue',

'yellowgreen','orange','gold'] explode = (0.1, 0)

# explode 1st slice

# Plot

plt.pie(sizes, explode=explode, labels=labels,

colors=colors, autopct='%1.1f%%', shadow=True,

startangle=180) plt.axis('equal')

plt.show()
#4.exang: exercise induced angina (1 = yes; 0 =

no) plt.figure(figsize=(8,6))

# Data to plot

labels =

'No','Yes'

sizes = [len(df[df['exang'] == 0]),len(df[df['exang'] ==

1])] colors = ['skyblue', 'yellowgreen']

explode = (0.1, 0) # explode 1st slice

# Plot

plt.pie(sizes, explode=explode, labels=labels,

colors=colors, autopct='%1.1f%%', shadow=True,

startangle=90) plt.axis('equal')

plt.show()

#Exploratory Data Analysis

sns.set_style('whitegrid') #set background

white #1. Heatmap

plt.figure(figsize=(14,8)) #14/8

#heatmap:Graphical representation of data that uses a system

of #color-coding to represent different values

#corr(): pairwise correlation of all columns in the dataframe

#annot: Value in each field bool or rectangular dataset

#cmap: Colourmap

sns.heatmap(df.corr(), annot = True, cmap='coolwarm',linewidths=.1)

plt.show()
#Plotting the distribution of various

attribures #1. thalach: maximum heart rate

achieved

sns.distplot(df['thalach'],kde=False,bins=30,color='violet')

#2.chol: serum cholestoral in mg/dl

sns.distplot(df['chol'],kde=False,bins=30,color='red')

plt.show()

#3. trestbps: resting blood pressure (in mm Hg on admission to the hospital)

sns.distplot(df['trestbps'],kde=False,bins=30,color='blue')

plt.show()

#4. Number of people who have heart disease according to

age plt.figure(figsize=(15,6))

sns.countplot(x='age',data = df, hue = 'target',palette='GnBu')

plt.show()

#5.Scatterplot for thalach vs. chol

plt.figure(figsize=(8,6))

sns.scatterplot(x='chol',y='thalach',data=df,hue='target')

plt.show()

#6.Scatterplot for thalach vs. trestbps

plt.figure(figsize=(8,6))

sns.scatterplot(x='trestbps',y='thalach',data=df,hue='target')

plt.show()

#Making Predictions

#Splitting the dataset into training and test set


X= df.drop('target',axis=1)

y=df['target']

fromsklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test =

train_test_split(X,y,test_size=.3,random_state=42) #Preprocessing - Scaling the

features

fromsklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_train = pd.DataFrame(X_train_scaled)

X_test_scaled = scaler.transform(X_test)

X_test = pd.DataFrame(X_test_scaled)

#1. k-NearestNeighor Algorithm

#ImplementingGridSearchCv to select best parameters and applying k-NN Algorithm

fromsklearn.neighbors import KNeighborsClassifier

fromsklearn.model_selection import GridSearchCV

knn =KNeighborsClassifier()

params =

{'n_neighbors':list(range(1,20)), 'p':[1,

2, 3, 4,5,6,7,8,9,10],

'leaf_size':list(range(1,20)),

'weights':['uniform', 'distance'] }

model = GridSearchCV(knn,params,cv=3, n_jobs=-1)

model.fit(X_train,y_train)

print(model.best_params_) #print's parameters best values


#Making predictions

predict = model.predict(X_test)

#Checking accuracy

fromsklearn.metrics import

accuracy_score,confusion_matrix print()

print('Accuracy Score:

',accuracy_score(y_test,predict)) print('Using k-NN we

get an accuracy score of: ',

round(accuracy_score(y_test,predict),5)*100,'%')

print()

#Confusion Matrix

class_names = [0,1]

fig,ax = plt.subplots()

tick_marks = np.arange(len(class_names))

plt.xticks(tick_marks,class_names)

plt.yticks(tick_marks,class_names)

cnf_matrix =

confusion_matrix(y_test,predict) print('Below

is the confusion matrix') print(cnf_matrix)

#create a heat map

sns.heatmap(pd.DataFrame(cnf_matrix), annot = True, cmap = 'YlGnBu',fmt = 'g')

ax.xaxis.set_label_position('top')

plt.tight_layout()

plt.title('Confusion matrix for k-Nearest Neighbors Model', y = 1.1)


plt.ylabel('Actual label')

plt.xlabel('Predicted label')

plt.show()

#Classification report

fromsklearn.metrics import classification_report

print(classification_report(y_test,predict))

#Receiver Operating Characterstic(ROC) Curve

fromsklearn.metrics import roc_auc_score,roc_curve

#Get predicted probabilites from the model

y_probabilities = model.predict_proba(X_test)[:,1]

#Create true and false positive rates

false_positive_rate_knn,true_positive_rate_knn,threshold_knn = roc_curve(y_test,y_probabilities)

#Plot ROC Curve

plt.figure(figsize=(10,6))

plt.title('Revceiver Operating Characterstic')

plt.plot(false_positive_rate_knn,true_positive_rate_knn)

plt.plot([0,1],ls='--')

plt.plot([0,0],[1,0],c='.5')

plt.plot([1,1],c='.5')

plt.xlabel('False Positive

Rate') plt.ylabel('True Positive

Rate') plt.show()

#Calculate area under the curve


print(roc_auc_score(y_test,y_probabilities))

Output:
Conclusion: Thus we have studied different data preparation techniques.

You might also like