Assignment 1 - LP1
Assignment 1 - LP1
1
AIM:
Data Preparation: Download heart dataset from following link.
https://www.kaggle.com/zhaoyingzhu/heartcsv
Through the diagnosis test I predicted 100 report as COVID positive, but only 45 of those were actually
positive. Total 50 people in my sample were actually COVID positive. I have total 500 samples.
Create confusion matrix based on above data and
find I Accuracy
II Precision
III Recall
IV F-1 score
Theory:
Data Preparation: It is the process of transforming raw data into a particular form so that data
scientists and analysts can run it through machine learning algorithms to uncover insights or make
predictions. All projects have the same general steps; they are:
Step 1: Define
Problem. Step 2:
Prepare Data.
Step 3: Evaluate
Models.
Step 4: Finalize Model.
We are concerned with the data preparation step (step 2), and there are common or standard tasks that you
may use or explore during the data preparation step in a machine learning project.
2. Feature Selection: Feature selection refers to techniques for selecting a subset of input features that
are most relevant to the target variable that is being predicted. Feature selection techniques are generally
grouped into those that use the target variable (supervised) and those that do not (unsupervised).
Additionally, the supervised techniques can be further divided into models that automatically select
features as part of fitting the model (intrinsic), those that explicitly choose features that result in the best
performing model (wrapper) and those that score each input feature and allow a subset to be selected
(filter).
3. Data Transforms: Data transforms are used to change the type or distribution of data variables.
Numeric Data Type: Number values.
Integer: Integers with no fractional part.
Real: Floating point values.
Categorical Data Type: Label values.
Ordinal: Labels with a rank ordering.
Nominal: Labels with no rank ordering.
Boolean: Values True and False.
4. Feature Engineering: Feature engineering refers to the process of creating new input variables from
the available data. Engineering new features is highly specific to your data and data types. As such, it
often requires the collaboration of a subject matter expert to help identify new features that could be
constructed from the data.
5. Dimensionality Reduction: The number of input features for a dataset may be considered the
dimensionality of the data. This motivates feature selection, although an alternative to feature selection is
to create a projection of the data into a lower-dimensional space that still preserves the most important
properties of the original data. The most common approach to dimensionality reduction is to use a matrix
factorization technique:
Confusion Matrix
A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model,
where N is the number of target classes. The matrix compares the actual target values with those predicted
by the machine learning model.
The explanation of the terms associated with confusion matrix is as follows −
True Positives (TP) − It is the case when both actual class & predicted class of data point is 1.
True Negatives (TN) − It is the case when both actual class & predicted class of data point is 0.
False Positives (FP) − It is the case when actual class of data point is 0 & predicted class of
data point is 1.
False Negatives (FN) − It is the case when actual class of data point is 1 & predicted class of
data point is 0.
Code:
import pandas as pd
#%matplotlib inline
dataset:') df.info()
#Details of Rows & Columns (Count, Datatypes, Null Values & Memory
print()
',df.shape[0])
print()
print(df.describe())
plt.figure(figsize=(8,6)) #8 by 6 inch
labels =
'Male','Female' sizes =
[male,female]
parameter
plt.show()
plt.figure(figsize=(8,6))
# Data to plot
labels = 'Chest Pain Type:0','Chest Pain Type:1','Chest Pain Type:2','Chest Pain Type:3'
len(df[df['cp'] == 2]),
len(df[df['cp'] == 3])]
colors = ['skyblue',
# Plot specifications
startangle=180) plt.axis('equal')
plt.show()
false) plt.figure(figsize=(8,6))
# Data to plot
labels = 'fasting blood sugar < 120 mg/dl','fasting blood sugar > 120 mg/dl'
colors = ['skyblue',
# Plot
startangle=180) plt.axis('equal')
plt.show()
#4.exang: exercise induced angina (1 = yes; 0 =
no) plt.figure(figsize=(8,6))
# Data to plot
labels =
'No','Yes'
# Plot
startangle=90) plt.axis('equal')
plt.show()
plt.figure(figsize=(14,8)) #14/8
#cmap: Colourmap
plt.show()
#Plotting the distribution of various
achieved
sns.distplot(df['thalach'],kde=False,bins=30,color='violet')
sns.distplot(df['chol'],kde=False,bins=30,color='red')
plt.show()
sns.distplot(df['trestbps'],kde=False,bins=30,color='blue')
plt.show()
age plt.figure(figsize=(15,6))
plt.show()
plt.figure(figsize=(8,6))
sns.scatterplot(x='chol',y='thalach',data=df,hue='target')
plt.show()
plt.figure(figsize=(8,6))
sns.scatterplot(x='trestbps',y='thalach',data=df,hue='target')
plt.show()
#Making Predictions
y=df['target']
X_train,X_test,y_train,y_test =
features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_train = pd.DataFrame(X_train_scaled)
X_test_scaled = scaler.transform(X_test)
X_test = pd.DataFrame(X_test_scaled)
knn =KNeighborsClassifier()
params =
{'n_neighbors':list(range(1,20)), 'p':[1,
2, 3, 4,5,6,7,8,9,10],
'leaf_size':list(range(1,20)),
'weights':['uniform', 'distance'] }
model.fit(X_train,y_train)
predict = model.predict(X_test)
#Checking accuracy
fromsklearn.metrics import
accuracy_score,confusion_matrix print()
print('Accuracy Score:
round(accuracy_score(y_test,predict),5)*100,'%')
print()
#Confusion Matrix
class_names = [0,1]
fig,ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks,class_names)
plt.yticks(tick_marks,class_names)
cnf_matrix =
confusion_matrix(y_test,predict) print('Below
ax.xaxis.set_label_position('top')
plt.tight_layout()
plt.xlabel('Predicted label')
plt.show()
#Classification report
print(classification_report(y_test,predict))
y_probabilities = model.predict_proba(X_test)[:,1]
false_positive_rate_knn,true_positive_rate_knn,threshold_knn = roc_curve(y_test,y_probabilities)
plt.figure(figsize=(10,6))
plt.plot(false_positive_rate_knn,true_positive_rate_knn)
plt.plot([0,1],ls='--')
plt.plot([0,0],[1,0],c='.5')
plt.plot([1,1],c='.5')
plt.xlabel('False Positive
Rate') plt.show()
Output:
Conclusion: Thus we have studied different data preparation techniques.