KNN - Jupyter Notebook
KNN - Jupyter Notebook
2 import pandas as pd
3 import numpy as np
4 import matplotlib.pyplot as plt
In [2]: 1 data=pd.read_csv \
2 ('C:/Users/kriti/OneDrive/Desktop/machine Learning/experiments\
3 /diabetes.csv')
4 data
Out[2]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
1 1 85 66 29 0 26.6 0.351 31 0
3 1 89 66 23 94 28.1 0.167 21 0
... ... ... ... ... ... ... ... ... ...
In [3]: 1 print(len(data))
768
In [4]: 1 data.columns
Out[7]: nan
In [8]: 1 np.nansum(v)
Out[8]: 8.0
In [9]: 1 print(data['SkinThickness'])
0 35.0
1 29.0
2 29.0
3 23.0
4 35.0
...
763 48.0
764 27.0
765 23.0
766 29.0
767 31.0
Name: SkinThickness, Length: 768, dtype: float64
Out[10]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age
-sklearn Library: Scikit-learn is probably the most useful library for machine learning in Python. The sklearn library
contains a lot of efficient tools for machine learning and statistical modeling including classification, regression,
clustering and dimensionality reduction.
•Supervised learning algorithms •Cross-validation •Unsupervised learning algorithms •Various toy datasets: (e.g.
IRIS dataset, Boston House prices dataset). •Feature extraction: Scikit-learn for extracting features from images
In [11]: 1 y = data.iloc[:, 8]
2 y
Out[11]: 0 1
1 0
2 1
3 0
4 1
..
763 0
764 0
765 0
766 1
767 0
Name: Outcome, Length: 768, dtype: int64
-model_selection: It is a method for setting a blueprint to analyze data and then using it to measure new data.
Selecting a proper model allows you to generate accurate results when making a prediction.
-train_test_split is a function in Sklearn model selection for splitting data arrays into two subsets: for training data
and for testing data. With this function, you don't need to divide the dataset manually.
In [28]: 1 print(x)
DiabetesPedigreeFunction Age
0 0.627 50
1 0.351 31
2 0.672 32
3 0.167 21
4 2.288 33
.. ... ...
763 0.171 63
764 0.340 27
765 0.245 30
766 0.349 47
767 0.315 23
0 1
1 0
2 1
3 0
4 1
..
763 0
764 0
765 0
766 1
767 0
Name: Outcome, Length: 768, dtype: int64
1 Data standardization is the process of rescaling the attributes so that they have mean
as 0 and variance as 1.
2 The ultimate goal to perform standardization is to bring down all the features to a
common scale without distorting the differences in the range of the values.
3 In sklearn.preprocessing.StandardScaler(), centering and scaling happens independently
on each feature.
4 The idea behind StandardScaler is that it will transform your data such that its
distribution will have a mean value 0 and standard deviation of 1.
5 python sklearn library offers us with StandardScaler() function to standardize the data
values into a standard format with mean 0 and standard deviation 1
1 feature scaling is done by making mean 0 and standard deviation 1 of every input column
2 fit_tranform function is used on x_train to learn paramters(mean and standard
deviation)
3 so that standardization can happpen use the tranform function on x_text to use the
paramters(mean and standard deviation)
4 of train set on set test
In [30]: 1
2 from sklearn.preprocessing import StandardScaler
3
4 sc = StandardScaler()
5
6 x_train = sc.fit_transform(x_train)
7
8 x_test = sc.transform(x_test)
1 object = StandardScaler()
2 object.fit_transform(data)
3 According to the above syntax, we initially create an object of the StandardScaler()
function. Further, we use fit_transform()
4 fit_transform() is used on the training data so that we can scale the training data and
also learn the scaling parameters of that data (mean and standard deviation) . Here,
the model built by us will learn the mean and standard deviation of the features of the
training set. These learned parameters are then used to scale our test data also.
5 Using the transform method we can use the same mean and standard deviation as it is
calculated from our training data to transform our test data. Thus, the parameters
learned by our model using the training data will help us to transform our test data
also.
In [31]: 1 x_train
First, import the KNeighborsClassifier module and create KNN classifier object by passing argument number of
neighbors in KNeighborsClassifier() function.
Why k is odd: Let's think for a while: The k, in the KNN algorithm, represent the number of closest neighbors that
you are comparing, right? So, no matter if you have 2 or n classes, if you choose an even k, there is a risk of a tie
in the decision of which class you should set a new instance. This is why the k is usually odd - no ties.
Fit the model on the train set using fit() and perform prediction on the test set using predict().
Out[19]: array([0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0,
1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0,
1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0,
0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0],
dtype=int64)
[[82 21]
[23 28]]
In [21]: 1 # Find accuracy
2 from sklearn.metrics import accuracy_score
3 print(accuracy_score(y_test, y_pred))
0.7142857142857143
0.5599999999999999
Elbow method: helps to find optimal value of k Elbow method helps data scientists to select the optimal number of
neighbors for KNN. As K increases, the error usually goes down, then stabilizes, and then raises again. Pick the
optimum K at the beginning of the stable zone. This is also called Elbow Method.
In [24]: 1 error_rate
Out[24]: [0.3246753246753247,
0.2922077922077922,
0.2792207792207792,
0.2792207792207792,
0.2922077922077922,
0.2727272727272727,
0.2727272727272727,
0.2662337662337662,
0.2792207792207792,
0.2662337662337662,
0.2857142857142857,
0.2597402597402597,
0.2792207792207792,
0.2792207792207792,
0.2597402597402597,
0.2727272727272727,
0.2532467532467532,
0.2597402597402597,
0.24675324675324675,
0.24025974025974026,
0.2532467532467532,
0.2532467532467532,
0.2532467532467532,
0.2532467532467532,
0.2532467532467532,
0.24675324675324675,
0.24675324675324675,
0.2532467532467532,
0.2597402597402597,
0.2727272727272727,
0.2792207792207792,
0.2727272727272727,
0.2662337662337662,
0.2597402597402597,
0.2597402597402597,
0.2532467532467532,
0.2792207792207792,
0.2597402597402597,
0.2662337662337662]
In [25]: 1 plt.figure(figsize=(10,6))
2 plt.plot(range(1,40),error_rate,color="blue", linestyle="dashed", marker="o",
3 markerfacecolor="red", markersize=10)
4 plt.title("Error Rate vs. K Value")
5 plt.xlabel("K")
6 plt.ylabel("Error Rate")
1 Use the confusion_matrix method from sklearn.metrics to compute the confusion matrix.
2 classification_report: Gives a text report showing the main classification metrics.
[[90 13]
[24 27]]
precision recall f1-score support
In [ ]: 1