ML Lab2 PGM
ML Lab2 PGM
Apply k-Nearest
Neighbor technique to identify the users who purchased the item or not.
Working:
For a given data point in the set, the algorithms find the distances between this and all
other K numbers of data point in the dataset close to the initial point and votes for that category
that has the most frequency. Usually, Euclidean distance is taking as a measure of distance.
Thus the end resultant model is just the labeled data placed in a space. This algorithm is
popularly known for various applications like genetics, forecasting, etc. The algorithm is best
when more features are present.
KNN reducing over fitting is a fact. On the other hand, there is a need to choose the best value
for K. So now how do we choose K? Generally we use the Square root of the number of
samples in the dataset as value for K. An optimal value has to be found out since lower value
may lead to overfitting and higher value may require high computational complication in
distance. So using an error plot may help. Another method is the elbow method. You can prefer
to take root else can also follow the elbow method.
Example:
Consider an example problem for getting a clear intuition on the K -Nearest Neighbor
classification. We are using the Social network ad dataset. The dataset contains the details of
users in a social networking site to find whether a user buys a product by clicking the ad on
the site based on their salary, age, and gender.
Importing essential libraries:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sklearn
Importing of the dataset and slicing it into independent and dependent variables:
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [1, 2, 3]].values
y = dataset.iloc[:, -1].values
Since the dataset containing character variables, need to encode it using LabelEncoder.
Split the dataset into train and test set. Providing the test size as 0.20, that means training
sample contains 320 training set and test sample contains 80 tests set.
Next, feature scaling is done to the training and test set of independent variables for reducing
the size to smaller values.
Build and train the K Nearest Neighbor model with the training set.
Three different parameters are used in the model creation. n_neighbors is setting as 5, which
means 5 neighborhood points are required for classifying a given point. The distance metric
used is Minkowski. Equation for the same is given below.
In this example, we are choosing the p value as 2. Machine Learning model is created, now
we have to predict the output for the test set.
y_pred = classifier.predict(X_test)
y_test
>>
array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
0, 0, 1,
0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0,
1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0,
1, 0, 1,
0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1,
1, 0, 1,
1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1], dtype=int64)
y_pred
>>
array([0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1,
0, 0, 1,
0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,
0, 0, 0,
1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0,
1, 0, 1,
0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1,
0, 0, 1,
1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1], dtype=int64)
Evaluating the model using the confusion matrix and accuracy score by comparing the
predicted and actual test values.
Confusion matrix :
cm
>>
[[64 4]
[ 3 29]]
ac
>>
0.95