K Nearest Neighbor - Step by Step Tutorial
K Nearest Neighbor - Step by Step Tutorial
HOME SAS R PYTHON DATA SCIENCE SQL EXCEL VBA SPSS RESOURCES INFOGRAPHICS MORE SEARCH... GO
Home » Data Science » knn » Machine Learning » R » K Nearest Neighbor : Step by Step Tutorial Follow us on Facebook
nearest neighbor in R. It is one of the most widely used algorithm for classification problems.
Knn is a non-parametric supervised learning technique in which we try to classify the data
https://www.listendata.com/2017/12/k-nearest-neighbor-step-by-step-tutorial.html 1/16
8/6/2018 K Nearest Neighbor : Step by Step Tutorial
point to a given category with the help of training set. In simple words, it captures information
of all training cases and classifies new cases based on a similarity.
Predictions are made for a new instance (x) by searching through the entire
training set for the K most similar cases (neighbors) and summarizing the
output variable for those K cases. In classification this is the mode (or most
common) class value.
Suppose we have height, weight and T-shirt size of some customers and we need to predict
the T-shirt size of a new customer given only height and weight information we have. Data
including height, weight and T-shirt size information is shown below -
158 58 M
158 59 M
158 63 M
160 59 M
160 60 M
163 60 M
163 61 M
160 64 L
163 64 L
165 61 L
165 62 L
165 65 L
168 62 L
https://www.listendata.com/2017/12/k-nearest-neighbor-step-by-step-tutorial.html 2/16
8/6/2018 K Nearest Neighbor : Step by Step Tutorial
168 63 L
168 66 L
170 63 L
170 64 L
170 68 L
There are many distance functions but Euclidean is the most commonly used measure. It is
mainly used when data is continuous. Manhattan distance is also very common for
continuous variables.
Distance Functions
The idea to use distance measure is to find the distance (similarity) between new sample and
training cases and then finds the k-closest customers to new customer in terms of height and
weight.
New customer named 'Monica' has height 161cm and weight 61kg.
Euclidean distance between first observation and new observation (monica) is as follows -
=SQRT((161-158)^2+(61-58)^2)
https://www.listendata.com/2017/12/k-nearest-neighbor-step-by-step-tutorial.html 3/16
8/6/2018 K Nearest Neighbor : Step by Step Tutorial
Similarly, we will calculate distance of all the training cases with new case and calculates the
rank in terms of distance. The smallest distance value will be ranked 1 and considered as
nearest neighbor.
Let k be 5. Then the algorithm searches for the 5 customers closest to Monica, i.e. most
similar to Monica in terms of attributes, and see what categories those 5 customers were in. If
4 of them had ‘Medium T shirt sizes’ and 1 had ‘Large T shirt size’ then your best guess for
Monica is ‘Medium T shirt. See the calculation shown in the snapshot below -
https://www.listendata.com/2017/12/k-nearest-neighbor-step-by-step-tutorial.html 4/16
8/6/2018 K Nearest Neighbor : Step by Step Tutorial
In the graph below, binary dependent variable (T-shirt size) is displayed in blue and orange
color. 'Medium T-shirt size' is in blue color and 'Large T-shirt size' in orange color. New
customer information is exhibited in yellow circle. Four blue highlighted data points and one
orange highlighted data point are close to yellow circle. so the prediction for the new case is
blue highlighted data point which is Medium T-shirt size.
Assumptions of KNN
1. Standardization
When independent variables in training data are measured in different units, it is important to
standardize variables before calculating distance. For example, if one variable is based on
height in cms, and the other is based on weight in kgs then height will influence more on the
distance calculation. In order to make them comparable we need to standardize them which
can be done by any of the following methods :
https://www.listendata.com/2017/12/k-nearest-neighbor-step-by-step-tutorial.html 5/16
8/6/2018 K Nearest Neighbor : Step by Step Tutorial
Standardization
After standardization, 5th closest value got changed as height was dominating earlier before
standardization. Hence, it is important to standardize predictors before running K-nearest
neighbor algorithm.
https://www.listendata.com/2017/12/k-nearest-neighbor-step-by-step-tutorial.html 6/16
8/6/2018 K Nearest Neighbor : Step by Step Tutorial
2. Outlier
Low k-value is sensitive to outliers and a higher K-value is more resilient to outliers as it
considers more voters to decide prediction.
Non-parametric means not making any assumptions on the underlying data distribution. Non-
parametric methods do not have fixed numbers of parameters in the model. Similarly in KNN,
model parameters actually grows with the training data set - you can imagine each training
case as a "parameter" in the model.
Many people get confused between these two statistical techniques- K-mean and K-nearest
neighbor. See some of the difference below -
Yes, K-nearest neighbor can be used for regression. In other words, K-nearest
neighbor algorithm can be applied when dependent variable is continuous. In
this case, the predicted value is the average of the values of its k nearest
neighbors.
https://www.listendata.com/2017/12/k-nearest-neighbor-step-by-step-tutorial.html 7/16
8/6/2018 K Nearest Neighbor : Step by Step Tutorial
Pros
1. Easy to understand
2. No assumptions about data
3. Can be applied to both classification and regression
4. Works easily on multi-class problems
Cons
For any given problem, a small value of k will lead to a large variance in
predictions. Alternatively, setting k to a large value may lead to a large model
bias.
Create dummy variables out of a categorical variable and include them instead of original
categorical variable. Unlike regression, create k dummies instead of (k-1). For example, a
categorical variable named "Department" has 5 unique levels / categories. So we will create 5
dummy variables. Each dummy variable has 1 against its department and else 0.
https://www.listendata.com/2017/12/k-nearest-neighbor-step-by-step-tutorial.html 8/16
8/6/2018 K Nearest Neighbor : Step by Step Tutorial
Cross-validation is a smart way to find out the optimal K value. It estimates the validation
error rate by holding out a subset of the training set from the model building process.
Cross-validation (let's say 10 fold validation) involves randomly dividing the training set into
10 groups, or folds, of approximately equal size. 90% data is used to train the model and
remaining 10% to validate it. The misclassification rate is then computed on the 10%
validation data. This procedure repeats 10 times. Different group of observations are treated
as a validation set each of the 10 times. It results to 10 estimates of the validation error which
are then averaged out.
K Nearest Neighbor in R
We are going to use historical data of past win/loss statistics and the corresponding
speeches. This dataset comprises of 1524 observations on 14 variables. Dependent variable
is win/loss where 1 indicates win and 0 indicates loss. The independent variables are:
https://www.listendata.com/2017/12/k-nearest-neighbor-step-by-step-tutorial.html 9/16
8/6/2018 K Nearest Neighbor : Step by Step Tutorial
Read Data
# Read data
data1 = read.csv("US Presidential Data.csv")
View(data1)
We read the CSV file with the help of read.csv command. Here the first argument is the
name of the dataset. The second argument - Header = TRUE or T implies that the first row
in our csv file denotes the headings while header = FALSE or F indicates that the data should
be read from the first line and does not involves any headings.
# load library
library(caret)
library(e1071)
Here we will use caret package in order to run knn. Since my dependent variable is numeric
here thus we need to transform it to factor using as.factor().
In order to partition the data into training and validation sets we use createDataPartition()
function in caret.
https://www.listendata.com/2017/12/k-nearest-neighbor-step-by-step-tutorial.html 10/16
8/6/2018 K Nearest Neighbor : Step by Step Tutorial
Firstly we set the seed to be 101 so that the same results can be obtained. In the
createDataPartition() the first argument is the dependent variable , p denotes how much
data we want in the training set; here we take 70% of the data in training set and rest in cross
validation set, list = F denotes that the indices we obtain should be in form of a vector.
# Explore data
dim(train)
dim(validation)
names(train)
head(train)
head(validation)
The dimensions of training and validation sets are checked via dim(). See first 6 rows of
training dataset -
By default, levels of dependent variable in this dataset is "0" "1". Later when we will do
prediction, these levels will be used as variable names for prediction so we need to make it
valid variable names.
https://www.listendata.com/2017/12/k-nearest-neighbor-step-by-step-tutorial.html 11/16
8/6/2018 K Nearest Neighbor : Step by Step Tutorial
Here we are using repeated cross validation method using trainControl . Number denotes
either the number of folds and ‘repeats’ is for repeated ‘r’ fold cross validation. In this case, 3
separate 10-fold validations are used.
set.seed(1234)
x = trainControl(method = "repeatedcv",
number = numbers,
repeats = repeats,
classProbs = TRUE,
summaryFunction = twoClassSummary)
Using train() function we run our knn; Win.Loss is dependent variable, the full stop after
tilde denotes all the independent variables are there. In ‘data=’ we pass our training set,
‘method=’ denotes which technique we want to deploy, setting preProcess to center and
scale tells us that we are standardizing our independent variables
trControl demands our ‘x’ which was obtained via train( ) and tunelength is always an
integer which is used to tune our algorithm.
https://www.listendata.com/2017/12/k-nearest-neighbor-step-by-step-tutorial.html 12/16
8/6/2018 K Nearest Neighbor : Step by Step Tutorial
# Summary of model
model1
plot(model1)
k-Nearest Neighbors
1068 samples
13 predictor
2 classes: 'X0', 'X1'
ROC was used to select the optimal model using the largest value.
The final value used for the model was k = 11.
https://www.listendata.com/2017/12/k-nearest-neighbor-step-by-step-tutorial.html 13/16
8/6/2018 K Nearest Neighbor : Step by Step Tutorial
Finally to make predictions on our validation set, we use predict function in which the first
argument is the formula to be applied and second argument is the new data on which we
want the predictions.
# Validation
valid_pred <- predict(model1,validation, type = "prob")
# Plot AUC
perf_val <- performance(pred_val, "tpr", "fpr")
plot(perf_val, col = "green", lwd = 1.5)
#Calculating KS statistics
ks <- max(attr(perf_val, "y.values")[[1]] - (attr(perf_val, "x.values")[[1]]))
ks
Special thanks to Ekta Aggarwal for her contribution in this article. She is a co-author of
this article. She is a Data Science enthusiast, currently in the final year of her post graduation
in statistics from Delhi University.
https://www.listendata.com/2017/12/k-nearest-neighbor-step-by-step-tutorial.html 14/16
8/6/2018 K Nearest Neighbor : Step by Step Tutorial
About Author:
Deepanshu founded ListenData with a simple objective - Make analytics easy to
understand and follow. He has over 7 years of experience in data science and
predictive modeling. During his tenure, he has worked with global clients in
various domains like banking, Telecom, HR and Health Insurance.
While I love having friends who agree, I only learn from those who don't.
*Please confirm your email address by clicking on the link sent to your Email*
Related Posts:
Understanding Bias-Variance Tradeoff
Ensemble Methods in R : Practical Guide
GBM (Boosted Models) Tuning Parameters
Dimensionality Reduction with R
Take Screenshot of Webpage using R
Run Python from R
15 Types of Regression you should know
Web Scraping Website with R
Tutorial : Build Webapp in R using Shiny
K Nearest Neighbor : Step by Step Tutorial
Python for Data Science : Learn in 3 Days
https://www.listendata.com/2017/12/k-nearest-neighbor-step-by-step-tutorial.html 15/16
8/6/2018 K Nearest Neighbor : Step by Step Tutorial
Really you explained it where well u deserve my salute u clear my all doubt with best example us
president
Reply
← PREV NEXT →
https://www.listendata.com/2017/12/k-nearest-neighbor-step-by-step-tutorial.html 16/16