0% found this document useful (0 votes)
101 views7 pages

Introduction To Machine Learning: K-Nearest Neighbors: Zhongheng Zhang

Uploaded by

Ethan Manani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views7 pages

Introduction To Machine Learning: K-Nearest Neighbors: Zhongheng Zhang

Uploaded by

Ethan Manani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Big-data Clinical Trial Column

Page 1 of 7

Introduction to machine learning: k-nearest neighbors


Zhongheng Zhang

Department of Critical Care Medicine, Jinhua Municipal Central Hospital, Jinhua Hospital of Zhejiang University, Jinhua 321000, China
Correspondence to: Zhongheng Zhang, MMed. 351#, Mingyue Road, Jinhua 321000, China. Email: [email protected].

Author’s introduction: Zhongheng Zhang, MMed. Department of Critical Care Medicine, Jinhua Municipal Central
Hospital, Jinhua Hospital of Zhejiang University. Dr. Zhongheng Zhang is a fellow physician of the Jinhua Municipal
Central Hospital. He graduated from School of Medicine, Zhejiang University in 2009, receiving Master Degree. He has
published more than 35 academic papers (science citation indexed) that have been cited for over 200 times. He has been
appointed as reviewer for 10 journals, including Journal of Cardiovascular Medicine, Hemodialysis International, Journal of
Translational Medicine, Critical Care, International Journal of Clinical Practice, Journal of Critical Care. His major research
interests include hemodynamic monitoring in sepsis and septic shock, delirium, and outcome study for critically ill patients.
He is experienced in data management and statistical analysis by using R and STATA, big data exploration, systematic
review and meta-analysis.

Zhongheng Zhang, MMed.

Abstract: Machine learning techniques have been widely used in many scientific fields, but its use in medical
literature is limited partly because of technical difficulties. k-nearest neighbors (kNN) is a simple method of
machine learning. The article introduces some basic ideas underlying the kNN algorithm, and then focuses on
how to perform kNN modeling with R. The dataset should be prepared before running the knn() function in R.
After prediction of outcome with kNN algorithm, the diagnostic performance of the model should be checked.
Average accuracy is the mostly widely used statistic to reflect the kNN algorithm. Factors such as k value, distance
calculation and choice of appropriate predictors all have significant impact on the model performance.

Keywords: Machine learning; R; k-nearest neighbors (kNN); class; average accuracy; kappa

Submitted Jan 25, 2016. Accepted for publication Feb 18, 2016.
doi: 10.21037/atm.2016.03.37
View this article at: http://dx.doi.org/10.21037/atm.2016.03.37

© Annals of Translational Medicine. All rights reserved. atm.amegroups.com Ann Transl Med 2016;4(11):218
Page 2 of 7 Zhang. Introduction to machine learning: k-nearest neighbors

Introduction to k-nearest neighbor (kNN)

kNN classifier is to classify unlabeled observations by


assigning them to the class of the most similar labeled
examples. Characteristics of observations are collected for Fruit
both training and test dataset. For example, fruit, vegetable

SWEET
and grain can be distinguished by their crunchiness and
sweetness (Figure 1). For the purpose of displaying them Vegetable
on a two-dimension plot, only two characteristics are Grain
employed. In reality, there can be any number of predictors,
and the example can be extended to incorporate any
number of characteristics. In general, fruits are sweeter
than vegetables. Grains are neither crunchy nor sweet. Our
work is to determine which category does the sweet potato
Crunchy
belong to. In this example we choose four nearest kinds of
food, they are apple, green bean, lettuce, and corn. Because Figure 1 Illustration of how k-nearest neighbors’ algorithm works.

the vegetable wins the most votes, sweet potato is assigned


to the class of vegetable. You can see that the key concept of
kNN is easy to understand. > df1 <- data.frame(x1=runif(200,0,100),
There are two important concepts in the above example. x2=runif(200,0,100))
One is the method to calculate the distance between > df1 <- transform(df1, y=1+ifelse(100 - x1 - x2
sweet potato and other kinds of food. By default, the + rnorm(200,sd=10) < 0, 0, ifelse(100 - 2*x2 +
knn() function employs Euclidean distance which can be rnorm(200,sd=10) < 0, 1, 2)))
calculated with the following equation (1,2). > df1$y<-as.factor(df1$y)
> df1$tag<-c(rep("train",150),rep("test",50))
( p1 − q1 ) + ( p2 − q2 ) +  + ( pn − qn )
2 2 2
D( p, q=
) [1]
The first line sets a seed to make the output reproducible.
where p and q are subjects to be compared with n The second line creates a data frame named df1, and it
characteristics. There are also other methods to calculate contains two variables x1 and x2. Then I add another
distance such as Manhattan distance (3,4). categorical variable y, and it has three categories. However,
Another concept is the parameter k which decides the variable y is numeric and I convert it into a factor by
how many neighbors will be chosen for kNN algorithm. using as.factor() function. A tag variable is added to split the
The appropriate choice of k has significant impact on dataset into training set and test set. Next we can examine
the diagnostic performance of kNN algorithm. A large k the dataset by graphical presentation.
reduces the impact of variance caused by random error, but
runs the risk of ignoring small but important pattern. The > library(ggplot2)
key to choose an appropriate k value is to strike a balance > qplot(x1,x2, data=df1, colour=y,shape=tag)
between overfitting and underfitting (5). Some authors
suggest to set k equal to the square root of the number of
As you can see in Figure 2, different categories are
observations in the training dataset (6).
denoted by red, green and blue colors. The whole dataset
is split in 150:50 ratio for training and test datasets. Dots
Working example represent test data and triangles are training data.

For illustration of how kNN works, I created a dataset that


had no actual meaning. Performing kNN algorithm with R

The R package class contains very useful function for the


> set.seed(seed=888) purpose of kNN machine learning algorithm (7). Firstly one

© Annals of Translational Medicine. All rights reserved. atm.amegroups.com Ann Transl Med 2016;4(11):218
Annals of Translational Medicine, Vol 4, No 11 June 2016 Page 3 of 7

100
100

75
75

tag
tag
test
test
train
train
×2
x2

50
50 yy
11
22
33

25
25

00

0 25 50 75 100
x1
×1

Figure 2 Visual presentation of simulated working example. The class 1, 2 and 3 are denoted by red, green and blue colors, respectively. Dots
represent test data and triangles are training data.

needs to install and load the class package to the working > test<-df1[151:200,1:2]
space. > test.label<-df1[151:200,3]

> install.packages(“class”) Up to now, datasets are well prepared for the kNN model
> library(class) building. Because kNN is a non-parametric algorithm,
we will not obtain parameters for the model. The kNN()
function returns a vector containing factor of classifications
Then we divide the original dataset into the training and
of test set. In the following code, I arbitrary choose a k
test datasets. Note that the training and test data frames
value of 6. The results are stored in the vector pred.
contain only the predictor variable. The response variable is
stored in other vectors.
> pred<-knn(train=train,test=test,cl=train.label,k=6)

> train<-df1[1:150,1:2] The results can be viewed by using CrossTable() function


> train.label<-df1[1:150,3] in the gmodels package.

© Annals of Translational Medicine. All rights reserved. atm.amegroups.com Ann Transl Med 2016;4(11):218
Page 4 of 7 Zhang. Introduction to machine learning: k-nearest neighbors

> install.packages(“gmodels”) where TP is the true positive, TN is the true negative, FP is


> library(gmodels) the false positive and FN is the false negative. The subscript
i indicates category, and l refers to the total category.
> CrossTable(x = test.label, y = pred,prop.chisq=FALSE)

> table<-CrossTable(x = test.label, y = pred,prop.


Cell Contents
chisq=TRUE)
> tp1<-table$t[1,1]
N
> tp2<-table$t[2,2]
N / Row Total
N / Col Total > tp3<-table$t[3,3]
N / Table Total > tn1<-table$t[2,2]+table$t[2,3]+table$t[3,2]+table
$t[3,3]

Total Observations in Table: 50 > tn2<-table$t[1,1]+table$t[1,3]+table$t[3,1]+table


$t[3,3]
> tn3<-table$t[1,1]+table$t[1,2]+table$t[2,1]+table
Pred
$t[2,2]
test.label 1 2 3 Row Total
> fn1<-table$t[1,2]+table$t[1,3]
1 29 0 0 29
> fn2<-table$t[2,1]+table$t[2,3]
1.000 0.000 0.000 0.580
> fn3<-table$t[3,1]+table$t[3,2]
0.935 0.000 0.000
> fp1<-table$t[2,1]+table$t[3,1]
0.580 0.000 0.000
> fp2<-table$t[1,2]+table$t[3,2]
2 2 6 2 10
> fp3<-table$t[1,3]+table$t[2,3]
0.200 0.600 0.200 0.200
> accuracy<-(((tp1+tn1)/
0.065 0.857 0.167
(tp1+fn1+fp1+tn1))+((tp2+tn2)/
0.040 0.120 0.040 (tp2+fn2+fp2+tn2))+((tp3+tn3)/(tp3+fn3+fp3+tn3)))/3
3 0 1 10 11 > accuracy
0.000 0.091 0.909 0.220 [1] 0.9333333
0.000 0.143 0.833
0.000 0.020 0.200 The CrossTable() function returns the result of cross
Column 31 7 12 50 tabulation of predicted and observed classifications. The
Total number in each cell can be used for the calculation of four
0.620 0.140 0.240 basic parameters true positive (TP), true negative (TN),
false negative (FN) and false positive (FP). The process
repeated for each category. Finally, the accuracy is 0.93.
Diagnostic performance of the model

The kNN algorithm assigns a category to observations in Sensitivity and specificity


the test dataset by comparing them to the observations in
Sensitivity is a measure of the proportion of positives that
the training dataset. Because we know the actual category
are correctly identify positive observations. Specificity
of observations in the test dataset, the performance of the
is a measure of the proportion of negatives that are
kNN model can be evaluated. One of the most commonly truly negative. They are commonly used to measure
used parameter is the average accuracy that is defined by the the diagnostic performance of a test (9). In evaluation
following equation (8): of a prediction model, they can be used to reflect the
performance of the model. Imaging a perfectly fitted
l
TPi + TN i model that can predict outcomes with 100% accuracy, both
Average  Accuracy = ∑ /l [2]
i =1 TPi + FN i + FPi + TN i sensitivity and specificity are 100%. In multiclass situation
as in our example, sensitivity and specificity are calculated

© Annals of Translational Medicine. All rights reserved. atm.amegroups.com Ann Transl Med 2016;4(11):218
Annals of Translational Medicine, Vol 4, No 11 June 2016 Page 5 of 7

separately for each class. The equations are as follows. the performance of kNN algorithm. Kappa can be formally
expressed by the following equation:
Seni TPi / (TPi + FN i )
= [3]
=Spi TN i / (TN i + FPi ) [4] P ( A) − P( E )
kappa = [5]
1 − P( E )
where TP is the true positive, TN is the true negative, FP is
the false positive and FN is the false negative. The subscript where P(A) is the relative observed agreement among raters,
i indicates category. and P(E) is the proportion of agreement expected between
the classifier and the ground truth by chance. In our
> sen1<-tp1/(tp1+fn1) example the tabulation of predicted and observed classes are
> sp1<-tn1/(tn1+fp1) as follows:
> sen1
[1] 1 > table<-table(test.label,pred)
> sp1 > table
[1] 0.9047619 pred
test.label 1 2 3
1 29 0 0
Multiclass area under the curve (AUC)
2 2 6 2
A receiver operating characteristic (ROC) curve measures the 3 0 1 10
performance of a classifier to correctly identify positives and
negatives. The AUC ranges between 0.5 and 1. An AUC of 0.5 The relative observed agreement can be calculated as
indicates a random classifier that it has no value. Multiclass
AUC is well describe by Hand and coworkers (10). The P ( A )= (29 + 6 + 10) / 50= 0.9 [6]
multiclass.roc() function in pROC package is able to do the
task.
the kNN algorithm predicts 1, 2 and 3 for 31, 7, and
12 times. Thus, the probability that kNN says for 1, 2
> install.packages("pROC") and 3 are 0.62, 0.14 and 0.24, respectively. Similarly, the
> library(pROC) probabilities that 1, 2 and 3 are observed are 0.58, 0.2 and
> multiclass.roc(response=test.label, predictor=as. 0.22, respectively. Then, the probability that both classifier
ordered(pred)) say 1, 2 and 3 are 0.62×0.58=0.3596, 0.14×0.2=0.028 and
0.24×0.22=0.0528. The overall probability of random
Call: agreement is:
multiclass.roc.default(response = test.label, predictor =
as.ordered(pred)) P ( E ) = 0.3596 + 0.028 + 0.0528 = 0.4404 [7]

Data: as.ordered(pred) with 3 levels of test.label: 1, 2, 3. and the kappa statistic is:
Multi-class area under the curve: 0.9212
P ( A ) − P ( E ) 0.9 − 0.4404
kappa
= = ≈ 0.82 [8]
1− P ( E ) 1 − 0.4404
As you can see from the output of the command, the
multi-class AUC is 0.9212.
Fortunately, the calculation can be performed by cohen.
kappa() function in the psych package. I present the
Kappa statistic calculation process here for readers to better understand the
concept of kappa.
Kappa statistic is a measurement of the agreement for
categorical items (11). Its typical use is in assessment of the
inter-rater agreement. Here kappa can be used to assess > install.packages("psych")

© Annals of Translational Medicine. All rights reserved. atm.amegroups.com Ann Transl Med 2016;4(11):218
Page 6 of 7 Zhang. Introduction to machine learning: k-nearest neighbors

is because the knn() function breaks ties at random.


To explain, if we have 4 nearest neighbors and two are
classified as A and 2 are classified as B, then A and B are
randomly chosen as predicted result.
0.9
0.9

> accuracyCal<-function(N) {
Average accuracy

accuracy<-1
Average accuracy

for (x in 1:N) {
0.8
0.8
pred<-knn(train=train,test=test,cl=train.
0.88 0.90 0.92 0.94 0.96

label,k=x)
table<- table(test.label,pred)
tp1<-table[1,1]

0.7
0.7
tp2<-table[2,2]
tp3<-table[3,3]
tn1<-table[2,2]+table[2,3]+table[3,2]+table[3,3]
00 50
50 100
100 150
150 tn2<-table[1,1]+table[1,3]+table[3,1]+table[3,3]
k kvalues
values

Figure 3 Graphical presentation of average accuracy with different k tn3<-table[1,1]+table[1,2]+table[2,1]+table[2,2]


values. The inset zooms in at k range between 0 and 30. fn1<-table[1,2]+table[1,3]
fn2<-table[2,1]+table[2,3]
fn3<-table[3,1]+table[3,2]
> library(psych) fp1<-table[2,1]+table[3,1]
> cohen.kappa(x=cbind(test.label,pred)) fp2<-table[1,2]+table[3,2]
Call: cohen.kappa1(x = x, w = w, n.obs = n.obs, alpha = fp3<-table[1,3]+table[2,3]
alpha)
accuracy<-c(accuracy, (((tp1+tn1)/
(tp1+fn1+fp1+tn1))+((tp2+tn2)/
Cohen Kappa and Weighted Kappa correlation coeffi- (tp2+fn2+fp2+tn2))+((tp3+tn3)/(tp3+fn3+fp3+tn3)))/3)
cients and confidence boundaries
}
lower estimate upper
return(accuracy[-1])
unweighted kappa 0.68 0.82 0.96
}
weighted kappa 0.87 0.93 0.99
Number of subjects = 50
The following code creates a visual display of the
results. An inset plot is created to better visualize how
Tuning k for kNN accuracy changes within the k range between 5 and 20. The
subplot() function contained in TeachingDemos package is
The parameter k is important in kNN algorithm. In the
helpful in drawing such an inset. It is interesting to adjust
last section I would like to tune k values and examine
graph parameters to make the figure a better appearance
the change of the diagnostic accuracy of the kNN
(Figure 3). The figure shows that the average accuracy is
model. Custom-made R function is helpful in simplify
highest at k=15. At a large k value (150 for example), all
the calculation process. Here I write a function named
observations in the training dataset are included and all
“accuracyCal” to calculate a series of average accuracies.
observations in the test dataset are assigned to the class with
There is only one argument for the function. That is the
the largest number of subjects in the training dataset. This
maximum number of k you would like to examine. There
is of course not the result we want.
is for loop with in the function that calculates accuracy
repeatedly from one to N. When you run the function,
the results may not exactly the same for each time. That > install.packages("TeachingDemos")

© Annals of Translational Medicine. All rights reserved. atm.amegroups.com Ann Transl Med 2016;4(11):218
Annals of Translational Medicine, Vol 4, No 11 June 2016 Page 7 of 7

> library(TeachingDemos) References


> qplot(seq(1:150),accuracyCal(150),xlab="k 1. Short RD, Fukunaga K. The optimal distance measure
values",ylab="Average accuracy",geom = c("point", for nearest neighbor classification. IEEE Transactions on
"smooth")) Information Theory 1981;27:622-7.
> subplot( 2. Weinberger KQ, Saul LK. Distance metric learning for
plot(seq(1:30),accuracyCal(30), col=2,xlab='', large margin nearest neighbor classification. The Journal
ylab='',cex.axis=0.8), of Machine Learning Research 2009;10:207-44.
x=grconvertX(c(0,0.75), from='npc'), 3. Cost S, Salzberg S. A weighted nearest neighbor algorithm
y=grconvertY(c(0,0.45), from='npc'), for learning with symbolic features. Machine Learning
type='fig', pars=list( mar=c(0,0,1.5,1.5)+0.1) ) 1993;10:57-78.
4. Breiman L. Random forests. Machine Learning.
2001;45:5-32.
Summary 5. Zhang Z. Too much covariates in a multivariable model
may cause the problem of overfitting. J Thorac Dis
The article introduces some basic ideas underlying the kNN
algorithm. The dataset should be prepared before running 2014;6:E196-7.
the knn() function in R. After prediction of outcome with 6. Lantz B. Machine learning with R. 2nd ed. Birmingham:
kNN algorithm, the diagnostic performance of the model Packt Publishing; 2015:1.
should be checked. Average accuracy is the most widely 7. Venables WN, Ripley BD. Modern applied statistics with
used statistic to reflect the performance kNN algorithm. S-PLUS. 3rd ed. New York: Springer; 2001.
Factors such as k value, distance calculation and choice of 8. Hernandez-Torruco J, Canul-Reich J, Frausto-Solis J, et al.
appropriate predictors all have significant impact on the Towards a predictive model for Guillain-Barré syndrome.
model performance. Conf Proc IEEE Eng Med Biol Soc 2015;2015:7234-7.
9. Linden A. Measuring diagnostic and predictive accuracy
in disease management: an introduction to receiver
Acknowledgements
operating characteristic (ROC) analysis. J Eval Clin Pract
None. 2006;12:132-9.
10. Hand DJ, Till RJ. A simple generalisation of the area
under the ROC curve for multiple class classification
Footnote
problems. Machine Learning 2001;45:171-86.
Conflicts of Interest: The author has no conflicts of interest to 11. Thompson JR. Estimating equations for kappa statistics.
declare. Stat Med 2001;20:2895-906.

Cite this article as: Zhang Z. Introduction to machine


learning: k-nearest neighbors. Ann Transl Med 2016;4(11):218.
doi: 10.21037/atm.2016.03.37

© Annals of Translational Medicine. All rights reserved. atm.amegroups.com Ann Transl Med 2016;4(11):218

You might also like