Introduction To Machine Learning: K-Nearest Neighbors: Zhongheng Zhang
Introduction To Machine Learning: K-Nearest Neighbors: Zhongheng Zhang
Page 1 of 7
Department of Critical Care Medicine, Jinhua Municipal Central Hospital, Jinhua Hospital of Zhejiang University, Jinhua 321000, China
Correspondence to: Zhongheng Zhang, MMed. 351#, Mingyue Road, Jinhua 321000, China. Email: [email protected].
Author’s introduction: Zhongheng Zhang, MMed. Department of Critical Care Medicine, Jinhua Municipal Central
Hospital, Jinhua Hospital of Zhejiang University. Dr. Zhongheng Zhang is a fellow physician of the Jinhua Municipal
Central Hospital. He graduated from School of Medicine, Zhejiang University in 2009, receiving Master Degree. He has
published more than 35 academic papers (science citation indexed) that have been cited for over 200 times. He has been
appointed as reviewer for 10 journals, including Journal of Cardiovascular Medicine, Hemodialysis International, Journal of
Translational Medicine, Critical Care, International Journal of Clinical Practice, Journal of Critical Care. His major research
interests include hemodynamic monitoring in sepsis and septic shock, delirium, and outcome study for critically ill patients.
He is experienced in data management and statistical analysis by using R and STATA, big data exploration, systematic
review and meta-analysis.
Abstract: Machine learning techniques have been widely used in many scientific fields, but its use in medical
literature is limited partly because of technical difficulties. k-nearest neighbors (kNN) is a simple method of
machine learning. The article introduces some basic ideas underlying the kNN algorithm, and then focuses on
how to perform kNN modeling with R. The dataset should be prepared before running the knn() function in R.
After prediction of outcome with kNN algorithm, the diagnostic performance of the model should be checked.
Average accuracy is the mostly widely used statistic to reflect the kNN algorithm. Factors such as k value, distance
calculation and choice of appropriate predictors all have significant impact on the model performance.
Keywords: Machine learning; R; k-nearest neighbors (kNN); class; average accuracy; kappa
Submitted Jan 25, 2016. Accepted for publication Feb 18, 2016.
doi: 10.21037/atm.2016.03.37
View this article at: http://dx.doi.org/10.21037/atm.2016.03.37
© Annals of Translational Medicine. All rights reserved. atm.amegroups.com Ann Transl Med 2016;4(11):218
Page 2 of 7 Zhang. Introduction to machine learning: k-nearest neighbors
SWEET
and grain can be distinguished by their crunchiness and
sweetness (Figure 1). For the purpose of displaying them Vegetable
on a two-dimension plot, only two characteristics are Grain
employed. In reality, there can be any number of predictors,
and the example can be extended to incorporate any
number of characteristics. In general, fruits are sweeter
than vegetables. Grains are neither crunchy nor sweet. Our
work is to determine which category does the sweet potato
Crunchy
belong to. In this example we choose four nearest kinds of
food, they are apple, green bean, lettuce, and corn. Because Figure 1 Illustration of how k-nearest neighbors’ algorithm works.
© Annals of Translational Medicine. All rights reserved. atm.amegroups.com Ann Transl Med 2016;4(11):218
Annals of Translational Medicine, Vol 4, No 11 June 2016 Page 3 of 7
100
100
75
75
tag
tag
test
test
train
train
×2
x2
50
50 yy
11
22
33
25
25
00
0 25 50 75 100
x1
×1
Figure 2 Visual presentation of simulated working example. The class 1, 2 and 3 are denoted by red, green and blue colors, respectively. Dots
represent test data and triangles are training data.
needs to install and load the class package to the working > test<-df1[151:200,1:2]
space. > test.label<-df1[151:200,3]
> install.packages(“class”) Up to now, datasets are well prepared for the kNN model
> library(class) building. Because kNN is a non-parametric algorithm,
we will not obtain parameters for the model. The kNN()
function returns a vector containing factor of classifications
Then we divide the original dataset into the training and
of test set. In the following code, I arbitrary choose a k
test datasets. Note that the training and test data frames
value of 6. The results are stored in the vector pred.
contain only the predictor variable. The response variable is
stored in other vectors.
> pred<-knn(train=train,test=test,cl=train.label,k=6)
© Annals of Translational Medicine. All rights reserved. atm.amegroups.com Ann Transl Med 2016;4(11):218
Page 4 of 7 Zhang. Introduction to machine learning: k-nearest neighbors
© Annals of Translational Medicine. All rights reserved. atm.amegroups.com Ann Transl Med 2016;4(11):218
Annals of Translational Medicine, Vol 4, No 11 June 2016 Page 5 of 7
separately for each class. The equations are as follows. the performance of kNN algorithm. Kappa can be formally
expressed by the following equation:
Seni TPi / (TPi + FN i )
= [3]
=Spi TN i / (TN i + FPi ) [4] P ( A) − P( E )
kappa = [5]
1 − P( E )
where TP is the true positive, TN is the true negative, FP is
the false positive and FN is the false negative. The subscript where P(A) is the relative observed agreement among raters,
i indicates category. and P(E) is the proportion of agreement expected between
the classifier and the ground truth by chance. In our
> sen1<-tp1/(tp1+fn1) example the tabulation of predicted and observed classes are
> sp1<-tn1/(tn1+fp1) as follows:
> sen1
[1] 1 > table<-table(test.label,pred)
> sp1 > table
[1] 0.9047619 pred
test.label 1 2 3
1 29 0 0
Multiclass area under the curve (AUC)
2 2 6 2
A receiver operating characteristic (ROC) curve measures the 3 0 1 10
performance of a classifier to correctly identify positives and
negatives. The AUC ranges between 0.5 and 1. An AUC of 0.5 The relative observed agreement can be calculated as
indicates a random classifier that it has no value. Multiclass
AUC is well describe by Hand and coworkers (10). The P ( A )= (29 + 6 + 10) / 50= 0.9 [6]
multiclass.roc() function in pROC package is able to do the
task.
the kNN algorithm predicts 1, 2 and 3 for 31, 7, and
12 times. Thus, the probability that kNN says for 1, 2
> install.packages("pROC") and 3 are 0.62, 0.14 and 0.24, respectively. Similarly, the
> library(pROC) probabilities that 1, 2 and 3 are observed are 0.58, 0.2 and
> multiclass.roc(response=test.label, predictor=as. 0.22, respectively. Then, the probability that both classifier
ordered(pred)) say 1, 2 and 3 are 0.62×0.58=0.3596, 0.14×0.2=0.028 and
0.24×0.22=0.0528. The overall probability of random
Call: agreement is:
multiclass.roc.default(response = test.label, predictor =
as.ordered(pred)) P ( E ) = 0.3596 + 0.028 + 0.0528 = 0.4404 [7]
Data: as.ordered(pred) with 3 levels of test.label: 1, 2, 3. and the kappa statistic is:
Multi-class area under the curve: 0.9212
P ( A ) − P ( E ) 0.9 − 0.4404
kappa
= = ≈ 0.82 [8]
1− P ( E ) 1 − 0.4404
As you can see from the output of the command, the
multi-class AUC is 0.9212.
Fortunately, the calculation can be performed by cohen.
kappa() function in the psych package. I present the
Kappa statistic calculation process here for readers to better understand the
concept of kappa.
Kappa statistic is a measurement of the agreement for
categorical items (11). Its typical use is in assessment of the
inter-rater agreement. Here kappa can be used to assess > install.packages("psych")
© Annals of Translational Medicine. All rights reserved. atm.amegroups.com Ann Transl Med 2016;4(11):218
Page 6 of 7 Zhang. Introduction to machine learning: k-nearest neighbors
> accuracyCal<-function(N) {
Average accuracy
accuracy<-1
Average accuracy
for (x in 1:N) {
0.8
0.8
pred<-knn(train=train,test=test,cl=train.
0.88 0.90 0.92 0.94 0.96
label,k=x)
table<- table(test.label,pred)
tp1<-table[1,1]
0.7
0.7
tp2<-table[2,2]
tp3<-table[3,3]
tn1<-table[2,2]+table[2,3]+table[3,2]+table[3,3]
00 50
50 100
100 150
150 tn2<-table[1,1]+table[1,3]+table[3,1]+table[3,3]
k kvalues
values
© Annals of Translational Medicine. All rights reserved. atm.amegroups.com Ann Transl Med 2016;4(11):218
Annals of Translational Medicine, Vol 4, No 11 June 2016 Page 7 of 7
© Annals of Translational Medicine. All rights reserved. atm.amegroups.com Ann Transl Med 2016;4(11):218