0% found this document useful (0 votes)
38 views2 pages

Data Imputation With KNN: E (A, B) X X E (A, B) X X

The document discusses using K-nearest neighbors (KNN) imputation to fill in missing data values. It explains that KNN imputation works by calculating the Euclidean distance between points to identify the K nearest neighbors, then imputing the missing value as the mean of the known values for those neighbors. The document provides an example using a dataset with missing values, calculating distances to identify the 2 nearest neighbors for each missing value, then imputing the mean of those neighbors.

Uploaded by

Hsu Let Yee Hnin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views2 pages

Data Imputation With KNN: E (A, B) X X E (A, B) X X

The document discusses using K-nearest neighbors (KNN) imputation to fill in missing data values. It explains that KNN imputation works by calculating the Euclidean distance between points to identify the K nearest neighbors, then imputing the missing value as the mean of the known values for those neighbors. The document provides an example using a dataset with missing values, calculating distances to identify the 2 nearest neighbors for each missing value, then imputing the mean of those neighbors.

Uploaded by

Hsu Let Yee Hnin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

Data Imputation with KNN

 The K Nearest Neighbor is the assigning a value based on how nearly it similar the points in the
training set.
 The data is imputed with the mean of nearest neighbors.
2
 E ( a , b) = √ ∑ (x −x )
i∈ D
ai bi

 E ( a , b ) is the distance between the two cases a and b

 x ai and x bi are the values of attribute i in cases a and b respectively,

 D is the set of attributes with non_missing values in both cases

No TotayDayMinutes TotalDayCalls TotalDayCharge


1 100.0 30.0 NaN
2 90.0 45.0 40.0
3 NaN 56.0 80.0
4 95.0 NaN 98.0

In this example calculation, k is set to 2.

TodayDayMinutes
E( r 3 , r 1)=√(56−30)2 =26
2 2
E( r 3 , r 2)=√ (56−45 ) + ( 80−40 ) =41.48

E( r 3 , r 4)=√(80−98)2=18
Select the first two values of the ascending Euclidean distance.
The first two values are 100 and 95.
The mean value of these is 97.5.

No TotayDayMinutes TotalDayCalls TotalDayCharge


1 100.0 30.0 NaN
2 90.0 45.0 40.0
3 97.5 56.0 80.0
4 95.0 NaN 98.0

TotalDayCalls
2
E( r 4 , r 1)= √( 95−100 ) =5
2 2
E( r 4 , r 2)= √( 95−90 ) + ( 98−40 ) =58.21
2 2
E( r 4 , r 3)=√ ( 95−97.5 ) + ( 98−80 ) =18.17

The selected values are 30 and 56.


The imputed data is 43.
No TotayDayMinutes TotalDayCalls TotalDayCharge
1 100.0 30.0 NaN
2 90.0 45.0 40.0
3 97.5 56.0 80.0
4 95.0 43.0 98.0

TotalDayCharge
2 2
E( r 1 , r 2)= √( 100−9 0 ) −( 30−45 ) =15.81
2 2
E( r 1 , r 3)=√( 100−97.5 ) + ( 30−56 ) =26.1 1
2 2
E( r 1 , r 4)= √ ( 100−95 ) + ( 30−43 ) =13.92
The selected values are 40 and 98.
The imputed data (mean of neighbors) is 69.
No TotayDayMinutes TotalDayCalls TotalDayCharge
1 100.0 30.0 69
2 90.0 45.0 40.0
3 97.5 56.0 80.0
4 95.0 43.0 98.0

You might also like