INSY446 - 5 - Classification Part 2
INSY446 - 5 - Classification Part 2
2
k-Nearest Neighbor Algorithm
Scenario
§ We are interested in classifying the type of
drug a patient should be prescribed
§ The training set consists of patients with Na/K
ratio, age, and drug attributes
§ Our task is to classify the type of drug a new
patient should be prescribed
3
k-Nearest Neighbor Algorithm
§ This scatter plot of Na/K against Age shows
the records in the training set that patients 1, 2,
and 3 are most similar to
§ A “drug” overlay is shown where Light points
= drug X, Medium points = drug Y, and Dark
points = drug Z
4
Patient 1 Patient 2 Patient 3
k-Nearest Neighbor Algorithm
§ Which drug should Patient 1, who is 40-years-
old and has a Na/K ratio of 29, be prescribed?
§ Since Patient 1’s profile places them in the
scatter plot near patients prescribed drug X,
we classify Patient 1 as drug X
§ All points near Patient 1 are prescribed drug X,
making this a straightforward classification
Patient 1
5
k-Nearest Neighbor Algorithm
§ How about Patient 2?
§ We classify a new patient who is 17-years-old
with a Na/K ratio = 12.5. A close-up shows the
neighborhood of training points in close
proximity to Patient 2
Patient2
6
k-Nearest Neighbor Algorithm
§ Suppose we let k = 1 for our k-Nearest
Neighbor algorithm
§ This means we classify Patient 2 according to
whichever single point in the training set it is
closet to
§ In this case, Patient 2 is closest to the Dark
point, and therefore we classify them as drug Z
7
k-Nearest Neighbor Algorithm
§ Suppose we let k = 2 and reclassify Patient 2
using k-Nearest Neighbor
§ Now, Patient 2 is closest to a Dark point and
Medium point
§ How does the algorithm decide which drug to
prescribe?
§ A simple voting scheme does not help
8
k-Nearest Neighbor Algorithm
§ However, with k = 3, voting determines that two
of the three closet points to Patient 2 are
Medium
§ Therefore, Patient 2 is classified as drug Y
§ Note that the classification of Patient 2 differed
based on the value chosen for k
9
k-Nearest Neighbor Algorithm
§ Patient 3?
§ Patient 3 is 47-years-old and has a Na/K ratio
of 13.5. A close-up shows Patient 3 in the
center, with the closest 3 training data points
Patient3
E
D
10
k-Nearest Neighbor Algorithm
§ With k = 1, Patient 3 is closest to the Dark
point, based on a distance measure
§ Therefore, Patient 3 is classified as drug Z
§ Using k = 2 or k = 3, voting does not help since
each of the three nearest training points have
different target values
11
Example 1
K-NN Basics
data = numpy.array([['Dark',0.0467,0.2471],['Medium',0.0533,0.1912],[Meidum',0.0917,0.2794]])
column_names = ['Class', ‘Age (MMN)', 'Na/K (MMN)']
row_names = ['A', 'B', 'C']
df = pandas.DataFrame(data, columns=column_names, index=row_names)
X = df.iloc[:,1:3]
y = df['Class']
knn = KNeighborsClassifier(n_neighbors=1)
model1 = knn.fit(X,y)
new_obs = [[0.05,0.25]]
model1.predict(new_obs)
12
k-Nearest Neighbor Algorithm
§ Considerations when using k-Nearest
Neighbor
– How many neighbors should be used? k = ?
– How is the distance between points measured?
– How is the information from two or more neighbors
combined when making a classification decision?
– Should all points be weighted equally, or should
some points have more influence?
13
Distance Function
§ How is similarity defined between an unclassified record
and its neighbors?
§ A distance metric is a real-valued function d used to
measure the similarity between coordinates x, y, and z
with properties:
1. d ( x, y ) ³ 0, and d ( x, y ) = 0 if and only if x = y
2. d ( x, y ) = d ( y , x )
3. d ( x, z ) £ d ( x, y ) + d ( y , z )
15
Distance Function
§ When measuring distance, one or more attributes can
have very large values, relative to the other attributes
§ For example, income may be scaled 30,000-100,000,
whereas years_of_service takes on values 0-10
§ In this case, the values of income will overwhelm the
contribution of years_of_service
§ To avoid this situation we standardize the data
– Continuous data values should be standardized using Min-Max
Normalization or Z-Score Standardization
X - min( X ) X - mean( X )
Min - Max Normalization = Z - Score Standardization =
max( X ) - min( X ) standard deviation( X )
§ Example:
– Which patient is more similar to a 50-year-old male: a 20-year-
old male or a 50-year-old female?
16
Distance Function
§ Let Patient A = 50-year-old male, Patient B =
20-year-old male, and Patient C = 50-year-old
female
§ Suppose that the Age variable has a range =
50, minimum = 10, mean = 45, and standard
deviation = 15
§ The table contains original, Min-Max
Normalized, and Z-Score Standardized values
for Age
17
Distance Function
§ Assume we do not standardized Age and calculate the
distance between Patient A and Patient B, and Patient A
and Patient C
18
Distance Function
§ Age Normalized using Min-Max
– Age is normalized using Min-Max Normalization.
– Again, we calculate the distance between Patient A and Patient
B, and Patient A and Patient C
20
Alternative Distance Function
21
Example 2
Data Standardization and Distance Function
import pandas
mower_df = pandas.read_csv("RidingMowers.csv")
X = mower_df.iloc[:,0:2]
y = mower_df['Ownership']
22
Example 3
Alternative Standardization Approach
import pandas
mower_df = pandas.read_csv("RidingMowers.csv")
X = mower_df.iloc[:,0:2]
y = mower_df['Ownership']
new_obs_std = standardizer.transform(new_obs)
model3.predict(new_obs_std)
23
Combination Function
§ The Distance function determines the similarity
of a new unclassified record to those in the
training set
§ How should the most similar (k) records
combine to provide a classification?
24
Combination Function
§ Simple Unweighted Voting
– This is the simplest combination function
– Decide on the value for k to determine the number
of similar records that “vote”
– Compare each unclassified record to its k nearest
(most similar) neighbors according to the Euclidean
Distance function
– Each of the k similar records vote
25
Combination Function
§ Recall that we classified a new patient 17-
years-old with a Na/K ratio = 12.5, using k = 3
§ Simple unweighted voting determined that two
of the three closet points to Patient 2 are
Medium
§ Therefore, Patient 2 is classified as drug Y with
a confidence of 2/3 = 66.67%
§ We also classified a new patient 47-years-old
that has a Na/K ratio of 13.5, using k = 3
§ However, simple unweighted voting did not
help and resulted in a tie
§ Perhaps weighted voting should be
considered?
26
Combination Function
§ The analyst may choose to apply weighted voting, where
closer neighbors have a larger voice in the classification
decision than do more distant neighbors
§ In weighted voting, the influence of a particular record is
inversely proportional to the distance of the record from
the new record to be classified
§ For example, distances of patient 2 and record A, B, and
C are as follows (both predictors are standardized):
27
Combination Function
§ The votes of these records are then weighted
according to the inverse of their distances:
1 1
𝑣𝑜𝑡𝑒𝑠 𝐷𝑟𝑢𝑔𝑍 = = ≅ 227.6255
𝑑(𝑛𝑒𝑤, 𝐴) 0.004393
1 1 1 1
𝑣𝑜𝑡𝑒𝑠 𝐷𝑟𝑢𝑔𝑌 = + = + ≅ 36.5795
𝑑(𝑛𝑒𝑤, 𝐵) 𝑑(𝑛𝑒𝑤, 𝐶) 0.058893 0.051022
29
Standardizing Train/Test dataset
§ When we standardize the training and
test dataset, the common approach is to
combine them and standardize them
first, then separate them later
§ Alternatively, we can also standardize
the training/test dataset separately, but
this has to be done carefully if the test
dataset is small
30
Choosing k
§ What value of k is optimal?
§ There is not necessarily an obvious solution
§ Smaller k
– Choosing a small value for k may lead the algorithm to
overfit the data
– Noise or outliers may unduly affect classification
§ Larger k
– Larger values will tend to smooth out idiosyncratic or
obscure data values in the training set
– If the values become too large, locally interesting values
will be overlooked
31
Choosing k
§ Choosing the appropriate value for k requires
balancing these considerations
§ The general rule of thumb to pick k is:
𝒌= 𝒏
§ Using cross-validation may help determine the
value for k, by choosing a value that minimizes
the classification error
32
Example 4
Standardization Examples
import pandas
mower_df = pandas.read_csv("RidingMowers.csv")
X = mower_df.iloc[:,0:2]
y = mower_df['Ownership']
33
Example 5
Choosing optimal k
import pandas
mower_df = pandas.read_csv("RidingMowers.csv")
X = mower_df.iloc[:,0:2]
y = mower_df['Ownership']
34
Example 6
Use UniversalBank.csv
35
Exercise #1
Use ClassifyRisk dataset
36
Exercise #2
§ Use the same dataset and model in #1
§ Split the data into a test (30%) and training
(70%) dataset
§ Using K-NN, find the optimal value of k for this
dataset (i.e., what is the k value that gives you
the most accurate results).
37