0% found this document useful (0 votes)
13 views37 pages

INSY446 - 5 - Classification Part 2

Uploaded by

iryannh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views37 pages

INSY446 - 5 - Classification Part 2

Uploaded by

iryannh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

INSY 446 – Winter 2023

Data Mining for Business


Analytics

Session 5 – K-Nearest Neighbor


February 6, 2023
Dongliang Sheng
k-Nearest Neighbor Algorithm
§ The k-Nearest Neighbor algorithm is an
example of instance-based learning where
training set records are first stored
§ Next, the classification of a new unclassified
record is performed by comparing it to records
in the training set it is most similar to
§ It is called a “lazy learner” algorithm since it
does not learn much from the training data

2
k-Nearest Neighbor Algorithm
Scenario
§ We are interested in classifying the type of
drug a patient should be prescribed
§ The training set consists of patients with Na/K
ratio, age, and drug attributes
§ Our task is to classify the type of drug a new
patient should be prescribed

3
k-Nearest Neighbor Algorithm
§ This scatter plot of Na/K against Age shows
the records in the training set that patients 1, 2,
and 3 are most similar to
§ A “drug” overlay is shown where Light points
= drug X, Medium points = drug Y, and Dark
points = drug Z

4
Patient 1 Patient 2 Patient 3
k-Nearest Neighbor Algorithm
§ Which drug should Patient 1, who is 40-years-
old and has a Na/K ratio of 29, be prescribed?
§ Since Patient 1’s profile places them in the
scatter plot near patients prescribed drug X,
we classify Patient 1 as drug X
§ All points near Patient 1 are prescribed drug X,
making this a straightforward classification

Patient 1

5
k-Nearest Neighbor Algorithm
§ How about Patient 2?
§ We classify a new patient who is 17-years-old
with a Na/K ratio = 12.5. A close-up shows the
neighborhood of training points in close
proximity to Patient 2

Patient2

6
k-Nearest Neighbor Algorithm
§ Suppose we let k = 1 for our k-Nearest
Neighbor algorithm
§ This means we classify Patient 2 according to
whichever single point in the training set it is
closet to
§ In this case, Patient 2 is closest to the Dark
point, and therefore we classify them as drug Z

7
k-Nearest Neighbor Algorithm
§ Suppose we let k = 2 and reclassify Patient 2
using k-Nearest Neighbor
§ Now, Patient 2 is closest to a Dark point and
Medium point
§ How does the algorithm decide which drug to
prescribe?
§ A simple voting scheme does not help

8
k-Nearest Neighbor Algorithm
§ However, with k = 3, voting determines that two
of the three closet points to Patient 2 are
Medium
§ Therefore, Patient 2 is classified as drug Y
§ Note that the classification of Patient 2 differed
based on the value chosen for k

9
k-Nearest Neighbor Algorithm
§ Patient 3?
§ Patient 3 is 47-years-old and has a Na/K ratio
of 13.5. A close-up shows Patient 3 in the
center, with the closest 3 training data points

Patient3

E
D

10
k-Nearest Neighbor Algorithm
§ With k = 1, Patient 3 is closest to the Dark
point, based on a distance measure
§ Therefore, Patient 3 is classified as drug Z
§ Using k = 2 or k = 3, voting does not help since
each of the three nearest training points have
different target values

11
Example 1
K-NN Basics

from sklearn.neighbors import KNeighborsClassifier


Create a dataset of 3 patients and
import numpy define row, column names
import pandas

data = numpy.array([['Dark',0.0467,0.2471],['Medium',0.0533,0.1912],[Meidum',0.0917,0.2794]])
column_names = ['Class', ‘Age (MMN)', 'Na/K (MMN)']
row_names = ['A', 'B', 'C']
df = pandas.DataFrame(data, columns=column_names, index=row_names)

X = df.iloc[:,1:3]
y = df['Class']

knn = KNeighborsClassifier(n_neighbors=1)
model1 = knn.fit(X,y)

new_obs = [[0.05,0.25]]

model1.predict(new_obs)

12
k-Nearest Neighbor Algorithm
§ Considerations when using k-Nearest
Neighbor
– How many neighbors should be used? k = ?
– How is the distance between points measured?
– How is the information from two or more neighbors
combined when making a classification decision?
– Should all points be weighted equally, or should
some points have more influence?

13
Distance Function
§ How is similarity defined between an unclassified record
and its neighbors?
§ A distance metric is a real-valued function d used to
measure the similarity between coordinates x, y, and z
with properties:
1. d ( x, y ) ³ 0, and d ( x, y ) = 0 if and only if x = y
2. d ( x, y ) = d ( y , x )
3. d ( x, z ) £ d ( x, y ) + d ( y , z )

§ Property 1: Distance is always non-negative


§ Property 2: Commutative, distance from “A to B” is
distance from “B to A”
§ Property 3: Triangle inequality holds, distance from “A
to C” must be less than or equal to distance from “A to
B to C”
14
Distance Function
§ The Euclidean Distance function is commonly-used to
measure distance
d Euclidean (x, y ) = å ( xi - yi ) 2
i

where x = x1 , x2 ,..., xm , and y = y1 , y2 ,..., ym


represent the m attributes
Example
§ Suppose Patient A is 20-years-old and has a Na/K ratio =
12, and Patient B is 30-years-old and has a Na/K ratio = 8
§ What is the Euclidean distance between these
instances?

d Euclidean (x, y ) = å ( xi - yi )2 = (20 - 30) 2 + (12 - 8) 2 = 10.77


i

15
Distance Function
§ When measuring distance, one or more attributes can
have very large values, relative to the other attributes
§ For example, income may be scaled 30,000-100,000,
whereas years_of_service takes on values 0-10
§ In this case, the values of income will overwhelm the
contribution of years_of_service
§ To avoid this situation we standardize the data
– Continuous data values should be standardized using Min-Max
Normalization or Z-Score Standardization
X - min( X ) X - mean( X )
Min - Max Normalization = Z - Score Standardization =
max( X ) - min( X ) standard deviation( X )

§ Example:
– Which patient is more similar to a 50-year-old male: a 20-year-
old male or a 50-year-old female?

16
Distance Function
§ Let Patient A = 50-year-old male, Patient B =
20-year-old male, and Patient C = 50-year-old
female
§ Suppose that the Age variable has a range =
50, minimum = 10, mean = 45, and standard
deviation = 15
§ The table contains original, Min-Max
Normalized, and Z-Score Standardized values
for Age

17
Distance Function
§ Assume we do not standardized Age and calculate the
distance between Patient A and Patient B, and Patient A
and Patient C

𝒅 𝑨, 𝑩 = (𝟓𝟎 − 𝟐𝟎)𝟐 +𝟎𝟐 = 𝟑𝟎

𝒅 𝑨, 𝑪 = (𝟓𝟎 − 𝟓𝟎)𝟐 +𝟏𝟐 = 𝟏


§ We determine, although perhaps incorrectly, that Patient
C is nearest Patient A
§ Is Patient B really 30 times more distant than Patient C is
to Patient A?
§ Perhaps neglecting to normalize the values of Age is
creating this discrepancy?

18
Distance Function
§ Age Normalized using Min-Max
– Age is normalized using Min-Max Normalization.
– Again, we calculate the distance between Patient A and Patient
B, and Patient A and Patient C

d MMN ( A, B) = (0.8 - 0.2) 2 + 0 2 = 0.6


d MMN ( A, C) = (0.8 - 0.8) 2 + 12 = 1.0
§ Age Standardized using Z-Score
– This time, Age is standardized using Z-Score Standardization

d Zscore ( A, B) = (0.33 - (-1.67))2 + 0 2 = 2.0


d Zscore ( A, C) = (0.33 - 0.33) 2 + 12 = 1.0
19
Distance Function
§ The use of different normalization techniques
resulted in Patient A being nearest to different
patients in the training set
§ This underscores the importance of
understanding which technique is being used
§ In Python
– Use StandardScaler for z-score standardization
– Use MinMaxScaler for Min-Max normalization

20
Alternative Distance Function

§ Manhattan Distance Function


– 𝑥" − 𝑥# + 𝑦" − 𝑦#
§ Minkowski Distance Function
– A generalization of both the Euclidean distance
and the Manhattan distance
§ In Python
– Use parameter p to specify the distance function

21
Example 2
Data Standardization and Distance Function

import pandas

mower_df = pandas.read_csv("RidingMowers.csv")

X = mower_df.iloc[:,0:2]
y = mower_df['Ownership']

new_obs = pandas.DataFrame([[24,60,20]], columns=["Index","Income","Lot_Size"])


new_obs.set_index("Index", inplace = True)
combined_obs = pandas.concat([X,new_obs])
Using StandardScaler to
from sklearn.preprocessing import StandardScaler standardize the predictors (based
on z-score standardization)
standardizer = StandardScaler()
combined_obs_std = standardizer.fit_transform(combined_obs) If Min-Max normalization is
preferred, use MinMaxScaler here
X_std = combined_obs_std[:24,:] instead
new_obs_std = combined_obs_std[24:,:]

from sklearn.neighbors import KNeighborsClassifier


knn2 = KNeighborsClassifier(n_neighbors=3,p=2)
model2 = knn2.fit(X_std,y) Parameter p determines the
distance function
model2.predict(new_obs_std) 1: Manhattan distance
2: Euclidean distance (default)
Other: Minkowski distance

22
Example 3
Alternative Standardization Approach

import pandas

mower_df = pandas.read_csv("RidingMowers.csv")

X = mower_df.iloc[:,0:2]
y = mower_df['Ownership']

from sklearn.preprocessing import StandardScaler


standardizer = StandardScaler()
X_std = standardizer.fit_transform(X)

from sklearn.neighbors import KNeighborsClassifier


knn3 = KNeighborsClassifier(n_neighbors=3)
model3 = knn3.fit(X_std,y)

new_obs = pandas.DataFrame([[60,20]], columns=["Income","Lot_Size"])

new_obs_std = standardizer.transform(new_obs)

model3.predict(new_obs_std)

23
Combination Function
§ The Distance function determines the similarity
of a new unclassified record to those in the
training set
§ How should the most similar (k) records
combine to provide a classification?

24
Combination Function
§ Simple Unweighted Voting
– This is the simplest combination function
– Decide on the value for k to determine the number
of similar records that “vote”
– Compare each unclassified record to its k nearest
(most similar) neighbors according to the Euclidean
Distance function
– Each of the k similar records vote

25
Combination Function
§ Recall that we classified a new patient 17-
years-old with a Na/K ratio = 12.5, using k = 3
§ Simple unweighted voting determined that two
of the three closet points to Patient 2 are
Medium
§ Therefore, Patient 2 is classified as drug Y with
a confidence of 2/3 = 66.67%
§ We also classified a new patient 47-years-old
that has a Na/K ratio of 13.5, using k = 3
§ However, simple unweighted voting did not
help and resulted in a tie
§ Perhaps weighted voting should be
considered?
26
Combination Function
§ The analyst may choose to apply weighted voting, where
closer neighbors have a larger voice in the classification
decision than do more distant neighbors
§ In weighted voting, the influence of a particular record is
inversely proportional to the distance of the record from
the new record to be classified
§ For example, distances of patient 2 and record A, B, and
C are as follows (both predictors are standardized):

𝑑 𝑛𝑒𝑤, 𝐴 = (0.05 − 0.0467)! +(0.25 − 0.2471)! = 0.004393

𝑑 𝑛𝑒𝑤, 𝐵 = (0.05 − 0.0533)! +(0.25 − 0.1912)! = 0.058893

𝑑 𝑛𝑒𝑤, 𝐶 = (0.05 − 0.0917)! +(0.25 − 0.2794)! = 0.051022

27
Combination Function
§ The votes of these records are then weighted
according to the inverse of their distances:
1 1
𝑣𝑜𝑡𝑒𝑠 𝐷𝑟𝑢𝑔𝑍 = = ≅ 227.6255
𝑑(𝑛𝑒𝑤, 𝐴) 0.004393

1 1 1 1
𝑣𝑜𝑡𝑒𝑠 𝐷𝑟𝑢𝑔𝑌 = + = + ≅ 36.5795
𝑑(𝑛𝑒𝑤, 𝐵) 𝑑(𝑛𝑒𝑤, 𝐶) 0.058893 0.051022

§ 227.6255 to 36.5795, the weighted voting


procedure would choose dark gray (drugs Z)
as the classification for a new 17-year-old
patient with a sodium/potassium ratio of 12.5
§ The probability of having drug Z subscribed is
227.6255
≅ 0.8615
227.6255 + 36.5795 28
Quantifying Attribute Relevance
§ Not all attributes may be relevant to classification
§ Some algorithms such as Decision Trees only
include attributes that contribute to improving
classification accuracy
§ In contrast, k-Nearest Neighbor’s default behavior
is to calculate distances using all attributes
§ A relevant record may be proximate for important
variables, while at the same time very distant for
other, unimportant variables
§ Taken together, the relevant record may now be
moderately far away from the new record, such that
it does not participate in the classification decision

29
Standardizing Train/Test dataset
§ When we standardize the training and
test dataset, the common approach is to
combine them and standardize them
first, then separate them later
§ Alternatively, we can also standardize
the training/test dataset separately, but
this has to be done carefully if the test
dataset is small

30
Choosing k
§ What value of k is optimal?
§ There is not necessarily an obvious solution
§ Smaller k
– Choosing a small value for k may lead the algorithm to
overfit the data
– Noise or outliers may unduly affect classification
§ Larger k
– Larger values will tend to smooth out idiosyncratic or
obscure data values in the training set
– If the values become too large, locally interesting values
will be overlooked

31
Choosing k
§ Choosing the appropriate value for k requires
balancing these considerations
§ The general rule of thumb to pick k is:
𝒌= 𝒏
§ Using cross-validation may help determine the
value for k, by choosing a value that minimizes
the classification error

32
Example 4
Standardization Examples

import pandas

mower_df = pandas.read_csv("RidingMowers.csv")

X = mower_df.iloc[:,0:2]
y = mower_df['Ownership']

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 5)

from sklearn.neighbors import KNeighborsClassifier


from sklearn.metrics import accuracy_score

for i in range (1,11):


knn4 = KNeighborsClassifier(n_neighbors=i)
model4 = knn4.fit(X_train,y_train)
y_test_pred = model4.predict(X_test)
print(accuracy_score(y_test, y_test_pred))

33
Example 5
Choosing optimal k

import pandas
mower_df = pandas.read_csv("RidingMowers.csv")
X = mower_df.iloc[:,0:2]
y = mower_df['Ownership']

from sklearn.preprocessing import StandardScaler


standardizer = StandardScaler()
X_std = standardizer.fit_transform(X)

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X_std, y, test_size = 0.33, random_state = 5)

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score


for i in range (1,11):

34
Example 6
Use UniversalBank.csv

from sklearn.neighbors import KNeighborsClassifier


import pandas

# Load and contruct the data


df = pandas.read_csv("UniversalBank.csv")
X = df.iloc[:,1:12]
y = df["Personal Loan"]

from sklearn.preprocessing import StandardScaler


standardizer = StandardScaler()
X_std = standardizer.fit_transform(X)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_std, y, test_size = 0.33, random_state = 5)

from sklearn.metrics import accuracy_score

for i in range (1,11):

35
Exercise #1
Use ClassifyRisk dataset

Develop a K-NN model to classify Record #1 (i.e.,


the first row in the dataset) using k=2, Euclidean
distance, target variable = risk, predictors = age,
marital status, and income.

36
Exercise #2
§ Use the same dataset and model in #1
§ Split the data into a test (30%) and training
(70%) dataset
§ Using K-NN, find the optimal value of k for this
dataset (i.e., what is the k value that gives you
the most accurate results).

37

You might also like