0% found this document useful (0 votes)
13 views

Lecture12 KNN Classification and Missingness Module 2

The document discusses k-nearest neighbors (k-NN) classification and regression algorithms. It covers ROC curves, choosing the k parameter, handling ties and imbalanced classes, and extending k-NN to multiple predictors. It also discusses dealing with missing data and different types of missingness.

Uploaded by

ashishamitav123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Lecture12 KNN Classification and Missingness Module 2

The document discusses k-nearest neighbors (k-NN) classification and regression algorithms. It covers ROC curves, choosing the k parameter, handling ties and imbalanced classes, and extending k-NN to multiple predictors. It also discusses dealing with missing data and different types of missingness.

Uploaded by

ashishamitav123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Lecture #12: kNN Classification and Missing

Data
Data Science 1
CS 109A, STAT 121A, AC 209A, E-109A

Pavlos Protopapas Kevin Rader


Margo Levine Rahul Dave
Lecture Outline

ROC Curves

k-NN Revisited

Dealing with Missing Data

Types of Missingness

Imputation Methods

2
ROC Curves

3
ROC Curves

The ROC curve illustrates the trade-off for all possible


thresholds chosen for the two types of error (or correct
classification).

The vertical axis displays the true positive predictive value


and the horizontal axis depicts the true negative predictive
value.

What is the shape of an ideal ROC curve?

4
ROC Curves

The ROC curve illustrates the trade-off for all possible


thresholds chosen for the two types of error (or correct
classification).

The vertical axis displays the true positive predictive value


and the horizontal axis depicts the true negative predictive
value.

What is the shape of an ideal ROC curve?

See next slide for an example.

4
ROC Curve Example

5
ROC Curve for measuring classifier preformance

The overall performance of a classifier, calculated over all


possible thresholds, is given by the area under the ROC curve
(’AUC’). Let T be the threshold False Positive Rate, and let
T P R(T ) be the corresponding True Positive rate at T , then
the AUC is simply just the integral of the function:
∫ 1
AU C = T P R(T )dT
0
What is the worst case scenario for AUC? What is the best
case? What is AUC if we independently just flip a coin to
perform classification?

6
ROC Curve for measuring classifier preformance

The overall performance of a classifier, calculated over all


possible thresholds, is given by the area under the ROC curve
(’AUC’). Let T be the threshold False Positive Rate, and let
T P R(T ) be the corresponding True Positive rate at T , then
the AUC is simply just the integral of the function:
∫ 1
AU C = T P R(T )dT
0
What is the worst case scenario for AUC? What is the best
case? What is AUC if we independently just flip a coin to
perform classification?
This AUC then can be use to compare various approaches to
classification: Logistic regression, LDA (to come), kNN, etc...

6
k-NN Revisited

7
k-Nearest Neighbors

We’ve already seen the k-NN method for predicting a


quantitative response (it was the very first method we
introduced). How was k-NN implemented in the Regression
setting (quantitative response)?

8
k-Nearest Neighbors

We’ve already seen the k-NN method for predicting a


quantitative response (it was the very first method we
introduced). How was k-NN implemented in the Regression
setting (quantitative response)?
The approach was simple: to predict an observation’s
response, use the other available observations that are most
similar to it.
For a specified value of k, each observation’s outcome is
predicted to be the average of the k-closest observations as
measured by some distance of the predictor(s).
With one predictor, the method was easily implemented.

8
Review: Choice of k

How well the predictions perform is related to the choice of k.


What will the predictions look like if k is very small? What if
it is very large?
More specifically, what will the predictions be for new
observations if k = n?

9
Review: Choice of k

How well the predictions perform is related to the choice of k.


What will the predictions look like if k is very small? What if
it is very large?
More specifically, what will the predictions be for new
observations if k = n?

A picture is worth a thousand words...

9
Choice of k Matters

2
4

10

100

500
2

1000
y_train

0
−2
−4

−6 −4 −2 0 2 4 6

x_train

10
k-NN for Classification

How can we modify the k-NN approach for classification?

11
k-NN for Classification

How can we modify the k-NN approach for classification?


The approach here is the same as for k-NN regression: use
the other available observations that are most similar to the
observation we are trying to predict (classify into a group)
based on the predictors at hand.
How do we classify which category a specific observation
should be in based on its nearest neighbors?

11
k-NN for Classification

How can we modify the k-NN approach for classification?


The approach here is the same as for k-NN regression: use
the other available observations that are most similar to the
observation we are trying to predict (classify into a group)
based on the predictors at hand.
How do we classify which category a specific observation
should be in based on its nearest neighbors?
The category that shows up the most among the nearest
neighbors.

11
k-NN for Classification formal definition

The KNN classifier first identifies the k points in the training


data that are closest to x0 , represented by N0 . It then
estimates the conditional probability for class j as the
fraction of points in N0 whose response values equal j:
1 ∑
P (Y = j|X = x0 ) = I(yi = j)
K
i∈N0

Then, the k-NN classifier applies Bayes rule and classifies the
test observation, x0 , to the class with largest probability.

12
k-NN for Classification (cont.)

There are some issues that may arise:


▶ How can we handle a tie?

13
k-NN for Classification (cont.)

There are some issues that may arise:


▶ How can we handle a tie?
With a coin flip!

13
k-NN for Classification (cont.)

There are some issues that may arise:


▶ How can we handle a tie?
With a coin flip!
▶ What could be a major problem with always classifying
to the most common group amongst the neighbors?

13
k-NN for Classification (cont.)

There are some issues that may arise:


▶ How can we handle a tie?
With a coin flip!
▶ What could be a major problem with always classifying
to the most common group amongst the neighbors?
If one category is much more common than the others
then all the predictions may be the same!

13
k-NN for Classification (cont.)

There are some issues that may arise:


▶ How can we handle a tie?
With a coin flip!
▶ What could be a major problem with always classifying
to the most common group amongst the neighbors?
If one category is much more common than the others
then all the predictions may be the same!

▶ How can we handle this?

13
k-NN for Classification (cont.)

There are some issues that may arise:


▶ How can we handle a tie?
With a coin flip!
▶ What could be a major problem with always classifying
to the most common group amongst the neighbors?
If one category is much more common than the others
then all the predictions may be the same!

▶ How can we handle this?


Rather than classifying with the most likely group, use a
biased coin flip to decide which group to classify to!

13
k-NN with Multiple Predictors

How could we extend k-NN (both regression and


classification) when there are multiple predictors?

14
k-NN with Multiple Predictors

How could we extend k-NN (both regression and


classification) when there are multiple predictors?
We would need to define a measure of distance for
observations in order to which are the most similar to the
observation we are trying to predict.
Euclidean distance is a good option. To measure the distance
of a new observation, x0 from each observation in the data
set, xi :


P
D2 (xi , x0 ) = (xi,j − x0,j )2
j=1

14
k-NN with Multiple Predictors

But what must we be careful about when measuring


distance?

15
k-NN with Multiple Predictors

But what must we be careful about when measuring


distance?
1. Differences in variability in our predictors!
2. Having a mixture of quantitative and categorical
predictors.
So what should be good practice?

15
k-NN with Multiple Predictors

But what must we be careful about when measuring


distance?
1. Differences in variability in our predictors!
2. Having a mixture of quantitative and categorical
predictors.
So what should be good practice? To determine closest
neighbors when P > 1, you should first standardize the
predictors! And you can even standardize the binaries if you
want to include them.
How else could we determine closeness in this
multi-dimensional setting?

15
Dealing with Missing Data

16
What is missing data?

Often times when data is collected, there are some missing


values apparent in the dataset. This leads to a few questions
to consider:
1. How does this show up in pandas?

17
What is missing data?

Often times when data is collected, there are some missing


values apparent in the dataset. This leads to a few questions
to consider:
1. How does this show up in pandas?
2. How does pandas and sklearn handle these NaNs?

17
What is missing data?

Often times when data is collected, there are some missing


values apparent in the dataset. This leads to a few questions
to consider:
1. How does this show up in pandas?
2. How does pandas and sklearn handle these NaNs?
3. How does this effect our modeling?

17
Naively handling missingness

What is the simplest way to handle missing data?


1. Impute the mean (if quantitative) or most common class
(if categorical) for all missing values.
2. How does pandas and sklearn handle these NaNs?

18
Naively handling missingness

What is the simplest way to handle missing data?


1. Impute the mean (if quantitative) or most common class
(if categorical) for all missing values.
2. How does pandas and sklearn handle these NaNs?

What are some consequences in handling missingness in


this fashion?

18
Types of Missingness

19
Sources of Missingness

Missing data can arise from various places in data:


▶ A survey was conducted and values were just randomly
missed when being entered in the computer.
▶ A respondent chooses not to respond to a question like
‘Have you ever done cocaine?’.
▶ You decide to start collecting a new variable (like Mac vs.
PC) partway through the data collection of a study.
▶ You want to measure the speed of meteors, and some
observations are just ’too quick’ to be measured properly.
The source of missing values in data can lead to the major
types of missingness:

20
Types of Missingness

There are 3 major types of missingness to be concerned


about:
1. Missing Completely at Random (MCAR) - the
probability of missingness in a variable is the same for
all units. Like randomly poking holes in a data set.
2. Missing at Random (MAR) - the probability of
missingness in a variable depends only on available
information (in other predictors).
3. Missing Not at Random (MNAR) - the probability of
missingness depends on information that has not been
recorded and this information also predicts the missing
values.

What are examples of each these 3 types?

21
Missing completely at random (MCAR)

Missing Completely at Random is the best case scenario, and


the easiest to handle:
▶ Examples: a coin is flipped to determine whether an
entry is removed. Or when values were just randomly
missed when being entered in the computer.
▶ Effect if you ignore: there is no effect on inferences
(estimates of beta).
▶ How to handle: lots of options, but best to impute (more
on next slide)

22
Missing at random (MAR)

Missing at random is still a case that can be handled.


▶ Example(s): men and women respond to the question
�have you ever felt harassed at work?� at different rates
(and may be harassed at different rates).
▶ Effect if you ignore: inferences are biased (estimates of
beta) and predictions are usually worsened.
▶ How to handle: use the information in the other
predictors to build a model and ‘impute’ a value for the
missing entry.

Key: we can fix any biases by modeling and imputing the


missing values based on what is observed!

23
Missing Not at Random (MNAR)

Missing Not at Random is the worst case scenario, and


impossible to handle:
▶ Example(s): patients drop out of a study because they
experience some really bad side effect that was not
measured. Or cheaters are less likely to respond when
asked if you’ve ever cheated.
▶ Effect if you ignore: there is no effect on inferences
(estimates of beta) or predictions.
▶ How to handle: you can ’improve’ things by dealing with
it like it is MAR, but you [likely] may never completely fix
the bias.

24
What type of missingness is present?

Can you ever tell based on your data what type of


missingness is actually present?

25
What type of missingness is present?

Can you ever tell based on your data what type of


missingness is actually present?
Since we asked the question, the answer must be no. It
generally cannot be determined whether data really are
missing at random, or whether the missingness depends on
unobserved predictors or the missing data themselves. The
problem is that these potential �lurking variables� are
unobserved (by definition) and so can never be completely
ruled out.
In practice, a model with as many predictors as possible is
used so that the ‘missing at random’ assumption is
reasonable.

25
Imputation Methods

26
Handling missing data

When encountering missing data, the approach to handling


it depends on:
1. whether the missing values are in the response or in the
predictors. Generally speaking, it is much easier to
handle missingness in predictors.
2. whether the variable is quantitative or categorical.
3. how much missingness is present in the variable. If
there is too much missingness, you may be doing more
damage than good.

Generally speaking, it is a good idea to attempt to impute (or


‘fill in’) entries for missing values in a variable (assuming
your method of imputation is a good one).

27
Imputation methods

There are several different approaches to imputing missing


values:
1. Plug in the mean (quantitative) or most common class
(categorical) for all missing values in a variable.
2. Create a new variable that is an indicator of missingness,
and include it in any model to predict the response (also
plug in zero or the mean in the actual variable).
3. Hot deck imputation: for each missing entry, randomly
select an observed entry in the variable and plug it in.
4. Model the imputation: plug in predicted values (ŷ) from
a model based on the other observed predictors.
5. Model the imputation with uncertainty: plug in predicted
values plus randomness (ŷ + ε) from a model based on
the other observed predictors.
What are the advantages and disadvantages of each
approach? 28
Schematic: imputation through modeling

How do we use models to fill in missing data?

29
Schematic: imputation through modeling

How do we use models to fill in missing data?

30
Schematic: imputation through modeling

How do we use models to fill in missing data? Using kNN for


k = 2?

31
Schematic: imputation through modeling

How do we use models to fill in missing data? Using kNN for


k = 2?

31
Schematic: imputation through modeling

How do we use models to fill in missing data? Using kNN for


k = 2?

32
Schematic: imputation through modeling

How do we use models to fill in missing data? Using linear


regression?

33
Schematic: imputation through modeling

How do we use models to fill in missing data? Using linear


regression?

Where m and b are computed from the observations (rows)


that do not have missingness (we should call them b = β0
and m = β1 ).

33
Imputation through modeling with uncertainty

The schematic in the last few slides ignores the fact of


imputing with uncertainty. What happens if you ignore this
fact and just use the ‘best’ model to impute values solely on
ŷ?

34
Imputation through modeling with uncertainty

The schematic in the last few slides ignores the fact of


imputing with uncertainty. What happens if you ignore this
fact and just use the ‘best’ model to impute values solely on
ŷ?
The distribution of the imputed values will be too narrow and
not represent real data (see next slide for illustration). The
goal is to impute values that include the uncertainty of the
model.
How can this be done in practice in kNN? In linear
regression? In logistic regression?

34
Imputation through modeling with uncertainty: an illustration

35
Imputation through modeling with uncertainty: linear regression

Recall the probabilistic model in linear regression:

Y = β0 + β1 X1 + ... + βp Xp + ε

where ε ∼ N (0, σ 2 ). How can we take advantage of this model


to impute with uncertainty?
It’s a 3 step process:
1. Fit a model to predict the predictor variable with
missingness from all the other predictors.
2. Predict the missing values from the model in the
previous part.
3. Add in a measure of uncertainty to this prediction by
randomly sampling from a N (0, σ̂ 2 ) distribution, where
σ̂ 2 is the mean square error (MSE) from the model.

36
Imputation through modeling with uncertainty: k-NN regression

How can we use k-NN regression to impute values that mimic


the error in our observations?

37
Imputation through modeling with uncertainty: k-NN regression

How can we use k-NN regression to impute values that mimic


the error in our observations?
Two ways:
1. Use k = 1.
2. Use any other k, but randomly select from the nearest
neighbors in N0 . This can be done with equal probability
or with some weighting (inverse to the distance measure
used).

37
Imputation through modeling with uncertainty: classifiers

For classifiers, this imputation with uncertainty/randomness


is a little easier process. How can it be implemented?

38
Imputation through modeling with uncertainty: classifiers

For classifiers, this imputation with uncertainty/randomness


is a little easier process. How can it be implemented?

If a classification model (logistic, kNN, etc...) is used to


predict the variable with missingness on the observed
predictors, then all you need to do is flip a ‘biased coin’ (or
multi-sided die) with the probabilities of coming up for each
class equal to the predicted probabilities from the model.

Warning: do not just classify blindly using the predict


command in sklearn!

38
Imputation across multiple variables

If only one variable has missing entries, life is easy. But what
if all the predictor variables have a little bit of missingness
(with some observations having multiple entries missing)?
How can we handle that?

39
Imputation across multiple variables

If only one variable has missing entries, life is easy. But what
if all the predictor variables have a little bit of missingness
(with some observations having multiple entries missing)?
How can we handle that?
It’s an iterative process. Impute X1 based on X2 , ..., Xp . Then
impute X2 based on X1 and X3 , ..., Xp . And continue down
the line.
Any issues?

39
Imputation across multiple variables

If only one variable has missing entries, life is easy. But what
if all the predictor variables have a little bit of missingness
(with some observations having multiple entries missing)?
How can we handle that?
It’s an iterative process. Impute X1 based on X2 , ..., Xp . Then
impute X2 based on X1 and X3 , ..., Xp . And continue down
the line.
Any issues? Yes, not all of the missing values may be
imputed with just one ’run’ through the data set. So you will
have to repeat these ’runs’ until you have a completely filled
in data set.

39
Multiple imputation: beyond this class

What is an issue with treating your now ‘complete’ data set (a


mixture of actually observed values and imputed values) as
simply all observed values?

40
Multiple imputation: beyond this class

What is an issue with treating your now ‘complete’ data set (a


mixture of actually observed values and imputed values) as
simply all observed values?
Any inferences or predictions carried out will be tuned and
potentially overfit to the random entries imputed for the
missing entries. How can we prevent this phenomenon?

40
Multiple imputation: beyond this class

What is an issue with treating your now ‘complete’ data set (a


mixture of actually observed values and imputed values) as
simply all observed values?
Any inferences or predictions carried out will be tuned and
potentially overfit to the random entries imputed for the
missing entries. How can we prevent this phenomenon?
By performing multiple imputation: rerun the imputation
algorithm many times, refit the model on the response many
times (one time each), and then ’average’ the predictions or
estimates of β coefficients to perform inferences (also
incorporating the uncertainty involved).
Note: this is beyond what we would expect in this class. But it
generally a good thing to be aware of.

40

You might also like