0% found this document useful (0 votes)
6 views20 pages

Lecture 10

The document discusses the issue of missing values in datasets, outlining their types: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR). It presents strategies for handling missing values, including deletion and various imputation methods such as using unique values, group centroids, or k nearest neighbors. The document emphasizes the importance of understanding the nature of missing data to choose the appropriate handling technique.

Uploaded by

Husam hr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views20 pages

Lecture 10

The document discusses the issue of missing values in datasets, outlining their types: Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR). It presents strategies for handling missing values, including deletion and various imputation methods such as using unique values, group centroids, or k nearest neighbors. The document emphasizes the importance of understanding the nature of missing data to choose the appropriate handling technique.

Uploaded by

Husam hr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Imputation of missing values

What are missing values?

In some practical situations, some values corresponding to


observations are missing
Example 1 : a poll. There could be no answer to some questions
Example 2 : videosurveillance. A sensor may not work properly,
and by time to time values are missing
What are missing values?

What can be done?


A simple solution : delete the observations corresponding to
missing values.
May imply a drop in the performance of the model, but also a
bias! Missing data may be interesting in itself : in a poll, there
may be some reason why informations are missing
It should be important to understand why some values are
missing
How handle missing values?
Possible types of missing values

Missing completely at random (MCAR)


the probability that a value is missing is exactly the same for
each observation
Example : a sensor does not work properly and do not collect a
given percentage of the data
Possible types of missing values

Missing at random (MAR)


The missing probability depends on some observed variables.
Example in a poll a participant has a higher probability of not
answering to Question 2, if he hasn’t answered Question 1
Possible types of missing values

Missing not at random (MNAR)


The missing probability depends on non observed variables
For ex. it may depend on the gender of the individual which is
not known
Possible types of missing values

It is usually not possible to deduce from data if missing values


are random or not (MAR or MNAR)
One has to know more about the way data hav been collected
One has to include in the study as many variables as possible
In the MNAR case, we have no model explaining missing values.
To circumvent this problem, one can only collect additional data
Possible types of missing values

Data : table of observations described par d variable, stored in a


matrix X

Matrix of missing values M = (mi,j ) with mi,j = 1 if xi,j is


missing, mi,j = 0 otherwise
We denote X0 the fully observed data and Xm the data with some
missing values
Possible types of missing values

One can caracterize each type of missing values in the following way
MCAR : p(M|X) = p(M)
MAR : p(M|X) = p(M|X0 )
MNAR : no information on p(M|X)
Solution 1 : delete missing values

What are potential consequences according to the type of missing


values we have?
MCAR
When ignoring missing values, we keep a random sample : no
bias
Reduce the quality of the resulting model if we have a lot of
missing values
MAR
When ignoring missing values, we create a bias
It is then preferable to complete the missing values
MNAR
When ignoring missing values, we also create a bias
Impute missing values is more difficult since we have no
information about the generative process
Solution 2 : imputation missing values

Complete the observations with missing values can improve the


precision of th model or reduce the bias
We have to compare the two strategies : imputation of missing
values and delation of missing values
Several imputation strategies :
By a unique value (mean, median, etc.)
By the centroid of the group
Using k nearest neighbors
Solution 2 : imputation missing values
Imputation by a unique value

The used value could be


The most frequent value if few values for each variables
The mean (not robust)
The median (robust)
Solution 2 : imputation missing values
Imputation using the centroid of the group

Underlying assumptions : ther is a structure in data with groups


How can we impute missing values having this in mind?
Solution 2 : imputation missing values
Imputation using the centroid of the group

Step 1 : perform a clustering algorithm to create groups


Solution 2 : imputation missing values
Imputation using the centroid of the group

Step 2 : for each observation with missing values, calculate its


distance to the center of each group
Solution 2 : imputation missing values
Imputation using the centroid of the group

Step 3 : for each observation with missing values, assign to each


missing variabl the corresponding value of the nearest centroid
Solution 2 : imputation missing values
Imputation using k nearest neighbors

Underlying assumption : for a given observation with missing


values, the nearest fully observed data are the most meaningful
How shall we proceed?
Solution 2 : imputation missing values
Imputation using k nearest neighbors

Step 1 : for each observation with missing values, find in the dataset
of fully observed data the k nearest neighbors
Solution 2 : imputation missing values
Imputation using k nearest neighbors

Step 2 : for each observation with missing values, assign to each


missing variable the corresponding mean value of the k nearest
neighbors
Solution 2 : imputation missing values
Recap!

See the notebook

You might also like