0% found this document useful (0 votes)
19 views26 pages

Week 5 Lecture - Data Wrangling

Uploaded by

Sujal Shrestha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views26 pages

Week 5 Lecture - Data Wrangling

Uploaded by

Sujal Shrestha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Data Wrangling

Dr. Mariam Adedoyin-Olowe

Birmingham City University

CMP4294 Week 5
Outline
tline
• Dealing with missing d a t a
• Missing d a t a mechanisms

• Missing At Random (MAR)


• Missing Completely At Random (MCAR)
• Missing Not At Random (MNAR)
• Methods for handling MVs
• Deletion methods
• Complete case analysis
• Available case analysis
• Imputation Methods
• Mean imputation
• Regression imputation
• Dummy variable imputation
• Other imputation methods
What is Data Wrangling?

• It is the process of transforming and structuring data


from one raw form into a desired format with the
purpose of improving data quality and making it more
useful for analytics or machine learning.

• It's sometimes called data munging.


Why missing data?
Reasons for missing data:
• Equipment errors
• Absence of survey participants
• Unavailability in GPS signals in rural area
• Change of circumstances: such as death, graduation, etc.
• Filter question when a set of questions in a survey is only
asked to participants who indicate they are married
Why should we pay attention?

Ignoring or inappropriately handling missing data may


lead to...

• Biased estimates analysis


• Incorrect standard errors
• Incorrect inferences/results
Missing data mechanisms

• These mechanisms describe the


relationship amongst missing values and
the missingness and the values of other
variables.

• Deciding on the method for analysing


missing values requires understanding
about both the reasons for the missing
values and the nature of the data for the
missing observations.
Missing At Random (MAR)
• MAR occurs when the missingness is related to a
particular variable, but it is not related to the value of
the variable that has missing data.

• Y missing is MAR if it depends on X but not on Y


available.

• MCAR is a special case of MAR. T h at is, if the data are


MCAR, they are also MAR.

• MAR mechanisms are classified as ignorable


Missing At Random

Examples:
• if men are more likely to tell you their weight than
women, weight is MAR.

• if males are less likely to fill in a depression survey


but this has nothing to do with their level of
depression, after accounting for maleness.
MCAR (Missing Completely At
Random)
• MCAR is a special case of MAR. T h at is, if the data are
MCAR, they are also MAR.

• An example of MCAR is a weighing scale that ran out of batteries.


Some of the data will be missing.

• Another example is when we take a random sample of a


population, where each member has the same chance of being
included in the sample. The (unobserved) data of members in the
population that were not included in the sample are MCAR.

• Like MAR, MCAR mechanisms are classified as ignorable


Missing Not At Random (MNAR)

• Missingness is dependent on the variables with


missingnesss. Y missing following this mechanism, is
dependent on Y available.

• also known as non-ignorable

For example:
• probability of someone reporting their income depends on what
their income is
• probability of reporting psychiatric treatment depends on
whether or not they have received it
Why MVs mechanisms are important?

• Methods that are often used for handling missing data


are based on the assumption that data is missing
completely at random (MCAR) (or at least missing at
random (MAR)). However, they cannot be applied to
data that is MNAR.
Lessons from missing data mechanisms

• The collection protocol


• Different methods for validation
• What about ignoring missing values?
• Software default?
Methods for handling MVs
Complete Case Analysis

• A direct approach to deal with missing values is to exclude them


• Bias in data
• Possible loss of the statistical power
Complete Case Analysis

• Any observation that has a missing value for any


variable is automatically discarded and only
complete observations are analysed.
• Simple and easy
• Direct comparability among variables
• Inefficient
• Can lead to bias, unless MCAR
Available Case Analysis
• Rather t h a n deleting the entire record because of the missingness
in any variable, available case analysis discards only the instance
(cell) of the missing value instead of deleting the entire record (row)

• Available-case analysis also arises when a researcher simply excludes a


variable or set of variables from the analysis because of their missing-data
rates (sometimes called “complete-variables analyses”)

• The statistics for each predictor is calculated on all d a t a except


the missing values.
Methods for handling MVs
Imputation Methods

• Another approach to handle missing values is to fill in


or “impute” them rather than deleting. There are
many methods to choose the values for filling in the
missing places

• The advantage of this approach over the deletion


approach is it preserves the sample size. In the
following slides we discuss common imputation
methods
Simple Imputation Methods

• M e a n / Mode imputation

• Regression imputation

• (Missing d a t a indicator) “Hot-deck”. Hot deck imputation is a


method for handling missing data in which each missing value is
replaced with an observed response from a "similar" unit.

• Last observation carried forward


...
Mean imputation
Replace the missing value with the variable’s
mean
Regression imputation

• Replace the missing value with the expected value from a


regression. The regression models is the missing variable
using the other independent variables

• In statistics, linear regression is an approach for modelling


the relationship between a variable y and one or more other
variables denoted X = x 1 , x 2 , ..x n

• Linear regression consists of finding the best-fitting straight


line through the points

This is referred to as a regression line


Dummy variable imputation

• A simple and common method for imputation is to add


an extra variable (dummy) which represents the
missingness in the variable

• The variable has value of 1 if data is observed


and 0 if it is missing

• Do simple imputation and include indicator of


missingness as an additional predictor in regression
models
Other imputation methods

Last observed carried forward:


If someone drops out of study, the last value observed for
them is “carried forward” (copied) to later time points

Hot-deck:
For an individual with missing data, find individuals
with the same observed values on other variables,
randomly pick one of their values as the one to use for
imputation
Imputation methods

“The idea of imputation is both seductive and


dangerous. It is seductive because it can lull the user
into the satisfying state of believing t hat the data are
complete after all, and it is dangerous because it
lumps together situations where [its application is
legitimate and where it creates serious biases]”
Evaluation of Mvs methods
There is no perfect method for handling MVs. Three
criteria to measure the efficiency of any MV method are:

• Minimise bias: Although it is well-known that missing


data can introduce bias into parameter estimates, a
good method should make that bias as small as
possible

• Maximise the use of available information: We want to


avoid discarding any data, and we want to use the
available data to produce parameter estimates that
are efficient

• Yield good estimates of uncertainty and accurate


estimates of standard errors
In the next class…

• We will begin to learn about some Data Mining


techniques used for data analytics

You might also like