0% found this document useful (0 votes)
65 views4 pages

Assignment 1

This document contains an assignment submission for a data science course. It includes details of the student, course, lecturer and date of submission. The assignment asks the student to define and discuss missing data, the different types of missing data with examples, complete case analysis, imputation methods with examples, and why missing data needs to be handled.

Uploaded by

ng boon jane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views4 pages

Assignment 1

This document contains an assignment submission for a data science course. It includes details of the student, course, lecturer and date of submission. The assignment asks the student to define and discuss missing data, the different types of missing data with examples, complete case analysis, imputation methods with examples, and why missing data needs to be handled.

Uploaded by

ng boon jane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

ASSIGNMENT 1

COURSE CODE BWB20403

NAME OF COURSE DATA SCIENCE

FACULTY FAST

NAME OF STUDENT NG BOON JANE

MATRIC NUMBER OF STUDENT AW170184

NAME OF LECTURER DR. KHUNESWARI A/P GOPAL PILLAY

DATE OF SUBMISSION 02 OCT 2018

MARKS
1. Define the meaning of missing data.
In statistics, missing data occur when no data value is stored for the variable in
an observation. In R, missing data are represented by the symbol NA (not
available). NA is not a string or a numeric value, but an indicator of
missingness. Impossible values (e.g., dividing by zero) are represented by the
symbol NaN (not a number). Unlike SAS, R uses the same symbol for
character and numeric data. SPSS has two types of missing data. In short,
system missing data are values that are completely absent from the data and
user missing data are values that are present in the data but must be excluded
from calculations.

2. Discuss the type of missing data with related examples.


The types of missing data according to the assumptions based on the reasons
for the missing data. In general, there are three types of missing data according
to the mechanisms of missingness.
 Missing completely at random (MCAR) is the only missing data
mechanism that can actually be verified. Missing data are MCAR
when the probability of missing data on a variable is unrelated to any
other measured variable and is unrelated to the variable with missing
values itself. In other words the missingness on the variable is
completely unsystematic. For example, if each survey respondent
decides whether to answer the “earnings” question by rolling a die and
refusing to answer if a “6” shows up. If data are missing completely at
random, then throwing out cases with missing data does not bias your
inferences.
 Missing data are missing at random (MAR) when the probability of
missing data on a variable is related to some other measured variable in
the model, but not to the value of the variable with missing values
itself. For example, take political opinion polls. Many people refuse to
answer. If you assume that the reasons people refuse to answer are
entirely based on demographics, and if you have those demographics
on each person, then the data is MAR. It is known that some of the
reasons why people refuse to answer can be based on demographics
(for instance, people at both low and high incomes are less likely to
answer than those in the middle), but there's really no way to know if
that is the full explanation.
 Data are missing not at random (MNAR) when the missing values on
a variable are related to the values of that variable itself, even after
controlling for other variables. A familiar example from medical
studies is that if a particular treatment causes discomfort, a patient is
more likely to drop out of the study. This missingness is not at random
(unless “discomfort” is measured and observed for all patients). If
missingness is not at random, it must be explicitly modeled, or else you
must accept some bias in your inferences.
3. Discuss complete case analysis.
The method is also called listwise deletion and is the most common method
for handling missing data, probably because of its simplicity. Complete Case
analysis confines attention to cases where all variables are present. If data are
missing complete randomly, meaning that the chance of data being missing is
unrelated to any of the variables involved in our analysis, a complete case
analysis is unbiased. This is because the subset of complete cases represent a
random (albeit smaller than intended) sample from the population. In general,
if the complete cases are systematically different from the sample as a whole
(different to the incomplete cases), i.e. the data are not missing completely
randomly, analysing only the complete cases will lead to biased estimates.
There are some advantages of complete case analysis. The first advantage is
that the method is very easy to perform. The second advantage is the
possibility to compare univariate statistics. This can be done because of the
fact that all statistics are based on the same subjects after the deletion of
incomplete cases (that is, after all observations with at least one missing value
are deleted from the analysis). The third advantage is that if the assumption of
MCAR holds then the parameter estimate will be unbiased. There are also
some drawbacks of complete case analysis. One disadvantage is that
incomplete cases are not considered in the analysis. This will for sure lead to
loss of information.

4. Explain the meaning of imputation with related examples.


There are two types of imputation which is single imputation (SI) and multiple
imputation (MI). SI is for treating missing data. Imputations could be either
draws or means from a predictive distribution of the missing values. There are
many ways to fill in values for missing observations. One simple way is to use
the mean of the observed cases of the variable of interest and imputing this
unconditional mean for each missing value. MI is a simulation-based method
for handling missing data. Multiple imputation is used to generate multiple
datasets, perform statistical analysis on them, and average the results.
Basically, multiple imputation takes a simple imputation and adds to it a
random value to try to restore randomness lost in the imputation process.
Therefore, averaging multiple imputations before doing any statistical analysis
on them just removes most of that restored randomness (by averaging) and
gives a result close to simple imputation plus a small random error.
5. Why do you need to handle missing data?
This is because missing data can cause serious problems. First, most statistical
procedures automatically eliminate cases with missing data. This means that in
the end, you may not have enough data to perform the analysis. For example,
you could not run a factor analysis on just a few cases. Second, the analysis
might run but the results may not be statistically significant because of the
small amount of input data. Third, your results may be misleading if the cases
you analyse are not a random sample of all cases. Missing data can also lead to
misleading results by introducing bias. Whenever segments of your target
population do not respond, they become underrepresented in your data. In this
situation, you end up not analysing what you intended to measure.

You might also like