0% found this document useful (0 votes)
2 views2 pages

Subtitle

Uploaded by

Bhavik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views2 pages

Subtitle

Uploaded by

Bhavik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 2

In the last lecture we

discussed data quality issues. We will now discuss some common techniques
for addressing those quality issues. After this video, you will be able
to define what imputation means, illustrate three ways to
handle missing values, and describe the role of domain knowledge
in addressing data quality issues. As we discussed in the last lecture,
real world data is messy. Some data quality issues that you can
find in your data are missing values, duplicate data, invalid data,
noise and outliers. You will need to clean your data if you
want to perform any meaningful analysis on that data. Recall that missing data
occurs
when you don't have a value for certain variables in some samples. A simple way to
handle missing data is
to simply drop any samples with missing values or NAs. All machine learning tools
provide
a mechanism or command for filtering out rows with
any missing values. The advantage of this approach
is that it is very simple. The caveat is that you are removing
data when you filter out examples. If the number of samples dropped is large,
then you end up losing a lot of your data. An alternative to dropping
samples with missing data is to impute the missing values. Imputing means to
replace the missing
values with some reasonable values. The advantage of this approach is that
you're making use of all your data. Oc course, imputing is more complicated
than simply dropping samples. There are several ways to
impute missing values. One strategy is to replace
the missing values with the mean or median value of the variable. For example, a
missing value for years of
employment can be replaced by the mean or median value for years of employment for
all current employees. Another approach is to use
the most frequent value in place of the missing value. For example, the most
frequently
recorded age of customers associated with the specific item can
be used if that value is missing. Alternatively, a sensible value can
be derived as a replacement for a missing value. For example, a missing value for
income can be set to zero for customers less then 18 years old, or it can be
replaced with an average
value based on occupation and location. Note that this approach requires
knowledge about the application and the variable with missing values in
order to make reasonable choices about what valuables would be sensible
to replace the missing values. In the case of duplicate data one
approach is to delete the older record. Another approach is to
merge duplicate records. This often requires a way to determine
how to resolve conflicting values. For example, in the case of multiple
addresses for the same customer, some logic for determining similarities
between addresses might be necessary. For example,
St period is the same as Street. To address invalid data, consulting
another data source may be necessary. For example,
an invalid zip code can be corrected by looking up the correct zip
code based on city and state. A best estimate for a reasonable value
can also be used as a replacement. For example, for
a missing age value for an employee, a reasonable value can be estimated based
on the employee's length of employment. Noise that distorts the data
values can be addressed by filtering out the source of the noise. For example,
filtering out the frequency
of a constant background noise will remove that noise
component from a recording. This filtering must be
done with care however, as it can also remove some components
of the true data in the process. Outliers can be detected through
the use of summary statistics and plots of the data. Outliers can significantly
skew
the distribution of your data and thus the results of your analysis. In cases where
outliers are not
the focus of your analysis, you will want to remove these
outlier samples from your data set. For example,
when a thermostat malfunctions and causes values to fluctuate wildly,
or to be much higher or lower than normal,
these samples should be filtered out. In some applications, however, outliers
are exactly what you're looking for. So when you detect outliers,
you don't want to throw them out. Instead, you want to
examine them more closely. A classic example of this is in fraud
detection, where outliers represent potential fraudulent use and
those samples should be analyzed closely. In order to address data
quality issues effectively knowledge about
the application is crucial. Things such as how the data was collected, the user
population, the intended use
of the application etc, are important. This domain knowledge is essential
to making informed decisions on how to best impute missing values,
how to handle duplicate records and invalid data and what to do about
noise and outliers in your data.

You might also like