Chapter 2 - Data Cleansing 2
Chapter 2 - Data Cleansing 2
Missing Data
Blakely Tires
Identification of Erroneous Outliers and other Erroneous Values
Variable Representation
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Missing Data:
• Data sets commonly include observations with missing values for one or
more variables.
• In some cases missing data naturally occur; these are called legitimately
missing data.
• Generally, no remedial action is taken for legitimately missing data.
• In other cases missing data occur for different reasons; these are called
illegitimately missing data.
• The primary options for addressing such missing data are:
1. To discard observations (rows) with any missing values.
2. To discard any variable (column) with missing values.
3. To fill in missing entries with estimated values.
4. To apply a data-mining algorithm that can handle missing values.
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Missing Data (cont.):
• Missing completely at random (MCAR): The tendency for an observation to
be missing the value for some variable is entirely random; whether data are
missing does not depend on either the value of the missing data or the value of
any other variable in the data.
• Missing at random (MAR): The tendency for an observation to be missing a
value for some variable is related to the value of some other variable(s) in the
data.
• Missing not at random (MNAR): The tendency for the value of a variable to be
missing is related to the value that is missing.
• Imputation: The systematic replacement of missing values with values that
seem reasonable.
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Blakely Tires:
• A U.S. producer of automobile tires wants to learn about the conditions
of its tires on automobiles in Texas.
• The data obtained includes the position of the tire on the automobile,
age of the tire, mileage on the tire, and depth of the remaining tread on
the tire.
• Begin assessing the quality of these data by determining which (if any)
observations have missing values (see Figure 2.30).
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Figure 2.30: Portion of Excel Spreadsheet Showing Number of Missing
Values for Variables in TreadWear Data
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Blakely Tires (cont.):
• Sort all of Blakely’s data on Miles from smallest to largest value to
determine which observation is missing its value of this variable.
Figure 2.31: Portion of Excel Spreadsheet Showing TreadWear Data
Sorted on Miles from Lowest to Highest Value
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Figure 2.32: Portion of
Excel Spreadsheet
Showing TreadWear Data
Sorted from Lowest to
Highest by ID Number
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Identification of Erroneous Outliers and other Erroneous Values:
• Examining the variables in the data set by use of summary statistics, frequency
distributions, bar charts and histograms, z-scores, scatter plots, correlation
coefficients, and other tools can uncover data-quality issues and outliers.
• Many software ignore missing values when calculating various summary
statistics.
• If missing values in a data set are indicated with a unique value (such as
9999999), these values may be used by software when calculating various
summary statistics.
• Both cases can result in misleading values for summary statistics.
• Many analysts prefer to deal with missing data issues prior to using summary
statistics to attempt to identify erroneous outliers and other erroneous values
in the data.
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Figure 2.33: Portion of Excel Spreadsheet Showing the Mean and
Standard Deviation for Each Variable in the TreadWear Data
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Figure 2.34: Portion of Excel Spreadsheet Showing the TreadWear Data
Sorted on Life of Tires (Months) from Lowest to Highest Value
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Figure 2.35: Scatter Diagram
of Tread Depth and Miles for
the TreadWear Data
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Variable Representation:
• In many data-mining applications, it may be prohibitive to analyze the
data because of the number of variables recorded.
• Dimension reduction is the process of removing variables from the
analysis without losing crucial information.
• A critical part of data mining is determining how to represent the
measurements of the variables and which variables to consider.
• Often data sets contain variables that, considered separately, are not
particularly insightful but that, when appropriately combined, result in a
new variable that reveals an important relationship.
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
End of Chapter 2
© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.