0% found this document useful (0 votes)
17 views13 pages

Chapter 2 - Data Cleansing 2

The document discusses data cleansing techniques, focusing on handling missing data, identifying erroneous values, and variable representation. It outlines methods for addressing missing data, including discarding observations or using imputation, and categorizes missing data into MCAR, MAR, and MNAR. Additionally, it highlights the importance of examining data quality through statistical tools and the need for dimension reduction in data mining applications.

Uploaded by

z224gttyyt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views13 pages

Chapter 2 - Data Cleansing 2

The document discusses data cleansing techniques, focusing on handling missing data, identifying erroneous values, and variable representation. It outlines methods for addressing missing data, including discarding observations or using imputation, and categorizes missing data into MCAR, MAR, and MNAR. Additionally, it highlights the importance of examining data quality through statistical tools and the need for dimension reduction in data mining applications.

Uploaded by

z224gttyyt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Data Cleansing

Missing Data
Blakely Tires
Identification of Erroneous Outliers and other Erroneous Values
Variable Representation

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Missing Data:
• Data sets commonly include observations with missing values for one or
more variables.
• In some cases missing data naturally occur; these are called legitimately
missing data.
• Generally, no remedial action is taken for legitimately missing data.
• In other cases missing data occur for different reasons; these are called
illegitimately missing data.
• The primary options for addressing such missing data are:
1. To discard observations (rows) with any missing values.
2. To discard any variable (column) with missing values.
3. To fill in missing entries with estimated values.
4. To apply a data-mining algorithm that can handle missing values.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Missing Data (cont.):
• Missing completely at random (MCAR): The tendency for an observation to
be missing the value for some variable is entirely random; whether data are
missing does not depend on either the value of the missing data or the value of
any other variable in the data.
• Missing at random (MAR): The tendency for an observation to be missing a
value for some variable is related to the value of some other variable(s) in the
data.
• Missing not at random (MNAR): The tendency for the value of a variable to be
missing is related to the value that is missing.
• Imputation: The systematic replacement of missing values with values that
seem reasonable.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Blakely Tires:
• A U.S. producer of automobile tires wants to learn about the conditions
of its tires on automobiles in Texas.
• The data obtained includes the position of the tire on the automobile,
age of the tire, mileage on the tire, and depth of the remaining tread on
the tire.
• Begin assessing the quality of these data by determining which (if any)
observations have missing values (see Figure 2.30).

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Figure 2.30: Portion of Excel Spreadsheet Showing Number of Missing
Values for Variables in TreadWear Data

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Blakely Tires (cont.):
• Sort all of Blakely’s data on Miles from smallest to largest value to
determine which observation is missing its value of this variable.
Figure 2.31: Portion of Excel Spreadsheet Showing TreadWear Data
Sorted on Miles from Lowest to Highest Value

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Figure 2.32: Portion of
Excel Spreadsheet
Showing TreadWear Data
Sorted from Lowest to
Highest by ID Number

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Identification of Erroneous Outliers and other Erroneous Values:
• Examining the variables in the data set by use of summary statistics, frequency
distributions, bar charts and histograms, z-scores, scatter plots, correlation
coefficients, and other tools can uncover data-quality issues and outliers.
• Many software ignore missing values when calculating various summary
statistics.
• If missing values in a data set are indicated with a unique value (such as
9999999), these values may be used by software when calculating various
summary statistics.
• Both cases can result in misleading values for summary statistics.
• Many analysts prefer to deal with missing data issues prior to using summary
statistics to attempt to identify erroneous outliers and other erroneous values
in the data.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Figure 2.33: Portion of Excel Spreadsheet Showing the Mean and
Standard Deviation for Each Variable in the TreadWear Data

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Figure 2.34: Portion of Excel Spreadsheet Showing the TreadWear Data
Sorted on Life of Tires (Months) from Lowest to Highest Value

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Figure 2.35: Scatter Diagram
of Tread Depth and Miles for
the TreadWear Data

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Variable Representation:
• In many data-mining applications, it may be prohibitive to analyze the
data because of the number of variables recorded.
• Dimension reduction is the process of removing variables from the
analysis without losing crucial information.
• A critical part of data mining is determining how to represent the
measurements of the variables and which variables to consider.
• Often data sets contain variables that, considered separately, are not
particularly insightful but that, when appropriately combined, result in a
new variable that reveals an important relationship.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
End of Chapter 2

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.

You might also like