Chapter 2 - Data Cleansing 2

The document discusses data cleansing techniques, focusing on handling missing data, identifying erroneous values, and variable representation. It outlines methods for addressing missing data, including discarding observations or using imputation, and categorizes missing data into MCAR, MAR, and MNAR. Additionally, it highlights the importance of examining data quality through statistical tools and the need for dimension reduction in data mining applications.

Uploaded by

z224gttyyt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views13 pages

Chapter 2 - Data Cleansing 2

Uploaded by

z224gttyyt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 13

Data Cleansing

Missing Data
Blakely Tires
Identification of Erroneous Outliers and other Erroneous Values
Variable Representation

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Missing Data:
• Data sets commonly include observations with missing values for one or
more variables.
• In some cases missing data naturally occur; these are called legitimately
missing data.
• Generally, no remedial action is taken for legitimately missing data.
• In other cases missing data occur for different reasons; these are called
illegitimately missing data.
• The primary options for addressing such missing data are:
1. To discard observations (rows) with any missing values.
2. To discard any variable (column) with missing values.
3. To fill in missing entries with estimated values.
4. To apply a data-mining algorithm that can handle missing values.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Missing Data (cont.):
• Missing completely at random (MCAR): The tendency for an observation to
be missing the value for some variable is entirely random; whether data are
missing does not depend on either the value of the missing data or the value of
any other variable in the data.
• Missing at random (MAR): The tendency for an observation to be missing a
value for some variable is related to the value of some other variable(s) in the
data.
• Missing not at random (MNAR): The tendency for the value of a variable to be
missing is related to the value that is missing.
• Imputation: The systematic replacement of missing values with values that
seem reasonable.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Blakely Tires:
• A U.S. producer of automobile tires wants to learn about the conditions
of its tires on automobiles in Texas.
• The data obtained includes the position of the tire on the automobile,
age of the tire, mileage on the tire, and depth of the remaining tread on
the tire.
• Begin assessing the quality of these data by determining which (if any)
observations have missing values (see Figure 2.30).

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Figure 2.30: Portion of Excel Spreadsheet Showing Number of Missing
Values for Variables in TreadWear Data

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Blakely Tires (cont.):
• Sort all of Blakely’s data on Miles from smallest to largest value to
determine which observation is missing its value of this variable.
Figure 2.31: Portion of Excel Spreadsheet Showing TreadWear Data
Sorted on Miles from Lowest to Highest Value

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Figure 2.32: Portion of
Excel Spreadsheet
Showing TreadWear Data
Sorted from Lowest to
Highest by ID Number

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Identification of Erroneous Outliers and other Erroneous Values:
• Examining the variables in the data set by use of summary statistics, frequency
distributions, bar charts and histograms, z-scores, scatter plots, correlation
coefficients, and other tools can uncover data-quality issues and outliers.
• Many software ignore missing values when calculating various summary
statistics.
• If missing values in a data set are indicated with a unique value (such as
9999999), these values may be used by software when calculating various
summary statistics.
• Both cases can result in misleading values for summary statistics.
• Many analysts prefer to deal with missing data issues prior to using summary
statistics to attempt to identify erroneous outliers and other erroneous values
in the data.

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Figure 2.33: Portion of Excel Spreadsheet Showing the Mean and
Standard Deviation for Each Variable in the TreadWear Data

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Figure 2.34: Portion of Excel Spreadsheet Showing the TreadWear Data
Sorted on Life of Tires (Months) from Lowest to Highest Value

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Figure 2.35: Scatter Diagram
of Tread Depth and Miles for
the TreadWear Data

© 2021 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Data Cleansing
Variable Representation:
• In many data-mining applications, it may be prohibitive to analyze the
data because of the number of variables recorded.
• Dimension reduction is the process of removing variables from the
analysis without losing crucial information.
• A critical part of data mining is determining how to represent the
measurements of the variables and which variables to consider.
• Often data sets contain variables that, considered separately, are not
particularly insightful but that, when appropriately combined, result in a
new variable that reveals an important relationship.

Chapter 4 Torsion PDF
67% (3)
Chapter 4 Torsion PDF
30 pages
Data Cleansing Using R
0% (1)
Data Cleansing Using R
10 pages
Levelling and Profile Ploting PDF
100% (4)
Levelling and Profile Ploting PDF
5 pages
Varahamihira
100% (2)
Varahamihira
6 pages
Linkers in The English Language
No ratings yet
Linkers in The English Language
3 pages
Assignment On Digital Image Processing
0% (1)
Assignment On Digital Image Processing
7 pages
Strength of Materials University Question Paper
No ratings yet
Strength of Materials University Question Paper
2 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
PM IMG Configuration Tracker KDS Key Data Structure
No ratings yet
PM IMG Configuration Tracker KDS Key Data Structure
8 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
5 Three Phase System1
No ratings yet
5 Three Phase System1
28 pages
Question Papers - Linear Prog. and Applications
No ratings yet
Question Papers - Linear Prog. and Applications
14 pages
Thermal Deformation Analysis of Automotive Disc Brake Squeal
No ratings yet
Thermal Deformation Analysis of Automotive Disc Brake Squeal
26 pages
Parallel Lines and Transversals FOLDABLE Notes
No ratings yet
Parallel Lines and Transversals FOLDABLE Notes
10 pages
Lec 3 Data Preprocessing and Transformation
No ratings yet
Lec 3 Data Preprocessing and Transformation
66 pages
Lec 3 Data Preprocessing and Transformation
No ratings yet
Lec 3 Data Preprocessing and Transformation
73 pages
Effective Analytics for Marketing
From Everand
Effective Analytics for Marketing
Sucheta Kakkar
No ratings yet
Marketing Analytics (Unit 2)
No ratings yet
Marketing Analytics (Unit 2)
78 pages
AL2 Series SOFTWARE MANUAL Jy992d74001l PDF
No ratings yet
AL2 Series SOFTWARE MANUAL Jy992d74001l PDF
124 pages
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
Module II - Data Processing
No ratings yet
Module II - Data Processing
54 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
DM Day3 Preprocessing A F24
No ratings yet
DM Day3 Preprocessing A F24
85 pages
CT605A-N Soft Computing
No ratings yet
CT605A-N Soft Computing
3 pages
Cleaning Techniques (Slides)
No ratings yet
Cleaning Techniques (Slides)
20 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
Data Processing - Unit-3
No ratings yet
Data Processing - Unit-3
38 pages
Specsem f2006 Handouts Francis2
No ratings yet
Specsem f2006 Handouts Francis2
49 pages
3b. Data Pre-Processing
No ratings yet
3b. Data Pre-Processing
84 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
1 - Hutchison 1957, Concluding Remarks
No ratings yet
1 - Hutchison 1957, Concluding Remarks
13 pages
Winspire
No ratings yet
Winspire
44 pages
2 Data Preprocessing
No ratings yet
2 Data Preprocessing
57 pages
06 Data Mining-Data Preprocessing-Cleaning
No ratings yet
06 Data Mining-Data Preprocessing-Cleaning
6 pages
Class3-9 DataPreprocessing 22Aug-06Sept2019
No ratings yet
Class3-9 DataPreprocessing 22Aug-06Sept2019
53 pages
Tutorial 4 - MATRIX and LINEAR - DE - WITH SOLUTION 2020
No ratings yet
Tutorial 4 - MATRIX and LINEAR - DE - WITH SOLUTION 2020
26 pages
Unit2 - Data Cleaning and Multivariate Techniques - 26 - 01 - 2025
No ratings yet
Unit2 - Data Cleaning and Multivariate Techniques - 26 - 01 - 2025
42 pages
Interpretation and Report Writing: Bm-Aryan Panchal
No ratings yet
Interpretation and Report Writing: Bm-Aryan Panchal
13 pages
Data Preparation
No ratings yet
Data Preparation
39 pages
Broadcast Engineering and Acoustics Expt. #2 FINAL
No ratings yet
Broadcast Engineering and Acoustics Expt. #2 FINAL
8 pages
11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
No ratings yet
11-Data Pre-Processing, Exploratory Data Analysis.-23-03-2023
37 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Corrections: Applied Drilling Engineering, by Adam T. Bourgoyne JR., Keith K
No ratings yet
Corrections: Applied Drilling Engineering, by Adam T. Bourgoyne JR., Keith K
8 pages
Michaelis Manten Kinetics
No ratings yet
Michaelis Manten Kinetics
8 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Week2 2
No ratings yet
Week2 2
25 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
ML Unit 1 Part 2
No ratings yet
ML Unit 1 Part 2
56 pages
Practical Skills
No ratings yet
Practical Skills
35 pages
03 Data Preprocessing
No ratings yet
03 Data Preprocessing
15 pages
Ander
No ratings yet
Ander
2 pages
EDA and Cleaning
No ratings yet
EDA and Cleaning
24 pages
Lec 8
No ratings yet
Lec 8
8 pages
Lecture 7 - Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 - Data Preprocessing - Cleaning-M
21 pages
One Dimensional Array in Java - Tutorial & Example
No ratings yet
One Dimensional Array in Java - Tutorial & Example
4 pages
Quesioner Design and Analyisis
No ratings yet
Quesioner Design and Analyisis
25 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Lecture 3
No ratings yet
Lecture 3
32 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Topic2 - 2024 - Descriptive Statistics - STD - Revised
No ratings yet
Topic2 - 2024 - Descriptive Statistics - STD - Revised
20 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
Chapter3 DS
No ratings yet
Chapter3 DS
17 pages
DWM - Co2-10
No ratings yet
DWM - Co2-10
27 pages
Data Cleaning
No ratings yet
Data Cleaning
42 pages
What Is Language?: Medium of Communication
No ratings yet
What Is Language?: Medium of Communication
3 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
40.cleaning Data
No ratings yet
40.cleaning Data
20 pages
Exploratory Data Analysis-1 (EDA-1)
No ratings yet
Exploratory Data Analysis-1 (EDA-1)
38 pages
Data Collection Cleaning Preprocessing Presentation
No ratings yet
Data Collection Cleaning Preprocessing Presentation
13 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Slide PTDL.1
No ratings yet
Slide PTDL.1
16 pages
Data Cleaning
No ratings yet
Data Cleaning
4 pages
Data Cleaning
No ratings yet
Data Cleaning
26 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
20 pages
BC 2014 Session2
No ratings yet
BC 2014 Session2
45 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
04 Data Cleaning in R
No ratings yet
04 Data Cleaning in R
36 pages
Lee Smolin - A Real Ensemble Interpretation of Quantum Mechanics
No ratings yet
Lee Smolin - A Real Ensemble Interpretation of Quantum Mechanics
14 pages
Data Mining Group Assignment4
No ratings yet
Data Mining Group Assignment4
10 pages
Chap 1 Data Preprocessing
No ratings yet
Chap 1 Data Preprocessing
17 pages
KJWDH
No ratings yet
KJWDH
4 pages
Zeus Case Study
No ratings yet
Zeus Case Study
7 pages
Ejemplos de Programación de Agentes en JADE
No ratings yet
Ejemplos de Programación de Agentes en JADE
7 pages
DM 24 Data Cleaning
No ratings yet
DM 24 Data Cleaning
2 pages
Ashfaq
No ratings yet
Ashfaq
1 page
Data Cleaning Workshop:: Club Data Science and Cloud Computing
No ratings yet
Data Cleaning Workshop:: Club Data Science and Cloud Computing
6 pages
AutoCAD Electrical 2018 Black Book
From Everand
AutoCAD Electrical 2018 Black Book
Gaurav Verma
No ratings yet
Solidworks 2018 Learn by Doing - Part 3: DimXpert and Rendering
From Everand
Solidworks 2018 Learn by Doing - Part 3: DimXpert and Rendering
Tutorial Books
No ratings yet

Chapter 2 - Data Cleansing 2

Uploaded by

Chapter 2 - Data Cleansing 2

Uploaded by

Data Cleansing

You might also like