0% found this document useful (0 votes)

23 views14 pages

Class5 DataPreprocessing DataCleaning 23aug2021

1. Data cleaning involves handling missing values, noisy data, and outliers. Common techniques to handle missing values include ignoring tuples with many missing attributes, filling in values manually, or using mean/median/mode imputation of attributes or groups. 2. Noise in data makes values incorrect and can be smoothed out using binning or regression methods. Binning involves grouping similar values, while regression finds patterns to predict values. 3. Outliers are identified and addressed during data cleaning to correct inconsistencies in data. The goal is to identify issues and fill in or correct values to produce clean, consistent data for analysis.

Uploaded by

siddharth0208yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views14 pages

Class5 DataPreprocessing DataCleaning 23aug2021

Uploaded by

siddharth0208yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

23-08-2021

Data Preprocessing
Data Cleaning: Handling Missing
Values, Noisy Data and Outliers

Data Cleaning (Data Cleansing)

• Real world data are tend to be incomplete, noisy and
inconsistent
• Data cleaning routines attempt to identify missing
values, fill in missing values, smooth out noise while
identifying outliers and correct inconsistencies in the
data

• One of the biggest data cleaning task is handling

missing values

1
23-08-2021

Data Cleaning: Missing Values

• Many tuple (records) have no recorded value for
several attributes
• Identifying missing values:
– When Pandas library for python is used, it detect the
missing values as “NaN” [1]
– It automatically consider “blank” in the attribute value,
“NaN/nan/NAN” in the attribute value , “NA” in the
attribute value, “n/a” in the attribute value, “NULL/null”
in the attribute value as NaN

• Important note: If any numeric attribute have the

value 0 (zero), then it is not a missing value
– If it is not correct value, then it is simply a noise

[1] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Methods to Handle Missing Values

• Ignore the tuples:
– This method is effective only when the tuples contain
several attributes (> 50% of attributes) with missing
value

Tuples contain several attributes (> 50% of attributes) with missing

value

2
23-08-2021

Methods to Handle Missing Values

• Ignore the tuples:
– This method is effective only when the tuples contain
several attributes (> 50% of attributes) with missing
value
– This method is also used when the target variable (class
label) is missing

Target attribute (StationID) with missing value

Methods to Handle Missing Values

• Fill in the missing values (imputing values) manually:
– Time consuming
– Not feasible given a large data set with many missing
values

• Use a global constant to fill in missing value (Imputing

global constant):
– Replace all missing attribute values by a same constant
– Imputed value may not be correct

3
23-08-2021

Methods to Handle Missing Values

• Use attribute mean/median/mode to fill in the missing
value (mean/median/mode imputation):
– Applicable to numeric data
– Centre of the data won’t change

Methods to Handle Missing Values

• Use attribute mean/median/mode to fill in the missing
value (mean/median/mode imputation):
– Applicable to numeric data
– Centre of the data won’t change

4
23-08-2021

Methods to Handle Missing Values

• Use attribute mean/median/mode to fill in the missing
value (mean/median/mode imputation):
– Applicable to numeric data
– Centre of the data won’t change
– However, it does not preserve the relationship with
other variables

Methods to Handle Missing Values

• Filling with local mean/median/mode:
– Use attribute mean/median/mode of all samples
belonging to a group (class) to fill in the missing value
• Applicable to numeric data
• Centre of the data of a group won’t change

5
23-08-2021

Methods to Handle Missing Values

6
23-08-2021

Methods to Handle Missing Values

• Use the values from the previous/next record (with in
a group) to fill in missing value (Padding)
– Useful only when the domine understanding is good

• If the data is categorical or text, one can replace the

missing values by most frequent observations

Methods to Handle Missing Values

• Use most probable value to fill the missing value:
– Use interpolation technique to predict the missing value
• Linear interpolation is achieved by geometrically
rendering a straight line between two adjacent points on a
graph or plane
• Interpolation happens column wise
• Popular strategy
• It does not preserves the relationship with other variables

7
23-08-2021

Methods to Handle Missing Values

• Use most probable value to fill the missing value:
– Use regression techniques to predict the missing value
(regression imputation)
• Let y1, y2, …, yd be a set of d attributes
• Regression (multivariate): The nth value is predicted as
xn = f(yn1, yn2, …, ynd )

y x
d
f(.)

• Linear regression (multivariate): xn = w1yn1 + w2yn2 +… + wdynd

Methods to Handle Missing Values

Temperature = f(Humidity, Rain)

Temperature = wT1Humidity + wT2Rain

Humidity = f(Temperature, Rain)

Humidity = wH1Temperature + wH2Rain

8
23-08-2021

Methods to Handle Missing Values

y x
d
f(.)

• Linear regression (multivariate): xn = w1yn1 + w2yn2 +… + wdynd

• Popular strategy
• It uses the most information from the present data to
predict the missing values
• It preserves the relationship with other variables

Data Cleaning: Handling the Noisy Data

• Noise is a random error or variance in a measured variable
• Consider the case where most of the entries in a numeric
attribute is 0 (zero)
• Example1 • Example2: Pima-Indians-Diabetes
Date Temperature --- BMI Age ---
Sept 1 25.47 --- 33.6 50 ---
Sept 2 26.19 --- 26.6 31 ---
Sept 3 0 --- 23.3 32 ---
Sept 4 24.30 --- 0 21 ---
Sept 5 24.07 --- 43.1 33 ---
Sept 6 21.21 --- 25.6 30 ---
Sept 7 0 --- 0 26 ---
Sept 8 21.79 --- 35.3 29 ---
Sept 9 25.09 --- 30.5 53 ---
Sept 10 0 --- 0 54 ---
--- ---
• Replace the 0s (zeros) based on domain knowledge
• Replace the 0s (zeros) by regression based methods

9
23-08-2021

Data Cleaning: Smoothing the Noisy Data

• Noise is a random error or variance in a measured variable
• Due to noise, many tuple (records) have incorrect value for
several attributes
• Mostly data is full of noise
• Smooth out the data to remove the effect of noise
• Data smoothing allows important patterns to stand out
• The idea is to sharpen the patterns (values) in the data and
highlight trends the data is pointing to

• Methods for data

smoothing:
– Binning
– Regression (function
approximation)

Binning Methods for Data Smoothing

• Binning method smooth a sorted data value of a noisy
attribute by consulting its neighbourhood i.e., the
values around it
• It perform local smoothing as this method consult the
neighbourhood of values
• The sorted values are partitioned into (almost) equal-
frequency bins

10
23-08-2021

Binning Methods for Data Smoothing

• Different approaches for smoothing by bin:
1. Smoothing by bin means:
– Each value in a bin is replaced by the mean value of the
bin
2. Smoothing by bin medians:
– Each value in a bin is replaced by the median value of
the bin
3. Smoothing by bin boundaries:
– The minimum and maximum values in a given bin are
identified as bin boundaries
– Each bin value is then replaced by the closest boundary
value
• Larger the width, the greater the effect of the
smoothing

Illustration of Binning Methods for

Data Smoothing
• Example:
• Noisy data for price (in Rs) : 8, 15, 34, 24, 4, 21, 28, 21, 25
• Sorted data for price (in Rs) : 4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into bins: Smoothing by bin means:

Bin1: 4, 8, 15 Bin1: 9, 9, 9
Bin2: 21, 21, 24 Bin2: 22, 22, 22
Bin3: 25, 28, 34 Bin3: 29, 29, 29
Noisy data

11
23-08-2021

Illustration of Binning Methods for

Data Smoothing
• Example:
• Noisy data for price (in Rs) : 8, 15, 34, 24, 4, 21, 28, 21, 25
• Smoothing by bin means : 9, 9, 29, 22, 9, 22, 29, 22, 29

Partition into bins: Smoothing by bin means:

Bin1: 4, 8, 15 Bin1: 9, 9, 9
Bin2: 21, 21, 24 Bin2: 22, 22, 22
Bin3: 25, 28, 34 Bin3: 29, 29, 29
Noisy data
Smoothing by bin means

Illustration of Binning Methods for

Data Smoothing
• Example:
• Noisy data for price (in Rs) : 8, 15, 34, 24, 4, 21, 28, 21, 25
• Sorted data for price (in Rs) : 4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into bins: Smoothing by bin Boundaries:

Bin1: 4, 8, 15 Bin1: 4, 4, 15
Bin2: 21, 21, 24 Bin2: 21, 21, 24
Bin3: 25, 28, 34 Bin3: 25, 25, 34
Noisy data

12
23-08-2021

Illustration of Binning Methods for

Data Smoothing
• Example:
• Noisy data for price (in Rs) : 8, 15, 34, 24, 4, 21, 28, 21, 25
• Smoothing by bin boundaries : 4, 15, 34, 24, 4, 21, 25, 21, 25

Partition into bins: Smoothing by bin Boundaries:

Bin1: 4, 8, 15 Bin1: 4, 4, 15
Bin2: 21, 21, 24 Bin2: 21, 21, 24
Bin3: 25, 28, 34 Bin3: 25, 25, 34
Noisy data
Smoothing by bin Boundaries

Illustration of Binning Methods for

Data Smoothing
• Example:
• Noisy data for price (in Rs) : 8, 15, 34, 24, 4, 21, 28, 21, 25
• Sorted data for price (in Rs) : 4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into bins: Smoothing by bin means: Smoothing by bin Boundaries:

Bin1: 4, 8, 15 Bin1: 9, 9, 9 Bin1: 4, 4, 15
Bin2: 21, 21, 24 Bin2: 22, 22, 22 Bin2: 21, 21, 24
Bin3: 25, 28, 34 Bin3: 29, 29, 29 Bin3: 25, 25, 34
Noisy data
Smoothing by bin means
Smoothing by bin Boundaries

13
23-08-2021

Outlier Detection and Replacing with

Centre of Tendency
• Compute first quartile (Q1) and third quartile (Q3) for
an attribute
• Compute the interquartile range (IQR) as IQR=Q3-Q1
for that attribute
• Compute
– Lower Bound = | Q1 – (1.5 x IQR) |
– Upper Bound = | Q3 + (1.5 x IQR) |
• Detect attribute value as outlier if
– it is less than Lower Bound OR
– it is larger than Upper Bound
• Replace these outlier values with mean/median/mode
of the attribute
• Important note: If the outliers are due to noise, then
it is better to replace
– Domine knowledge is very important 27

Summary of Data Cleaning

• 80% of data analyst’s time spent in cleaning that data
• Data cleaning routines attempt to identify missing
values, fill in missing values, smooth out noise while
identifying outliers
• One of the biggest data cleaning task is handling
missing values
• Among the different methods for filling the missing
values
– Filling by central tendency (mean/median/mode)
– Filling by interpolation
– Filling by regression are popular methods
• When data is mostly full of noise, smooth out the data
to remove the effect of noise (binning and regression)
• Outliers can be detected using quartiles and IQR
– Detected outliers can be replaced by
mean/median/mode 28

Fuji Inverter Manual
No ratings yet
Fuji Inverter Manual
103 pages
CKA Exam - 28102022
100% (1)
CKA Exam - 28102022
31 pages
992 G Wheel Loader 5IMPLEM
100% (6)
992 G Wheel Loader 5IMPLEM
45 pages
Mixed Use Development
No ratings yet
Mixed Use Development
13 pages
Detailed Lesson Plan in Pe6
No ratings yet
Detailed Lesson Plan in Pe6
5 pages
Daftar Harga Allengers - 012020
No ratings yet
Daftar Harga Allengers - 012020
8 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
66 pages
Bwms Samsung Purimar Final Drawing - 224
No ratings yet
Bwms Samsung Purimar Final Drawing - 224
224 pages
Slide 2 - Data Preprocessing
100% (1)
Slide 2 - Data Preprocessing
39 pages
Lec 6 Data Preprocessing Using R
No ratings yet
Lec 6 Data Preprocessing Using R
84 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
3b. Data Pre-Processing
No ratings yet
3b. Data Pre-Processing
84 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Unit-1 3
No ratings yet
Unit-1 3
58 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
J-CAPS-01 (SC+MATHS) Class X (17th To 23rd Apr 2020) by AAKASH Institute
No ratings yet
J-CAPS-01 (SC+MATHS) Class X (17th To 23rd Apr 2020) by AAKASH Institute
5 pages
Unit 3
No ratings yet
Unit 3
41 pages
Unit-Ii Data Preprocessing
No ratings yet
Unit-Ii Data Preprocessing
94 pages
Unit 2
No ratings yet
Unit 2
37 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
CH2 Data Cleaning
No ratings yet
CH2 Data Cleaning
41 pages
1) Cryptic Species As A Window Into The Paradigm Shift of The Species Concept
No ratings yet
1) Cryptic Species As A Window Into The Paradigm Shift of The Species Concept
75 pages
Dataminin Presentation (1) .PPTX - Read-Only
No ratings yet
Dataminin Presentation (1) .PPTX - Read-Only
23 pages
Unit 2
No ratings yet
Unit 2
46 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
M 2.3 Data Preprocessing
No ratings yet
M 2.3 Data Preprocessing
22 pages
Unit-2 Lecture Notes
No ratings yet
Unit-2 Lecture Notes
33 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Unit 2
No ratings yet
Unit 2
34 pages
Week2 DataPreprocessing
No ratings yet
Week2 DataPreprocessing
43 pages
DMiningKuliah 2A DPreparation
No ratings yet
DMiningKuliah 2A DPreparation
32 pages
4 - Data Pre-Processing I
No ratings yet
4 - Data Pre-Processing I
37 pages
Lecture 5
No ratings yet
Lecture 5
27 pages
Unit-2 Preprocessing
No ratings yet
Unit-2 Preprocessing
18 pages
Data Cleaning
No ratings yet
Data Cleaning
26 pages
3-Data Pre-Processing
No ratings yet
3-Data Pre-Processing
18 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
25 pages
ML 4
No ratings yet
ML 4
17 pages
Lecture 7 - Data Preprocessing - Cleaning-M
No ratings yet
Lecture 7 - Data Preprocessing - Cleaning-M
21 pages
UNIT-2 Data Preprocessing
No ratings yet
UNIT-2 Data Preprocessing
51 pages
Final - Unit 3 Data Preprocessing - Phases
No ratings yet
Final - Unit 3 Data Preprocessing - Phases
42 pages
DWDM Unit 2
No ratings yet
DWDM Unit 2
20 pages
Data Processing - Unit-3
No ratings yet
Data Processing - Unit-3
38 pages
Module 2 - Data Preprocessing
No ratings yet
Module 2 - Data Preprocessing
16 pages
Data Preprocessing
No ratings yet
Data Preprocessing
21 pages
Data Preprocessing 013333
No ratings yet
Data Preprocessing 013333
8 pages
Exp-12 Iaiml
No ratings yet
Exp-12 Iaiml
13 pages
Outliners
No ratings yet
Outliners
15 pages
3 - Missing Values-1
No ratings yet
3 - Missing Values-1
9 pages
Upute Za Montiranje Vretena
No ratings yet
Upute Za Montiranje Vretena
80 pages
DSR Unit III
No ratings yet
DSR Unit III
11 pages
DWDM Unit-Ii
No ratings yet
DWDM Unit-Ii
18 pages
Missing Data
No ratings yet
Missing Data
14 pages
BC 2014 Session2
No ratings yet
BC 2014 Session2
45 pages
06 Data Mining-Data Preprocessing-Cleaning
No ratings yet
06 Data Mining-Data Preprocessing-Cleaning
6 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
Lecture 09 DM
No ratings yet
Lecture 09 DM
14 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
30 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
R20 DMT Unit-Ii
No ratings yet
R20 DMT Unit-Ii
17 pages
Chap 1 Data Preprocessing
No ratings yet
Chap 1 Data Preprocessing
17 pages
Alternators: LSA 42.2 - 2 Pole
No ratings yet
Alternators: LSA 42.2 - 2 Pole
7 pages
1 GOCE Introduction
No ratings yet
1 GOCE Introduction
62 pages
DM 24 Data Cleaning
No ratings yet
DM 24 Data Cleaning
2 pages
Lecture Notes 1.7 & 1.8
No ratings yet
Lecture Notes 1.7 & 1.8
3 pages
Ferti Jet
No ratings yet
Ferti Jet
19 pages
Teaching English Language Methods and Ap
No ratings yet
Teaching English Language Methods and Ap
41 pages
Regional Rural Banks (Appointment Andpromotion of Officers and Employees) Rules, 2010
No ratings yet
Regional Rural Banks (Appointment Andpromotion of Officers and Employees) Rules, 2010
33 pages
DWM
No ratings yet
DWM
14 pages
Student Exploration: Fan Cart Physics
100% (1)
Student Exploration: Fan Cart Physics
4 pages
2SC829
No ratings yet
2SC829
4 pages
UNIT-2 Data Preprocessing
No ratings yet
UNIT-2 Data Preprocessing
51 pages
GIS and Its Implementations
No ratings yet
GIS and Its Implementations
250 pages
Question IV. Supply The Correct Verb Tense
No ratings yet
Question IV. Supply The Correct Verb Tense
1 page
TCP Header
No ratings yet
TCP Header
8 pages
Chapt 4
No ratings yet
Chapt 4
33 pages
TENTEC V-Series Data Sheet R8 A4
No ratings yet
TENTEC V-Series Data Sheet R8 A4
4 pages
7.3 Java Applet
No ratings yet
7.3 Java Applet
6 pages
Module 6 Inputs and Outputs
No ratings yet
Module 6 Inputs and Outputs
35 pages
Vehicle Over-Speed Detection System
No ratings yet
Vehicle Over-Speed Detection System
3 pages
Zulueta - Sts - Act10
No ratings yet
Zulueta - Sts - Act10
4 pages
GRADE 6 Consolidated Report On The 1st Quarterly Assessment 2024 3RD Q
No ratings yet
GRADE 6 Consolidated Report On The 1st Quarterly Assessment 2024 3RD Q
8 pages
Check List For Module 2 Krishna Sevak
No ratings yet
Check List For Module 2 Krishna Sevak
1 page
Sist TS Cen TS 16555 7 2016
No ratings yet
Sist TS Cen TS 16555 7 2016
11 pages
CBM (2019)
No ratings yet
CBM (2019)
1 page
Mathematics for Data Science: Linear Algebra with Matlab
From Everand
Mathematics for Data Science: Linear Algebra with Matlab
César Pérez López
No ratings yet
The Numpy Pocketbook: Essentials on the Go
From Everand
The Numpy Pocketbook: Essentials on the Go
Silas Meadowlark
No ratings yet

Class5 DataPreprocessing DataCleaning 23aug2021

Uploaded by

Class5 DataPreprocessing DataCleaning 23aug2021

Uploaded by

23-08-2021

Data Cleaning (Data Cleansing)

• One of the biggest data cleaning task is handling

Data Cleaning: Missing Values

• Important note: If any numeric attribute have the

Methods to Handle Missing Values

Tuples contain several attributes (> 50% of attributes) with missing

Methods to Handle Missing Values

Target attribute (StationID) with missing value

Methods to Handle Missing Values

• Use a global constant to fill in missing value (Imputing

Methods to Handle Missing Values

Methods to Handle Missing Values

Methods to Handle Missing Values

Methods to Handle Missing Values

Methods to Handle Missing Values

Methods to Handle Missing Values

Methods to Handle Missing Values

• If the data is categorical or text, one can replace the

Methods to Handle Missing Values

Methods to Handle Missing Values

• Linear regression (multivariate): xn = w1yn1 + w2yn2 +… + wdynd

Methods to Handle Missing Values

Temperature = f(Humidity, Rain)

Humidity = f(Temperature, Rain)

Methods to Handle Missing Values

• Linear regression (multivariate): xn = w1yn1 + w2yn2 +… + wdynd

Data Cleaning: Handling the Noisy Data

Data Cleaning: Smoothing the Noisy Data

• Methods for data

Binning Methods for Data Smoothing

Binning Methods for Data Smoothing

Illustration of Binning Methods for

Partition into bins: Smoothing by bin means:

Illustration of Binning Methods for

Partition into bins: Smoothing by bin means:

Illustration of Binning Methods for

Partition into bins: Smoothing by bin Boundaries:

Illustration of Binning Methods for

Partition into bins: Smoothing by bin Boundaries:

Illustration of Binning Methods for

Partition into bins: Smoothing by bin means: Smoothing by bin Boundaries:

Outlier Detection and Replacing with

Summary of Data Cleaning

You might also like