0% found this document useful (0 votes)
23 views14 pages

Class5 DataPreprocessing DataCleaning 23aug2021

1. Data cleaning involves handling missing values, noisy data, and outliers. Common techniques to handle missing values include ignoring tuples with many missing attributes, filling in values manually, or using mean/median/mode imputation of attributes or groups. 2. Noise in data makes values incorrect and can be smoothed out using binning or regression methods. Binning involves grouping similar values, while regression finds patterns to predict values. 3. Outliers are identified and addressed during data cleaning to correct inconsistencies in data. The goal is to identify issues and fill in or correct values to produce clean, consistent data for analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views14 pages

Class5 DataPreprocessing DataCleaning 23aug2021

1. Data cleaning involves handling missing values, noisy data, and outliers. Common techniques to handle missing values include ignoring tuples with many missing attributes, filling in values manually, or using mean/median/mode imputation of attributes or groups. 2. Noise in data makes values incorrect and can be smoothed out using binning or regression methods. Binning involves grouping similar values, while regression finds patterns to predict values. 3. Outliers are identified and addressed during data cleaning to correct inconsistencies in data. The goal is to identify issues and fill in or correct values to produce clean, consistent data for analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

23-08-2021

Data Preprocessing
Data Cleaning: Handling Missing
Values, Noisy Data and Outliers

Data Cleaning (Data Cleansing)


• Real world data are tend to be incomplete, noisy and
inconsistent
• Data cleaning routines attempt to identify missing
values, fill in missing values, smooth out noise while
identifying outliers and correct inconsistencies in the
data

• One of the biggest data cleaning task is handling


missing values

1
23-08-2021

Data Cleaning: Missing Values


• Many tuple (records) have no recorded value for
several attributes
• Identifying missing values:
– When Pandas library for python is used, it detect the
missing values as “NaN” [1]
– It automatically consider “blank” in the attribute value,
“NaN/nan/NAN” in the attribute value , “NA” in the
attribute value, “n/a” in the attribute value, “NULL/null”
in the attribute value as NaN

• Important note: If any numeric attribute have the


value 0 (zero), then it is not a missing value
– If it is not correct value, then it is simply a noise

[1] https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

Methods to Handle Missing Values


• Ignore the tuples:
– This method is effective only when the tuples contain
several attributes (> 50% of attributes) with missing
value

Tuples contain several attributes (> 50% of attributes) with missing


value

2
23-08-2021

Methods to Handle Missing Values


• Ignore the tuples:
– This method is effective only when the tuples contain
several attributes (> 50% of attributes) with missing
value
– This method is also used when the target variable (class
label) is missing

Target attribute (StationID) with missing value

Methods to Handle Missing Values


• Fill in the missing values (imputing values) manually:
– Time consuming
– Not feasible given a large data set with many missing
values

• Use a global constant to fill in missing value (Imputing


global constant):
– Replace all missing attribute values by a same constant
– Imputed value may not be correct

3
23-08-2021

Methods to Handle Missing Values


• Use attribute mean/median/mode to fill in the missing
value (mean/median/mode imputation):
– Applicable to numeric data
– Centre of the data won’t change

Methods to Handle Missing Values


• Use attribute mean/median/mode to fill in the missing
value (mean/median/mode imputation):
– Applicable to numeric data
– Centre of the data won’t change

4
23-08-2021

Methods to Handle Missing Values


• Use attribute mean/median/mode to fill in the missing
value (mean/median/mode imputation):
– Applicable to numeric data
– Centre of the data won’t change
– However, it does not preserve the relationship with
other variables

Methods to Handle Missing Values


• Filling with local mean/median/mode:
– Use attribute mean/median/mode of all samples
belonging to a group (class) to fill in the missing value
• Applicable to numeric data
• Centre of the data of a group won’t change

5
23-08-2021

Methods to Handle Missing Values


• Filling with local mean/median/mode:
– Use attribute mean/median/mode of all samples
belonging to a group (class) to fill in the missing value
• Applicable to numeric data
• Centre of the data of a group won’t change

Methods to Handle Missing Values


• Filling with local mean/median/mode:
– Use attribute mean/median/mode of all samples
belonging to a group (class) to fill in the missing value
• Applicable to numeric data
• Centre of the data of a group won’t change
• However, it does not preserve the relationship with other
variables

6
23-08-2021

Methods to Handle Missing Values


• Use the values from the previous/next record (with in
a group) to fill in missing value (Padding)
– Useful only when the domine understanding is good

• If the data is categorical or text, one can replace the


missing values by most frequent observations

Methods to Handle Missing Values


• Use most probable value to fill the missing value:
– Use interpolation technique to predict the missing value
• Linear interpolation is achieved by geometrically
rendering a straight line between two adjacent points on a
graph or plane
• Interpolation happens column wise
• Popular strategy
• It does not preserves the relationship with other variables

7
23-08-2021

Methods to Handle Missing Values


• Use most probable value to fill the missing value:
– Use regression techniques to predict the missing value
(regression imputation)
• Let y1, y2, …, yd be a set of d attributes
• Regression (multivariate): The nth value is predicted as
xn = f(yn1, yn2, …, ynd )

y x
d
f(.)

• Linear regression (multivariate): xn = w1yn1 + w2yn2 +… + wdynd

Methods to Handle Missing Values


• Use most probable value to fill the missing value:
– Use regression techniques to predict the missing value
(regression imputation)
• Let y1, y2, …, yd be a set of d attributes
• Regression (multivariate): The nth value is predicted as
xn = f(yn1, yn2, …, ynd )

Temperature = f(Humidity, Rain)


Temperature = wT1Humidity + wT2Rain

Humidity = f(Temperature, Rain)


Humidity = wH1Temperature + wH2Rain

8
23-08-2021

Methods to Handle Missing Values


• Use most probable value to fill the missing value:
– Use regression techniques to predict the missing value
(regression imputation)
• Let y1, y2, …, yd be a set of d attributes
• Regression (multivariate): The nth value is predicted as
xn = f(yn1, yn2, …, ynd )

y x
d
f(.)

• Linear regression (multivariate): xn = w1yn1 + w2yn2 +… + wdynd


• Popular strategy
• It uses the most information from the present data to
predict the missing values
• It preserves the relationship with other variables

Data Cleaning: Handling the Noisy Data


• Noise is a random error or variance in a measured variable
• Consider the case where most of the entries in a numeric
attribute is 0 (zero)
• Example1 • Example2: Pima-Indians-Diabetes
Date Temperature --- BMI Age ---
Sept 1 25.47 --- 33.6 50 ---
Sept 2 26.19 --- 26.6 31 ---
Sept 3 0 --- 23.3 32 ---
Sept 4 24.30 --- 0 21 ---
Sept 5 24.07 --- 43.1 33 ---
Sept 6 21.21 --- 25.6 30 ---
Sept 7 0 --- 0 26 ---
Sept 8 21.79 --- 35.3 29 ---
Sept 9 25.09 --- 30.5 53 ---
Sept 10 0 --- 0 54 ---
--- ---
• Replace the 0s (zeros) based on domain knowledge
• Replace the 0s (zeros) by regression based methods

9
23-08-2021

Data Cleaning: Smoothing the Noisy Data


• Noise is a random error or variance in a measured variable
• Due to noise, many tuple (records) have incorrect value for
several attributes
• Mostly data is full of noise
• Smooth out the data to remove the effect of noise
• Data smoothing allows important patterns to stand out
• The idea is to sharpen the patterns (values) in the data and
highlight trends the data is pointing to

• Methods for data


smoothing:
– Binning
– Regression (function
approximation)

Binning Methods for Data Smoothing


• Binning method smooth a sorted data value of a noisy
attribute by consulting its neighbourhood i.e., the
values around it
• It perform local smoothing as this method consult the
neighbourhood of values
• The sorted values are partitioned into (almost) equal-
frequency bins

10
23-08-2021

Binning Methods for Data Smoothing


• Different approaches for smoothing by bin:
1. Smoothing by bin means:
– Each value in a bin is replaced by the mean value of the
bin
2. Smoothing by bin medians:
– Each value in a bin is replaced by the median value of
the bin
3. Smoothing by bin boundaries:
– The minimum and maximum values in a given bin are
identified as bin boundaries
– Each bin value is then replaced by the closest boundary
value
• Larger the width, the greater the effect of the
smoothing

Illustration of Binning Methods for


Data Smoothing
• Example:
• Noisy data for price (in Rs) : 8, 15, 34, 24, 4, 21, 28, 21, 25
• Sorted data for price (in Rs) : 4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into bins: Smoothing by bin means:


Bin1: 4, 8, 15 Bin1: 9, 9, 9
Bin2: 21, 21, 24 Bin2: 22, 22, 22
Bin3: 25, 28, 34 Bin3: 29, 29, 29
Noisy data

11
23-08-2021

Illustration of Binning Methods for


Data Smoothing
• Example:
• Noisy data for price (in Rs) : 8, 15, 34, 24, 4, 21, 28, 21, 25
• Smoothing by bin means : 9, 9, 29, 22, 9, 22, 29, 22, 29

Partition into bins: Smoothing by bin means:


Bin1: 4, 8, 15 Bin1: 9, 9, 9
Bin2: 21, 21, 24 Bin2: 22, 22, 22
Bin3: 25, 28, 34 Bin3: 29, 29, 29
Noisy data
Smoothing by bin means

Illustration of Binning Methods for


Data Smoothing
• Example:
• Noisy data for price (in Rs) : 8, 15, 34, 24, 4, 21, 28, 21, 25
• Sorted data for price (in Rs) : 4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into bins: Smoothing by bin Boundaries:


Bin1: 4, 8, 15 Bin1: 4, 4, 15
Bin2: 21, 21, 24 Bin2: 21, 21, 24
Bin3: 25, 28, 34 Bin3: 25, 25, 34
Noisy data

12
23-08-2021

Illustration of Binning Methods for


Data Smoothing
• Example:
• Noisy data for price (in Rs) : 8, 15, 34, 24, 4, 21, 28, 21, 25
• Smoothing by bin boundaries : 4, 15, 34, 24, 4, 21, 25, 21, 25

Partition into bins: Smoothing by bin Boundaries:


Bin1: 4, 8, 15 Bin1: 4, 4, 15
Bin2: 21, 21, 24 Bin2: 21, 21, 24
Bin3: 25, 28, 34 Bin3: 25, 25, 34
Noisy data
Smoothing by bin Boundaries

Illustration of Binning Methods for


Data Smoothing
• Example:
• Noisy data for price (in Rs) : 8, 15, 34, 24, 4, 21, 28, 21, 25
• Sorted data for price (in Rs) : 4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into bins: Smoothing by bin means: Smoothing by bin Boundaries:


Bin1: 4, 8, 15 Bin1: 9, 9, 9 Bin1: 4, 4, 15
Bin2: 21, 21, 24 Bin2: 22, 22, 22 Bin2: 21, 21, 24
Bin3: 25, 28, 34 Bin3: 29, 29, 29 Bin3: 25, 25, 34
Noisy data
Smoothing by bin means
Smoothing by bin Boundaries

13
23-08-2021

Outlier Detection and Replacing with


Centre of Tendency
• Compute first quartile (Q1) and third quartile (Q3) for
an attribute
• Compute the interquartile range (IQR) as IQR=Q3-Q1
for that attribute
• Compute
– Lower Bound = | Q1 – (1.5 x IQR) |
– Upper Bound = | Q3 + (1.5 x IQR) |
• Detect attribute value as outlier if
– it is less than Lower Bound OR
– it is larger than Upper Bound
• Replace these outlier values with mean/median/mode
of the attribute
• Important note: If the outliers are due to noise, then
it is better to replace
– Domine knowledge is very important 27

Summary of Data Cleaning


• 80% of data analyst’s time spent in cleaning that data
• Data cleaning routines attempt to identify missing
values, fill in missing values, smooth out noise while
identifying outliers
• One of the biggest data cleaning task is handling
missing values
• Among the different methods for filling the missing
values
– Filling by central tendency (mean/median/mode)
– Filling by interpolation
– Filling by regression are popular methods
• When data is mostly full of noise, smooth out the data
to remove the effect of noise (binning and regression)
• Outliers can be detected using quartiles and IQR
– Detected outliers can be replaced by
mean/median/mode 28

14

You might also like