0% found this document useful (0 votes)
13 views21 pages

Lecture 7 - Data Preprocessing - Cleaning-M

The document discusses data preprocessing in data mining, emphasizing the importance of data cleaning, integration, reduction, transformation, and discretization to ensure high-quality data for effective mining results. It highlights common issues such as missing, noisy, and inconsistent data, and outlines methods for addressing these problems, including imputation techniques and data smoothing. The document also covers the significance of maintaining data quality to enhance the accuracy and efficiency of data mining algorithms.

Uploaded by

gihel53025
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views21 pages

Lecture 7 - Data Preprocessing - Cleaning-M

The document discusses data preprocessing in data mining, emphasizing the importance of data cleaning, integration, reduction, transformation, and discretization to ensure high-quality data for effective mining results. It highlights common issues such as missing, noisy, and inconsistent data, and outlines methods for addressing these problems, including imputation techniques and data smoothing. The document also covers the significance of maintaining data quality to enhance the accuracy and efficiency of data mining algorithms.

Uploaded by

gihel53025
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 21

CS06504

Data Mining
Lecture # 7
Data Preprocessing
(Ch # 3)
Data Preprocessing
 Why preprocess the data?
 Data cleaning
 Data integration
 Data reduction
 Data Transformation and
Discretization
 Summary
Why Data

Preprocessing?
Data in the real world is dirty
 incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
 noisy: containing errors or outliers
 inconsistent: containing discrepancies in codes or names
 Data quality is a major concern in Data Mining and
Knowledge Discovery tasks.
 Why: At most all Data Mining algorithms induce
knowledge strictly from data.
 No quality data, no quality mining results!
 Quality decisions must be based on quality data
 No quality data, inefficient mining process!
 Complete, noise-free, and consistent data means faster
algorithms
 The quality of knowledge extracted highly depends on
the quality of data
Effect of Noisy Data on Results Accuracy

age income student buys_computer Discover only


<=30 high yes yes those rules
<=30 high no yes which contain
>40 medium yes no support
Data Mining
>40 medium no no (frequency)
>40 low yes yes greater >= 2
31…40 no yes
31…40 medium yes yes

• If ‘age <= 30’ and income = ‘high’


Training then buys_computer = ‘yes’
data • If ‘age > 40’ and income =
‘medium’ then buys_computer =
‘no’
Due to the missing value in age income student buys_computer
training dataset, the accuracy <=30 high no ?
of prediction decreases and >40 medium yes ?
becomes “66.7%” 31…40 medium yes ?
Testing data or actual
data
Major Tasks in Data
Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data,
identify or remove outliers, and resolve
inconsistencies
 Data integration
 Integration of multiple databases, data cubes,
or files
 Data reduction
 Obtains reduced representation in volume but
produces the same or similar analytical results
 Data transformation
 Normalization and aggregation
 Data discretization
 Part of data reduction but with particular
Forms of data
preprocessing
Data Preprocessing
 Why preprocess the data?
 Data cleaning
 Data integration
 Data reduction
 Data Transformation and
Discretization
 Summary
Data Cleaning

 Data cleaning tasks


Fill in missing values
Noisy data
Correct inconsistent data
Missing Data
 Data is not always available
 E.g., many tuples have no recorded value for
several attributes, such as customer income in
sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus
deleted
 data not entered due to misunderstanding
 certain data may not be considered important at
the time of entry
 no register history or changes of the data
 Missing data may need to be inferred.
Methods of Treating Missing Data
 Ignoring and discarding data:- There are two main ways to
discard data with missing values.
 Discard all those records which have missing data also called
as discard case analysis. Usually done when class label is
missing
 Discarding only those attributes which have high level of
missing data.
 Fill in the missing value manually: tedious + infeasible?
 Use a global constant to fill in the missing value: e.g.,
“unknown”, a new class.
 Imputation using Mean, median or Mode:- One of the
most frequently used method (Statistical technique).
 Use the attribute mean to fill in the missing value
 Use the attribute mean for all samples belonging to the same
class to fill in the missing value: smarter
 Replace (numeric continuous) type “attribute missing
values” using mean/median. (Median robust against noise).
Methods of Treating Missing Data
 Replace missing values using prediction/
classification model:-
 Use the most probable value to fill in the missing
value: inference-based such as Bayesian formula or
decision tree
 Advantage:- it considers relationship among the known
attribute values and the missing values, so the
imputation accuracy is very high.
 Disadvantage:- If there is no correlation exist for some
missing attribute values and known attribute values.
The imputation can’t be performed.
 (Alternative approach):- Use hybrid combination of
Prediction/Classification model and Mean/MODE.
• First try to impute missing value using
prediction/classification model, and then Median/MODE.
 We will study more about this topic in Association
Methods of Treating Missing Data
 K-Nearest Neighbor (k-NN) approach (Best
approach):-
 k-NN imputes the missing attribute values on the
basis of nearest K neighbor. Neighbors are
determined on the basis of distance measure.
 Once K neighbors are determined, missing value
are imputed by taking mean/median or MODE of
known attribute values of missing attribute.

Missing value record

Other dataset records


Imputation of Missing Data
(Basic)
 Imputation is a term that denotes a procedure that
replaces the missing values in a dataset by some
plausible values
i.e. by considering relationship among
correlated values among the attributes of
the dataset.
Attribute 1 Attribute 2 Attribute 3 Attribute 4 If we consider only
20 cool high false {attribute#2}, then
cool high true value “cool” appears
20 cool high true in 3 records.
20 mild low false
30 cool normal false Probability of Imputing
10 mild high true value (20) = 66.7%
Probability of Imputing
value (30) = 33.3%
Imputation of Missing Data
(Basic) For {attribute#4}
Attribute 1 Attribute 2 Attribute 3 Attribute 4
20 cool high false the value “true”
cool high true appears in 2 records
20 cool high true
Probability of Imputing
20 mild low false value (20) = 50%
30 cool normal false
10 mild high true Probability of Imputing
value (10) = 50%

Attribute 1 Attribute 2 Attribute 3 Attribute 4 For {attribute#2,


20 cool high false attribute#3} the
cool high true value {“cool”,
20 cool high true “high”} appears in
20 mild low false only 2 records
30 cool normal false
Probability of Imputing
10 mild high true
value (20) = 100%
Noisy Data
 Noise: random error or variance in a
measured variable
 Incorrect attribute values may be due
to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which requires
data cleaning
 duplicate records
 incomplete data
Removing Noise
 Data Smoothing (rounding, averaging
within a window).
Data smoothing by Binning method:
• first sort data and partition into (equi-depth) bins
• then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Smoothing by Regression
• smooth by fitting the data into regression functions

 Clustering/merging and Detecting


outliers.
 detect and remove outliers
Smoothing by Binning Method
 Equal-width (distance) partitioning:
 It divides the range into N intervals of equal size:
uniform grid
 if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B-A)/k,
where k is the number of bins.
 The most straightforward
 But outliers may dominate presentation
 Skewed data is not handled well.
 Equal-depth (frequency) partitioning:
 It divides the range into M intervals, each containing
approximately same number of samples
 Good data scaling
 Managing categorical attributes can be tricky.
Binning Methods for Data
Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into (equi-width) bins: A+w, A+2w,…
- Bin 1: 4, 8, 9
- Bin 2: 15, 21, 21, 24
- Bin 3: 25, 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 7, 7, 7
- Bin 2: 20, 20, 20, 20
- Bin 3: 28, 28, 28, 28, 28
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 14
- Bin 2: 15, 24, 24, 24
- Bin 3: 25, 25, 25, 25, 34
Binning Methods for Data
Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Regression Method for smoothing the
data
 Regression is a
technique that conforms y
data values to a
function. Linear
regression involves Y1
finding the “best” line to
fit two attributes (or
variables) so that one Y1’ y=x+1
attribute can be used to
predict the other.

X1 x
Detecting Outliers (Clustering)
 Outliers may be detected by clustering, where
similar values are organized into groups or
“clusters”.

 Values which falls outside of the set of clusters


may be considered outliers.

You might also like