0% found this document useful (0 votes)
18 views28 pages

03 Data Preparation

Data pre-processing is essential for improving the quality of datasets used in data mining, as it addresses issues like inconsistencies, missing values, and noise. Techniques such as data cleaning, smoothing, normalization, and data reduction are employed to enhance data quality and ensure effective model building. The success of data mining projects heavily relies on the quality of the prepared data, making careful data preparation a prerequisite.

Uploaded by

William D2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views28 pages

03 Data Preparation

Data pre-processing is essential for improving the quality of datasets used in data mining, as it addresses issues like inconsistencies, missing values, and noise. Techniques such as data cleaning, smoothing, normalization, and data reduction are employed to enhance data quality and ensure effective model building. The success of data mining projects heavily relies on the quality of the prepared data, making careful data preparation a prerequisite.

Uploaded by

William D2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

DATA PRE-PROCESSING

▪ in real world application, data can be inconsistent,


incomplete and/or noisy
▪ Errors can happen
▪ prediction rate should be lower
▪ produces better models, faster because a good data is a
prerequisite for producing effective models of any type
▪ Analyzing data that has not been carefully screened for
such problems can produce highly misleading results.
▪Then, the success of data mining projects heavily depends
on the quality of the prepared data.
▪Data preparation is about constructing a dataset from one
or more data sources to be used for exploration and
modeling.
Start with an initial dataset to get familiar with the data, to

discover first insights into the data and have a good
understanding of any possible data quality issues.
Data cleaning attempts
to: Fill in missing values
▪ ▪
Smooth out noisy data ▪
Correct inconsistencies
▪ Ignore the tuple with missing values;
▪ Fill in the missing values manually;
▪ Use a global constant to fill in missing values (NULL, unknown,
etc.);
▪ Use the attribute value mean to filling missing values of that
attribute;
▪ Use the attribute mean for all samples belonging to the same
class to fill in the missing values;
▪ Infer the most probable value to fill in the missing value.
▪ The purpose of data smoothing is to eliminate noise.
▪This can be done by:
✓ Binning
✓ Clustering
✓ Regression
▪Binning smooth the data by consulting the value’s
neighborhood.
▪ It aims to remove the noise from the data set [1]
smoothing the data by equal frequency bins [2]
smoothing by bin means;
[3] smoothing by bin boundaries
Unsorted Data for price in dollars:
8, 16, 9, 15, 21, 24, 30, 26, 27, 30, 34, 21
STEP 1: Sort the Data
8, 9, 15, 16, 21, 21, 24, 26,27, 30, 30,
34 Smooth the data by equal frequencies
bins: Bin 1: 8, 9, 15, 16
Bin 2: 21, 21, 24, 26
Bin 3: 27, 30, 30, 34
STEP 1: Sorted Data 8, 9, 15, 16, 21, 21, 24, 26,27, 30, 30, 34
Smooth the data by bin means:
For Bin 1: (8 + 9 + 15 + 16)/4= 12
Bin 1: 12, 12, 12, 12
Bin 2:
Bin 3:
STEP 1: Sorted Data 8, 9, 15, 16, 21, 21, 24, 26,27, 30, 30, 34
Smooth the data by bin boundaries:
▪ Pick the MIN and MAX value
▪ Put the MIN in the left side and MAX on the right side ▪
Middle values in bin boundaries move to its closest neighbor
value with less distance
Bin 1: 8, 8, 16, 16
Bin 2:
Bin 3:
▪Data is organized
into groups of “similar”
values.
▪Rare values that fall
outside these groups are
considered outliers and
are discarded.
▪Data regression
consists of fitting the
data to a function.
▪A linear regression for
instance, finds the line to fit 2
variables so that one variable
can predict the other.
▪More variables can be
involved in a multiple linear
regression.
Data analysis may require a combination of data from
multiple sources into a coherent data store.
There are many challenges:
▪ Schema integration:
CID » C_number » Cust-id » cust#
▪ Semantic heterogeneity
▪ Data value conflicts (different representations or scales,
etc.)
There are many challenges:
▪ Redundant records
▪ Redundant attributes
(redundant if it can be derived from other
attributes) ▪ Correlation analysis P(AÙB)/(P(A)P(B))
1: independent, >1 positive correlation, <1: negative
correlation.
▪ Data is sometimes in a form not appropriate for
mining.
▪ Either the algorithm at hand can not handle it, the
form of the data is not regular, or the data itself is not
specific enough.
▪ Normalization
(to compare carrots with carrots)
▪ Smoothing
▪ Aggregation
(summary operation applied to data)
▪ Generalization
(low level data is replaced with level data – concept
hierarchy
Min-max normalization: linear transformation from v to
v’ ▪ v’= v-min/(max – min) (newmax – newmin) + new min
▪ Example:
transform $30000 between [10000..45000] into [0..1] → Î30-10/35(1)+0=0.514
Zscore normalization:
▪ normalization v into v’ based on attribute value mean and standard deviation
▪ v’=v-Mean/StandardDeviation
Normalization by decimal scaling:
▪ moves the decimal point of v by j positions such that j is the minimum
number of positions moved to the decimal of the absolute maximum
value to make is fall in [0..1]. v’=v/10j
▪ Example:
if v ranges between –56 and 9976, j=4 →
v’ ranges between –0.0056 and 0.9976
▪ The data is often too large.
▪ Reducing the data can improve performance.
▪ Data reduction consists of reducing the representation of
the data set while producing the same (or almost the same)
results.
Data reduction
includes: Data cube

aggregation
▪ Dimension reduction
▪ Data compression
▪Discretization
▪Numerosity reduction
▪ Reduce the data to
the concepts level
needed in the
analysis
▪Queries regarding
aggregated
information should
be answered using data
cube when possible
▪ Feature selection (i.e.,
attribute subset selection)
▪Use heuristics: select local
‘best’ (or most pertinent)
attribute
▪Decision Tree Induction
Data compression reduces the size of data.
▪ saves storage space.
▪ saves communication time.

Data compression is beneficial if data mining algorithms can


manipulate compressed data directly without
uncompressing it.
Parametric
▪Regression (a model or function estimating the
distribution instead of the data.)
Non-parametric
▪Histograms
▪Clustering
▪Sampling
A popular data reduction technique:
▪ Divide data into buckets and store representation of
buckets ▪ (sum, count, etc.)
▪ Equi-width
▪ Equi-depth
▪ V-Optimal
▪ MaxDiff
▪ Partition data into clusters
based on “closeness” in space.
▪ Retain representatives of
clusters (centroids) and
outliers.
▪ Effectiveness depends upon the
distribution of data
▪Hierarchical clustering is possible (multi
resolution).
Allows a large data set to be represented by a much smaller
random sample of the data (sub-set).
▪Simple random sample without replacement (SRSWOR)
▪Simple random sampling with replacement (SRSWR)
▪Cluster sample (SRSWOR or SRSWR from clusters)
▪Stratified sample
▪ Discretization is used to reduce the number of values for
a given continuous attribute, by dividing the range of the
attribute into intervals.
Discretization can reduce the data set, and can also be used

to generate concept hierarchies automatically

You might also like