BIS 541 Ch03 20-21 S

BIS 541
Data Mining
2020/2021 Spring
Chapter 3
Data Preprocessing
1
Chapter 3: Data Preprocessing
 Data Preprocessing: An Overview

 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary
2
Why Data Preprocessing?
 Data in the real world is dirty
 incomplete: lacking attribute values, lacking
certain attributes of interest, or containing

only aggregate data
 e.g., occupation=“”
 noisy: containing errors or outliers
 e.g., Salary=“-10”
 inconsistent: containing discrepancies in codes
or names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records
Why Is Data Dirty?
 Incomplete data comes from
 n/a data value when collected
 different consideration between the time when the data was
collected and when it is analyzed.
 human/hardware/software problems
 Noisy data comes from the process of data
 collection
 entry
 transmission
 Inconsistent data comes from
 different data sources
 functional dependency violation
Why Is Data Preprocessing Important?
 No quality data, no quality mining results!

 Quality decisions must be based on quality data
 e.g., duplicate or missing data may cause incorrect or even
misleading statistics.
 Data warehouse needs consistent integration of quality
data
 Data extraction, cleaning, and transformation comprises
the majority of the work of building a data warehouse
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression (not covered)
 Data transformation and data discretization
 Normalization: from numerical to numerical mapping
 Discretization: from numerical to categorical encoding
 Concept hierarchy generation: from categorical to categorical
6
Forms of data preprocessing

 Data Cleaning
 Data Reduction
 Summary
8
Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
 incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
 e.g., Occupation=“ ” (missing data)
 noisy: containing noise, errors, or outliers
 e.g., Salary=“−10” (an error)
 inconsistent: containing discrepancies in codes or names, e.g.,
 Age=“42”, Birthday=“03/07/2010”
 Was rating “1, 2, 3”, now rating “A, B, C”
 discrepancy between duplicate records
 Intentional (e.g., disguised missing data)
 Jan. 1 as everyone’s birthday?
9
Incomplete (Missing) Data
 Data is not always available

 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the
time of entry
 not register history or changes of the data
 Missing data may need to be inferred
10
How to Handle Missing Data?
 Ignore the tuple: usually done when class label is missing
(when doing classification)—not effective when the % of
missing values per attribute varies considerably
 Fill in the missing value manually: tedious + infeasible?
 Fill in it automatically with
 a global constant : e.g., “unknown”, a new class?!
 the attribute mean
 the attribute mean for all samples belonging to the
same class: smarter
 the most probable value: inference-based such as
Bayesian formula or decision tree
11
Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may be due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
 Other data problems which require data cleaning

 duplicate records
 incomplete data
 inconsistent data
12
How to Handle Noisy Data?
 Binning
 first sort data and partition into (equal-frequency or
distance) bins
 then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.

 Regression
 smooth by fitting the data into regression functions
 Clustering
 detect and remove outliers
 Combined computer and human inspection

 detect suspicious values and check by human (e.g.,
deal with possible outliers)
13
Binning
 Equal-width (distance) partitioning

 Divides the range into N intervals of equal size: uniform grid
 if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
 The most straightforward, but outliers may dominate presentation
 Skewed data is not handled well
 Equal-depth (frequency) partitioning
 Divides the range into N intervals, each containing approximately
same number of samples
 Good data scaling
 Managing categorical attributes can be tricky
14
Equal-Depth Binning Methods
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26,
28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
15
Inconsistent Data
 Inconsistent data may due to
 faulty data collection instruments
 data entry problems:human or computer error
 data transmission problems
 change in scale over time

 1,2,3 to A, B. C
 inconsistency in naming convention
 Data integration:
 Different units used for the same variable
 TL or dollar
 Value added tax ıncluded ın one source not in other
 duplicate records
Notes
 Some methods are used for both smoothing and
data reduction or discretization
 binning
 used in decision tress to reduce number of

categories
 concept hierarchies
 Example price a numerical variable
 mapped to concepts as:
 expensive, moderately prised, expensive


 Data Cleaning
 Data Reduction
 Summary
18
Data Integration
 Data integration:
 Combines data from multiple sources into a coherent store
 Entity identification problem:
 Identify real world entities from multiple data sources
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Object matching: e.g., Bill Clinton = William Clinton
 Detecting and resolving data value conflicts:
 For the same real world entity, attribute values from different sources are
different
 Possible reasons: different representations, different scales, e.g., metric vs.
British units
19
Handling Redundancy in Data Integration
 Redundant data occur often when integration of multiple

databases
 Object identification: The same attribute or object
may have different names in different databases
 Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
 Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
 Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality
20

 Data Cleaning
 Data Reduction
 Summary
21
Data Reduction Strategies
 Data reduction: Obtain a reduced representation of the data set that
is much smaller in volume but yet produces the same (or almost the
same) analytical results
 Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time to
run on the complete data set.
 Data reduction strategies
 Dimensionality reduction, e.g., remove unimportant attributes
 Feature subset selection, feature creation
 Principal Components Analysis (PCA)
 Numerosity reduction (some simply call it: Data Reduction)
 Regression and Log-Linear Models
 Histograms, clustering, sampling
 Data compression
22
Data Reduction Situations
 Eliminate redandent attributes
 Correlation coefficients show that
 Watch and magazine are associated
 Eliminate one
 Combine variables to reduce number of independent
variables
 Principle component analysis
 Sampling reduce number of cases (records) by sampling

from secondary storage
 Discretization
 Age to categorical values
Data Reduction 1: Dimensionality Reduction
 Curse of dimensionality
 Advantages of dimensionality reduction
 Dimensionality reduction techniques
24
Curse of Dimensionality
 When dimensionality increases, data becomes
increasingly sparse
 Density and distance between points, which is
critical to clustering, outlier analysis, becomes
less meaningful
 The possible combinations of subspaces will
grow exponentially
Advantages of Dimensionality Reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce
noise
 Reduce time and space required in data mining
 Reduce # of patterns in the evaluation phase,
easier to understand the results
 Allow easier visualization
Dimensionality Reduction Techniques
 Feature (attribute) subset selection:
 Find a subset of the original variables (or
features, attributes)
 Feature (attribute) creation (generation):
 Create new attributes (features) that can
capture the important information in a data set
more effectively than the original ones
 Principal component analysis (PCA):
 Transform the data in the high-dimensional
space to a space of fewer dimensions
Feature (Attribute) Subset Selection
 Redundant attributes
 Duplicate much or all of the information contained in
one or more other attributes
 E.g., purchase price of a product and the amount of
sales tax paid
 Irrelevant attributes
 Contain no information that is useful for the data
mining task at hand
 E.g., students' ID is often irrelevant to the task of
predicting students' GPA
28
Heuristic Feature Selection Methods
 There are 2d possible sub-features of d features
 Several heuristic feature selection methods:
 Best single features under the feature
independence assumption:
 Choose by significance tests.
 Best step-wise feature selection:
 The best single-feature is picked first
 Then next best feature condition to the first, ...
 Step-wise feature elimination:

 Repeatedly eliminate the worst feature
 Best combined feature selection and elimination

 Decision-tree induction
Credit Card Promotion Response Data Table
Credıt Card
Income Magazıne Watch
Gender Age Life Insurance Insurance
Range Promotıon Promotıon
Promotıon
40-50 K Yes No Male 45 No No
30-40 K Yes Yes Female 40 No Yes
40-50 K No No Male 42 No No
30-40 K Yes Yes Male 43 Yes Yes
50-60 K Yes No Female 38 No Yes
20-30 K No No Female 55 No No
30-40 K Yes No Male 35 Yes Yes
20-30 K No Yes Male 27 No No
30-40 K Yes No Male 43 No No
40-50 K No Yes Female 43 No Yes
20-30 K No Yes Male 29 No Yes
40-50 K No Yes Male 55 No No
20-30 K No No Female 19 Yes Yes
Example of Decision Tree Induction
Initial attribute set:

{A1, A2, A3, A4, A5, A6}
A4 ?
A1? A6?
Class 1 Class 2 Class 1 Class 2
> Reduced attribute set: {A1, A4, A6}

Example: Inflation Targeting Model
 There are tens of macroeconomic variables
 For this example let’s say in total 45
 Which ones are the best predictor for inflation rate (three
months ahead)?
 Aim: develop a simple model to predict inflation by using
only a couple of those 45 macro variables
 Best-stepwise feature selection:
 Models with one independent variable
 45 models
 Select the best single variable – say X3
 Keep X3 in the model
 Find the second best variable
 try 44 models
 – second best variable – say X43
 Continue until a stopping criteria is satisfied
Example: Inflation Targeting Model (cont.)
 Best feature elimination:
 Develop 45 models
 each consisting of 44 input variables

 in each of these models only one of these input
variables are removed
 Find the worst of these 45 models
 Remove the worst variable
 Repeat
 continue eliminating a new variable at each step
 unitl a stoping criteria is satified
 Rearly used comared to feature selection
Feature (Attribute) Creation (Generation)
 Create new attributes (features) that can capture

the important information in a data set more
effectively than the original ones
 Three general methodologies:
 Attribute extraction
 domain-specific
 Attribute construction
 combining features
 automatic or semi-automatic ways
34
Example: Attribute/Feature Construction
 Automatic attribute generation
 using product or and opperations
 E.g.: a regression model to explain spending of a customer

 dependent variable (Y) - spending
 independent variables: X1: age, X2: income, ...
 Y = a + b1*X1 + b2*X2 : a linear regression model

 a,b1,b2 are parameters to be estimated
 Generate a new feature representing interaction of age and income:
X1*X2
 New model:
 Y = a + b1*X1 + b2*X2 + c*X1*X2
 A nonlinear regression model
 parameters: a,b1,b2,c
 new feature: X1*X2 can be generated from income and age

Example: Financial Ratios and Differences
 Define new attributes

 E.g.,
 Real variables in finance or economics
 real exchange rates, from nominal exchange rates

and inflation rates
 Financial ratios
 obtained from features in balance sheets
 profit = revenue – cost
Principal Component Analysis (PCA)
 PCA: A statistical procedure that uses an orthogonal

transformation to convert a set of observations of possibly
correlated variables into a set of values of linearly
uncorrelated variables called principal components
 The original data are projected onto a much smaller
space, resulting in dimensionality reduction
 Method: Find the eigenvectors of the covariance matrix,
and these eigenvectors define the new space
Principal Component Analysis (PCA) (cont.)
 Find a projection that captures the largest amount of variation in data
 The original data are projected onto a much smaller space, resulting
in dimensionality reduction. We find the eigenvectors of the
covariance matrix, and these eigenvectors define the new space
x2
x1
38
Principal Component Analysis (cont.)
 Given N data objects from n-dimensions, find k < n
orthogonal vectors that can be best used to represent
data
 The original data set is reduced to one consisting of n
data vectors on k principal components (reduced
dimensions)
 Each data vector (object) is a linear combination of the k
principal component vectors
 Works for numeric data only
 Used when the number of dimensions is large
39
Dimensionality Reduction from 2 to 1
Variability along Y2 is low X2

Remove the second component
Y1
Y2
* *
* *
*
*
*
 *
* *
* * X1
* *
*
40
Change of coordinates
Coordinate of P :
X2

x1,x2 in original variables
with principle components
 new coordinates: y ,y
1 2
Y1

Y2 a1 a2
expressed in unit vectors
Remove a2
 P
Dimensionality reduction a2
 a1

X1
41
Principal Component Analysis (Steps)
 Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors
(principal components) that can be best used to represent data
 Normalize input data: Each attribute falls within the same range
 Compute k orthonormal (unit) vectors, i.e., principal components
 Each input data (vector) is a linear combination of the k principal
component vectors
 The principal components are sorted in order of decreasing
“significance” or strength
 Since the components are sorted, the size of the data can be
reduced by eliminating the weak components, i.e., those with low
variance (i.e., using the strongest principal components, it is
possible to reconstruct a good approximation of the original data)
 Works for numeric data only
42
Positively and Negatively Correlated Data
Suitable for a PCA

Dimensionality is reduced from 2 to 1
43
A Nonlinear Relationship Among Variables
Althoug there is a relationship among

variables
This is a nonlinear one
Cannot be captured by linear PCA
44
Not Correlated Data
not corrolated
cannot reduce dimensionality
from 2 to 1
45
How to perform PCA
 Is the dataset appropriate for PCA

 Obtaining the components
 deciding number of components
 Rotating the components

 Naming the components
46
Is Data Appropriate for PCA
 1 – corrolation matrix
 present correlation coefficient among pairs of variables
(bivariate)
 High correlations likely to form
 components
 2 - Barlett test of sphericity

 a multivariate test: at least there are enough
correlation amang some of the variables

 3- Kaiser-Meyer-Olkin
 considers correlqtion as well as partial corelations
among vriables
 higher values indication of appropriateness of data to
PCA
47
Communality
 For an orignal variable the common variance

shared with other variables
 should be > 0.5 or remove the variable from
the analysis
48
How It Works
 Mean corrected or normalized z scores
 if original variables are of different units some have
high variances (they dominate)

 Even units of all variables are the same, some variables
may have high variances (see lab example SalesFact data;
food sales have higher variance than drink sales)
 Before applying PCA, all original variables have
 Mean: 0, Variance: 1
 mean(X ) = 0, variance(X )=1 X i 1,2,..,n

i i i
 total variability =
 Var(X1) + Var(X2) +...+ Var(Xn) =n (# of variables)
 total variability is not: Var(X1 + X2 +...+ Xn)
How It Works (cont.)
 Note that total variance is the same as n ( total # of
original variables)
 First principle component:
 is along the direction with the highest variability
 it captures as much variability as possibe
 its variance is expected to be >> 1
 Second principle component:

 orthgonal to the firet principle component
 captures the remaining variance as much as possible
 ...
 Last componant gets the remaining variance

 In total variability of the newly formed principle
components is n as well
Example: SalesFact Data
 In a retail company three broad categories of
products are sold
 food, drinks and non-consumables
 for each customer spending on food, drink and

non-consumbles are hold in three variables
 F,D, NC – 3 dimensional data
 There are correlations among variables indicated
by high values of correlation coefficient
Optaining the Components
 Number of components
 eignevalue over 1.0 (> 1.0) are taken
 eigenvalue: variances of new componants

 eigenvalue = 1: variance of new component is the
same as variance of original componants
 the new componant should have at leasat a variane
equaling to the original onces
 k < n principle components should capture
enough percentage of total variance
 Specified by researcher (%60-%70 s good for
business social problems)
52
Rotting the Factors
 Facilitates interpretation and naming
 Orthogonal rotation: preserve the orthogonality among
factors
 factors are still perpendicular (orthogonal) to each
other
 no correlation among factors
 Rotated compnent matrix

 shows the correlations between original variables and
components
 Naming of components
 name factors or component by examining highly
correlated original variables

 Factor scores:
 components are stored as new variables
53
Data Reduction 2: Numerosity Reduction
 Reduce data volume by choosing alternative, smaller
forms of data representation
 Parametric methods (e.g., regression)
 Assume the data fits some model, estimate model
parameters, store only the parameters, and discard

the data (except possible outliers)
 Ex.: Log-linear models—obtain value at a point in m-
D space as the product on appropriate marginal

subspaces
 Non-parametric methods
 Do not assume models
 Major families: histograms, clustering, sampling, …
54
Histogram Analysis
 Divide data into buckets and 40
store average (sum) for each 35
bucket 30
 Partitioning rules: 25
 Equal-width: equal bucket 20
range 15
 Equal-frequency (or equal- 10
depth) 5
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
55
Histogram Analysis: Methods
 Equal--widht: the width of each bucket range is

uniform
 Equal-depth: each bucket contains roughly the
same number of continuous data samples
 Max difference:
 V-Optimal: least variance
 histogram variance is the weighted sum of the
original values that each bucket represents

 bucket weight = #values in bucket
Example: Binning Methods
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Max Difference
 Specify number of buckets: k
 Sort the data in an descending order
 Make a bucket boundry between adjacent values
if the difference is one of the largest k differences
 x
i+1 - xi >= max_k-1(x1,..xN)
V-optimal Design
 Sort the values
 Assign equal number of values in each bucket
 Compute variance
repeat
change bucket`s of boundary values
compute new variance
until no reduction in variance
variance = (n1*Var1+n2*Var2+...+nk*Vark)/N
N= n1+n2+..+nk,
Vari= nij=1(xj-x_meani)2/ni,
Note that: V-Optimal in one dimension is equivalent to K-means

clusterıng in one dimension
Sampling
 Sampling: obtaining a small sample s to represent the

whole data set N
 Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
 Key principle: Choose a representative subset of the data
 Simple random sampling may have very poor
performance in the presence of skew
 Develop adaptive sampling methods, e.g., stratified
sampling:
 Note: Sampling may not reduce database I/Os (page at a
time)
60
Types of Sampling
 Simple random sampling
 There is an equal probability of selecting any particular
item
 Sampling without replacement
 Once an object is selected, it is removed from the
population
 Sampling with replacement
 A selected object is not removed from the population
 Stratified sampling:
 Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same

percentage of the data)
 Used in conjunction with skewed data
 Cluster sampleing
61
Sampling Methods
 Simple random sample without
replacement (SRSWOR):
 of size n
 n of N tuples from D: n<N
 P(drawing any tuple)=1/N all are equally likely
 Simple random sample with replacement

(SRSWR):
 of size n
 each time a tuple is drawn from D, it is
recorded and then replaced

 it may be drawn again
Sampling Methods (cont.)
 Cluster Sample: if tuples in D are grouped into M
mutually disjoint “clusters” then an SRS of m clusters can
be obtained where m < M
 Tuples in a database are retrieved a page at a time
 Each pages can be considered as a cluster
 Stratified Sample: if D is divided into mutually disjoint

parts called strata.
 Obtain a SRS at each stratum
 a representative sample when data are skewed
 Ex: customer data a stratum for each age group
 The age group having the smallest number of customers will

be sure to be presented
Sampling: With or without Replacement
W O R
SRS le random
i m p ho ut
( s e wi t
l
samp ment)
p l a ce
re
SRSW
R
Raw Data
64
Sampling: Cluster or Stratified Sampling
Raw Data Cluster/Stratified Sample
65

 Data Cleaning
 Data Reduction
 Summary
66
Data Transformation
 A function that maps the entire set of values of a given attribute to a
new set of replacement values s.t. each old value can be identified
with one of the new values
 Methods
 Smoothing: Remove noise from data
 Attribute/feature construction
 New attributes constructed from the given ones
 Aggregation: Summarization, data cube construction
 Normalization: Scaled to fall within a smaller, specified range
 min-max normalization
 z-score normalization
 normalization by decimal scaling
 Discretization: Concept hierarchy climbing
67
Normalization
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
When a score is outside the new range the algorithm

may give an error or warning message

In the presence of outliers regular observations are
squized in to a small interval
E.g., Let income range $12,000 to $98,000
normalized to [0.0, 1.0]. Then $73,000 is mapped to
73,600  12,000
(1.0  0)  0  0.716
98,000  12,000
Normalization (cont.)
v  meanA
v' 
stand _ devA
Good in handling outliers

as the new range is between -to +  no out of range
error
 Z-score normalization (μ: mean, σ: standard
deviation):
v  A Z-score: The distance between the raw score and the
v' 
 A population mean in the unit of the standard deviation
 e.g., Let μ = 54,000, σ = 16,000. Then

73,600  54,000
 1.225
16,000
Good in handling outliers

as the new range is between -to +  no out of range
error
70
 Decimal scaling
v
v'  j v' |)<1
Where j is the smallest integer such that Max(|
10
e.g., max v=984 then j =3 v’=0.984

preserves the appearance of figures
Logarithmic Transformation
 Logarithmic transformations:
Y` = logY
 Used in some distance based methods
 Clustering
 For ratio scaled variables
 E.g., weight, TL/dollar
 Distance between two persons is related to
percentage changes rather then actual differences
Linear Transformations
 Note that linear transformations preserve the

shape of the distribution
 Whereas nonlinear transformations distors the
distribution of data
 Example:
 Logistic function f(x) = 1/(1+exp(-x))
 Transforms x between 0 and 1
 New data are always between 0-1

Data Discretization
 Three types/scales of attributes
 Nominal—values from an unordered set, e.g., color, profession
 Ordinal—values from an ordered set, e.g., military or academic rank
 Numeric—real numbers, e.g., integer or real numbers
 Discretization: Divide the range of a continuous attribute into intervals
 Interval labels can then be used to replace actual data values
 Reduce data size by discretization
 Supervised vs. unsupervised
 Split (top-down) vs. merge (bottom-up)
 Discretization can be performed recursively on an attribute
 Prepare for further analysis, e.g., classification
74
Data Discretization Methods
 Typical methods: All the methods can be applied recursively
 Binning
 Top-down split, unsupervised
 Histogram analysis
 Top-down split, unsupervised
 Clustering analysis (unsupervised, top-down split or
bottom-up merge)
 Decision-tree analysis (supervised, top-down split)
 Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)
75
Discretization by Binning
 After forming bins

 Each bin is represented
 bin mean or median
 or by a symbolic value
 E.g., letter grades

 After bins are formed
 Transformed to small numbers between 0 to 4
 Or represented by letters AA, BA,…,F

Classification of Transformations/Discretizations
 numerical to numerical
 numerical to categorical
 discretization
 categorical to categorical
 aggregation
 categorical to numerical
Numerical to Numerical
 between 0 to 1 or -1 to 1
 Discretization – binning
 represented by bin mean, bin median,…
 represented by other numerical values
 e.g., grades by 4.0, 3.5,…,0

Numerical to Categorical
 Discretization – binning
 Bins are represented by symbolic values
 E.g., age (a continuous variable) to age groups

 such as, young, middle, old
 E.g., letter grades: continuous grades (weighted

averages of midterm, homework, final,…)
 mapped to AA, BA, … , F
Categorical to Categorical
 Aggregation/disaggregation (merge/split)
 Drill-down and roll-up OLAP operations
 Ordinal scale examples:

 E.g., letter grades: from AA to F 8 distinct values mapped to pass and
fail: two distinct values
 E.g., SalesFact data set
 There are 8 income groups
 mapped into three groups as low, medium, high
 Nominal scale examples:

 E.g., branches of a multinational company
 continent  country  region  city  district
 E.g., comparing university students

 university  faculty  department
Categorical to Numerical
 Symbolic values are encoded as numbers
 Some algorithms work only with numerical values
 regression, neural networks
 E.g., gender {male, female} encoded as 0-1

 male  0, female  1
 E.g., buying behaviour {buy, not_buy}

 buy  1, not_buy  0
 E.g., education level (ordinal scale) mapped between 0

and 1
 {no school, primary school, high school, undergraduate,
graduate}  {0.0, 0.25, 0.5, 0.75,1.0}
Categorical to Numerical (cont.)
 Binarization: for nominal scale variables with

usually more than 2 distinct state values
 Create is_a variables
 E.g., faculty {engineering, management, law}
 Create three binary variables.
 is_engineering, is_management, is_law
 Thermometer coding: for ordinal scale variables

 See the example for education levels
Example: Categorical to Numerical
Transformations
Symbolic Variables
Gender Education Marital Status Buy
male high school single yes
female no school married no
female graduate divorced yes
male high school single yes
male undergraduate married no
female primary school divorced no
Numerical Variables
Gender Education Is_Single Is_Married Is_Diverced Buy
0 0.50 1 0 0 1
1 0.00 0 1 0 0
1 1.00 0 0 1 1
0 0.50 1 0 0 1
0 0.75 0 1 0 0
1 0.25 0 0 1 0
Example: Thermometer Encoding for an Ordinal
Variable
Education I1 I2 I3 I4
high school 0 0 1 1
no school 0 0 0 0
graduate 1 1 1 1
high school 0 0 1 1
undergraduate 0 1 1 1
primary school 0 0 0 1

 Data Cleaning
 Data Reduction
 Summary
85
Summary
 Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
 Data cleaning: e.g. missing/noisy values, outliers
 Data integration from multiple sources:
 Entity identification problem
 Remove redundancies
 Detect inconsistencies
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization

 Normalization
 Concept hierarchy generation
86

BIS 541 Ch03 20-21 S

Uploaded by

Copyright:

Available Formats

BIS 541 Ch03 20-21 S

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BIS 541 Ch03 20-21 S

Uploaded by

Copyright:

Available Formats

BIS 541

 Data Preprocessing: An Overview

certain attributes of interest, or containing

 No quality data, no quality mining results!

 Data Preprocessing: An Overview

 Data is not always available

 data entry problems

 data transmission problems

 inconsistency in naming convention

 Other data problems which require data cleaning

median, smooth by bin boundaries, etc.

 Combined computer and human inspection

deal with possible outliers)

 Equal-width (distance) partitioning

 data entry problems:human or computer error

 data transmission problems

 change in scale over time

 used in decision tress to reduce number of

 expensive, moderately prised, expensive

 Data Preprocessing: An Overview

 Redundant data occur often when integration of multiple

 Data Preprocessing: An Overview

 Feature subset selection, feature creation

 Principal Components Analysis (PCA)

 Numerosity reduction (some simply call it: Data Reduction)

 Regression and Log-Linear Models

 Histograms, clustering, sampling

 Sampling reduce number of cases (records) by sampling

 Then next best feature condition to the first, ...

 Step-wise feature elimination:

 Best combined feature selection and elimination

Initial attribute set:

Class 1 Class 2 Class 1 Class 2

> Reduced attribute set: {A1, A4, A6}

 each consisting of 44 input variables

 Create new attributes (features) that can capture

 automatic or semi-automatic ways

 E.g.: a regression model to explain spending of a customer

 independent variables: X1: age, X2: income, ...

 Y = a + b1*X1 + b2*X2 : a linear regression model

 new feature: X1*X2 can be generated from income and age

 Define new attributes

 real exchange rates, from nominal exchange rates

 PCA: A statistical procedure that uses an orthogonal

Variability along Y2 is low X2

x1,x2 in original variables

with principle components

Suitable for a PCA

Althoug there is a relationship among

 Is the dataset appropriate for PCA

 Rotating the components

 2 - Barlett test of sphericity

correlation amang some of the variables

 For an orignal variable the common variance

high variances (they dominate)

 mean(X ) = 0, variance(X )=1 X i 1,2,..,n

 its variance is expected to be >> 1

 Second principle component:

 captures the remaining variance as much as possible

 Last componant gets the remaining variance

 Y = a + b1X1 + b2X2 : a linear regression model