02 - Data Pre Processing
02 - Data Pre Processing
— Chapter 2 —
Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
©2006 Jiawei Han and Micheline Kamber, All rights reserved
Data Mining: Concepts and
July 30, 2011 Techniques 1
Chapter 2: Data Preprocessing
remove noise and merges data from reduce the data normalization
correct multiple size by
inconsistencies sources eliminating
in the data redundant
features, or
clustering
Data Mining: Concepts and
July 30, 2011 Techniques 3
Why Data Preprocessing?
Data in the real world is dirty
incomplete: lacking attribute values,
Data cleaning
Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Example: naming inconsistencies.
Data transformation
Normalization and aggregation
Data Mining: Concepts and
July 30, 2011 Techniques 7
Major Tasks in Data
Preprocessing
Data reduction
Obtains reduced representation in volume but produces the same or similar
analytical results
Data discretization
Part of data reduction but with particular importance, especially for
numerical data
Motivation
For many data preprocessing tasks, users would
like to learn about data characteristics regarding
both central tendency and dispersion of the data.
Measures of central tendency include mean,
median, mode, and midrange, while measures of
data dispersion include quartiles, interquartile
range (IQR), and variance.
In particular, it is necessary to introduce the notions of
distributive measure, algebraic measure, and holistic
measure. Knowing what kind of measure we are dealing
with can help us choose an efficient implementation for
it.
Data Mining: Concepts and
July 30, 2011 Techniques 11
2.2- Descriptive data
summarization
2.2.1- Measuring the Central
Tendency
A distributive measure is a measure (i.e., function) that can be
computed for a given data set by partitioning the data into smaller
subsets, computing the measure for each subset, and then merging the
results in order to arrive at the measure’s value for the original
(entire) data set. Both sum() and count() are distributive measures
because they can be computed in this manner. Other examples include
max() and min().
An algebraic measure is a measure that can be computed by applying
an algebraic function to one or more distributive measures. Hence,
average (or mean()) is an algebraic measure because it can be
computed by sum()/count().
A holistic measure is a measure that must be computed on the entire
data set as a whole. It cannot be computed by partitioning the given
data into subsets and merging the values obtained for the measure in
each subset. The median is an example of a holistic measure.
Trimmed Mean:
To offset the effect caused by a small number of extreme values, we can instead
use the trimmed mean, which is the mean obtained after chopping off values at
the high and low extremes. For example, we can sort the values observed for
salary and remove the top and bottom 2% before computing the mean. We should
avoid trimming too large a portion (such as 20%) at both ends as this can result in
the loss of valuable information.
order. If N is odd, then the median is the middle value of the ordered set;
otherwise (i.e., if N is even), the median is the average of the middle two
values.
Mode
Value that occurs most frequently in the data
Midrange
The midrange can also be used to assess the central tendency of a data set.
It is the average of the largest and smallest values in the set. This
algebraic measure is easy to compute using the SQL aggregate functions,
max() and min().
For each data value, compute the squared deviation by subtracting the mean and then squaring
the result. The sum of these squared deviations is 10,692.87. Divide by 6 to get 1782.15
(variance). Take the square root of this value to get the standard deviation, 42.2.
Data Mining: Concepts and
July 30, 2011 Techniques 21
2.2- Descriptive data
summarization
2.2.3- Graphic Displays of Basic Descriptive Data Summaries
Aside from the bar charts, pie charts, and line graphs used in most statistical or
graphical data presentation software packages, there are other popular types of
graphs for the display of data summaries and distributions. These include
histograms, quantile plots, q-q plots, scatter plots, and loess curves. Such
graphs are very helpful for the visual inspection of your data.
Plotting histograms, or frequency histograms, is a graphical method for
summarizing the distribution of a given attribute.
A scatter plot is one of the most effective graphical methods for determining if
there appears to be a relationship, pattern, or trend between two numerical
attributes. To construct a scatter plot, each pair of values is treated as a pair of
coordinates in an algebraic sense and plotted as points in the plane.
A loess curve is another important exploratory graphic aid that adds a smooth
curve to a scatter plot in order to provide better perception of the pattern of
dependence. The word loess is short for “local regression.”
Suppose that the data for analysis includes the attribute age. The age values
for the data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22,
22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
(a) What is the mean of the data? What is the median?
(b) What is the mode of the data? Comment on the data’s modality (i.e.,
bimodal, trimodal, etc.).
(c) What is the midrange of the data?
(d) Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of
the data?
(e) Give the five-number summary of the data.
(f) Show a boxplot of the data.
n − 1 i =1
∑ 2
∑
( xi − x ) = ∑
[ xi − ( xi ) ]
n − 1 i =1 n i =1
2
σ = ∑ ( xi − µ ) 2 =
2
N i =1 N
∑ xi − µ 2
i =1
2
warehousing”—DCI survey
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
technology limitation
incomplete data
inconsistent data
“neighborhood,” that is, the values around it. The sorted values
are distributed into a number of “buckets,” or bins. Because
binning methods consult the neighborhood of values, they
perform local smoothing.
Smoothing by bin means
bins
then one can smooth by bin means, smooth by bin
Clustering
detect and remove outliers
and added from the given set of attributes to help the mining process.
Data Mining: Concepts and
July 30, 2011 Techniques 64
Data Transformation:
Normalization
Min-max normalization: to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
Ex. Let income range $12,000 to $98,000 normalized to
73,600 − 12,000
(1.0 − 0) + 0 = 0.716
[0.0, 1.0]. Then $73,600 is mapped to
98,000 − 12,000
73,600 − 54,000
= 1.225
Ex. Let μ = 54,000, σ = 16,000. Then16,000
Normalization by decimal scaling
v
v' = j Where j is the smallest integer such that Max(|ν’|) < 1
10
Data Mining: Concepts and
July 30, 2011 Techniques 65
Data Transformation:
Normalization
Note that normalization can change the original data quite a bit, especially
the latter two methods shown above. It is also necessary to save the
normalization parameters (such as the mean and standard deviation if using
z-score normalization) so that future data can be normalized in a uniform
manner.
In attribute construction, new attributes are constructed from the given
attributes and added in order to help improve the accuracy and
understanding of structure in high-dimensional data. For example, we may
wish to add the attribute area based on the attributes height and width. By
combining attributes, attribute construction can discover missing
information about the relationships between data attributes that can be
useful for knowledge discovery.
Chapter 2: Data Preprocessing
Dimensionality Reduction
In dimensionality reduction, data encoding or
transformations are applied so as to obtain a reduced or
“compressed” representation of the original data. If the
original data can be reconstructed from the compressed
data without any loss of information, the data reduction
is called lossless. If, instead, we can reconstruct only an
approximation of the original data, then the data
reduction is called lossy.
ss y
lo
Original Data
Approximated
Data Mining: Concepts and
July 30, 2011 Techniques 75
Numerosity Reduction
Reduce data volume by choosing alternative, smaller
forms of data representation
Parametric methods
Assume the data fits some model, estimate model
sampling
Data Mining: Concepts and
July 30, 2011 Techniques 76
Regression and Log-Linear
Models
Linear regression:
Linear regression analyzes the relationship between two variables, X and Y. For each subject (or
experimental unit), you know both X and Y and you want to find the best straight line through the
data. In some situations, the slope and/or intercept have a scientific meaning. In other cases, you
use the linear regression line as a standard curve to find new values of X from Y, or Y from X.
Approximate the percentage of each class (or
subpopulation of interest) in the overall
database
Used in conjunction with skewed data
Data Mining: Concepts and
July 30, 2011 Techniques 83
Data Reduction Method (4):
Sampling