0% found this document useful (0 votes)
311 views

02 - Data Pre Processing

Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data noise: containing errors or outliers inconsistent: containing discrepancies in codes or names e.g., discrepancy between duplicate records also need data cleaning 5 Why Is Data Preprocessing Important? no quality data, no quality mining results!

Uploaded by

krania77
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
311 views

02 - Data Pre Processing

Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data noise: containing errors or outliers inconsistent: containing discrepancies in codes or names e.g., discrepancy between duplicate records also need data cleaning 5 Why Is Data Preprocessing Important? no quality data, no quality mining results!

Uploaded by

krania77
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 91

Data Mining:

Concepts and Techniques

— Chapter 2 —

Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
©2006 Jiawei Han and Micheline Kamber, All rights reserved
Data Mining: Concepts and
July 30, 2011 Techniques 1
Chapter 2: Data Preprocessing

 Why preprocess the data?


 Descriptive data summarization
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary

Data Mining: Concepts and


July 30, 2011 Techniques 2
Introduction
“How can the data be preprocessed in order to help improve the quality of
the data and, consequently, of the mining results? How can the data be
preprocessed so as to improve the efficiency and ease of the mining
process?”

Data preprocessing techniques

Data cleaning Data integration Data reduction Data transformations

remove noise and merges data from reduce the data normalization
correct multiple size by
inconsistencies sources eliminating
in the data redundant
features, or
clustering
Data Mining: Concepts and
July 30, 2011 Techniques 3
Why Data Preprocessing?
 Data in the real world is dirty
 incomplete: lacking attribute values,

lacking certain attributes of interest, or


containing only aggregate data
 e.g., occupation=“ ”
 noisy: containing errors or outliers
 e.g., Salary=“-10”
 inconsistent: containing discrepancies in
codes or names
 e.g., Age=“42” Birthday=“03/07/1997”

e.g., Was rating “1,2,3”, now rating “A, B, C”

e.g., discrepancy between duplicate records
Data Mining: Concepts and
July 30, 2011 Techniques 4
Why Is Data Dirty?
 Incomplete data may come from
 “Not applicable” data value when collected
 Different considerations between the time when the data was
collected and when it is analyzed.
 Human/hardware/software problems
 Noisy data (incorrect values) may come from
 Faulty data collection instruments
 Human or computer error at data entry
 Errors in data transmission
 Inconsistent data may come from
 Different data sources
 Functional dependency violation (e.g., modify some linked
data)
 Duplicate records also need data cleaning
Data Mining: Concepts and
July 30, 2011 Techniques 5
Why Is Data Preprocessing
Important?

 No quality data, no quality mining results!


 Quality decisions must be based on quality data

e.g., duplicate or missing data may cause incorrect or
even misleading statistics.
 Data warehouse needs consistent integration of
quality data
 Data extraction, cleaning, and transformation
comprises the majority of the work of building a
data warehouse

Data Mining: Concepts and


July 30, 2011 Techniques 6
Major Tasks in Data
Preprocessing

Data cleaning
 Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies

Data integration
 Integration of multiple databases, data cubes, or files
 Example: naming inconsistencies.

Data transformation
 Normalization and aggregation
Data Mining: Concepts and
July 30, 2011 Techniques 7
Major Tasks in Data
Preprocessing

Data reduction
 Obtains reduced representation in volume but produces the same or similar
analytical results

Data discretization
 Part of data reduction but with particular importance, especially for
numerical data

Data Mining: Concepts and


July 30, 2011 Techniques 8
Forms of Data Preprocessing

Data Mining: Concepts and


July 30, 2011 Techniques 9
Chapter 2: Data Preprocessing

 Why preprocess the data?


 Descriptive data summarization
 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary

Data Mining: Concepts and


July 30, 2011 Techniques 10
2.2- Descriptive data
summarization

 Motivation
 For many data preprocessing tasks, users would
like to learn about data characteristics regarding
both central tendency and dispersion of the data.
Measures of central tendency include mean,
median, mode, and midrange, while measures of
data dispersion include quartiles, interquartile
range (IQR), and variance.
 In particular, it is necessary to introduce the notions of
distributive measure, algebraic measure, and holistic
measure. Knowing what kind of measure we are dealing
with can help us choose an efficient implementation for
it.
Data Mining: Concepts and
July 30, 2011 Techniques 11
2.2- Descriptive data
summarization
2.2.1- Measuring the Central
Tendency
 A distributive measure is a measure (i.e., function) that can be
computed for a given data set by partitioning the data into smaller
subsets, computing the measure for each subset, and then merging the
results in order to arrive at the measure’s value for the original
(entire) data set. Both sum() and count() are distributive measures
because they can be computed in this manner. Other examples include
max() and min().
 An algebraic measure is a measure that can be computed by applying
an algebraic function to one or more distributive measures. Hence,
average (or mean()) is an algebraic measure because it can be
computed by sum()/count().
 A holistic measure is a measure that must be computed on the entire
data set as a whole. It cannot be computed by partitioning the given
data into subsets and merging the values obtained for the measure in
each subset. The median is an example of a holistic measure.

Data Mining: Concepts and


July 30, 2011 Techniques 12
2.2- Descriptive data
summarization
2.2.1- Measuring the Central
Tendency
 Mean or Average (algebraic measure):

 Weighted arithmetic mean or Weighted Average:


 The weights reflect the significance, importance, or
occurrence frequency attached to their respective values.

Data Mining: Concepts and


July 30, 2011 Techniques 13
2.2- Descriptive data
summarization
2.2.1- Measuring the Central
Tendency
 Although the mean is the single most useful quantity for describing a data set,
it is not always the best way of measuring the center of the data. A major
problem with the mean is its sensitivity to extreme (e.g., outlier) values.
Example Salary.

 Trimmed Mean:
 To offset the effect caused by a small number of extreme values, we can instead
use the trimmed mean, which is the mean obtained after chopping off values at
the high and low extremes. For example, we can sort the values observed for
salary and remove the top and bottom 2% before computing the mean. We should
avoid trimming too large a portion (such as 20%) at both ends as this can result in
the loss of valuable information.

Data Mining: Concepts and


July 30, 2011 Techniques 14
2.2- Descriptive data
summarization
2.2.1- Measuring the Central
Tendency
 Median: A holistic measure
 Suppose that a given data set of N distinct values is sorted in numerical

order. If N is odd, then the median is the middle value of the ordered set;
otherwise (i.e., if N is even), the median is the average of the middle two
values.
 Mode
 Value that occurs most frequently in the data

 Midrange
 The midrange can also be used to assess the central tendency of a data set.

It is the average of the largest and smallest values in the set. This
algebraic measure is easy to compute using the SQL aggregate functions,
max() and min().

Data Mining: Concepts and


July 30, 2011 Techniques 15
2.2- Descriptive data
summarization
2.2.2- Measuring the Dispersion of Data
 The degree to which numerical data tend to spread
is called the dispersion, or variance of the data.
 The most common measures of data dispersion are
range, the five-number summary (based on
quartiles), the interquartile range, and the
standard deviation. Boxplots can be plotted
based on the five-number summary and are a
useful tool for identifying outliers.

Data Mining: Concepts and


July 30, 2011 Techniques 16
2.2- Descriptive data
summarization
2.2.2- Measuring the Dispersion of Data
Range, Quartiles, Outliers, and Boxplots
 The range of the set is the difference between the largest (max())
and smallest (min()) values. For the remainder of this section,
let’s assume that the data are sorted in increasing numerical
order.
 The most commonly used percentiles other than the median are quartiles. The
first quartile, denoted by Q1, is the 25th percentile; the third quartile, denoted
by Q3, is the 75th percentile. The quartiles, including the median, give some
indication of the center, spread, and shape of a distribution. The distance
between the first and third quartiles is a simple measure of spread that gives the
range covered by the middle half of the data. This distance is called the
interquartile range (IQR) and is defined as

Data Mining: Concepts and


July 30, 2011 Techniques 17
2.2- Descriptive data
summarization
2.2.2- Measuring the Dispersion of Data
Range, Quartiles, Outliers, and Boxplots
 BecauseQ1, the median, andQ3 together contain no information about the endpoints
(e.g., tails) of the data, a fuller summary of the shape of a distribution can be obtained
by providing the lowest and highest data values as well. This is known as the five-
number summary. The five-number summary of a distribution consists of the median,
the quartiles Q1 and Q3, and the smallest and largest individual observations,
written in the order
Minimum; Q1; Median; Q3; Maximum:

Data Mining: Concepts and


July 30, 2011 Techniques 18
2.2- Descriptive data
summarization
2.2.2- Measuring the Dispersion of Data
Variance and Standard Deviation
 The variance of N observations, x1;x2; : : : ;xN, is

 The standard deviation, σ, of the observations is the square


root of the variance, σ2.
 The variance and standard deviation are algebraic measures
because they can be computed from distributive measures. That
is, N (which is count() in SQL), Σxi (which is the sum() of xi), and
Σ xi2 (which is the sum() of xi2 ) can be computed in any partition
and then merged to feed into the algebraic Equation. Thus the
computation of the variance and standard deviation is scalable in
large databases.
Data Mining: Concepts and
July 30, 2011 Techniques 19
2.2- Descriptive data
summarization
2.2.2- Measuring the Dispersion of Data
Variance and Standard Deviation
The standard deviation is a measure of how spread out your data
are. Computation of the standard deviation is a bit tedious. The
steps are:
 Compute the mean for the data set.

 Compute the deviation by subtracting the mean from each value.

 Square each individual deviation.

 Add up the squared deviations.

 Divide by one less than the sample size.

 Take the square root.

Data Mining: Concepts and


July 30, 2011 Techniques 20
2.2- Descriptive data
summarization
2.2.2- Measuring the Dispersion of Data
Example
 Let's examine a standard deviation computation for data on the following.
The seven values in this data set are 73, 58, 67, 93, 33, 18, and 147. The
mean for this data set is 69.9.
(73-69.9)2 = (3.1)2 = 9.61

(58-69.9)2 = (-11.9)2 = 141.61

(67-69.9)2 = (-2.9)2 = 8.41

(93-69.9)2 = (23.1)2 = 533.61

(33-69.9)2 = (-36.9)2 = 1361.61

(18-69.9)2 = (-51.9)2 = 2693.61

(147-69.9)2 = (77.1)2 = 5944.41

For each data value, compute the squared deviation by subtracting the mean and then squaring
the result. The sum of these squared deviations is 10,692.87. Divide by 6 to get 1782.15
(variance). Take the square root of this value to get the standard deviation, 42.2.
Data Mining: Concepts and
July 30, 2011 Techniques 21
2.2- Descriptive data
summarization
2.2.3- Graphic Displays of Basic Descriptive Data Summaries

 Aside from the bar charts, pie charts, and line graphs used in most statistical or
graphical data presentation software packages, there are other popular types of
graphs for the display of data summaries and distributions. These include
histograms, quantile plots, q-q plots, scatter plots, and loess curves. Such
graphs are very helpful for the visual inspection of your data.
 Plotting histograms, or frequency histograms, is a graphical method for
summarizing the distribution of a given attribute.

Data Mining: Concepts and


July 30, 2011 Techniques 22
2.2- Descriptive data
summarization
2.2.3- Graphic Displays of Basic Descriptive Data Summaries

 A quantile plot is a simple and effective way to have a first look


at a univariate data distribution. First, it displays all of the data
for the given attribute (allowing the user to assess both the
overall behavior and unusual occurrences). Second, it plots
quantile information.

Data Mining: Concepts and


July 30, 2011 Techniques 23
2.2- Descriptive data
summarization
2.2.3- Graphic Displays of Basic Descriptive Data Summaries

 A quantile-quantile plot, or q-q plot, graphs the quantiles of one univariate


distribution against the corresponding quantiles of another. It is a powerful
visualization tool in that it allows the user to view whether there is a shift in
going from one distribution to another.

Data Mining: Concepts and


July 30, 2011 Techniques 24
2.2- Descriptive data
summarization
2.2.3- Graphic Displays of Basic Descriptive Data Summaries

 A scatter plot is one of the most effective graphical methods for determining if
there appears to be a relationship, pattern, or trend between two numerical
attributes. To construct a scatter plot, each pair of values is treated as a pair of
coordinates in an algebraic sense and plotted as points in the plane.

Data Mining: Concepts and


July 30, 2011 Techniques 25
2.2- Descriptive data
summarization
2.2.3- Graphic Displays of Basic Descriptive Data Summaries

Data Mining: Concepts and


July 30, 2011 Techniques 26
2.2- Descriptive data
summarization
2.2.3- Graphic Displays of Basic Descriptive Data Summaries

 A loess curve is another important exploratory graphic aid that adds a smooth
curve to a scatter plot in order to provide better perception of the pattern of
dependence. The word loess is short for “local regression.”

Data Mining: Concepts and


July 30, 2011 Techniques 27
Example: Exercise 4

 Suppose that the data for analysis includes the attribute age. The age values
for the data tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22,
22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
 (a) What is the mean of the data? What is the median?
 (b) What is the mode of the data? Comment on the data’s modality (i.e.,
bimodal, trimodal, etc.).
 (c) What is the midrange of the data?
 (d) Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of
the data?
 (e) Give the five-number summary of the data.
 (f) Show a boxplot of the data.

Data Mining: Concepts and


July 30, 2011 Techniques 28
Example: Exercise 4

 (a) What is the mean of the data? What is the median?


The (arithmetic) mean of the data is: ¯x = 809/27 = 30. The median (middle value of the
ordered set, as the number of values in the set is odd) of the data is: 25.
 (b) What is the mode of the data? Comment on the data’s modality (i.e., bimodal,
trimodal, etc.).
This data set has two values that occur with the same highest frequency and is, therefore,
bimodal. The modes (values occurring with the greatest frequency) of the data are 25
and 35.
 (c) What is the midrange of the data?
The midrange (average of the largest and smallest values in the data set) of the data is:
(70+13)/2 = 41.5
 (d) Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of the
data?
The first quartile (corresponding to the 25th percentile) of the data is: 20. The third
quartile (corresponding to the 75th percentile) of the data is: 35.

Data Mining: Concepts and


July 30, 2011 Techniques 29
Example: Exercise 4

 (e) Give the five-number summary of the data.


The five number summary of a distribution consists of the
minimum value, first quartile, median value, third quartile, and
maximum value. It provides a good summary of the shape of the
distribution and for this data is: 13, 20, 25, 35, 70.
 (f) Show a boxplot of the data.

Draw the figure.

Data Mining: Concepts and


July 30, 2011 Techniques 30
Measuring the Dispersion of Data
 Quartiles, outliers and boxplots
 Quartiles: Q1 (25th percentile), Q3 (75th percentile)
 Inter-quartile range: IQR = Q3 – Q1
 Five number summary: min, Q1, M, Q3, max
 Boxplot: ends of the box are the quartiles, median is marked,
whiskers, and plot outlier individually
 Outlier: usually, a value higher/lower than 1.5 x IQR
 Variance and standard deviation (sample: s, population: σ)
 Variance: (algebraic, scalable computation)
1 n 1 n 2 1 n 1 n 1 n
s =
2

n − 1 i =1
∑ 2

( xi − x ) = ∑
[ xi − ( xi ) ]
n − 1 i =1 n i =1
2
σ = ∑ ( xi − µ ) 2 =
2

N i =1 N
∑ xi − µ 2
i =1
2

 Standard deviation s (or σ) is the square root of variance s2( or σ2)


Data Mining: Concepts and
July 30, 2011 Techniques 31
Boxplot Analysis

 Five-number summary of a distribution:


Minimum, Q1, M, Q3, Maximum
 Boxplot
 Data is represented with a box
 The ends of the box are at the first and third
quartiles, i.e., the height of the box is IRQ
 The median is marked by a line within the box
 Whiskers: two lines outside the box extend to
Minimum and Maximum

Data Mining: Concepts and


July 30, 2011 Techniques 32
Visualization of Data Dispersion: Boxplot
Analysis

Data Mining: Concepts and


July 30, 2011 Techniques 33
Histogram Analysis

 Graph displays of basic statistical class descriptions


 Frequency histograms

 A univariate graphical method


 Consists of a set of rectangles that reflect the counts or
frequencies of the classes present in the given data

Data Mining: Concepts and


July 30, 2011 Techniques 34
Quantile Plot
 Displays all of the data (allowing the user to assess
both the overall behavior and unusual occurrences)
 Plots quantile information
 For a data x data sorted in increasing order, f
i i
indicates that approximately 100 fi% of the data are
below or equal to the value xi

Data Mining: Concepts and


July 30, 2011 Techniques 35
Quantile-Quantile (Q-Q) Plot
 Graphs the quantiles of one univariate distribution
against the corresponding quantiles of another
 Allows the user to view whether there is a shift in
going from one distribution to another

Data Mining: Concepts and


July 30, 2011 Techniques 36
Scatter plot
 Provides a first look at bivariate data to see
clusters of points, outliers, etc
 Each pair of values is treated as a pair of
coordinates and plotted as points in the plane

Data Mining: Concepts and


July 30, 2011 Techniques 37
Loess Curve
 Adds a smooth curve to a scatter plot in order to provide
better perception of the pattern of dependence
 Loess curve is fitted by setting two parameters: a
smoothing parameter, and the degree of the polynomials
that are fitted by the regression

Data Mining: Concepts and


July 30, 2011 Techniques 38
Positively and Negatively Correlated
Data

Data Mining: Concepts and


July 30, 2011 Techniques 39
Not Correlated Data

Data Mining: Concepts and


July 30, 2011 Techniques 40
Graphic Displays of Basic Statistical
Descriptions

 Histogram: (shown before)


 Boxplot: (covered before)
 Quantile plot: each value xi is paired with fi indicating
that approximately 100 fi % of data are ≤ xi
 Quantile-quantile (q-q) plot: graphs the quantiles of
one univariant distribution against the corresponding
quantiles of another
 Scatter plot: each pair of values is a pair of coordinates
and plotted as points in the plane
 Loess (local regression) curve: add a smooth curve to
a scatter plot to provide better perception of the
pattern of dependence

Data Mining: Concepts and


July 30, 2011 Techniques 41
Chapter 2: Data Preprocessing

 Why preprocess the data?


 Descriptive data summarization
 Data cleaning or data cleansing
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy generation
 Summary

Data Mining: Concepts and


July 30, 2011 Techniques 42
Data Cleaning
 Importance
 “Data cleaning is one of the three biggest problems in

data warehousing”—Ralph Kimball


 “Data cleaning is the number one problem in data

warehousing”—DCI survey
 Data cleaning tasks
 Fill in missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
 Resolve redundancy caused by data integration

Data Mining: Concepts and


July 30, 2011 Techniques 43
Data Cleaning
2.3.1- Missing Data
 Data is not always available
 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the
time of entry
 not register history or changes of the data
 Missing data may need to be inferred.

Data Mining: Concepts and


July 30, 2011 Techniques 44
Data Cleaning
2.3.1- Missing Data
How to Handle Missing Data?
1. Ignore the tuple: usually done when class label is missing (assuming the tasks in
classification—not effective when the percentage of missing values per attribute
varies considerably.
2. Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with:
3. A global constant : e.g., “unknown”, a new class?! - not foolproof
4. The attribute mean
5. The attribute mean for all samples belonging to the same class: smarter
6. The most probable value: inference-based such as Bayesian formula or decision tree.
For example, using the other customer attributes in your data set, you may construct a
decision tree to predict the missing values for income.
Data Mining: Concepts and
July 30, 2011 Techniques 45
Data Cleaning
2.3.1- Missing Data
 It is important to note that, in some cases, a missing value may not
imply an error in the data! For example, when applying for a credit
card, candidates may be asked to supply their driver’s license
number. Candidates who do not have a driver’s license may naturally
leave this field blank.
 Forms should allow respondents to specify values such as “not
applicable”. Software routines may also be used to uncover other
null values, such as “don’t know”, “?”, or “none”.
 Ideally, each attribute should have one or more rules regarding the
null condition. The rules may specify whether or not nulls are
allowed, and/or how such values should be handled or transformed.
 Fields may also be intentionally left blank if they are to be provided
in a later step of the business process. Hence, although we can try
our best to clean the data after it is seized, good design of databases
and of data entry procedures should help minimize the number of
missing values or errors in the first place.

Data Mining: Concepts and


July 30, 2011 Techniques 46
Data Cleaning
2.3.2- Noisy Data
 “What is noise?” Noise: random error or variance
in a measured variable
 Incorrect attribute values may due to
 faulty data collection instruments

 data entry problems

 data transmission problems

 technology limitation

 inconsistency in naming convention

 Other data problems which requires data cleaning


 duplicate records

 incomplete data

 inconsistent data

Data Mining: Concepts and


July 30, 2011 Techniques 47
Data Cleaning
2.3.2- Noisy Data
Data smoothing techniques:
1. Binning:
 Binning methods smooth a sorted data value by consulting its

“neighborhood,” that is, the values around it. The sorted values
are distributed into a number of “buckets,” or bins. Because
binning methods consult the neighborhood of values, they
perform local smoothing.
 Smoothing by bin means

 Smoothing by bin medians

 Smoothing by bin boundaries, the minimum and maximum

values in a given bin are identified as the bin boundaries.


Binning is also used as a discretization technique and is further
discussed in Section 2.6.
Data Mining: Concepts and
July 30, 2011 Techniques 48
Data Cleaning
2.3.2- Noisy Data

Data Mining: Concepts and


July 30, 2011 Techniques 49
Binning Methods for Data
Smoothing
 Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25,
26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Data Mining: Concepts and
July 30, 2011 Techniques 50
Data Cleaning
2.3.2- Noisy Data
Data smoothing techniques:
2. Regression:
Data can be smoothed by fitting the data to a function, such as with regression. Linear
regression involves finding the “best” line to fit two attributes (or variables), so that
one attribute can be used to predict the other. Multiple linear regression is an
extension of linear regression, where more than two attributes are involved and the
data are fit to a multidimensional surface. Regression is further described in Section
2.5.4, as well as in Chapter 6.
3- Clustering:
Outliers may be detected by clustering, where similar values are organized into groups,
or “clusters.” Intuitively, values that fall outside of the set of clusters may be
considered outliers (Figure 2.12).

Data Mining: Concepts and


July 30, 2011 Techniques 51
Data Cleaning
2.3.2- Noisy Data

Data Mining: Concepts and


July 30, 2011 Techniques 52
Cluster Analysis

Data Mining: Concepts and


July 30, 2011 Techniques 53
Data Cleaning
2.3.2- Noisy Data
 Many methods for data smoothing are also methods for data
reduction involving discretization. For example, the binning
techniques described above reduce the number of distinct values
per attribute. This acts as a form of data reduction for logic-
based data mining methods, such as decision tree induction,
which repeatedly make value comparisons on sorted data.
Concept hierarchies are a form of data discretization that can
also be used for data smoothing. A concept hierarchy for price,
for example, may map real price values into inexpensive,
moderately priced, and expensive, thereby reducing the number
of data values to be handled by the mining process. Data
discretization is discussed in Section 2.6. Some methods of
classification, such as neural networks, have built-in data
smoothing mechanisms. Classification is the topic of Chapter 6.

Data Mining: Concepts and


July 30, 2011 Techniques 54
How to Handle Noisy Data?
 Binning
 first sort data and partition into (equal-frequency)

bins
 then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.


 Regression
 smooth by fitting the data into regression functions

 Clustering
 detect and remove outliers

 Combined computer and human inspection


 detect suspicious values and check by human (e.g.,

deal with possible outliers)

Data Mining: Concepts and


July 30, 2011 Techniques 55
2.3.3-Data
DataCleaning
Cleaning as a
Process

What about data cleaning as a process?


How exactly does one proceed in tackling this task?
Are there any tools out there to help?”

Data Mining: Concepts and


July 30, 2011 Techniques 56
2.3.3-Data
DataCleaning
Cleaning as a
Process
Data discrepancy detection:
How can we proceed with discrepancy detection?
 Use metadata (e.g., domain, range, dependency, distribution). For example, what are the
domain and data type of each attribute? What are the acceptable values for each attribute?
What is the range of the length of values? Do all values fall within the expected range? Are
there any known dependencies between attributes?
 Check field overloading
 Check uniqueness rule: says that each value of the given attribute must be different from all
other values for that attribute.
 consecutive rule and null rule: says that there can be no missing values between the lowest
and highest values for the attribute, and that all values must also be unique (e.g., as in check
numbers)
 Use commercial tools
 Data scrubbing: use simple domain knowledge (e.g., postal code, spell-check) to
detect errors and make corrections
 Data auditing: by analyzing data to discover rules and relationship to detect
violators (e.g., correlation and clustering to find outliers)

Data Mining: Concepts and


July 30, 2011 Techniques 57
2.3.3-Data
DataCleaning
Cleaning as a
Process
Data discrepancy detection:
How can we proceed with discrepancy detection?
 Data migration and integration

 Data migration tools: allow transformations to be specified

 ETL (Extraction/Transformation/Loading) tools: allow users to specify

transformations through a graphical user interface

Data Mining: Concepts and


July 30, 2011 Techniques 58
2.3.3-Data
DataCleaning
Cleaning as a
Process
Data discrepancy detection:
Integration of the two processes:
 The two-step process of discrepancy detection and data transformation (to
correct discrepancies) iterates. This process, however, is error-prone and time-
consuming. Some transformations may introduce more discrepancies. Some
nested discrepancies may only be detected after others have been fixed. For
example, a typo such as “20004” in a year field may only surface once all date
values have been converted to a uniformformat. Transformations are often done
as a batch process while the user waits without feedback. Only after the
transformation is complete can the user go back and check that no new
anomalies have been created by mistake. Typically, numerous iterations are
required before the user is satisfied. Any tuples that cannot be automatically
handled by a given transformation are typically written to a file without any
explanation regarding the reasoning behind their failure. As a result, the entire
data cleaning process also suffers from a lack of interactivity.

Data Mining: Concepts and


July 30, 2011 Techniques 59
2.3.3-Data
DataCleaning
Cleaning as a
Process
Data discrepancy detection:
Integration of the two processes:
Another approach to increased interactivity in data cleaning is the
development of declarative languages for the specification of data
transformation operators. Such work focuses on defining powerful
extensions to SQL and algorithms that enable users to express data
cleaning specifications efficiently.

Data Mining: Concepts and


July 30, 2011 Techniques 60
2.3.3-Data
DataCleaning
Cleaning as a
Process
 Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)

 Check field overloading

 Check uniqueness rule, consecutive rule and null rule

 Use commercial tools

 Data scrubbing: use simple domain knowledge (e.g., postal code,

spell-check) to detect errors and make corrections


 Data auditing: by analyzing data to discover rules and relationship to

detect violators (e.g., correlation and clustering to find outliers)


 Data migration and integration
 Data migration tools: allow transformations to be specified

 ETL (Extraction/Transformation/Loading) tools: allow users to specify

transformations through a graphical user interface


 Integration of the two processes
 Iterative and interactive (e.g., Potter’s Wheels)

Data Mining: Concepts and


July 30, 2011 Techniques 61
Chapter 2: Data Preprocessing

 Why preprocess the data?


 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy
generation
 Summary

Data Mining: Concepts and


July 30, 2011 Techniques 62
Data Integration
 Data integration:
 Combines data from multiple sources into a coherent store

 Schema integration: e.g., A.cust-id ≡ B.cust-#


 Integrate metadata from different sources

 Entity identification problem:


 Identify real world entities from multiple data sources, e.g.,

Bill Clinton = William Clinton


 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from different

sources are different


 Possible reasons: different representations, different scales,

e.g., metric vs. British units

Data Mining: Concepts and


July 30, 2011 Techniques 63
Data Transformation
In data transformation,
transformation the data are transformed or consolidated into
forms appropriate for mining. Data transformation can involve the
following:
 Smoothing: remove noise from data, Such techniques include

binning, and clustering.


 Aggregation: summarization, For example, the daily sales data may

be aggregated so as to compute monthly and annual total amounts.


 Generalization: concept hierarchy, For example, categorical
attributes, like street, can be generalized to higher-level concepts, like
city or country. Similarly, values for numerical attributes, like age,
may be mapped to higher-level concepts, like youth, middle-aged, and
senior.
 Normalization: where the attribute data are scaled so as to fall within

a small specified range, such as -1.0 to 1.0 or 0.0 to 1.0.


 Attribute/feature construction: where new attributes are constructed

and added from the given set of attributes to help the mining process.
Data Mining: Concepts and
July 30, 2011 Techniques 64
Data Transformation:
Normalization
 Min-max normalization: to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
 Ex. Let income range $12,000 to $98,000 normalized to
73,600 − 12,000
(1.0 − 0) + 0 = 0.716
[0.0, 1.0]. Then $73,600 is mapped to
98,000 − 12,000

 Z-score normalization (μ: mean, σ: standard deviation):


v −µA
v' =
σ A

73,600 − 54,000
= 1.225
 Ex. Let μ = 54,000, σ = 16,000. Then16,000
 Normalization by decimal scaling
v
v' = j Where j is the smallest integer such that Max(|ν’|) < 1
10
Data Mining: Concepts and
July 30, 2011 Techniques 65
Data Transformation:
Normalization

 Note that normalization can change the original data quite a bit, especially
the latter two methods shown above. It is also necessary to save the
normalization parameters (such as the mean and standard deviation if using
z-score normalization) so that future data can be normalized in a uniform
manner.
 In attribute construction, new attributes are constructed from the given
attributes and added in order to help improve the accuracy and
understanding of structure in high-dimensional data. For example, we may
wish to add the attribute area based on the attributes height and width. By
combining attributes, attribute construction can discover missing
information about the relationships between data attributes that can be
useful for knowledge discovery.
Chapter 2: Data Preprocessing

 Why preprocess the data?


 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy
generation
 Summary
Data Mining: Concepts and
July 30, 2011 Techniques 67
Data Reduction Strategies

 Why data reduction?


 A database/data warehouse may store terabytes of data

 Complex data analysis/mining may take a very long time

to run on the complete data set


 Data reduction
 Obtain a reduced representation of the data set that is

much smaller in volume but yet produce the same (or


almost the same) analytical results

Data Mining: Concepts and


July 30, 2011 Techniques 68
Data Reduction Strategies

Data reduction strategies:


 1. Data cube aggregation, where aggregation operations are applied to the data in the
construction of a data cube.
 2. Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes or
dimensions may be detected and removed.
 3. Dimensionality reduction, where encoding mechanisms are used to reduce the data set
size.
 4. Numerosity reduction, where the data are replaced or estimated by alternative, smaller
data representations such as parametric models (which need store only the model parameters
instead of the actual data) or nonparametric methods such as clustering, sampling, and the use
of histograms.
 5. Discretization and concept hierarchy generation, where raw data values for attributes are
replaced by ranges or higher conceptual levels. Data discretization is a form of numerosity
reduction that is very useful for the automatic generation of concept hierarchies. Discretization
and concept hierarchy generation are powerful tools for data mining, in that they allow the
mining of data at multiple levels of abstraction. We therefore defer the discussion of
discretization and concept hierarchy generation to Section 2.6, which is devoted entirely to
this topic.

Data Mining: Concepts and


July 30, 2011 Techniques 69
Data Cube Aggregation

The resulting data set is smaller in


volume, without loss of
information necessary for the
analysis task.

Data cubes store


multidimensional aggregated
information. For example, Figure
2.14 shows a data cube for
multidimensional analysis of sales
data with respect to annual sales
per item type for each
AllElectronics branch.
Data Mining: Concepts and
July 30, 2011 Techniques 70
Attribute Subset Selection
 Attribute subset selection:
 Attribute subset selection reduces the data set

size by removing irrelevant or redundant


attributes (or dimensions). The goal of attribute
subset selection is to find a minimum set of
attributes
 “How can we find a ‘good’ subset of the original
attributes?” For n attributes, there are 2n possible subsets.
An exhaustive search for the optimal subset of attributes
can be prohibitively expensive, especially as n and the
number of data classes increase. Therefore, heuristic
methods that explore a reduced search space are
commonly used for attribute subset selection.

Data Mining: Concepts and


July 30, 2011 Techniques 71
Attribute Subset Selection
Basic heuristic methods of attribute subset selection include the following
techniques, some of which are illustrated in Figure 2.15.
 1. Stepwise forward selection: The procedure starts with an empty set of
attributes as the reduced set. The best of the original attributes is determined
and added to the reduced set. At each subsequent iteration or step, the best of
the remaining original attributes is added to the set.
 2. Stepwise backward elimination: The procedure starts with the full set of
attributes. At each step, it removes the worst attribute remaining in the set.
 3. Combination of forward selection and backward elimination: The
stepwise forward selection and backward elimination methods can be
combined so that, at each step, the procedure selects the best attribute and
removes the worst from among the remaining attributes.
 4. Decision tree induction: Decision tree algorithms, such as ID3, C4.5, and
CART, were originally intended for classification. Decision tree induction
constructs a flowchart like structure where each internal (nonleaf) node
denotes a test on an attribute, each branch corresponds to an outcome of the
test, and each external (leaf) node denotes a class prediction. At each node, the
algorithm chooses the “best” attribute to partition the data into individual
classes.
Data Mining: Concepts and
July 30, 2011 Techniques 72
Attribute Subset Selection

Data Mining: Concepts and


July 30, 2011 Techniques 73
Dimensionality Reduction

Dimensionality Reduction
 In dimensionality reduction, data encoding or
transformations are applied so as to obtain a reduced or
“compressed” representation of the original data. If the
original data can be reconstructed from the compressed
data without any loss of information, the data reduction
is called lossless. If, instead, we can reconstruct only an
approximation of the original data, then the data
reduction is called lossy.

Data Mining: Concepts and


July 30, 2011 Techniques 74
Data Compression

Original Data Compressed


Data
lossless

ss y
lo
Original Data
Approximated
Data Mining: Concepts and
July 30, 2011 Techniques 75
Numerosity Reduction
 Reduce data volume by choosing alternative, smaller
forms of data representation
 Parametric methods
 Assume the data fits some model, estimate model

parameters, store only the parameters, and


discard the data (except possible outliers)
 Example: Log-linear models—obtain value at a

point in m-D space as the product on appropriate


marginal subspaces
 Non-parametric methods
 Do not assume models

 Major families: histograms, clustering,

sampling
Data Mining: Concepts and
July 30, 2011 Techniques 76
Regression and Log-Linear
Models
 Linear regression:
Linear regression analyzes the relationship between two variables, X and Y. For each subject (or
experimental unit), you know both X and Y and you want to find the best straight line through the
data. In some situations, the slope and/or intercept have a scientific meaning. In other cases, you
use the linear regression line as a standard curve to find new values of X from Y, or Y from X.

 Data are modeled to fit a straight line


 Often uses the least-square method to fit the line
 Multiple regression: allows a response variable Y to be modeled as
a linear function of multidimensional feature vector
 Log-linear model: approximates discrete multidimensional
probability distributions

Data Mining: Concepts and


July 30, 2011 Techniques 77
Data Reduction Method (2):
Histograms
 Histograms use binning to approximate data distributions and are a
popular form of data reduction. Histograms were introduced in Section
2.2.3. A histogram for an attribute, A, partitions the data distribution of A
into disjoint subsets, or buckets. If each bucket represents only a single
attribute-value/frequency pair, the buckets are called singleton buckets.
Often, buckets instead represent continuous ranges for the given attribute.

Data Mining: Concepts and


July 30, 2011 Techniques 78
Data Reduction Method (2):
Histograms
 Example 2.5 Histograms. The following data are a list of prices of commonly sold items
at AllElectronics (rounded to the nearest dollar). The numbers have been sorted: 1, 1, 5,
5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18,
18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30,
30.

Data Mining: Concepts and


July 30, 2011 Techniques 79
Data Reduction Method (2):
Histograms

Data Mining: Concepts and


July 30, 2011 Techniques 80
Data Reduction Method (3):
Clustering
 Clustering techniques consider data tuples as objects. They partition the objects
into groups or clusters, so that objects within a cluster are “similar” to one
another and “dissimilar” to objects in other clusters. Similarity is commonly
defined in terms of how “close” the objects are in space, based on a distance
function. The “quality” of a cluster may be represented by its diameter, the
maximum distance between any two objects in the cluster. Centroid distance is an
alternative measure of cluster quality and is defined as the average distance of each
cluster object from the cluster centroid (denoting the “average object,” or average
point in space for the cluster). Figure 2.12 of Section 2.3.2 shows a 2-D plot of
customer data with respect to customer locations in a city, where the centroid of
each cluster is shown with a “+”. Three data clusters are visible.
 In data reduction, the cluster representations of the data are used to replace
the actual data. The effectiveness of this technique depends on the nature of the
data. It is much more effective for data that can be organized into distinct clusters
than for smeared data.

Data Mining: Concepts and


July 30, 2011 Techniques 81
Data Reduction Method (3):
Clustering
 An index tree can store aggregate and detail data at varying levels of
resolution or abstraction. It provides a hierarchy of clusterings of the
data set, where each cluster has a label that holds for the data
contained in the cluster. If we consider each child of a parent node as a
bucket, then an index tree can be considered as a hierarchical histogram.

Data Mining: Concepts and


July 30, 2011 Techniques 82
Data Reduction Method (4):
Sampling
 Sampling: obtaining a small sample s to represent
the whole data set N
 Allow a mining algorithm to run in complexity that
is potentially sub-linear to the size of the data
 Choose a representative subset of the data
 Simple random sampling may have very poor

performance in the presence of skew


 Develop adaptive sampling methods
 Stratified sampling:


Approximate the percentage of each class (or
subpopulation of interest) in the overall
database

Used in conjunction with skewed data
Data Mining: Concepts and
July 30, 2011 Techniques 83
Data Reduction Method (4):
Sampling

Data Mining: Concepts and


July 30, 2011 Techniques 84
Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

Data Mining: Concepts and


July 30, 2011 Techniques 85
Chapter 2: Data Preprocessing

 Why preprocess the data?


 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy
generation
 Summary
Data Mining: Concepts and
July 30, 2011 Techniques 86
Discretization
 Three types of attributes:
 Nominal — values from an unordered set, e.g., color, profession
 Ordinal — values from an ordered set, e.g., military or academic rank
 Continuous — real numbers, e.g., integer or real numbers
 Discretization:
 Divide the range of a continuous attribute into intervals
 Some classification algorithms only accept categorical attributes.
 Reduce data size by discretization
 Prepare for further analysis

Data Mining: Concepts and


July 30, 2011 Techniques 87
Discretization and Concept
Hierarchy
 Discretization
 Reduce the number of values for a given continuous
attribute by dividing the range of the attribute into intervals
 Interval labels can then be used to replace actual data
values
 Discretization can be performed recursively on an attribute
 Concept hierarchy formation
 Recursively reduce the data by collecting and replacing low
level concepts (such as numeric values for age) by higher
level concepts (such as young, middle-aged, or senior)

Data Mining: Concepts and


July 30, 2011 Techniques 88
Discretization and Concept Hierarchy
Generation for Numeric Data
 Typical methods: All the methods can be applied recursively
 Binning (covered above)
 Top-down split, unsupervised,
 Histogram analysis (covered above)
 Top-down split, unsupervised
 Clustering analysis (covered above)
 Either top-down split or bottom-up merge, unsupervised
 Entropy-based discretization: supervised, top-down split
 Interval merging by χ 2
Analysis: unsupervised, bottom-up merge
 Segmentation by natural partitioning: top-down split, unsupervised

Data Mining: Concepts and


July 30, 2011 Techniques 89
Automatic Concept Hierarchy
Generation
 Some hierarchies can be automatically generated
based on the analysis of the number of distinct values
per attribute in the data set
 The attribute with the most distinct values is placed

at the lowest level of the hierarchy


 Exceptions, e.g., weekday, month, quarter, year

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values


Data Mining: Concepts and
July 30, 2011 Techniques 90
Chapter 2: Data Preprocessing

 Why preprocess the data?


 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy
generation
 Summary
Data Mining: Concepts and
July 30, 2011 Techniques 91

You might also like