0% found this document useful (0 votes)
3 views50 pages

Lecture 3 - Data Preprocessing

The document outlines the data preprocessing steps essential for knowledge discovery, including data cleaning, integration, transformation, discretization, and reduction. It emphasizes the importance of data quality, addressing issues such as missing data, noise, and redundancy, and discusses methods for handling these challenges. Additionally, it presents tasks related to a moviegoer database, such as classification, estimation, clustering, and affinity grouping.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views50 pages

Lecture 3 - Data Preprocessing

The document outlines the data preprocessing steps essential for knowledge discovery, including data cleaning, integration, transformation, discretization, and reduction. It emphasizes the importance of data quality, addressing issues such as missing data, noise, and redundancy, and discusses methods for handling these challenges. Additionally, it presents tasks related to a moviegoer database, such as classification, estimation, clustering, and affinity grouping.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Outline

• Data Preprocessing
– Data Quality
– Major Tasks
• Data Cleaning
• Data Integration
• Data Transformation
• Data Discretization
• Data Reduction
2
Knowledge Discovery Process

Machine
Learning

3
The Moviegoer Example

4
5
Moviegoer Database - Tasks

• Classification
– Determine gender based on age, source and movies seen
– Determine source based on gender, age and movies seen
• Estimation
– For estimation, you need a continuos variable, e.g. age
– Estimate age as a function of source, gender and past movies
• Clustering
– Find groupings of movies that are often seen by same people
– Find groupings of people that tend to see the same movies
6
Moviegoer Database - Tasks

• Affinity grouping
– Association rules: which movies
go together?
– Need to create “transactions” for
each moviegoer containing
movies seen by that person

– May result in association rules


like:

7
8

Data Quality: Why Preprocess the Data?


• Measures for data quality: A multidimensional view
– Accuracy: correct or wrong, accurate or not
– Completeness: not recorded, unavailable, …
– Consistency: some modified but some not, dangling, …
– Timeliness: timely update?
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be understood?

Can the decision be trusted?


Better change to discover useful knowledge when the data is clean.
9

Major Tasks in Data Preprocessing


• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data transformation
– Normalization
• Data Discretization
– Part of data reduction but with particular importance, especially for numerical data
• Data Reduction
– Obtains reduced representation in volume but produces the same or similar analytical
results
Outline

• Data Preprocessing
– Data Quality
– Major Tasks
• Data Cleaning
• Data Integration
• Data Transformation
• Data Discretization
• Data Reduction
10
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty,
human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of interest, or containing
only aggregate data
• e.g., Occupation=“ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday? 11
Incomplete (Missing) Data

• Data is not always available


– E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of entry
– not register history or changes of the data
• Missing data may need to be inferred
12
How to Handle Missing Data?

• Ignore the record: usually done when class label is missing (when doing
classification)—not effective when the % of missing values per attribute varies
considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– the attribute mean for all samples belonging to the same class: smarter

13
Imputation of Missing Data (Basic)

• Imputation is a term that denotes a procedure that replaces


the missing values in a dataset by some plausible values
– i.e. by considering relationship among correlated values among
the attributes of the dataset.
If we consider only
Attribute 1 Attribute 2 Attribute 3 Attribute 4 {attribute#2}, then value
20 cool high false “cool” appears in 4
cool high true records.
20 cool high true Probability of Imputing
20 mild low false
value (20) = 66.6%
30 cool normal false
10 mild high true Probability of Imputing
value (30) = 33.3%
14
Imputation of Missing Data (Basic)
For {attribute#4} the
Attribute 1 Attribute 2 Attribute 3 Attribute 4
value “true” appears in 3
20 cool high false
records
cool high true
20 cool high true Probability of Imputing
20 mild low false value (20) = 50%
30 cool normal false Probability of Imputing
10 mild high true value (10) = 50%

Attribute 1 Attribute 2 Attribute 3 Attribute 4 For {attribute#2,


20 cool high false attribute#3} the value
cool high true {“cool”, “high”} appears
20 cool high true in only 2 records
20 mild low false
30 cool normal false Probability of Imputing
10 mild high true value (20) = 100% 15
16

Methods of Treating Missing Data


• K-Nearest Neighbor (k-NN) approach
– k-NN imputes the missing attribute values on the basis of nearest K
neighbor. Neighbors are determined on the basis of distance measure.
– Once K neighbors are determined, missing value are imputed by taking
mean/median or MOD of known attribute values of missing attribute.

Missing value
record

Other dataset
records
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which require data cleaning
– duplicate records
– incomplete data
– inconsistent data
17
18

How to Handle Noisy Data?


• Binning
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Regression
– smooth by fitting the data into regression functions
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human (e.g., deal with
possible outliers)
19

Noisy Data (Binning Methods)


Sorted data for price (in dollars) with Bin = 3:
4, 8, 15, 21, 21, 24, 25, 28, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 15
- Bin 2: 21, 21, 24
- Bin 3: 25, 28, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9
- Bin 2: 22, 22, 22
- Bin 3: 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 15
- Bin 2: 21, 21, 24
- Bin 3: 25, 25, 34
20

Noisy Data (Binning Methods)


Sorted data for price (in dollars) Bin = 4:
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
21

Noisy Data (Clustering)


• Outliers may be detected by clustering, where similar values are
organized into groups or “clusters”.

• Values which falls outside of the set of clusters may be considered


outliers.
Outline

• Data Preprocessing
– Data Quality
– Major Tasks
• Data Cleaning
• Data Integration
• Data Transformation
• Data Discretization
• Data Reduction
22
23

Data Integration
• Data integration:
– Combines data from multiple sources into a coherent store
• Schema integration: e.g., A.cust-id  B.cust-#
– Integrate metadata from different sources
• Entity identification problem:
– Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
• Detecting and resolving data value conflicts
– For the same real world entity, attribute values from different sources are
different
– Possible reasons: different representations, different scales, e.g., metric vs.
British units
23
24

Handling Redundancy in Data Integration


• Redundant data occur often when integration of multiple databases
– Object identification: The same attribute or object may have
different names in different databases
– Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue
• Redundant attributes may be able to be detected by correlation
analysis and covariance analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining
speed and quality

24
Correlation Analysis (Nominal Data)

• Χ2 (chi-square) test
(Observed  Expected) 2
2  
Expected

• The larger the Χ2 value, the more likely the variables are related
• The cells that contribute the most to the Χ2 value are those whose actual count
is very different from the expected count

25
Chi-Square Calculation: An Example
• Null Hypothesis: A & B are independent (not correlated)
• Alternate Hypothesis: A & B are dependent (correlated)

Gender
Male Female Sum (row)
fiction 250(90) 200(360) 450
Preferred
Reading Non fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

• Χ2 (chi-square) calculation (numbers in parenthesis are expected counts


calculated based on the data distribution in the two categories)

( 250  90 ) 2
(50  210) 2
( 200  360) 2
(1000  840) 2
2      507.93
90 210 360 840 26
Chi Square Calculation: An Example
Degree of Freedom = (r-1) (c-1) = (2-1) (2-1) = 1 where r= distinct value in variable A,
c= distinct values in variable B

Level of Significance

Critical Value

It shows that Gender and Preferred Reading are strongly correlated in the group. 27
Correlation Analysis (Numeric Data)

• Correlation coefficient (also called Pearson’s product moment coefficient)

i1 (ai  A)(bi  B) 


n n
(ai bi )  n AB
rA, B   i 1

(n  1) A B (n  1) A B

where n is the number of records, 𝐴ҧ and 𝐵ത are the respective means of A and B, σA and
σB are the respective standard deviation of A and B, and Σ(aibi) is the sum of the AB cross-
product.
• If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The
higher, the stronger correlation.
• rA,B = 0: independent; rAB < 0: negatively correlated

28
Covariance (Numeric Data)
Expected Values of A and B

Covariance of A and B

Positively Correlated
Stock prices of both companies rise together 29
30

Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.
31

Pearson Correlation
32

Pearson Correlation (Shortcut)


Outline

• Data Preprocessing
– Data Quality
– Major Tasks
• Data Cleaning
• Data Integration
• Data Transformation
• Data Discretization
• Data Reduction
33
Data Transformation
• A function that maps the entire set of values of a given attribute to a new set of replacement
values s.t. each old value can be identified with one of the new values
• Methods
– Smoothing: Remove noise from data
– Attribute/feature selection
• Subset of a dataset is selected for further processing
– Normalization: Scaled to fall within a smaller, specified range
• min-max normalization
• z-score normalization
• normalization by decimal scaling
– Discretization: supervised
34
Normalization

• Works on numeric attributes


• Attribute normalization – one range of values to another
• Usual ranges: -1 to +1, or 0 to 1.
• Issues: this might introduce distortions or biases into the data
• So you need to understand the properties and potential weaknesses
of the method before using it

35
Normalization: min-max

• Min-max normalization: to [new_minA, new_maxA]


v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA
– Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,000 is
mapped to 73,600  12,000 (1.0  0)  0  0.716
98,000  12,000

• Positive: min-max normalization preserves all relationships of data values


exactly and doesn't introduce any potential biases
• Negative: If a future input case falls outside the original data range, an “out of
bounds” error will occur

36
Normalization (Dealing with out-of-range values)
• Ignore the range has been exceeded
– ….but does this affect the quality of the model?
• Ignore the out-of-range instances
– Reducing the number of instances, reduces the confidence in the sample
representing population
– Introduces bias
• Clip the out-of-range values
– E.g. if the value > 1, assign 1 to it. If value < 0, assign 0 to it.
– Information content on the limits is distorted by projecting multiple values
to a single value.

37
Normalization (z-score)

• Normalization of A and A’ based on the mean and the


standard deviation of the attribute
– Mean and Standard deviation depend on the data
v  A
v' 
 A

73,600  54,000
– Ex. Let μ = 54,000, σ = 16,000. Then 16,000
 1.225

• When to use z-score if min-max normalization is available?

38
Normalization: decimal scaling

• moves the decimal point of A by j positions such that j is


the minimum number of positions moved so that absolute
maximum value falls in [0,1]
v
v'  j
10

• E.g. if v ranges between -98 and 9738, then j = 4 means


that v` ranges between -0.0098 and 0.9738

39
Outline

• Data Preprocessing
– Data Quality
– Major Tasks
• Data Cleaning
• Data Integration
• Data Transformation
• Data Discretization
• Data Reduction
40
41

Data Discretization
• The task of attribute (feature)-discretization techniques is to
discretize the values of continuous features into a small number
of intervals, where each interval is mapped to a discrete symbol.
• Advantages:-
– Simplified data description and easy-to-understand data and final data-
mining results.
– Only Small interesting rules are mined.
– End-results processing time decreased.
– End-results accuracy improved.
Entropy Based Discretization
• Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the
Information is

• Where S1 and S2 corresponds to the samples in S, satisfying the condition A<T and A>=T.

• Where pi is the probability of class i in S1, determined by dividing the number of samples of
class i in S1 by the total number of samples in S1.

• The boundary that minimizes the entropy function over all possible boundaries is selected as a
binary discretization.

• The process is recursively applied to partitions obtained until some stopping criterion is met

42
Entropy Based Discretization: Example
ID 1 2 3 4 5 6 7 8 9
Age 21 22 24 25 27 27 27 35 41
Grade F F P F P P P P P
• Let Grade be the class attribute. Use entropy-based
discretization to divide the range of ages into different
discrete intervals.

• There are 6 possible boundaries. They are 21.5, 23, 24.5,


26, 31, and 38.

• Let us consider the boundary at T = 21.5.


Let S1 = {21}
Let S2 = {22, 24, 25, 27, 27, 27, 35, 41} 43
Entropy Based Discretization: Example
ID 1 2 3 4 5 6 7 8 9
Age 21 22 24 25 27 27 27 35 41
Grade F F P F P P P P P

• The number of elements in S1 and S2 are:


|S1| = 1
|S2| = 8
• The entropy of S1 is
Ent ( S1 )   P (Grade  F)  log 2 P(Grade  F)  P (Grade  P)  log 2 P (Grade  P)
 (1 / 1)  log 2 (1 / 1)  (0 / 1)  log 2 (0 / 1)
0
• The entropy of S2 is
Ent ( S 2 )   P (Grade  F)  log 2 P (Grade  F)  P (Grade  P)  log 2 P (Grade  P)
 (2 / 8)  log 2 (2 / 8)  (6 / 8)  log 2 (6 / 8)
 0.5  (0.311)
 0.811 44
45

Entropy Based Discretization: Example

• Hence, the entropy after partitioning at T = 21.5 is


| S1 | | S2 |
E (S , T )  Ent ( S1 )  Ent ( S 2 )
|S| |S|
|1| |8|
 Ent ( S1 )  Ent ( S 2 )
|9| |9|
 (1 / 9)(0)  (8 / 9)(0.811)
 0.721
46

Entropy Based Discretization: Example


• The entropies after partitioning for all the boundaries are:
T = 21.5 = E(S,21.5)
T = 23 = E(S,23)
.
Now recursively apply entropy
.
discretization upon both
T = 38 = E(S,38)
partitions

Select the boundary with the smallest entropy


Suppose best is T = 23

ID 1 2 3 4 5 6 7 8 9
Age 21 22 24 25 27 27 27 35 41
Grade F F P F P P P P P
Outline

• Data Preprocessing
– Data Quality
– Major Tasks
• Data Cleaning
• Data Integration
• Data Transformation
• Data Discretization
• Data Reduction
47
Data Reduction

• Data is often too large and reducing data can improve performance
• Data reduction consists of reducing the representation of the dataset
while producing the same or almost the same results
• Data reduction includes:
– Aggregation, dimensionality reduction, discretization, numerosity reduction

48
Dimensionality reduction

• Feature selection (or attribute subset selection)


– Select only the necessary attributes
– The goal is to find a minimum set of attributes such that the
resulting probability distribution of data classes is as close as
possible to the original distribution using the attributes
• Example Technique:
– Decision-tree induction

49
50

Summary
• Data quality: accuracy, completeness, consistency, timeliness, believability,
interpretability
• Data cleaning: e.g. missing/noisy values, outliers
• Data integration from multiple sources:
– Remove redundancies
– Detect inconsistencies
• Data transformation
– Dimensionality Reduction
– Normalization
– Discretization
51

References
• D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of
ACM, 42:73-78, 1999
• A. Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. IEEE Spectrum, Oct 1996
• T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
• J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. Duxbury Press, 1997.
• H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning:
Language, model, and algorithms. VLDB'01
• M. Hua and J. Pei. Cleaning disguised missing data: A heuristic approach. KDD'07
• H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical
Committee on Data Engineering, 20(4), Dec. 1997
• H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A Data Mining
Perspective. Kluwer Academic, 1998
• J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003
• D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
• V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and
Transformation, VLDB’2001
• T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001
• R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7:623-640, 1995

You might also like