0% found this document useful (0 votes)
49 views

Week 4 - 5 - Data Preprocessing

This document discusses data preprocessing techniques for data mining. It covers 4 major tasks in data preprocessing: data cleaning, data integration, data transformation, and data reduction. Data cleaning involves tasks like filling in missing data, identifying outliers, and resolving inconsistencies. Data integration combines data from multiple sources. Data transformation techniques include normalization, aggregation, generalization and discretization. Data reduction obtains a reduced representation of the data volume.

Uploaded by

Hussain ASL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

Week 4 - 5 - Data Preprocessing

This document discusses data preprocessing techniques for data mining. It covers 4 major tasks in data preprocessing: data cleaning, data integration, data transformation, and data reduction. Data cleaning involves tasks like filling in missing data, identifying outliers, and resolving inconsistencies. Data integration combines data from multiple sources. Data transformation techniques include normalization, aggregation, generalization and discretization. Data reduction obtains a reduced representation of the data volume.

Uploaded by

Hussain ASL
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 67

CIS 517 :Data Mining and

Warehousing

Week4-5: Data Preprocessing


(Chapter 3)

Data Warehouse and Data Mining 1

1
Why Data Preprocessing?
• Data in the real world is dirty
– incomplete:
• lacking attribute values
• lacking certain attributes of interest
• containing only aggregate data
– noisy:
• containing errors or outliers
– inconsistent:
• No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of quality data

Data Warehouse and Data Mining 2


Forms of data preprocessing
• 1. Data cleaning

• 2. Data integration

• 3.Data
transformation
• 4. Data
reduction

Data Warehouse and Data Mining 3


Major Tasks in Data Preprocessing
• I. Data cleaning
– fill in missing values
– smooth noisy data
– identify or remove outliers
– resolve inconsistencies

• II. Data integration


– Integration of multiple databases, data cubes, or files
• III. Data transformation
– Normalization and aggregation

Data Warehouse and Data Mining 4


Major Tasks in Data Preprocessing (cont)

• IV. Data reduction


– Obtains reduced representation in volume but
produces the same or similar analytical
results

Data Warehouse and Data Mining 5


I. Data Cleaning : Data Preprocessing

• Data cleaning tasks


– Fill in missing values
– Identify outliers and smooth out noisy data
– Correct inconsistent data

Data Warehouse and Data Mining 6


How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (assuming
the tasks in classification—not effective when the percentage of
missing values per attribute varies considerably.
• Fill in the missing value manually: tedious + infeasible?

• Use a global constant to fill in the missing value: e.g., “unknown”,


a new class?!

• Use the attribute mean to fill in the missing value

• Use the attribute mean for all samples belonging to the same class
to fill in the missing value: smarter
• Use the most probable value to fill in the missing value: inference-
based such as Bayesian formula or decision tree
Data Warehouse and Data Mining 7
Noisy Data

• Other data problems which requires data


cleaning
1. duplicate records
2. incomplete data
3. inconsistent data

Data Warehouse and Data Mining 8


How to Handle Noisy Data?
• 1.1 Binning method:
– first sort data and partition into (equi-depth) bins
then one can
• smooth by bin means
• smooth by bin median
• smooth by bin boundaries
• 1.2 Clustering
– detect and remove outliers
• 1.3 Regression
– smooth by fitting the data into regression functions

Data Warehouse and Data Mining 9


1.1 Simple Discretization Methods: Binning
• Equal-width (distance) partitioning:
– It divides the range into N intervals of equal size:
uniform grid

– if A and B are the lowest and highest values of the


attribute, the width of intervals will be:
W = (B-A)/N.

Data Warehouse and Data Mining 10


1.1 Simple Discretization Methods: Binning
• Equal-depth (frequency) partitioning:

– It divides the range into N intervals, each


containing approximately same number of
samples

Data Warehouse and Data Mining 11


Binning Methods for Data Smoothing
• Sorted data for price (in dollars):
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Partition into (equi-depth) bins: (size = 4)
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9 (3 + 8 + 9 + 15) /4 = 9
- Bin 2: 23, 23, 23, 23 (21+ 21+ 24 + 25)/4 = 23
- Bin 3: 29, 29, 29, 29 (26 + 28 + 29 + 34)/4 = 29
* Smoothing by bin boundaries: (select min or max value)
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Data Warehouse and Data Mining 12
Binning Methods for Data Smoothing
• Sorted data for price (in dollars):
4, 8, 9, 12, 15, 21, 21, 21, 24, 25, 26, 26, 28, 29, 34
Partition into (equi-depth) bins: (size = 5)
* Smoothing by bin median:
- Bin 1: 9, 9, 9, 9, 9
- Bin 2: 21, 21, 21, 21, 21
- Bin 3: 28, 28, 28, 28, 28

Data Warehouse and Data Mining 13


1.2 Cluster Analysis

outliers

Select outliner from cluster analysis

Data Warehouse and Data Mining 14


1.2 Cluster Analysis
• Partition data set into clusters, and one can
store cluster representation only
• Can be very effective if data is clustered but not
if data is “smeared”
• Can have hierarchical clustering and be stored
in multi-dimensional index tree structures

Data Warehouse and Data Mining 15


1.3 Regression
y

Y1

Y1’ y=x+1

X1 x

Data Warehouse and Data Mining 16


II. Data Integration : Data Preprocessing
• Data integration:
– combines data from multiple sources into a coherent
store
• Schema integration
– integrate metadata from different sources
– Entity identification problem: identify real world entities
from multiple data sources,
– e.g., A.cust-id  B.cust-#
• Detecting and resolving data value conflicts
– for the same real world entity, attribute values from
different sources are different
– possible reasons: different representations, different
scales, e.g., metric vs. British units
Data Warehouse and Data Mining 17
Handling Redundant Data in Data Integration
• Redundant data occur often when integration of multiple
databases
– The same attribute.... may have different names in different
databases
– One attribute may be a “derived” attribute in another table,
• e.g., annual revenue
• Redundant data may be able to be detected by
correlational analysis
• Careful integration of the data from multiple sources may
help reduce/avoid redundancies and inconsistencies and
improve mining speed and quality

Data Warehouse and Data Mining 18


Data Integration

Data Warehouse and Data Mining 19


III. Data Transformation : Data Preprocessing

1. Smoothing: remove noise from data


2. Aggregation: summarization, data cube construction
3. Generalization: concept hierarchy
4. Normalization: scaled to fall within a small, specified
range
– min-max normalization
– z-score normalization
– normalization by decimal scaling
5. Attribute/feature construction
– New attributes constructed from the given ones
6. Discretization
Data Warehouse and Data Mining 20
Data Transformation: Normalization
• Min-max normalization: to [new_minA, new_maxA]
v  minA
v'  (new _ maxA  new _ minA)  new _ minA
maxA  minA

– Ex. Let income range $12,000 to $98,000


73,600  12normalized
,000
(1.0  0)  0  to [0.0,
0.716
98,000  12,000
1.0]. Then $73,000 is mapped to
• Z-score normalization (μ: mean, σ: standard deviation):
v  A
v' 
 A

73,600  54,000
 1.225
16,000
– Ex. Let μ = 54,000, σ = 16,000. Then
• Normalization by decimal scaling
v
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
Data Warehouse and Data Mining 21
Discretization
• Discretization
– reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace
actual data values.
– reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute
age) by higher level concepts (such as young,
middle-aged, or senior).

Data Warehouse and Data Mining 22


Discretization
• Three types of attributes
– Nominal—values from an unordered set, e.g., color, profession
– Ordinal—values from an ordered set, e.g., military or academic
rank
– Numeric—real numbers, e.g., integer or real numbers
• Discretization: Divide the range of a continuous attribute into intervals
– Interval labels can then be used to replace actual data values
– Reduce data size by discretization

Data Warehouse and Data Mining 23


Data Discretization Methods
• Typical methods:
– Binning
– Histogram analysis
– Clustering analysis
– Decision-tree analysis

Data Warehouse and Data Mining

24
Binning
• Attribute values (for one attribute e.g., age):
– 0, 4, 12, 16, 16, 18, 24, 26, 28
• Equal-width binning – for bin width of e.g., 10:
– Bin 1: 0, 4 [-,10) bin
– Bin 2: 12, 16, 16, 18 [10,20) bin
– Bin 3: 24, 26, 28 [20,+) bin
• Equal-frequency binning – for bin density of e.g., 3:
– Bin 1: 0, 4, 12
– Bin 2: 16, 16, 18
– Bin 3: 24, 26, 28

Data Warehouse and Data Mining 25


Concept Hierarchy Generation
• Concept hierarchy organizes concepts (i.e., attribute values)
hierarchically and is usually associated with each dimension in a data
warehouse
• Concept hierarchies facilitate drilling and rolling in data warehouses to
view data in multiple granularity
• Concept hierarchy formation: Recursively reduce the data by collecting
and replacing low level concepts (such as numeric values for age) by
higher level concepts (such as youth, adult, or senior)
• Concept hierarchies can be explicitly specified by domain experts
and/or data warehouse designers
• Concept hierarchy can be automatically formed for both numeric and
nominal data—For numeric data, use discretization methods shown
Data Warehouse and Data Mining 26
Concept Hierarchy Generation
for Nominal Data
• Specification of a partial/total ordering of attributes
explicitly at the schema level by users or experts
– street < city < state < country
• Specification of a hierarchy for a set of values by explicit
data grouping
– {Urbana, Champaign, Chicago} < Illinois
• Specification of only a partial set of attributes
– E.g., only street < city, not others
• Automatic generation of hierarchies (or attribute levels) by
the analysis of the number of distinct values
– E.g., for a set of attributes: {street, city, state,
Data Warehouse and Data Mining 27
country}
Automatic Concept Hierarchy Generation
• Some hierarchies can be automatically generated based on the
analysis of the number of distinct values per attribute in the data
set
– The attribute with the most distinct values is placed at the
lowest level of the hierarchy
– Exceptions, e.g., weekday, month, quarter, year

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values


Data Warehouse and Data Mining 28
IV. Data Reduction : Data Preprocessing

• Warehouse may store terabytes of data:


– Complex data analysis/mining may take a very long time
to run on the complete data set

• Data reduction
– Obtains a reduced representation of the data set that is
much smaller in volume but yet produces the same
(or almost the same) analytical results

Data Warehouse and Data Mining 29


IV. Data Reduction : Data Preprocessing

• Data reduction strategies


4.1 Dimensionality reduction
4.2 Numerosity reduction
4.3 Data Compression

Data Warehouse and Data Mining 30


4.1 Dimensionality Reduction
• Dimensionality reduction is the process of
reducing the number of random variables or
attributes under consideration.
• Attribute subset selection is a method of
dimensionality reduction in which irrelevant,
weakly relevant, or redundant attributes or
dimensions are detected and removed .
– Select a minimum set of features such that the probability
distribution of different classes given the values for those features is
as close as possible to the original distribution given the values of
all features

Data Warehouse and Data Mining 31


4.1 Dimensionality Reduction

• Heuristic methods
(due to exponential # of choices):
1. step-wise forward selection
2. step-wise backward elimination
3. combining forward selection and backward
elimination
4. decision-tree induction

Data Warehouse and Data Mining 32


4.1 Dimensionality Reduction
• Heuristic methods

Data Warehouse and Data Mining 33


4.1 Dimensionality Reduction
• Heuristic methods

Data Warehouse and Data Mining 34


4.2 Data Compression
• Data compression are applied so as to obtain a
reduced or “compressed” representation of the
original data.
• If the original data can be reconstructed from the
compressed data without any information loss, the data
reduction is called lossless.
• If, instead, we can reconstruct only an approximation of
the original data, then the data reduction is called lossy.
• Dimensionality and numerosity reduction may also be
considered as forms of data compression.

Data Warehouse and Data Mining 35


Data Compression

Original Data Compressed


Data
lossless

os sy
l
Original Data
Approximated

Data Warehouse and Data Mining 36


4.3 Numeric Reduction
• Reduce data volume by choosing alternative, smaller
forms of data representation
• Parametric methods (e.g., regression)
– Assume the data fits some model, estimate model
parameters, store only the parameters, and discard
the data (except possible outliers)
– Ex.: Log-linear models—obtain value at a point in m-
D space as the product on appropriate marginal
subspaces
• Non-parametric methods
– Major families: histograms, clustering, sampling, …

Data Warehouse and Data Mining 37


Non-parametric methods 1. Histograms

Data Warehouse and Data Mining 38


Non-parametric methods 1. Histograms

Data Warehouse and Data Mining 39


Non-parametric methods 1. Histograms

Data Warehouse and Data Mining 40


2.Clustering
• Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only.
• Can be very effective if data is clustered but not if data is
“smeared”.
• Can have hierarchical clustering and be stored in multi-
dimensional index tree structures.
• There are many choices of clustering definitions and
clustering algorithms.

Data Warehouse and Data Mining 41


Non-parametric methods 2. Clustering

C1 C2

C3

Data Warehouse and Data Mining 42


Non-parametric methods 3. Sampling

• Sampling can be used as a data reduction technique

because it allows a large data set to be represented by a

much smaller random data sample (or subset).

Data Warehouse and Data Mining 43


Non-parametric methods 3. Sampling

Data Warehouse and Data Mining 44


Non-parametric methods 3. Sampling

Data Warehouse and Data Mining 45


If want sample of 10 ?

Data Warehouse and Data Mining 46


If want sample of 10 ?

Data Warehouse and Data Mining 47


Data Warehouse and Data Mining 48
If want sample of 10 ?

Data Warehouse and Data Mining 49


Data Warehouse and Data Mining 50
Data Warehouse and Data Mining 51
If want sample of 12 ?

Data Warehouse and Data Mining 52


If want sample of 12 ?

Data Warehouse and Data Mining 53


Data Warehouse and Data Mining 54
Data Warehouse and Data Mining 55
Sampling

SW OR om
SR le rand
p t
(sim le withou
samp ment)
pl ace
re

SRSW
R

Raw Data
Data Warehouse and Data Mining 56
Data Cube Aggregation
• The lowest level of a data cube
– the aggregated data for an individual entity of interest
– e.g., a customer in a phone calling data warehouse.

• Multiple levels of aggregation in data cubes


– Further reduce the size of data to deal with
• Reference appropriate levels
– Use the smallest representation which is enough to solve
the task
• Queries regarding aggregated information should be
answered using data cube, when possible
Data Warehouse and Data Mining 57
Data Cube Aggregation

Data Warehouse and Data Mining 58


Data Cube Aggregation

Data Warehouse and Data Mining 59


Correlation Analysis (Nominal Data)
• Χ2 (chi-square) test
2
(Observed  Expected )
2  
Expected
• The larger the Χ2 value, the more likely the variables are related
• The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count
• Correlation does not imply causality
– # of hospitals and # of car-theft in a city are correlated
– Both are causally linked to the third variable: population

Data Warehouse and Data Mining 60


Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

• Χ2 (chi-square) calculation (numbers in parenthesis


are expected counts calculated based on the data
distribution in the two categories)
2 (250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2
      507.93
90 210 360 840
• It shows that like_science_fiction and play_chess are
correlated in the group
Data Warehouse and Data Mining 61
Correlation Analysis (Numeric Data)
• Correlation coefficient (also called Pearson’s product moment
coefficient)

i 1 (ai  A)(bi  B) 
n n
(ai bi )  n AB
rA, B   i 1

(n  1) A B (n  1) A B

where n is the number of tuples, and are the respective


means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(aibi) is the sum of the AB cross-product.
• If rA,B > 0, A and B are positively correlated (A’s values increase
as B’s). The higher, the stronger correlation.
• rA,B = 0: independent; rAB < 0: negatively correlated

Data Warehouse and Data Mining 62


Visually Evaluating Correlation

Scatter plots
showing the
similarity from
–1 to 1.

Data Warehouse and Data Mining 63


Correlation (viewed as linear
relationship)
• Correlation measures the linear relationship between
objects
• To compute correlation, we standardize data objects,
A and B, and then take their dot product

a 'k  (ak  mean( A)) / std ( A)


b'k  (bk  mean( B)) / std ( B )

correlatio n( A, B )  A' B'


Data Warehouse and Data Mining 64
Covariance (Numeric Data)
• Covariance is similar to correlation

Correlation coefficient:
where n is the number of tuples, and are the respective mean or
expected values of A and B, σA and σB are the respective standard
deviation of A and B.
• Positive covariance: If CovA,B > 0, then A and B both tend to be larger than
their expected values.
• Negative covariance: If CovA,B < 0 then if A is larger than its expected value,
B is likely to be smaller than its expected value.
• Independence: CovA,B = 0 but the converse is not true:
– Some pairs of random variables may have a covariance of 0 but are not independent.
Only under some additional assumptions (e.g., the data follow multivariate normal
distributions) does a covariance of 0 imply independence

Data Warehouse and Data Mining 65


Co-Variance: An Example

• It can be simplified in computation as

• Suppose two stocks A and B have the following values in one week: (2, 5), (3,
8), (5, 10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry trends, will their
prices rise or fall together?
– E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

– E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6


– Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

• Thus, A and B rise together since Cov(A, B) > 0.


Data Warehouse and Data Mining 66
Summary
• Data quality: accuracy, completeness, consistency,
timeliness, believability, interpretability
• Data cleaning: e.g. missing/noisy values, outliers
• Data integration from multiple sources:
– Entity identification problem; Remove redundancies;
Detect inconsistencies
• Data reduction
– Dimensionality reduction; Numerosity reduction; Data
compression
• Data transformation and data discretization
– Normalization; Concept hierarchy generation
Data Warehouse and Data Mining 67

You might also like