0% found this document useful (0 votes)
12 views33 pages

Data Mining - Lecture 3

Data Mining - Lecture 3

Uploaded by

hendymostafa256
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views33 pages

Data Mining - Lecture 3

Data Mining - Lecture 3

Uploaded by

hendymostafa256
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Data Mining and Business Intelligence

Integration

Reduction
Data Pre-processing
Transformation

By
Dr. Nora Shoaip
Lecture 3

Damanhour University
Faculty of Computers & Information Sciences
Department of Information Systems

2024 - 2025
Data Integration
• Entity Identification Problem
• Redundancy and correlation analysis
• Tuple duplication
Data Integration

Merging data from multiple data stores


Helps reduce and avoid redundancies and inconsistencies in the resulting data set
Challenges:
 Semantic heterogeneity  entity identification problem
 Structure of data  functional dependencies and referential constraints
 Redundancy

3
Data Integration
Entity Identification Problem

 Schema integration and object matching


 Metadata  name, meaning, data type, and range of values
permitted, null rules for handling blank, zero, or null values

 can help avoid errors in schema integration and data


transformation

4
Data Integration
Redundancy and Correlation Analysis

5
Data Integration
Redundancy and Correlation Analysis

gender
male female Total
Fiction 250 200 450
Preferred Non-fiction 50 1000 1050
reading
Total 300 1200 1500

6
Data Integration
Redundancy and Correlation Analysis

gender
male female Total
Fiction 250 (90) 200 (360) 450
Preferred Non-fiction 50(210) 1000 (840) 1050
reading
Total 300 1200 1500

7
Data Integration
Redundancy and Correlation Analysis

gender
male female Total
Fiction 250 (90) 200 (360) 450
Preferred Non-fiction 50(210) 1000 (840) 1050
reading
Total 300 1200 1500

8
Data Integration
Redundancy and Correlation Analysis

9
Data Integration
Redundancy and Correlation Analysis

10
Data Integration
Redundancy and Correlation Analysis

11
Data Integration
Redundancy and Correlation Analysis

Time AllElectronics HighTech


point

T1 6 20
T2 5 10
T3 4 14
T4 3 5
T5 2 5

12
Data Integration
More Issues

Tuple duplication
The use of denormalized tables (often done to improve performance by
avoiding joins) is another source of data redundancy.
e.g. purchaser name and address, and purchases
Data value conflict
e.g. grading system in two different institutes  A, B, … versus 90%,
80% …

13
Data Reduction
• Wavelet transforms
• PCA
• Attribute subset selection
• Regression
• Histograms
• Clustering
• Sampling
Data Reduction
Strategies

 Dimensionality reduction  reduce number of attributes


◦Wavelet transforms, PCA, Attribute subset selection
 Numerosity reduction  replace original data volume by smaller data representation
◦Parametric  a model is used to estimate the data - only the data parameters are
stored
Regression
◦Nonparametric  store reduced representations of the data
Histograms, clustering, sampling
 Compression  transformations applied to obtain a “compressed” representation of
original data
◦Lossless, Lossy

15
Data Reduction
Attribute Subset Selection

 find a min set of attributes such that the resulting probability distribution of data is as
close as possible to the original distribution using all attributes
 An exhaustive search can be prohibitively expensive
 Heuristic (Greedy) search
◦Stepwise forward selection: start with empty set of attributes as reduced set. The best of the
attributes is determined and added to the reduced set. At each subsequent iteration, the best of
the remaining attributes is added to the set
◦Stepwise backward elimination: start with the full set of attributes. At each step, remove the
worst attribute remaining in the set
◦Combination of forward selection and backward elimination
◦Decision tree induction
 Attribute construction  e.g. area attribute based on height and width attributes
16
Data Reduction
Attribute Subset
Selection

17
Data Reduction- Numerosity reduction
Regression

 Data is modeled to fit a straight line


 A random variable y (response variable), can be modeled
as a linear function of another random variable x
(predictor variable)
Regression line equation  y = wx + b
 w and b are regression coefficients  they specify the
slope of the line and y-intercept
 Solved for by the method of least squares minimize error
between actual line separating data and estimate of the
line (best-fitting line)

18
Data Reduction
Regression

X Y

1.00 1.00
2.00 2.00
3.00 1.30
4.00 3.75
5.00 2.25

19
Data Reduction
Regression

X Y

1.00 1.00
2.00 2.00
3.00 1.30
4.00 3.75
5.00 2.25

20
Data Reduction
Histograms

 A histogram for an attribute, A, partitions the data distribution of A into disjoint


subsets, referred to as buckets or bins.

 a single attribute–value/frequency pair singleton buckets

 Often, buckets represent continuous ranges for the given attribute.

 Equal-width: the width of each bucket range is uniform (e.g., the width of $10 for the
buckets).

 Equal-frequency (or equal-depth): roughly, the frequency of each bucket is constant


(i.e., each bucket contains roughly the same number of contiguous data samples).

21
Data Reduction
Histograms

The following data are a list of AllElectronics


prices for commonly sold items (rounded to the
nearest dollar). The numbers have been sorted:
1, 1, 5, 5, 5,5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14,
14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18, 18,18,
18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21,
21, 25, 25, 25, 25, 25, 28, 28, 30,30, 30.

22
Data Reduction
Sampling

 A large data set represented by a smaller random data sample


 Simple random sample without replacement (SRSWOR) of size s  draw s of the N
tuples (s < N)
◦all tuples are equally likely to be sampled
 Simple random sample with replacement (SRSWR) of size s  similar to SRSWOR,
but each time a tuple is drawn, it’s recorded then placed back so it may be drawn again
 Cluster sample  If tuples are grouped into M “clusters,” an SRS of s clusters can be
obtained
 Stratified sample  If tuples are divided into strata, a stratified sample is generated by
obtaining an SRS at each stratum
◦e.g. stratum is created for each customer age group
23
Data Reduction
Sampling

24
Transformation and Discretization
Transformation Strategies

 Smoothing  binning, regression

 Attribute construction

 Aggregation

 Normalization  raw values of a numeric attribute (e.g. age) replaced by interval

labels (e.g. 0–10, 11–20) or conceptual labels (e.g., youth, adult, senior)

 Concept hierarchy  e.g. street generalized to higher-level concepts (city or country)

25
Transformation and Discretization
Transformation by Normalization

To help avoid dependence on the choice of measurement units


Give all attributes equal weight
Methods:
min-max normalization
z-score normalization

26
Transformation and Discretization
Transformation by Normalization

27
Transformation and Discretization
Transformation by Normalization

28
Transformation and Discretization
Concept Hierarchy

 Concept hierarchy organizes concepts (i.e., attribute values) hierarchically


 Concept hierarchies facilitate drilling and rolling to view data in multiple
granularity
 Concept hierarchy formation: Recursively reduce data by collecting and
replacing low level concepts (e.g. age values) by higher level concepts (e.g.
age groups: youth, adult, or senior)
 Concept hierarchies can be explicitly specified by domain experts
 Concept hierarchy can be automatically formed for both numeric and nominal
data  discretization
29
Transformation and Discretization
Concept Hierarchy

For nominal data:


Specification of a partial ordering of attributes explicitly at the schema level by
users or experts
street, city, province or state, country  street < city < province or state < country
Specification of a set of attributes, but not of their partial ordering  order
automatically generated by system
e.g. Location  country contains a smaller #distinct values than street
automatically generate concept hierarchy based on # distinct values per attribute in the
given attribute set
Not for all concepts! Time  year (20), month (12), day of week (7)

30
Summary
Cleaning Integration Reduction Transformation/Discretization
Binning Binning
Regression Regression Regression
Correlation analysis Correlation
Histograms Histogram analysis
Clustering Clustering
Attribute construction Attribute construction
Aggregation
Normalization
Outlier analysis
Wavelet transforms
PCA
Attribute subset selection
Sampling
Concept hierarchy
31
Summary

21

You might also like