0% found this document useful (0 votes)
11 views41 pages

03 Data Preparation

The document outlines the process of data preparation, which includes data integration, selection and reduction, preprocessing, and transformation techniques. Key tasks in data preprocessing involve filling in missing values, removing noisy data, identifying outliers, and correcting inconsistencies. Various strategies for data reduction and transformation, such as aggregation, dimensionality reduction, and normalization, are also discussed to ensure high-quality data for effective mining results.

Uploaded by

cessmania
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views41 pages

03 Data Preparation

The document outlines the process of data preparation, which includes data integration, selection and reduction, preprocessing, and transformation techniques. Key tasks in data preprocessing involve filling in missing values, removing noisy data, identifying outliers, and correcting inconsistencies. Various strategies for data reduction and transformation, such as aggregation, dimensionality reduction, and normalization, are also discussed to ensure high-quality data for effective mining results.

Uploaded by

cessmania
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Data Preparation

Outline

◘ Data Integration
◘ Data Selection and Reduction
◘ Data Preprocessing and Data Cleaning
– Filling in Missing Values
– Removing Noisy Data
– Identification of Outliers
– Correcting Inconsistent Data
◘ Data Transformation Techniques
– Normalization
– Discretization

No quality data → No quality mining results !!


Data Preparation
Data Preparation

Data
Data Data Data
Data Selection &
Integration Preprocessing Transformation
Reduction
Data Integration

◘ Data integration
– Integration of multiple databases, data cubes, or files
– Obtain data from various sources

Database
Data Integration

Data
Cube

File
Data Preparation

Data
Data Data Data
Selection &
Integration Preprocessing Transformation
Reduction
Data Selection & Reduction

◘ Data Reduction
– Selecting a target data set
– Removing duplicates

◘ Data Reduction
– Obtains reduced representation of the data set
(smaller in volume but yet produces the same (or almost the same) results

◘ Why data reduction?


– A database/data warehouse may store terabytes of data
– Complex data analysis/mining may take a very long time to run
Data Reduction Strategies

1- Data Aggregation — e.g., sum, average


2- Dimensionality Reduction — e.g., remove unimportant attributes
3- Data Compression — e.g., encoding mechanisms
4- Sampling — e.g., fit data into models
5- Clustering — e.g., cluster data
6- Concept hierarchy generation — e.g., street < city < state < country
1- Data Aggregation

◘ Data aggregation is any process in which information is expressed in a


summary form.
◘ Summarization
2- Dimensionality Reduction

◘ Attribute subset selection


◘ Remove unimportant attributes
◘ Remove redundant and/or correlating attributes
◘ Combine attributes (sum, multiply, difference)

◘ Example (Decision Tree Induction) :


Initial attribute set: {A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

Reduced attribute set: {A1, A4, A6}


Data Reduction Examples

Tutar <= 5 TL
SatışID Ürün Tarih ToplamTutar SatıldığıYer
1 Domates, Peynir, Kola 1.1.2008 45 İzmir Horizontal
2 Makarna, Çay 3.1.2008 55 İstanbul
Data
Reduction
3 Saç Bakımı 5.1.2008 5 İstanbul
4 Sigara, Bira 8.1.2008 25 İzmir

Vertical Data Reduction


SatışID Ürün Tarih Tutar SatıldığıYer Açıklama
1 Domates 1.1.2008 20 Buca .....
1 Peynir 1.1.2008 10 Buca .....
1 Kola 1.1.2008 15 Buca .....
2 Makarna 3.1.2008 25 Mecidiyeköy .....
2 Çay 3.1.2008 30 Mecidiyeköy .....
3 Saç Bakımı 5.1.2008 5 Kadıköy .....
4 Sigara 8.1.2008 15 Bornova .....
4 Bira 8.1.2008 10 Bornova .....
3- Data Compression

◘ Data compression is the process of reproducing information in a


more compact form.
– Number comsression
– String compression
– Image/Audio/Video compression

◘ Lossless vs. Lossy Compression


– Original Data
25.888888888
Original Data Compressed
Data
– Lossless system lossless
25.[9]8
Original Data
– Lossy system Approximated
26
4- Sampling
◘ Sampling: obtaining a small sample s to represent the whole data set N.

Raw Data Cluster/Stratified Sample


5- Clustering

◘ Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter only)

Marital
C1 C2 ID Gender Age
Satatus
Score Cluster
1021 F 41 NeverM 55 C1
1022 M 27 Married 35 C1
1023 M 20 NeverM 480 C2
1024 F 34 Married 950 C3
1025 M 74 Married 500 C2
1026 M 32 Married 500 C2
1027 M 18 NeverM 890 C3
1028 M 54 Married 68 C1
C3 C4 … … … … … …
6- Concept Hierarchy Generation

◘ Replace low level concepts (such as numeric values for age) by


higher level concepts (such as young, middle-aged, or senior)
◘ Specification of a hierarchy for a set of values by explicit data grouping
– {Urbana, Champaign, Chicago} < Illinois
◘ The attribute with the most distinct values is placed at the lowest level of
the hierarchy
– street < city < state < country

country 15 distinct values

state 365 distinct values

city 3567 distinct values

street 674,339 distinct values


Data Preparation

Data
Data Data Data
Selection &
Integration Preprocessing Transformation
Reduction
Why Data Preprocessing?

Data in the real world is dirty.

◘ Incomplete: lacking or missing attribute values


– e.g., occupation=“”

◘ Noisy: containing errors or outliers


– e.g., Salary=“-10”

◘ Inconsistent: containing discrepancies in codes or names


– e.g., Age=“42” Birthday=“03/07/1976”
Outliers
– e.g., Was rating “1,2,3”, now rating “A, B, C”
– e.g., discrepancy between duplicate records
– e.g., different meanings (annual, yearly)
Example Errors

NAME M.Ulku Metin Ü.


SURNAME SANER SANRE
BIRTH DATE 10/04/1965 04.10.1965
CITY G.ANTEB GAZİANTEP
ADRESS Atatürk Cd. Kemaliye Sok. No.25 Atatrk Cad. Kemaliye Mah. 25/3

TITLE Gen. Müdr. Genel Müdür


WORKING PLACE G.Antep D.S.İ. Devlet Su İşleri A.O
......... ......... .........
Major Tasks in Data Preprocessing

1. Fill in missing values


2. Remove noisy data
3. Identify and remove outliers
4. Resolve inconsistencies
1- Filling in Missing Values

For Example: %10 of Salary is incomplete

Solutions:
1- Ignore the tuple
– Usually done when class label is missing (in classification)
– Not effective when the percentage of missing values per attribute varies
considerably

2- Fill in the missing value manually


– Tedious and infeasible

3- Fill in it automatically with


– A global constant : e.g., “unknown”, it generates a new class?
– The attribute mean
– The conditioned mean for all samples belonging to the same class
– The most probable value (use Bayesian formula or Decision Tree)
Example - Filling in Missing Values

◘ Global Constant: 900 (user defined)


◘ Most Repeated Value: 1000
◘ Mean: 1042
◘ Conditioned mean: 1100
◘ Most probable value: 1200 (most similar to the customer 1021)
Marital
ID Gender Age Education Region Salary Cluster
Satatus
1021 F 41 Married Masters Izmir 1200 C1
1022 M 27 Married Bach. Ankara 1000 C1
1023 M 20 NeverM High School Izmir 1000 C2
1024 F 34 Married Bach. İstanbul 1000 C3
1025 M 74 Married Middle Ankara 500 C2
PhD
1026 M 32 Married İstanbul 2000 C2

1027 M 18 NeverM High School Ankara 800 C3


1028 F 43 Married Master Izmir ? C1
2- Removing Noisy Data

Solutions:

A. Binning
– First sort data and partition into (width or depth) bins
– Then one can
• (a) Equal Depth and Smooting by Bin Boundaries
• (b) Equal Depth and Smooting by Bin Means
• (c) Equal Width and Smooting by Bin Boundaries
• (d) Equal Width and Smooting by Bin Means

B. Regression
– Smooth by fitting the data into regression functions
A. Binning

◘ Equal-width partitioning
– Divides the range into N intervals of equal size
– if A and B are the lowest and highest values of the attribute, the width of
intervals will be: W = (B –A)/N.
◘ Equal-depth partitioning
– Divides the range into N intervals, each containing approximately same
number of samples

Equal width B1 B1 B2 B2 B2 B2 B2 B2 B2 B3 B3 B3
Price in € 4 6 14 16 18 19 21 22 23 25 27 33
Equal depth B1 B1 B1 B1 B2 B2 B2 B2 B3 B3 B3 B3

Equal-Width Partitioning Equal-Depth Partitioning


(33-4) / 3 ~ 9
Bin1 (4-13) : 4 6 Bin1 : 4 6 14 16
Bin2 (14-23) : 14 16 18 19 21 22 23 Bin2 : 18 19 21 22
Bin3 (24-33) : 25 27 33 Bin3: 23 25 27 33
A. Binning

◘ Replace all values in a BIN by ONE value (smoothing values)

Price in € 4 6 14 16 18 19 21 22 23 25 27 33
Equal depth B1 B1 B1 B1 B2 B2 B2 B2 B3 B3 B3 B3
Smoothing by
10 10 10 10 20 20 20 20 27 27 27 27
bin means
Smoothing by
4 4 16 16 18 18 22 22 23 23 23 33
bin boundaries

5 5 19 19 19 19 19 19 19 28 28 28

4 6 14 14 14 23 23 23 23 25 25 33
A. Binning

❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

* Partition into Equal-Depth bins: * Partition into Equal-Width bins:


- Bin 1: - Bin 1:
- Bin 2: - Bin 2:
- Bin 3: - Bin 3:
* Smoothing by bin means: * Smoothing by bin means:
- Bin 1: - Bin 1:
- Bin 2: - Bin 2:
- Bin 3: - Bin 3:
* Smoothing by bin boundaries: * Smoothing by bin boundaries:
- Bin 1: - Bin 1:
- Bin 2: - Bin 2:
- Bin 3: - Bin 3:
Binning Example

❑ Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34

* Partition into Equal-Depth bins: * Partition into Equal-Width bins:


- Bin 1: 4, 8, 9, 15 - Bin 1: 4, 8, 9
- Bin 2: 21, 21, 24, 25 - Bin 2: 15, 21, 21, 24
- Bin 3: 26, 28, 29, 34 - Bin 3: 25, 26, 28, 29, 34
* Smoothing by bin means: * Smoothing by bin means:
- Bin 1: 9, 9, 9, 9 - Bin 1: 7, 7, 7
- Bin 2: 23, 23, 23, 23 - Bin 2: 20, 20, 20, 20
- Bin 3: 29, 29, 29, 29 - Bin 3: 28, 28, 28, 28, 28
* Smoothing by bin boundaries: * Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15 - Bin 1: 4, 9, 9
- Bin 2: 21, 21, 25, 25 - Bin 2: 15, 24, 24, 24
- Bin 3: 26, 26, 26, 34 - Bin 3: 25, 25, 25, 25, 34
Binning Example
[3 − 13]
35 − 3
= 10 [14 − 24]
Örneğin: 3, 8, 10, 11, 15, 19, 23, 29, 35 3 [25 − 35]

Equal-Depth Equal-Width
Bin 1: Bin 1:
Bin 2: Bin 2:
Bin 3: Bin 3:

Means Means
Bin 1: Bin 1:
Bin 2: Bin 2:
Bin 3: Bin 3:

Boundaries
Bin 1: Bin 1:
Bin 2: Bin 2:
Bin 3: Bin 3:
Binning Example
[3 − 13]
35 − 3
= 10 [14 − 24]
Örneğin: 3, 8, 10, 11, 15, 19, 23, 29, 35 3 [25 − 35]

Equal-Depth Equal-Width
Bin 1: 3, 8, 10 Bin 1: 3, 8, 10, 11
Bin 2: 11, 15, 19 Bin 2: 15, 19, 23
Bin 3: 23, 29, 35 Bin 3: 29, 35

Means Means
Bin 1: 7, 7, 7 Bin 1: 8, 8, 8, 8
Bin 2: 15, 15, 15 Bin 2: 19, 19, 19
Bin 3: 29, 29, 29 Bin 3: 32, 32

Boundaries
Bin 1: 3, 10, 10 Bin 1: 3, 11, 11, 11
Bin 2: 11, 11, 19 Bin 2: 15, 15, 23
Bin 3: 23, 23, 35 Bin 3: 29, 35
B. Regression

Y1

Y1’ y=x+1

X1 x
3- Removing Outliers

◘ Outlier: Data points inconsistent with the majority of data


◘ Removal methods
– Clustering
– Curve-fitting

Clustering

Outliers
4- Resolve inconsistencies

◘ Data discrepancy detection


– Use metadata (e.g., domain, range, dependency, distribution)
– Check field overloading
– Check uniqueness rule, consecutive rule and null rule

◘ Data Type Conversion may be necessary


– Different representations, different scales, e.g., metric vs. British units

◘ For example: inconsistency in naming convention


Data Preparation

Data
Data Data Data
Selection &
Integration Preprocessing Transformation
Reduction
Data Transformation

◘ It is the process of changing the form or structure of existing


attributes.
– Convert data into common format
– Transform data into new format

◘ It involves converting data into a single common format acceptable


to the data mining methodology.
Data Transformation Example

Data Warehouse

appl A - m,f
appl B - 1,0
appl C - x,y
appl D - male, female

appl A - pipeline - cm
appl B - pipeline - in
appl C - pipeline - feet
appl D - pipeline - yds

appl A - balance
appl B - bal
appl C - currbal
appl D - balcurr
Encoding Errors

◘ Education Field
– C: college
– U: university
– H: high school
– D: doctorate
– M: master
– S : secondary school
– P: primary school
– I : illegitimate

but X,Q,Y,T values may seen in the data


Data Transformation

◘ Normalization: scaled to fall within a small, specified range


– Min-max normalization
– Z-score normalization

◘ Discretization
– Fixed k-Interval Discretization
– Cluster-Based Discretization
– Entropy-Based Discretization
Normalization
◘ Min-max normalization: to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA

◘ Z-score normalization (μ: mean, σ: standard deviation):


v − A
v' =
 A

Mean Average Standard Deviation


Data Transformation Example

Price in € 4 6 14 16 18 19 21 22 23 24 27 34

Min-max [0,1] 0 .06 .33 .4 .46 .5 .56 .6 .63 .66 .76 1

Z-score -1.8 -1.6 -0.6 -0.3 -0.1 0 0.2 0.4 0.5 0.6 1 1.8

.04 .06 .14 .16 .18 .19 .21 .22 .23 .24 .27 .34

v − minA v − A
v' = (new _ maxA − new _ minA) + new _ minA v' =
maxA − minA  A
Discretization

◘ Discretization:
– Divide the range of a continuous attribute into intervals.

◘ Discretization Methods
– Fixed k-Interval Discretization
– Cluster-Based Discretization
– Entropy-Based Discretization
Buys Buys Buys
Age Computer Age Computer Age Computer

1 10 No 1 10 No 1 Young No

2 14 No 2 14 No 2 Young No

3 20 Yes 3 20 Yes 3 Adult Yes

4 22 Yes 4 22 Yes 4 Adult Yes

5 44 Yes 5 44 Yes 5 Adult Yes

6 48 No 6 48 No 6 Adult No

7 52 Yes 7 52 Yes 7 Adult Yes

8 70 No 8 70 No 8 Old No

9 76 No 9 76 No 9 Old No
Fixed k-Interval Discretization

◘ vmin is the minimum observed value


◘ vmax is the maximum observed value

◘ Intervals have width w = (vmax - vmin) / k

◘ The cut points are


vmin + w , vmin + 2w , ... , vmin + (k - 1)w

◘ Replace continuous values in Attribute with discrete ranges or labels


Fixed k-Interval Discretization

◘ Use Fixed 4-Interval Discretization method to discretize the following dataset.

Customer ID Age Customer ID Age


1 10 1 10 - 28
2 14
2 10 - 28
3 20
4 22 3 10 - 28
5 44 4 10 - 28
6 48
7 52 5 28 - 46
8 70 6 46 - 64
9 76
10 82 7 46 - 64

( 82 – 10 ) / 4 = 72 / 4 = 18 8 64 - 82
9 64 - 82
[10 – 28] 10 64 - 82
(28 – 46]
(46 – 64]
(64 – 82]

You might also like