0% found this document useful (0 votes)

22 views17 pages

5 Data Preprocessing III Editted Notes

The document outlines data preprocessing techniques essential for improving data quality in data mining, including data cleaning, integration, reduction, transformation, and discretization. It emphasizes the importance of normalization and various methods for transforming data types, such as min-max normalization and z-score normalization. Additionally, it discusses the significance of encoding categorical data to numeric formats for effective data analysis.

Uploaded by

Nashwa Fouad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views17 pages

5 Data Preprocessing III Editted Notes

Uploaded by

Nashwa Fouad

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

DATA MINING

Lectures 5: Data Preprocessing III

Dr. Doaa Elzanfaly

Lecture Outline

◼ Data Preprocessing: An Overview

◼ Data Quality

◼ Major Tasks in Data Preprocessing

◼ Data Cleaning

◼ Data Integration

◼ Data Reduction

◼ Data Transformation and Data Discretization

There are several data pre-processing techniques.

Data cleaning can be applied to remove noise and correct inconsistencies in
data.
Data integration merges data from multiple sources into a coherent data store
such as a data warehouse.
Data reduction can reduce data size by, for instance, aggregating, eliminating
redundant features, or clustering.
Data transformations (e.g., normalization) may be applied, where data are
scaled to fall within a smaller range like 0.0 to 1.0. This can improve the
accuracy and efficiency of mining algorithms involving distance measurements.
These techniques are not mutually exclusive; they may work together.

Data processing techniques, when applied before mining, can substantially

improve the overall quality of the patterns mined and/or the time required for
the actual mining.
Why Transformation??
◼ Data is Often Heterogeneous
◼ A demographic data set may contain both numeric and mixed attributes.
because different data mining algorithms may only work with specific
data types.

◼ Possible Solutions
◼ Designing an algorithm with an arbitrary/‫ عشوائيه‬combination of data
types. that can handle a variety of data types simultaneously, processing both numerical and categorical
data. >>Time-consuming and sometimes impractical

◼ Converting between various data types >>Utilize off-the-shelf tools for

processing

Ex: Converting categorical data (like gender) to numerical format using encoding (e.g.,
"Male" to 1, "Female" to 0).
Ex: Discretizing continuous numerical data into categories if needed.

Demographic data refers to data about groups of people according to

certain attributes. Examples include age, gender and interests.
Data mining algorithms (Read)
Data Transformation

◼ A function that maps the entire set of values of a given attribute to a

new set of replacement values.

◼ Methods

1. Aggregation: Summarization, data cube construction

2. Normalization: Scaled to fall within a smaller, specified range

a. min-max normalization

b. z-score normalization

c. normalization by decimal scaling

3. Discretization: Concept hierarchy climbing

5
3. Aggregation
◼ Data aggregation is the process where raw data is gathered and
summarized to perform statistical analysis

◼ Aggregated data is usually presented in data warehouses

For example, finding the average age of customer buying a particular product
can help in finding out the targeted age group for that particular product.
Instead of dealing with an individual customer, the average age of the customer
is calculated.

◼ Time aggregation - It provides the data point for single resources for a
defined time period. Example: The website receives 60 visits in one hour. After
aggregating, the data will show total visits per day like 1,500 visits on Monday,
1,800 on Tuesday, etc.
◼ Spatial aggregation - It provided the data point for a group of resources for
a defined time period. Example: Suppose there are multiple weather stations in
different cities within a region. Spatial aggregation can combine the readings to
provide an average temperature for the entire region for a given day.
4. Normalization
◼ To give all attributes an
equal weight, the data
should be normalized or
standardized.
◼ This helps to prevent
attributes with initially large
ranges from outweighing
attributes with initially
smaller ranges
◼ The measurement unit used can
◼ This is particularly useful for affect the data analysis.
classification algorithms ◼ Changing measurement units from
involving neural networks or meters to inches for height, or
distance measurements such from kilograms to pounds for
weight.
as nearest-neighbour
Expressing an attribute in smaller
classification and clustering.
◼
units will lead to a larger range.

expressing an attribute in smaller units will lead to a larger range for that
attribute, and thus tend to give such an attribute greater effect or “weight.”
k-nearest neighbors (k-NN),

◼ Calculate the distance House Size (sq

ft)
Rooms
using Euclidean 1000 2
distance of 2500 and 2000 3
3bedrooms 3000 4
4000 5

House Size (sq

Rooms
ft)
-1.16 -1.34
-0.39 -0.45
0.39 0.45
1.16 1.34
Data Normalization Methods
let A be a numeric attribute with n observed values, v1, v2, … , vn.
◼ Min-max Normalization to [new_minA, new_maxA]

◼ Performs a linear transformation on the original data by mapping

a value, vi , of A to vi in the range [new-minA, new-maxA]

◼ Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then
$73,000 is mapped to: 73,600 − 12,000
(1.0 − 0) + 0 = 0.716
98,000 − 12,000

◼ Preserves the relationships among the original data values.

◼ Encounters an “out-of-bounds” error if a future input case falls
outside of the original data range for A.

Min-max normalization preserves the relationships

among the original data values. It will encounter an
“out-of-bounds” error if a future input case for
normalization falls outside of the original data range
for A.
Data Normalization Methods

◼ Z-score normalization (μ: mean, σ: standard deviation):

where Ᾱ and σA are the mean and standard deviation, respectively, of attribute
A.

73,600 − 54,000
◼ Ex. Let μ = 54,000, σ = 16,000. Then = 1.225
16,000

◼ This method of normalization is useful when:

◼ The actual minimum and maximum of attribute A are unknown,
◼ Or when there are outliers that dominate the min-max normalization

If you need normalized data with a controlled or fixed range after Z-score
normalization, you could apply an additional step:
1.Apply Z-score Normalization First: Normalize the data to mean 0 and
standard deviation 1.
2.Apply Min-Max Scaling on the Z-scores: Use min-max normalization to
rescale the Z-score values to a specific range, like [0,1]or [−1,1]
Data Normalization Methods
◼ Normalization by decimal scaling
◼ Normalizes by moving the decimal point of values of attribute A.
◼ The number of decimal points moved depends on the maximum
absolute value of A.
v
v' =
10 j
Where j is the smallest integer such that Max(|ν’|) < 1

◼ Ex. Suppose that the recorded values of A range from -986 to 917. The maximum
absolute value of A is 986. To normalize by decimal scaling, we therefore divide each
value by 1000 (i.e., j = 3) so that -986 normalizes to -0.986 and 917 normalizes to
0.917.

Note that normalization can change the original data

quite a bit, especially when using z-score
normalization or decimal scaling. It is also necessary
to save the normalization parameters (e.g., the mean
and standard deviation if using z-score
normalization) so that future data can be normalized
in a uniform manner.
5. Discretization
◼ Discretization refers to the process converting or partitioning continuous
attributes to discretized or nominal attributes.

◼ Typical methods: All the methods can be applied recursively

• Binning
Top-down split, unsupervised
• Histogram analysis
Top-down split, unsupervised
• Clustering analysis
Unsupervised, top-down split or
bottom-up merge
Binning as a discretization technique
◼ Binning can also be used as a discretization technique.

◼ Attribute values can be discretized by applying equal-width or equal-

frequency binning,

◼ The continuous values in each bin can be converted to a nominal or

discretized value by replacing them by the bin mean or median.

◼ Variations within a range are not distinguishable after discretization.

◼ For uniformly distributed data, equal width bins may be useful.

◼ For data that is not uniformly distributed, equal depth bins work
reasonably well.

For example, consider the age attribute. One could create ranges [0, 10], [11,
20], [21, 30], and so on. The symbolic value for any record in the range [11, 20]
is “2” and the symbolic value for a record in the range [21, 30] is “3”. Because
these are symbolic values, no ordering is assumed between the values “2” and
“3”. Furthermore, variations within a range are not distinguishable after
discretization. Thus, the discretization process does lose some information for
the mining process. However, for some applications, this loss of information is
not too debilitating. One challenge with discretization is that the data may be
nonuniformly distributed across the different intervals. For example, for the
case of the salary attribute, a large subset of the population may be grouped in
the [40, 000, 80, 000] range, but very few will be grouped in the [1, 040, 000, 1,
080, 000] range. Note that both ranges have the same size. Thus, the use of
ranges of equal size may not be very helpful in discriminating between different
data segments. On the other hand, many attributes, such as age, are not as
nonuniformly distributed, and therefore ranges of equal size may work
reasonably well.
Histograms

◼ Histogram-based discretization often

provides the most interpretable bins because it
respects natural groupings in the data, which can
align with real-world categories (e.g., income
levels).

• Low Income: 10 - 30
• Lower-Middle Income: 30 - 50
• Upper-Middle Income: 50 - 70
• High Income: 70 - 90
• Very High Income: 90 - 130
Categorical to Numeric Data
◼ Direct Encoding
◼ By giving each distinct value a number.
◼ Would cause the model to misinterpret these values
◼ Ex. encoding male by 1 and female by 2 may be interpreted by the model as if
female is more important than mail.

◼ One Hot Encoding – Binarization

◼ This method creates a binary vector for each value

Desirable to use numeric data mining algorithms on categorical data.

Because binary data is a special form of both numeric and categorical data, it is
possible to convert the categorical attributes to binary form and then use
numeric algorithms on the binarized data.

Direct encoding, by giving each distinct value a

number, would cause the model to misinterpret
these values as it will consider that there is an
order relationship between these values, which
is not the case.
For example, if there is a categorical feature of
type nominal (i.e. male and female), encoding
mail by 1 and female by 2 may be interpreted by
the model as if female is more important than mail.
References
◼ D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. of
ACM, 42:73-78, 1999
◼ A. Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. IEEE Spectrum, Oct 1996
◼ T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley, 2003
◼ J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. Duxbury Press, 1997.
◼ H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning:
Language, model, and algorithms. VLDB'01
◼ M. Hua and J. Pei. Cleaning disguised missing data: A heuristic approach. KDD'07
◼ H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical
Committee on Data Engineering, 20(4), Dec. 1997
◼ H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A Data Mining
Perspective. Kluwer Academic, 1998
◼ J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003
◼ D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
◼ V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and
Transformation, VLDB’2001
◼ T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001
◼ R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7:623-640, 1995

Data Transformation and Standardization
No ratings yet
Data Transformation and Standardization
5 pages
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
No ratings yet
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
57 pages
Data Normalization in Data Mining
No ratings yet
Data Normalization in Data Mining
8 pages
Lecture 7 Data Transformation and Dimensionality Reduction
No ratings yet
Lecture 7 Data Transformation and Dimensionality Reduction
22 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
Data Discretization
No ratings yet
Data Discretization
9 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Week 2
No ratings yet
Week 2
96 pages
Wireless Communication Between PC and Microcontroller Project
89% (9)
Wireless Communication Between PC and Microcontroller Project
39 pages
DM 02 04 Data Transformation
No ratings yet
DM 02 04 Data Transformation
52 pages
Data Preprocessing For Clustering
No ratings yet
Data Preprocessing For Clustering
40 pages
Data Preprocessing
No ratings yet
Data Preprocessing
33 pages
L2 Data Preparation
No ratings yet
L2 Data Preparation
18 pages
Data Transformation
No ratings yet
Data Transformation
16 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
DSR Unit III
No ratings yet
DSR Unit III
11 pages
Data Minig Lab Manual
No ratings yet
Data Minig Lab Manual
58 pages
Chapter - 3 Data Pre - Processing
No ratings yet
Chapter - 3 Data Pre - Processing
54 pages
Data Preparation DM
No ratings yet
Data Preparation DM
26 pages
Scaling Techniques
No ratings yet
Scaling Techniques
30 pages
Week2 DataPreprocessing
No ratings yet
Week2 DataPreprocessing
43 pages
10-2 Data Analysis and Pre-Processing Part 4 PDF
No ratings yet
10-2 Data Analysis and Pre-Processing Part 4 PDF
23 pages
3point5point2 Normalization
No ratings yet
3point5point2 Normalization
3 pages
Lecture 10 - Data Transformation-M
No ratings yet
Lecture 10 - Data Transformation-M
8 pages
3 1 Chapter 3 Normalization
No ratings yet
3 1 Chapter 3 Normalization
22 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
Consular Electronic Application Center - Print Application
No ratings yet
Consular Electronic Application Center - Print Application
7 pages
Data Transformation in Data Mining
No ratings yet
Data Transformation in Data Mining
6 pages
DM-2Preprocessing 2
No ratings yet
DM-2Preprocessing 2
61 pages
Lecture 5
No ratings yet
Lecture 5
27 pages
IDS5
No ratings yet
IDS5
56 pages
Lec 5
No ratings yet
Lec 5
24 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Week 2 - Data Quality
No ratings yet
Week 2 - Data Quality
43 pages
DM Data Transformation Techniques
No ratings yet
DM Data Transformation Techniques
25 pages
Data Normalization
No ratings yet
Data Normalization
7 pages
DMDW 5
No ratings yet
DMDW 5
25 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
dmdw2 2
No ratings yet
dmdw2 2
24 pages
Data and Metrics
No ratings yet
Data and Metrics
35 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
Chapter 3: Data Preprocessing
No ratings yet
Chapter 3: Data Preprocessing
15 pages
Feature Engineering
No ratings yet
Feature Engineering
18 pages
Stacked It
No ratings yet
Stacked It
28 pages
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Machine Learning
35 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
DWDM AR16 Unit 1.2
No ratings yet
DWDM AR16 Unit 1.2
14 pages
JAVA Advanced 3
No ratings yet
JAVA Advanced 3
19 pages
Normalization
No ratings yet
Normalization
35 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Time Travel-Paul Davies
100% (1)
Time Travel-Paul Davies
7 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Data Preprocessing Techniques
No ratings yet
Data Preprocessing Techniques
11 pages
Lecture - 04 - Data Understanding and Preparation
No ratings yet
Lecture - 04 - Data Understanding and Preparation
59 pages
4 - Finding and Fixing Data Quality Issues
No ratings yet
4 - Finding and Fixing Data Quality Issues
48 pages
Free Parenting Plan Template - PDF & Word
100% (1)
Free Parenting Plan Template - PDF & Word
11 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
SQL Plus: A Command Line DOS-like Interface Which Can Provide Users An Environment To Execute
No ratings yet
SQL Plus: A Command Line DOS-like Interface Which Can Provide Users An Environment To Execute
5 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Learning Guide No 1
No ratings yet
Learning Guide No 1
66 pages
Presenter
No ratings yet
Presenter
2 pages
Romanian Chart of Accounts
No ratings yet
Romanian Chart of Accounts
16 pages
Kinetic Energy Recovery System
No ratings yet
Kinetic Energy Recovery System
9 pages
Rob Lect2
No ratings yet
Rob Lect2
29 pages
Distributed Python
No ratings yet
Distributed Python
22 pages
Application and Comparison of Classification Techniques in Controlling Credit Risk
0% (1)
Application and Comparison of Classification Techniques in Controlling Credit Risk
16 pages
Chapter 1. M.F
No ratings yet
Chapter 1. M.F
11 pages
Labeeb Khan Resume
No ratings yet
Labeeb Khan Resume
2 pages
Emsisoft Howto Diavol
No ratings yet
Emsisoft Howto Diavol
4 pages
Complete Fab Guide
No ratings yet
Complete Fab Guide
57 pages
Electronic and Mobile Commerce: Ralph M. Stair - George W. Reynolds
No ratings yet
Electronic and Mobile Commerce: Ralph M. Stair - George W. Reynolds
67 pages
Theems and Imagery
No ratings yet
Theems and Imagery
13 pages
Senator Patty Ritchie 2016 Veterans Hall of Fame Honorees
No ratings yet
Senator Patty Ritchie 2016 Veterans Hall of Fame Honorees
36 pages
60e459fedbb7070071bf2942 - ## - Chemical Equilibrium - 230409 - 220542
No ratings yet
60e459fedbb7070071bf2942 - ## - Chemical Equilibrium - 230409 - 220542
6 pages
Solutions Manual: Modern Auditing & Assurance Services
No ratings yet
Solutions Manual: Modern Auditing & Assurance Services
18 pages
HP t740 Thin Client
No ratings yet
HP t740 Thin Client
3 pages
Robotic Arm Manipulator (5 DoF) Pick-Place Operation & Gripping Force Analysis
No ratings yet
Robotic Arm Manipulator (5 DoF) Pick-Place Operation & Gripping Force Analysis
9 pages
Boggio - Dino Risi's Il Sorpasso - (Im) Mobility in The Economic Boom Years
No ratings yet
Boggio - Dino Risi's Il Sorpasso - (Im) Mobility in The Economic Boom Years
13 pages
English 4 - EJE 6 - Week 14
No ratings yet
English 4 - EJE 6 - Week 14
6 pages
Finance Technical Assessment
No ratings yet
Finance Technical Assessment
3 pages
Traits
No ratings yet
Traits
2 pages
Spaulding Lighting Seattle I-II-III Spec Sheet 9-87
No ratings yet
Spaulding Lighting Seattle I-II-III Spec Sheet 9-87
2 pages
Ec16403 Lic
No ratings yet
Ec16403 Lic
2 pages
NSU Spanish Program Expands To Broken Arrow Campus
No ratings yet
NSU Spanish Program Expands To Broken Arrow Campus
2 pages

5 Data Preprocessing III Editted Notes

Uploaded by

5 Data Preprocessing III Editted Notes

Uploaded by

DATA MINING

Lectures 5: Data Preprocessing III

Dr. Doaa Elzanfaly

◼ Data Preprocessing: An Overview

◼ Major Tasks in Data Preprocessing

◼ Data Transformation and Data Discretization

There are several data pre-processing techniques.

Data processing techniques, when applied before mining, can substantially

◼ Converting between various data types >>Utilize off-the-shelf tools for

Demographic data refers to data about groups of people according to

◼ A function that maps the entire set of values of a given attribute to a

1. Aggregation: Summarization, data cube construction

2. Normalization: Scaled to fall within a smaller, specified range

c. normalization by decimal scaling

3. Discretization: Concept hierarchy climbing

◼ Aggregated data is usually presented in data warehouses

◼ Calculate the distance House Size (sq

House Size (sq

◼ Performs a linear transformation on the original data by mapping

◼ Preserves the relationships among the original data values.

Min-max normalization preserves the relationships

◼ Z-score normalization (μ: mean, σ: standard deviation):

◼ This method of normalization is useful when:

Note that normalization can change the original data

◼ Typical methods: All the methods can be applied recursively

◼ Attribute values can be discretized by applying equal-width or equal-

◼ The continuous values in each bin can be converted to a nominal or

◼ Variations within a range are not distinguishable after discretization.

◼ For uniformly distributed data, equal width bins may be useful.

◼ Histogram-based discretization often

◼ One Hot Encoding – Binarization

Desirable to use numeric data mining algorithms on categorical data.

Direct encoding, by giving each distinct value a

You might also like