3 1 Chapter 3 Normalization
3 1 Chapter 3 Normalization
• Normalization
3/7/2023 1
• Data Transformation and Data Discretization
• This section presents methods of data transformation.
• In this preprocessing step, the data are transformed or consolidated so
that the resulting mining process may be more efficient, and the
patterns found may be easier to understand.
3/7/2023 2
•Data Transformation Strategies Overview
In data transformation, the data are transformed or consolidated into
forms appropriate for mining.
• Strategies for data transformation include the following:
3/7/2023 3
• Data Transformation by Normalization
•
Data Normalization
• Motivation The measurement unit used can affect the data analysis.
• For example, changing measurement units from meters to inches for height, or from
kilograms to pounds for weight, may lead to very different results
• To help avoid dependence on the choice of measurement units, the data should be normalized or
standardized.
• This involves transforming the data to fall within a smaller or common range such as [-1,1] or
[0.0, 1.0].
3/7/2023 4
• Range Normalization
Let X be an attribute and let x1,x2,...,xn be a random sample drawn from X. In range
normalization each value is scaled by the sample range rˆ of X:
• After transformation the new attribute takes on values in the range [0,1].
• Example, Lets X taken the values 12, 14, 18 , 23 , transformation this data into
normalization
Solution /solving : Sep1: Find Max X= 23 , MinX= 12 , MaxXi-MinXi= 23-12= 11
step2: x1= 12-12/11= 0 , x2= 14-12/11=0.1818
, x3=18-12/11= 0.5454 , x4=23-12/ = 1
3/7/2023 5
• Example 2 , consider we have the table below
3/7/2023 6
• Solution
3/7/2023 7
• Second method
• Normal Xi= Xi/max xi
• Example, Lets X taken the values 12, 14, 18 , 23 , transformation this data into
normalization
3/7/2023 8
3/7/2023 9
• Example 3 , consider we have the table below
3/7/2023 10
• There are many methods for data normalization
1- Min-max normalization performs a linear transformation on the original
data. Suppose that minA and maxA are the minimum and maximum values
of an attribute, A. Min-max normalization maps a value, vi, of A to vi0 in the
range [new minA, new maxA ] by computing
3/7/2023 11
• Example Min-max normalization. Suppose that the minimum and
maximum values for the attribute income are $12,000 and $98,000,
respectively. We would like to map income to the range [0.0, 1.0]. By min-
max normalization, a value of $73,600 for income is transformed to
•
•=
3/7/2023 12
• In z-score normalization (or zero-mean normalization), the values for an
attribute, A, are normalized based on the mean (i.e., average) and standard
deviation of A. A value, vi, of A is normalized to vi0 by computing
•
Example 3.6 Decimal scaling. Suppose that the recorded values of A range
from -986 to 917. The maximum absolute value of A is 986. To normalize by
decimal scaling, we therefore divide each value by 1000 (i.e., j D 3) so that -986
normalizes to -0.986 and 917 normalizes to 0.917.
3/7/2023 13
• Note that normalization can change the original data quite a bit, especially
when using z-score normalization or decimal scaling. It is also necessary to
save the normalization parameters (e.g., the mean and standard deviation
if using z-score normalization) so that future data can be normalized in a
uniform manner
3/7/2023 14
• Data Reduction
• The basic idea of this theory is to reduce the data representation which trades accuracy for
speed in response to the need to obtain quick approximate answers to queries on very large
databases. Some of the data reduction techniques are as follows:
Histograms
1-
2- Clustering
3- Sampling
4- Construction of Index Trees
5- Singular value Decomposition
6- Wavelets
7- Regression
8- Log-linear models
3/7/2023 15
• Histograms
• Histograms use binning to approximate data distributions and are a popular
form of data reduction
• A histogram for an attribute, A, partitions the data distribution of A into
disjoint subsets, referred to as buckets or bins.
• If each bucket represents only a single attribute–value/frequency pair, the
• buckets are called singleton buckets.
• Often, buckets instead represent continuous ranges for the given attribute.
•
3/7/2023 16
• Example 3.3 Histograms. The following data are a list of AllElectronics prices for
commonly sold items (rounded to the nearest dollar). The numbers have been
sorted: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15,
15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25,
25, 25, 25, 25, 28, 28, 30, 30, 30.
• Reduction the data above via a histogram ?
Solution
• step1 : Record the data using singleton bin/ buckets
• Step2: fill in the table
•
Data 1 5 8 10 12 14 15 18 20 21 25 28 30
Frequency 2 5 2 4 1 3 6 8 7 4 5 2 3
3/7/2023 17
• Draw the figure
Chart Title
35
30
25
20
15
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13
Data Frequency
3/7/2023 18
• An equal-width histogram for price, where values are aggregated so that each
bucket has a uniform width of $10.
Frequency
30
25
20
15
10
0
10-Jan 20-Nov 21 -30
3/7/2023 19
• Data Compression - The basic idea of this theory is to
compress the given data by encoding in terms of the
following:
1- Decision Trees
2-Clusters
3-Association Rules
4- Bits
3/7/2023 20
• Pattern Discovery - The basic idea of this theory is to
discover patterns occurring in a database. Following are the
areas that contribute to this
theory:
•
1-Machine Learning
2- Neural Network
3- Association Mining
4-Sequential Pattern Matching
5- Clustering
3/7/2023 21
• END
3/7/2023 22