0% found this document useful (0 votes)
24 views22 pages

3 1 Chapter 3 Normalization

This document discusses techniques for major data preprocessing including normalization, data transformation, data discretization, and data reduction. Normalization techniques such as range normalization and z-score normalization are presented. Data transformation strategies like histograms and clustering are overviewed.

Uploaded by

mazeen naser
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views22 pages

3 1 Chapter 3 Normalization

This document discusses techniques for major data preprocessing including normalization, data transformation, data discretization, and data reduction. Normalization techniques such as range normalization and z-score normalization are presented. Data transformation strategies like histograms and clustering are overviewed.

Uploaded by

mazeen naser
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Chapter 3-1

Techniques for solving of Major Data Preprocessing

• Normalization

3/7/2023 1
• Data Transformation and Data Discretization
• This section presents methods of data transformation.
• In this preprocessing step, the data are transformed or consolidated so
that the resulting mining process may be more efficient, and the
patterns found may be easier to understand.

3/7/2023 2
•Data Transformation Strategies Overview
In data transformation, the data are transformed or consolidated into
forms appropriate for mining.
• Strategies for data transformation include the following:

3/7/2023 3
• Data Transformation by Normalization

Data Normalization
• Motivation The measurement unit used can affect the data analysis.
• For example, changing measurement units from meters to inches for height, or from
kilograms to pounds for weight, may lead to very different results
• To help avoid dependence on the choice of measurement units, the data should be normalized or
standardized.

• This involves transforming the data to fall within a smaller or common range such as [-1,1] or
[0.0, 1.0].

• Normalizing the data attempts to give all attributes an equal weight.


• Normalization is particularly useful for classification algorithms involving neural networks or
distance measurements such as nearest-neighbor classification and clustering

- If using the neural network backpropagation algorithm for classification mining

3/7/2023 4
• Range Normalization
Let X be an attribute and let x1,x2,...,xn be a random sample drawn from X. In range
normalization each value is scaled by the sample range rˆ of X:

• After transformation the new attribute takes on values in the range [0,1].
• Example, Lets X taken the values 12, 14, 18 , 23 , transformation this data into
normalization
Solution /solving : Sep1: Find Max X= 23 , MinX= 12 , MaxXi-MinXi= 23-12= 11
step2: x1= 12-12/11= 0 , x2= 14-12/11=0.1818
, x3=18-12/11= 0.5454 , x4=23-12/ = 1

Xi-Normalization = 0, 0.1818 , 0.5454 , 1


Note : Can we done this rule by Excel or matlab

3/7/2023 5
• Example 2 , consider we have the table below

• Transfer data of Income in to normalization data depend rule above

3/7/2023 6
• Solution

3/7/2023 7
• Second method
• Normal Xi= Xi/max xi
• Example, Lets X taken the values 12, 14, 18 , 23 , transformation this data into
normalization

• Solution , max Xi= 23


• Normalxi= x1= 12/23 = 0.521 , x2=14/23= 0.6086 , x3= 18/23=0.782
, x4=23/23= 1
• The data normalization = 0.521 , 0.6086 , 0.782 , 1

3/7/2023 8
3/7/2023 9
• Example 3 , consider we have the table below

• Transfer data of Income in to normalization data depend rule above

3/7/2023 10
• There are many methods for data normalization
1- Min-max normalization performs a linear transformation on the original
data. Suppose that minA and maxA are the minimum and maximum values
of an attribute, A. Min-max normalization maps a value, vi, of A to vi0 in the
range [new minA, new maxA ] by computing

• Min-max normalization preserves the relationships among the original data


values. It will encounter an “out-of-bounds” error if a future input case for
normalization falls outside of the original data range for A

3/7/2023 11
• Example Min-max normalization. Suppose that the minimum and
maximum values for the attribute income are $12,000 and $98,000,
respectively. We would like to map income to the range [0.0, 1.0]. By min-
max normalization, a value of $73,600 for income is transformed to


•=

3/7/2023 12
• In z-score normalization (or zero-mean normalization), the values for an
attribute, A, are normalized based on the mean (i.e., average) and standard
deviation of A. A value, vi, of A is normalized to vi0 by computing

• where j is the smallest integer such that max.jvi0j/ < 1

Example 3.6 Decimal scaling. Suppose that the recorded values of A range
from -986 to 917. The maximum absolute value of A is 986. To normalize by
decimal scaling, we therefore divide each value by 1000 (i.e., j D 3) so that -986
normalizes to -0.986 and 917 normalizes to 0.917.

3/7/2023 13
• Note that normalization can change the original data quite a bit, especially
when using z-score normalization or decimal scaling. It is also necessary to
save the normalization parameters (e.g., the mean and standard deviation
if using z-score normalization) so that future data can be normalized in a
uniform manner

3/7/2023 14
• Data Reduction

• The basic idea of this theory is to reduce the data representation which trades accuracy for
speed in response to the need to obtain quick approximate answers to queries on very large
databases. Some of the data reduction techniques are as follows:

Histograms
1-
2- Clustering
3- Sampling
4- Construction of Index Trees
5- Singular value Decomposition
6- Wavelets
7- Regression
8- Log-linear models

3/7/2023 15
• Histograms
• Histograms use binning to approximate data distributions and are a popular
form of data reduction
• A histogram for an attribute, A, partitions the data distribution of A into
disjoint subsets, referred to as buckets or bins.
• If each bucket represents only a single attribute–value/frequency pair, the
• buckets are called singleton buckets.
• Often, buckets instead represent continuous ranges for the given attribute.

3/7/2023 16
• Example 3.3 Histograms. The following data are a list of AllElectronics prices for
commonly sold items (rounded to the nearest dollar). The numbers have been
sorted: 1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14, 14, 14, 15, 15, 15, 15, 15,
15, 18, 18, 18, 18, 18, 18, 18, 18, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 25,
25, 25, 25, 25, 28, 28, 30, 30, 30.
• Reduction the data above via a histogram ?
Solution
• step1 : Record the data using singleton bin/ buckets
• Step2: fill in the table

Data 1 5 8 10 12 14 15 18 20 21 25 28 30
Frequency 2 5 2 4 1 3 6 8 7 4 5 2 3

3/7/2023 17
• Draw the figure
Chart Title
35

30

25

20

15

10

0
1 2 3 4 5 6 7 8 9 10 11 12 13

Data Frequency

3/7/2023 18
• An equal-width histogram for price, where values are aggregated so that each
bucket has a uniform width of $10.

Frequency
30

25

20

15

10

0
10-Jan 20-Nov 21 -30

3/7/2023 19
• Data Compression - The basic idea of this theory is to
compress the given data by encoding in terms of the
following:
1- Decision Trees
2-Clusters
3-Association Rules
4- Bits

3/7/2023 20
• Pattern Discovery - The basic idea of this theory is to
discover patterns occurring in a database. Following are the
areas that contribute to this
theory:

1-Machine Learning
2- Neural Network
3- Association Mining
4-Sequential Pattern Matching
5- Clustering

3/7/2023 21
• END

3/7/2023 22

You might also like