Data Normalization and Standardization

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6
At a glance
Powered by AI
The document discusses data normalization and standardization which are two important data preprocessing techniques used before applying machine learning algorithms. It defines the two techniques, explains how they work and provides examples of applying them.

The two main data preprocessing techniques discussed are normalization and standardization. Normalization rescales the data to a specific range like 0-1, while standardization rescales the data to have a mean of 0 and standard deviation of 1.

Normalization rescales the data to fit within a specific range, like 0-1 or -1 to 1, while standardization rescales the data to have a mean of 0 and standard deviation of 1. Normalization is used when there are differences in the ranges of features, while standardization is used when the data follows a normal distribution.

Data Normalization and Standardization  

Peshawa Jammal Muhammad Ali


Department of Software Engineering, Koya University, Kurdistan Region, Iraq.
[email protected]

Please write me your comments by email so as I can improve the document


Abstract
This paper aims to clarify how and why data are normalized or standardized, these two
processes are used in the data preprocessing stage in which the data is prepared to be
processed later by one of the data mining and machine learning techniques like support vector
machine, neural network, etc. The two methods try to scale the data set. These two processes
are helpful in some cases and necessary in some other cases, most of the data mining and
machine learning tools include these two preprocessing techniques like in Weka or in Matlab.
This paper will simply define and present the use of these two data preprocessing techniques.

Normalization
It’s the process of casting the data to the specific range, like between 0 and 1 or between -1 and
+1. Normalization is required when there are big differences in the ranges of different features.
This scaling method is useful when the data set does not contain outliers. The theoretical
background of normalization can be easily understood from Figure (1). If it is required to cast
the data to the range 0,1 then:

From Trigonometry:
valueAf terN ormalization − 0 valueBef oreN ormalization − min
1−0 = max − min
valueAf terN ormalization valueBef oreN ormalization − min
1 = max − min

valueBef oreN ormalization − min


v alueAf terN ormalization = max − min

x − min
or x′ = max − min

Denormalization
This process should be done if normalization applied. For example, to denormalize the a data
from the range 0, 1 below equation can be used:

x = [x′ * (max − min)] + min

where x’ is the normalized data and x is denormalized data, min and max are the same values
used previously in the normalization process.

To normalize the data to the range -1, +1 see Fig(2):


valueAf terN ormalization − (−1) valueBef oreN ormalization − min
1 − (−1) = max − min

valueAf terN ormalization +1) valueBef oreN ormalization − min


2 = max − min

valueBef oreN ormalization − min


v alueAf terN ormalization = 2 * ( max − min ) −1

x − min
or x′ = 2 * ( max − min ) − 1

Denormalization from range -1, +1

x = [ ( x′ 2+ 1 )(max − min) ] + min


In WEKA, for the range -1,+1, the formula is organized as follow:

x − min
x′ = 2 * ( max − min ) − 1

x − min x − min−( max−min )


x′ = ( max−min ) − 1 = [ 2
max−min ]
2 2

x − min− max min


2 + 2 ) x − max min
2 − 2 )
x′ = [ max−min ]=[ max−min ]
2 2

x − ( max min
2 + 2 )
x′ = [ max−min ]
2

x − ( max 2+ min )
x′ = max−min
2
Z-score standardization
Making a data set with mean=0, and standard deviation =1. This scaling method is
useful when the data follows a normal distribution (Gaussian distribution), if the data
does not follow normal distribution then this will make problems.

Example: -20, -6, 0, 40, 70,120

−20−6+0+40+70+120
M ean = 6 = 34

sd = √ (−20−34)2 +(−6−34)2 +(0−34)2 +(40−34)2 +(70−34)2 + (120−34)2


6

sd = 48.98979

z-score standardization

x−mean −20−34
x" = sd = 48.98979 = − 1.1022

Other values are changed too,

Accordingly, values are changed to:


-1.10227

-0.8165

-0.69402

0.122474

0.734847

1.755468
Now, if you calculate the average and sd of these new values you will see that the mean
is zero and sd=1.

Important note:

However, the point must be made that N/S are _not_ good where the raw measurement
is desireable and where the N/S is irreversible, thus losing much of the information in
the raw measurement, this is according to a note made by Kevin Hankins
([email protected]).

References
1. Yazen A. Khalil and Peshawa J. Muhammad Ali; “A proposed method for colorizing
grayscale images”, International Journal of Computer, Science and Engineering,
2013, 2(2), pp.104-109.
​http://www.iaset.us/view_archives.php?year=2013&id=14&jtype=2&page=2
2. Peshawa J. Muhammad Ali, Nigar M.S. Suramerry, Abdul-rahman M. Yunis, Ladeh
S.Abdulrahman, “Gender prediction of journalists from writing style”, Aro Journal,
2013, 1(1), pp.22-28. ​http://aro.koyauniversity.org/issues/volumeone/aro-10031
3. Peshawa J. Muhammad Ali; “Predicting the gender of the Kurdish writers in
Facebook” Sulaimani Journal for Engineering Sciences, 2013, 1(1), pp.18-28.
http://www.univsul.edu.iq/Wenekan_KS/12111313102014_Sulaimani%20Journal-EN
G.%2020-30.pdf
4. Peshawa J. Muhammad Ali and Rezhna H. Faraj; “Traffic congestion problem and
solutions, the road between Sawz square and Shahidan square at Koya city as a case
study”, The first international symposium on urban development, Iraq, Koya, Koya
University, 2013, pp.125-133. Transactions of the Wessex institute Paper DOI:
10.2495/ISUD130151
http://library.witpress.com/pages/PaperInfo.asp?PaperID=25351
5. Peshawa J. Muhammad Ali and Noura A. Semary; “A proposed color image protection
system based on color embedding”, International conference on electrical,
communication, computer, power and control engineering, Mosul, Iraq, 2013.

You might also like