0% found this document useful (0 votes)

52 views25 pages

DMDW 5

Normalization is a process used to standardize data values that are measured on different scales. It is often necessary prior to performing data analysis to avoid attributes with larger ranges dominating over others. There are several normalization methods including min-max normalization, z-score normalization, and decimal scaling. Dimensionality reduction techniques can also be applied to reduce the number of random variables under consideration by obtaining a set of principal variables.

Uploaded by

Anu agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views25 pages

DMDW 5

Uploaded by

Anu agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Dr.

Amiya Ranjan Panda

 Normalization is generally required when multiple attributes are
there but attributes have values on different scales, this may lead
to poor data models while performing data mining operations.
 Otherwise, it may lead to a dilution in effectiveness of an
important equally important attribute(on lower scale) because of
other attribute having values on larger scale.
 Heterogenous data with different units usually needs to be
normalized. Otherwise, data has the same unit and same order of
magnitude it might not be necessary with normalization.
 Unless normalized at pre-processing, variables with disparate
ranges or varying precision acquire different driving values.
 Normalization is normally done, when there is a distance
computation involved in our algorithm.
 Methods of Data Normalization:
◦ Decimal Scaling
◦ Min-Max Normalization
◦ z-Score Normalization(zero-mean Normalization)

 There are several approaches in normalisation which can be

used in deep learning
models.
 Batch Normalization
 Layer Normalization
 Group Normalization
 Instance Normalization
 Weight Normalization
◦ Decimal Scaling Method For Normalization
- It normalizes by moving the decimal point of values of the data.
- To normalize the data by this technique, we divide each value of the data
by the maximum absolute value of data.
- The data value, vi, of data is normalized to v'i by using the formula

- where j is the smallest integer such that max(|v'i|)<1.

In this technique, the computation is generally scaled in terms of decimals. It means

that the result is generally scaled by multiplying or dividing it with pow(10,k).

Example:
- Let the input data is: -15, 121, 201, 421, 561, 601, 850
- To normalize the above data,
- Step 1: Maximum absolute value in given data(m): 850
- Step 2: Divide the given data by 1000 (i.e j=3)
- Result:The normalized data is: -0.015, 0.121, 0.201, 0.421, 0.561, 0.601, 0.85
◦ Min-Max Normalization
- In this technique of data normalization, linear transformation is
performed on the original data.
- Minimum and maximum value from data is fetched and each value is
replaced according to the following formula.

- Where A is the attribute data,

- min(A), max(A) are the minimum and maximum absolute value of A respectively.
- v' is the new value of each entry in data.
- v is the old value of each entry in data.
- new_max(A), new_min(A) is the max and min value of the range(i.e boundary value of
range required) respectively.

Roll No Marks Example Roll No Marks

1 10 If we were to normalize it 1 0
2 15 between the ranges of 0 to 1 2 0.1
we would get the following
3 50 3 0.8
4 60 4 1
◦ z-Score Normalization (zero-mean Normalization)
- In this technique, values are normalized based on mean and standard
deviation of the data A.
- It is also called Standard Deviation method.
- So, the unstructured data can be normalized using z-score parameter,
the formula for z-score is as below;
where, is the mean and is the standard
deviation.
v is the old value of each entry in data.
v' is the Z-score normalized of each entry in data.
Example
Roll No Marks Roll No Marks
1 10 Mean is 33.75 and Standard 1 -0.951587303
2 15 Deviation is 24.95 2 -0.751253134
3 50 3 0.651086049
4 60 4 1.051754387
z-Score Normalization (zero-mean Normalization)
The normal distribution is a probability function that describes how the
values of a variable are distributed.
No matter what  and  are,
the area between - and + is about 68%;
the area between -2 and +2 is about 95%; and
the area between -3 and +3 is about 99.7%.
Almost all values fall within 3 standard deviations.

 
 1 ( x   )2
1
  e 2  dx  .68
 2
 
  2 1 x 2
1  ( )

  2
 e 2  dx  .95

  2
  3 1 x 2
1  ( )

  2
 e 2  dx  .997

  3
 Data aggregation is any process in which data is brought together and
conveyed in a summary form. It is typically used prior to the performance
of a statistical analysis.
 Combining two or more attributes (or objects) into a single attribute (or
object). Data aggregation is an element of business intelligence (BI)
solutions.
 Data aggregation generally works on the big data or data marts that do not
provide enough information value as a whole.
 Data aggregation is useful for everything from finance or business strategy
decisions to product, pricing, operations, and marketing strategies.
 Purpose
◦ Data reduction
 Reduce the number of attributes or objects
◦ Change of scale
 Cities aggregated into regions, states, countries, etc.
 Days aggregated into weeks, months, or years
◦ More “stable” data
 Aggregated data tends to have less variability
Examples of Data Aggregation
Companies often collect data from their online customers and website visitors.
For example, I am using Google analytics to see where my users are from? What
kind of content they like etc.
For example,
• Google collects data in the form of cookies to show targeted
advertisements to its users.
• Facebook is doing the same thing by collecting and analyzing the
information and show ads to its users.
In marketing: You can aggregate your data from a particular campaign, looking
at how it performed over time and with specific cohorts.

The retail industry: they must always be gathering the fresh information about
their competitors’ product offerings, promotions, and prices.

The travel industry: Include competitive price monitoring, competitor

research, gaining market intelligence, customer sentiment analysis, and capturing
images and descriptions for the services on their online travel sites.

Healthcare industry: can use data aggregation to help maintain transparency

and trust between the healthcare industry and patients.
 Time aggregation
◦ It is data points for a single resource over a specified period.
 Spatial aggregation
◦ It is data points for a group of resources over a specified period.
 Aggregating data can be a remarkably manual process, especially
if your need it in the early stages.
➢ Go through an excel sheet. Reformat it so it looks like other data sources.
➢ Then create charts to compare the performance/budget/progress of your
multiple analysis.

 If you want to go for the automated process, then It looks like

the implementation of third-party software/code/algorithm,
sometimes called Middleware, that can pull data automatically
from your database sources.
 So, Manual and automated data aggregation is possible based on
your domain’s requirements.
 When dimensionality increases, data becomes increasingly
sparse in the space that it occupies
 The curse of dimensionality basically means that the error
increases with the increase in the number of features.
 Complexcity (running time) increases with dimension d.
 If we have more features than observations than we run
the risk of massively overfitting our model — this would
generally result in terrible out of sample performance.
 Definitions of density and distance between points, which
are critical for clustering and outlier detection, become less
meaningful
 Dimensionality reduction is a method of converting the high
dimensional variables into lower dimensional variables without
changing the specific information of the variables.
 Dimensionality Reduction is used to reduce the feature space
with consideration by a set of principal features.
 Purpose:
◦ Avoid curse of dimensionality
◦ Reduce amount of time and memory required by data mining algorithms
◦ Allow data to be more easily visualized
◦ May help to eliminate irrelevant features or reduce noise

 Techniques
◦ Principal Components Analysis (PCA)
◦ Singular Value Decomposition
◦ Others: supervised and non-linear techniques
 Feature selection
 Feature Extraction (reduction)
A process that chooses an optimal subset
of features according to a objective
function
 Objectives
◦ To reduce dimensionality and remove noise
◦ To improve mining performance
 Speed of learning
 Predictive accuracy
 Simplicity and comprehensibility of mined results
 Another way to reduce dimensionality of data
 Redundant features
◦ Duplicate much or all of the information contained in
one or more other attributes
◦ Example: purchase price of a product and the amount of
sales tax paid
 Irrelevant features
◦ Contain no information that is useful for the data mining
task at hand
◦ Example: students' ID is often irrelevant to the task of
predicting students' GPA
 Many techniques developed, especially for
classification
 Create new attributes that can capture the
important information in a data set much
more efficiently than the original attributes

 Three general methodologies:

◦ Feature extraction
 Example: extracting edges from images
◦ Feature construction
 Example: dividing mass by volume to get density
◦ Mapping data to new space
 Example: Fourier and wavelet analysis
 Feature reduction refers to the mapping of the
original high-dimensional data onto a lower
dimensional space
 Given a set of data points of p variables {x1,x2,....xn}
Compute their low-dimensional representation:
xiRd → yiRp (p<<d)
Criterion for feature reduction can be different based on
different problem settings.
◦ Unsupervised setting: minimize the information loss
◦ Supervised setting: maximize the class discrimination
 Feature reduction
◦ All original features are used
◦ The transformed features are linear combinations
of the original features
 Feature selection
◦ Only a subset of the original features are selected
 Filter model
◦ Separating feature selection from classifier learning
◦ Relying on general characteristics of data (information,
distance, dependence, consistency)
◦ No bias toward any learning algorithm, fast
 Wrapper model
◦ Relying on a predetermined classification algorithm
◦ Using predictive accuracy as goodness measure
◦ High accuracy, computationally expensive
 Typical Error Matrix:
TRUE POSITIVE FALSE POSITIVE
FALSE NEGATIVE TRUE NEGATIVE

Reference Data

• Diagonals represent sites

Classified Data

classified correctly according

to reference data.

• Off-diagonals were
misclassified.
v Overall Accuracy is essentially tells us out of all of the reference sites what
proportion were mapped correctly.
Overall Accuracy = (TP+TN)/(TP+TN+FP+FN)

v Individual Class Accuracy Calculated by dividing the number of correctly

classified pixels in each category by either the total number of pixels in the
corresponding column; Producer’s accuracy, or row; User’s accuracy.

➢ Producer's Accuracy is the map accuracy from the point of view of the
map maker (the producer).

Producerʼs Accuracy (Class A) = TP/(TP+FN)

Producerʼs Accuracy (Class B) = TN/(FP+TN)
➢ User's Accuracy is the accuracy from the point of view of a map user, not
the map maker.The User's accuracy essentially tells use how often the class
on the map will actually be present on the ground. This is referred to as
reliability.
Userʼs Accuracy (Class A) = TP/(TP+FP)
Userʼs Accuracy (Class B) = TN/(TN+FN)
Overall Accuracy = (TP+TN)/(TP+TN+FP+FN)
Producerʼs Accuracy (Class A) = TP/(TP+FN)
Producerʼs Accuracy (Class B) = TN/(FP+TN)
Userʼs Accuracy (Class A) = TP/(TP+FP)
Userʼs Accuracy (Class B) = TN/(TN+FN)
Accuracy on preceding slide:
• Overall Accuracy = 92.4%
• Producerʼs Accuracy (Class A) = 89.9%
• Producerʼs Accuracy (Class B) = 94.7%
• Userʼs Accuracy (Class A) = 94.2%
• Userʼs Accuracy (Class B) = 90.7%

Rubric Music Sight Reading 4
100% (1)
Rubric Music Sight Reading 4
1 page
Behavioral Management Accounting PDF
67% (3)
Behavioral Management Accounting PDF
352 pages
dmdw2 2
No ratings yet
dmdw2 2
24 pages
5 Data Preprocessing III Editted Notes
No ratings yet
5 Data Preprocessing III Editted Notes
17 pages
Week 2
No ratings yet
Week 2
96 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Data Transformation in Data Mining
No ratings yet
Data Transformation in Data Mining
6 pages
10-2 Data Analysis and Pre-Processing Part 4 PDF
No ratings yet
10-2 Data Analysis and Pre-Processing Part 4 PDF
23 pages
Mod1 DM Part2
No ratings yet
Mod1 DM Part2
34 pages
JAVA Advanced 3
No ratings yet
JAVA Advanced 3
19 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
No ratings yet
Data Mining CSE-443: Ayesha Aziz Prova Lecturer, Dept. of CSE CWU
21 pages
Data Cleaning: Missing Values: - For Example in Attribute Income If
No ratings yet
Data Cleaning: Missing Values: - For Example in Attribute Income If
30 pages
Down 2
No ratings yet
Down 2
61 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
DM 2 Part 2
No ratings yet
DM 2 Part 2
35 pages
Data Preprocessing
No ratings yet
Data Preprocessing
49 pages
Integration and Normalization
No ratings yet
Integration and Normalization
19 pages
Data Transformation and Standardization
No ratings yet
Data Transformation and Standardization
5 pages
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
No ratings yet
6-Significance of Exploratory Data Analysis, Making Sense of Data-06!02!2024
85 pages
Unit-2 Data Warehouse Notes
No ratings yet
Unit-2 Data Warehouse Notes
11 pages
Unit - 1 Data Preprocessing
No ratings yet
Unit - 1 Data Preprocessing
66 pages
Study+Material+Unit 4+Data+Preprocessing+
No ratings yet
Study+Material+Unit 4+Data+Preprocessing+
8 pages
Data Pre Processing - NG
No ratings yet
Data Pre Processing - NG
43 pages
Data Preprocessing
No ratings yet
Data Preprocessing
33 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
Normalization
No ratings yet
Normalization
35 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
3 1 Chapter 3 Normalization
No ratings yet
3 1 Chapter 3 Normalization
22 pages
OJCST Vol13 N2-3 P 78-81
No ratings yet
OJCST Vol13 N2-3 P 78-81
4 pages
Lecture6a DataPreprocessing
No ratings yet
Lecture6a DataPreprocessing
52 pages
Data Mining Chapter3 0
No ratings yet
Data Mining Chapter3 0
32 pages
DMTN
No ratings yet
DMTN
17 pages
Unit 2 Part 4
No ratings yet
Unit 2 Part 4
47 pages
Data Minig Lab Manual
No ratings yet
Data Minig Lab Manual
58 pages
3point5point2 Normalization
No ratings yet
3point5point2 Normalization
3 pages
Unit 2 DWDM
No ratings yet
Unit 2 DWDM
14 pages
6 Data Preprocessing
No ratings yet
6 Data Preprocessing
37 pages
Data Science Unit I (LN and QB)
No ratings yet
Data Science Unit I (LN and QB)
44 pages
Unit 1
No ratings yet
Unit 1
8 pages
Knowledge Discovery and Data Mining
No ratings yet
Knowledge Discovery and Data Mining
55 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
Data Integration & Transformation
No ratings yet
Data Integration & Transformation
14 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Data Transformation
No ratings yet
Data Transformation
12 pages
ICS 2408 - Lecture 2 - Data Preprocessing
No ratings yet
ICS 2408 - Lecture 2 - Data Preprocessing
29 pages
Concepts (PPT) - Data Preprocessing
No ratings yet
Concepts (PPT) - Data Preprocessing
19 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
No ratings yet
Data Pre-Processing: - Data Cleaning - Data Integration - Data Transformation - Data Reduction - Data Discretization
55 pages
BI Unit 4
No ratings yet
BI Unit 4
21 pages
R21 Unit 2
No ratings yet
R21 Unit 2
101 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
17 Data Analysis
No ratings yet
17 Data Analysis
64 pages
Spatial and Temporal Data Mining
No ratings yet
Spatial and Temporal Data Mining
52 pages
Data Binning
No ratings yet
Data Binning
9 pages
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
No ratings yet
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
57 pages
Lecture 10 - Data Transformation-M
No ratings yet
Lecture 10 - Data Transformation-M
8 pages
SCA - Module 3
No ratings yet
SCA - Module 3
48 pages
Data Preprocessing
No ratings yet
Data Preprocessing
21 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Scale Invariant Feature Transform: Unveiling the Power of Scale Invariant Feature Transform in Computer Vision
From Everand
Scale Invariant Feature Transform: Unveiling the Power of Scale Invariant Feature Transform in Computer Vision
Fouad Sabry
No ratings yet
DMDW 2
No ratings yet
DMDW 2
68 pages
DMDW 4
No ratings yet
DMDW 4
24 pages
DMDW 6
No ratings yet
DMDW 6
41 pages
Basic Statistical Descriptions of Data: Dr. Amiya Ranjan Panda
No ratings yet
Basic Statistical Descriptions of Data: Dr. Amiya Ranjan Panda
35 pages
Basic Statistical Descriptions of Data: Dr. Amiya Ranjan Panda
No ratings yet
Basic Statistical Descriptions of Data: Dr. Amiya Ranjan Panda
35 pages
Basic Statistical Descriptions of Data: Dr. Amiya Ranjan Panda
No ratings yet
Basic Statistical Descriptions of Data: Dr. Amiya Ranjan Panda
35 pages
10.1515 - Opis 2022 0158
No ratings yet
10.1515 - Opis 2022 0158
24 pages
Dengue Fever Prediction A Data Mining Problem 2153 0602 1000181 PDF
No ratings yet
Dengue Fever Prediction A Data Mining Problem 2153 0602 1000181 PDF
5 pages
Quality Assurance in Malaria Diagnosis
No ratings yet
Quality Assurance in Malaria Diagnosis
14 pages
AIF-C01 AWS Certified AI Practitioner Free Updated Dumps
100% (1)
AIF-C01 AWS Certified AI Practitioner Free Updated Dumps
14 pages
Validation in Clinical Chemistry: Elvar Theodorsson
No ratings yet
Validation in Clinical Chemistry: Elvar Theodorsson
26 pages
Khan Saibaz Resume
No ratings yet
Khan Saibaz Resume
1 page
Ebook Gas Detection Tube and Sampling Handbook PDF
No ratings yet
Ebook Gas Detection Tube and Sampling Handbook PDF
69 pages
Cot Tle 8-Technical Drafting
No ratings yet
Cot Tle 8-Technical Drafting
3 pages
Sample Writing Rubic For Building Technology
No ratings yet
Sample Writing Rubic For Building Technology
2 pages
CH 08
No ratings yet
CH 08
11 pages
Walker 2080 Mk2
No ratings yet
Walker 2080 Mk2
2 pages
Thesis Supervisor Recommendation With Representative Content and Information Retrieval
No ratings yet
Thesis Supervisor Recommendation With Representative Content and Information Retrieval
8 pages
Practical Guide Students PDF
100% (2)
Practical Guide Students PDF
24 pages
8TH Reference Thesis
No ratings yet
8TH Reference Thesis
8 pages
Development of Standard Calibration Equipment For The Rain Gauges
No ratings yet
Development of Standard Calibration Equipment For The Rain Gauges
6 pages
Measurement & Calibration
No ratings yet
Measurement & Calibration
32 pages
Recommended Test Method: Nonwovens Mass Per Unit Area
No ratings yet
Recommended Test Method: Nonwovens Mass Per Unit Area
2 pages
Eas.68.3.2007 (ISO 6611)
No ratings yet
Eas.68.3.2007 (ISO 6611)
19 pages
C512 Manual PDF
No ratings yet
C512 Manual PDF
93 pages
Biomechanics Analysis of Passing Accuracy by Using Foot and Kick Distance at The Student Football Player
No ratings yet
Biomechanics Analysis of Passing Accuracy by Using Foot and Kick Distance at The Student Football Player
5 pages
A Novel Method For Facial Recognition Ba
No ratings yet
A Novel Method For Facial Recognition Ba
5 pages
Lab Report - Exp 3 - G6
No ratings yet
Lab Report - Exp 3 - G6
13 pages
Bme (Ee) 605a-Topic-1
No ratings yet
Bme (Ee) 605a-Topic-1
37 pages
Lecture 02
No ratings yet
Lecture 02
12 pages
Carpentry NC II (Superseded)
No ratings yet
Carpentry NC II (Superseded)
75 pages
MEASUREMENT
No ratings yet
MEASUREMENT
5 pages
Teltek Project Submission
No ratings yet
Teltek Project Submission
14 pages
Cambridge IGCSE™: Cambridge International Mathematics 0607/42 February/March 2022
No ratings yet
Cambridge IGCSE™: Cambridge International Mathematics 0607/42 February/March 2022
8 pages

DMDW 5

Uploaded by

DMDW 5

Uploaded by

Dr.

Amiya Ranjan Panda

 There are several approaches in normalisation which can be

- where j is the smallest integer such that max(|v'i|)<1.

In this technique, the computation is generally scaled in terms of decimals. It means

- Where A is the attribute data,

Roll No Marks Example Roll No Marks

The travel industry: Include competitive price monitoring, competitor

Healthcare industry: can use data aggregation to help maintain transparency

 If you want to go for the automated process, then It looks like

 Three general methodologies:

• Diagonals represent sites

classified correctly according

v Individual Class Accuracy Calculated by dividing the number of correctly

Producerʼs Accuracy (Class A) = TP/(TP+FN)

You might also like