0% found this document useful (0 votes)

38 views22 pages

Lecture 7 Data Transformation and Dimensionality Reduction

Uploaded by

ssrindes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views22 pages

Lecture 7 Data Transformation and Dimensionality Reduction

Uploaded by

ssrindes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 22

APEX INSTITUTE OF TECHNOLOGY

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

Machine Learning (21CSH-286)

Faculty: Prof. (Dr.) Madan Lal Saini(E13485)

Lecture - 7
Data Transformation, Normalization, DISCOVER . LEARN . EMPOWER
Dimensionality reduction 1
Machine Learning: Course Objectives
COURSE OBJECTIVES
The Course aims to:
1. Understand and apply various data handling and visualization techniques.
2. Understand about some basic learning algorithms and techniques and their
applications, as well as general questions related to analysing and handling large data
sets.
3. To develop skills of supervised and unsupervised learning techniques and
implementation of these to solve real life problems.
4. To develop basic knowledge on the machine techniques to build an intellectual
machine for making decisions behalf of humans.
5. To develop skills for selecting an algorithm and model parameters and apply them for
designing optimized machine learning applications.

2
COURSE OUTCOMES

On completion of this course, the students shall be able to:-

CO1 Describe and apply various data pre-processing and visualization techniques on dataset.

Understand about some basic learning on algorithms and analysing their applications, as
CO2
well as general questions related to analysing and handling large data sets.

Describe machine learning techniques to build an intellectual machine for making

CO3
decisions on behalf of humans.

Develop supervised and unsupervised learning techniques and implementation of these to

CO4
solve real life problems.

Analyse the performance of machine learning model and apply optimization techniques to
CO5
improve the performance of the model.

3
Unit-1 Syllabus

Unit-1 Data Pre-processing Techniques

Data Pre- Data Frame Basics, CSV File, Libraries for Pre-processing, Handling
Processing Missing data, Encoding Categorical data, Feature Scaling, Handling Time
Series data.

Feature Extraction Dimensionality Reduction: Feature Selection Techniques, Feature

Extraction Techniques; Data Transformation, Data Normalization.

Data Visualization Different types of plots, Plotting fundamentals using Matplotlib, Plotting
fundamentals using Seaborn.

4
SUGGESTIVE READINGS
TEXT BOOKS:
• T1: Tom.M.Mitchell, “Machine Learning”, McGraw Hill, International Edition, 2018
• T2: Ethern Alpaydin, “Introduction to Machine Learning”. Eastern Economy Edition, Prentice Hall of
India, 2015.
• T3: Andreas C. Miller, Sarah Guido, “Introduction to Machine Learning with Python”, O’REILLY
(2018).

REFERENCE BOOKS:
• R1 Sebastian Raschka, Vahid Mirjalili, “Python Machine Learning”, Packt Publisher (2019)
• R2 Aurélien Géron, “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow”, Wiley,
2nd Edition, 2022
• R3 Christopher Bishop, “Pattern Recognition and Machine Learning”, Illustrated Edition, Springer,
2016.

5
Data Transformation
• Data Transformation
• Data transformation is one of the fundamental steps in the part of data processing.
When I first learnt the technique of feature scaling, the terms scale, standardise,
and normalise are often being used. However, it was pretty hard to find information
about which of them I should use and also when to use.
• What Does Feature Scaling Mean?
• In practice, we often encounter different types of variables in the same dataset. A
significant issue is that the range of the variables may differ a lot. Using the original
scale may put more weights on the variables with a large range. In order to deal with
this problem, we need to apply the technique of features rescaling to independent
variables or features of data in the step of data pre-processing. The
terms normalisation and standardisation are sometimes used interchangeably, but
they usually refer to different things.
6
Data Transformation
A dataset that contains an independent variable (Purchased) and 3 dependent
variables (Country, Age, and Salary). We can easily notice that the variables are
not on the same scale because the range of Age is from 27 to 50, while the
range of Salary going from 48 K to 83 K. The range of Salary is much wider than
the range of Age. This will cause some issues in our models since a lot of
machine learning models such as k-means clustering and nearest neighbour
classification are based on the Euclidean Distance.
Focusing on age and salary
When we calculate the equation of Euclidean distance, the number of (x2-x1)²
is much bigger than the number of (y2-y1)² which means the Euclidean
distance will be dominated by the salary if we do not apply feature scaling. The
difference in Age contributes less to the overall difference. Therefore, we
should use Feature Scaling to bring all values to the same magnitudes and,
thus, solve this issue. To do this, there are primarily two methods called
Standardisation and Normalisation. 7
Data Transformation

8
Normalization
• Max-Min Normalization
Another common approach is the so-called Max-Min Normalization (Min-Max scaling). This
technique is to re-scales features with a distribution value between 0 and 1. For every feature, the
minimum value of that feature gets transformed into 0, and the maximum value gets transformed
into 1. The general equation is shown below:

• Standardisation vs Max-Min Normalization

• In contrast to standardisation, we will obtain smaller standard deviations through the process of
Max-Min Normalisation. Let me illustrate more in this area using the above dataset.

9
Normalization

10
Normalization

• From the above graphs, we can clearly notice that applying Max-
Min Nomaralisation in our dataset has generated smaller
standard deviations (Salary and Age) than using Standardisation
method. It implies the data are more concentrated around the
mean if we scale data using Max-Min Nomaralisation.
• As a result, if you have outliers in your feature (column),
normalizing your data will scale most of the data to a small
interval, which means all features will have the same scale but
does not handle outliers well. Standardisation is more robust to
outliers, and in many cases, it is preferable over Max-Min
Normalisation.
11
Dimensionality Reduction
• What is Dimensionality Reduction?
In machine learning classification problems, there are often too many
factors on the basis of which the final classification is done. These
factors are basically variables called features. The higher the number of
features, the harder it gets to visualize the training set and then work
on it. Sometimes, most of these features are correlated, and hence
redundant. This is where dimensionality reduction algorithms come
into play. Dimensionality reduction is the process of reducing the
number of random variables under consideration, by obtaining a set of
principal variables. It can be divided into feature selection and feature
extraction.

12
Dimensionality Reduction

Components of Dimensionality Reduction

There are two components of dimensionality reduction:
Feature selection: In this, we try to find a subset of the original set of variables, or
features, to get a smaller subset which can be used to model the problem. It usually
involves three ways:
Filter
Wrapper
Embedded
Feature extraction: This reduces the data in a high dimensional space to a lower
dimension space, i.e. a space with lesser no. of dimensions.

13
Dimensionality Reduction
Methods of Dimensionality Reduction
• The various methods used for dimensionality reduction include:
• Principal Component Analysis (PCA)
• Linear Discriminant Analysis (LDA)
• Generalized Discriminant Analysis (GDA)
• Dimensionality reduction may be both linear or non-linear, depending upon the
method used. The prime linear method, called Principal Component Analysis, or
PCA, is discussed below.

14
Principal Component Analysis
The main idea of principal component analysis (PCA) is to reduce the
dimensionality of a data set consisting of many variables correlated
with each other, either heavily or lightly, while retaining the variation
present in the dataset, up to the maximum extent. The same is done by
transforming the variables to a new set of variables, which are known
as the principal components (or simply, the PCs) and are orthogonal,
ordered such that the retention of variation present in the original
variables decreases as we move down in the order. So, in this way, the
1st principal component retains maximum variation that was present in
the original components. The principal components are the
eigenvectors of a covariance matrix, and hence they are orthogonal.

15
How PCA works?
Step 1: Normalize the data
First step is to normalize the data that we have so that PCA works
properly. This is done by subtracting the respective means from the

Y, all X become 𝔁- and all Y become 𝒚-. This produces a dataset whose
numbers in the respective column. So if we have two dimensions X and

mean is zero.
Step 2: Calculate the covariance matrix
Since the dataset we took is 2-dimensional, this will result in a 2x2
Covariance matrix.

16
How PCA works?
Step 3: Calculate the eigenvalues and eigenvectors
Next step is to calculate the eigenvalues and eigenvectors for the
covariance matrix. The same is possible because it is a square
matrix. ƛ is an eigenvalue for a matrix A if it is a solution of the
characteristic equation:
det( ƛI - A ) = 0
Where, I is the identity matrix of the same dimension as A which is a
required condition for the matrix subtraction as well in this case and
‘det’ is the determinant of the matrix. For each eigenvalue ƛ, a
corresponding eigen-vector v, can be found by solving:
( ƛI - A )v = 0
17
How PCA works?
Step 4: Choosing components and forming a feature vector:
We order the eigenvalues from largest to smallest so that it gives us the components in
order or significance. Here comes the dimensionality reduction part. If we have a
dataset with n variables, then we have the corresponding n eigenvalues and
eigenvectors. It turns out that the eigenvector corresponding to the highest eigenvalue
is the principal component of the dataset and it is our call as to how many eigenvalues
we choose to proceed our analysis with. To reduce the dimensions, we choose the
first p eigenvalues and ignore the rest. We do lose out some information in the process,
but if the eigenvalues are small, we do not lose much.
Next we form a feature vector which is a matrix of vectors, in our case, the
eigenvectors. In fact, only those eigenvectors which we want to proceed with. Since we
just have 2 dimensions in the running example, we can either choose the one
corresponding to the greater eigenvalue or simply take both.
Feature Vector = (eig1, eig2)
18
How PCA works?
Step 5: Forming Principal Components:
This is the final step where we actually form the principal components using all the
math we did till here. For the same, we take the transpose of the feature vector and
left-multiply it with the transpose of scaled version of original dataset.
NewData = FeatureVectorT x ScaledDataT
Here,
NewData is the Matrix consisting of the principal components,
FeatureVector is the matrix we formed using the eigenvectors we chose to keep, and
ScaledData is the scaled version of original dataset
(‘T’ in the superscript denotes transpose of a matrix which is formed by
interchanging the rows to columns and vice versa. In particular, a 2x3 matrix has a
transpose of size 3x2)
19
Questions?

• How Do You Handle Missing or Corrupted Data in a Dataset?

• How Can You Choose a Classifier Based on a Training Set Data Size?

• What Are the Three Stages of Building a Model in Machine Learning?

• What Are the Different Types of Machine Learning?

• What is ‘training Set’ and ‘test Set’ in a Machine Learning Model? How
Much Data Will You Allocate for Your Training, Validation, and Test Sets?

20
References
Book:
• Ethern Alpaydin, “Introduction to Machine Learning”. Eastern Economy Edition, Prentice Hall of
India, 2015.
• Andreas C. Miller, Sarah Guido, “Introduction to Machine Learning with Python”, O’REILLY
(2018).
Research Paper:
• Bi, Qifang, et al. "What is machine learning? A primer for the epidemiologist." American journal of
epidemiology 188.12 (2019): 2222-2239.
• Jordan, Michael I., and Tom M. Mitchell. "Machine learning: Trends, perspectives, and
prospects." Science 349.6245 (2015): 255-260.
Websites:
• https://www.geeksforgeeks.org/machine-learning/
• https://www.javatpoint.com/machine-learning
Videos:
• https://www.youtube.com/playlist?list=PLIg1dOXc_acbdJo-AE5RXpIM_rvwrerwR 21
THANK YOU

For queries
Email: [email protected]

t8 Manual 1.2
No ratings yet
t8 Manual 1.2
323 pages
PSCP Template RO and SDO
100% (1)
PSCP Template RO and SDO
28 pages
My Intern1-Recovered
100% (1)
My Intern1-Recovered
27 pages
ML Unit 2
No ratings yet
ML Unit 2
90 pages
Testing An OnBase Solution PDF
100% (1)
Testing An OnBase Solution PDF
27 pages
PCIE Protocol
No ratings yet
PCIE Protocol
29 pages
User Manual Foi Voice Recording (Funcrowd)
No ratings yet
User Manual Foi Voice Recording (Funcrowd)
14 pages
Machine Learning Mindmap PDF
100% (1)
Machine Learning Mindmap PDF
5 pages
Unit 3dimentionality Reduction
No ratings yet
Unit 3dimentionality Reduction
13 pages
Credit Check Functionality in Order Management
No ratings yet
Credit Check Functionality in Order Management
5 pages
Access Control System Complyance Sheet
No ratings yet
Access Control System Complyance Sheet
4 pages
Unit - IV - DIMENSIONALITY REDUCTION AND GRAPHICAL MODELS
No ratings yet
Unit - IV - DIMENSIONALITY REDUCTION AND GRAPHICAL MODELS
59 pages
Testbank and Solutions For Microelectronic Circuits 7th Edition
No ratings yet
Testbank and Solutions For Microelectronic Circuits 7th Edition
18 pages
CSE231 - Lecture 5
No ratings yet
CSE231 - Lecture 5
33 pages
Certified Fiber Optics Tech
No ratings yet
Certified Fiber Optics Tech
4 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Pca Kmeans GMM
No ratings yet
Pca Kmeans GMM
96 pages
Application of Cognitive Ergonomics To The Control Room Design of Advanced Technologies
No ratings yet
Application of Cognitive Ergonomics To The Control Room Design of Advanced Technologies
40 pages
Unit 2 ML 2019
No ratings yet
Unit 2 ML 2019
91 pages
Week 2
No ratings yet
Week 2
96 pages
DAI101 4 Data Preparation
No ratings yet
DAI101 4 Data Preparation
45 pages
CHP 4
No ratings yet
CHP 4
72 pages
Coin98 (C98) - Audit - BSC
No ratings yet
Coin98 (C98) - Audit - BSC
23 pages
EDAB Module 5 Singular Value Decomposition (SVD)
No ratings yet
EDAB Module 5 Singular Value Decomposition (SVD)
58 pages
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
No ratings yet
Data Preprocessing: Essential Steps For Preparing Data Before Modeling
111 pages
Lecture 9 - Data Reduction
No ratings yet
Lecture 9 - Data Reduction
36 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
Multinomial Goodness-of-Fit Based On U - Statistics: High-Dimensional Asymptotic and Minimax Optimality
No ratings yet
Multinomial Goodness-of-Fit Based On U - Statistics: High-Dimensional Asymptotic and Minimax Optimality
29 pages
3 1 Chapter 3 Normalization
No ratings yet
3 1 Chapter 3 Normalization
22 pages
DMDW 5
No ratings yet
DMDW 5
25 pages
ML Unit2 Classppt
No ratings yet
ML Unit2 Classppt
44 pages
L 10 Principal Component Analysis 09052024 072206pm
No ratings yet
L 10 Principal Component Analysis 09052024 072206pm
37 pages
Model Selection and Feature Engineering
No ratings yet
Model Selection and Feature Engineering
64 pages
JAVA Advanced 3
No ratings yet
JAVA Advanced 3
19 pages
Module 8
No ratings yet
Module 8
13 pages
AML Unit - 1 Material
No ratings yet
AML Unit - 1 Material
36 pages
5 Data Preprocessing III Editted Notes
No ratings yet
5 Data Preprocessing III Editted Notes
17 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
48 pages
Data
No ratings yet
Data
36 pages
Amazon's Dynamo - All Things Distributed
No ratings yet
Amazon's Dynamo - All Things Distributed
21 pages
ML - Week 04
No ratings yet
ML - Week 04
33 pages
ML Unit 2 Part 2
No ratings yet
ML Unit 2 Part 2
23 pages
Install and Setup FreeRADIUS On CentOS 5 CentOS 6 and Ubuntu 11
No ratings yet
Install and Setup FreeRADIUS On CentOS 5 CentOS 6 and Ubuntu 11
8 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
Lecture-11 - Feature Scaling
No ratings yet
Lecture-11 - Feature Scaling
26 pages
3 - AML - Lecture 3 - Feature Engg
No ratings yet
3 - AML - Lecture 3 - Feature Engg
39 pages
5.feauture Engineering
No ratings yet
5.feauture Engineering
34 pages
Marketing Cell, BTCL.: Bangladesh Telecommunications Company Limited
No ratings yet
Marketing Cell, BTCL.: Bangladesh Telecommunications Company Limited
42 pages
Unit 1
No ratings yet
Unit 1
8 pages
Lecture # 13 Data - Transformation - Techniques
No ratings yet
Lecture # 13 Data - Transformation - Techniques
36 pages
Scaling Techniques
No ratings yet
Scaling Techniques
30 pages
Feature Selection and Extraction
No ratings yet
Feature Selection and Extraction
26 pages
5 Data Pre Processing III
No ratings yet
5 Data Pre Processing III
30 pages
W4.2 DataPreProcessing-PCA
No ratings yet
W4.2 DataPreProcessing-PCA
22 pages
PMT Hps 34 ST 25 37 Pressure Safety Manual
No ratings yet
PMT Hps 34 ST 25 37 Pressure Safety Manual
16 pages
10 Autoencoders
No ratings yet
10 Autoencoders
42 pages
Xplore Feature Engineering
No ratings yet
Xplore Feature Engineering
9 pages
1588V2/Ptpv2 Synchronization of Alu Nodeb 1588 Design in Ip/Mpls Network
No ratings yet
1588V2/Ptpv2 Synchronization of Alu Nodeb 1588 Design in Ip/Mpls Network
12 pages
10-2 Data Analysis and Pre-Processing Part 4 PDF
No ratings yet
10-2 Data Analysis and Pre-Processing Part 4 PDF
23 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
19 pages
U5@-Data Reduction
No ratings yet
U5@-Data Reduction
22 pages
Unit 3-2
No ratings yet
Unit 3-2
15 pages
Day School 03
No ratings yet
Day School 03
32 pages
DX Diag
No ratings yet
DX Diag
31 pages
ML Mod 4 Part 2
No ratings yet
ML Mod 4 Part 2
32 pages
Instruction Manual EOI PJB
No ratings yet
Instruction Manual EOI PJB
6 pages
Principal Component Analysis and Cluster Analysis
No ratings yet
Principal Component Analysis and Cluster Analysis
14 pages
Fire Alarm Control Panel: Efficient, Scalable, Connected General
No ratings yet
Fire Alarm Control Panel: Efficient, Scalable, Connected General
7 pages
Feature Extraction: - Saheni Patra
No ratings yet
Feature Extraction: - Saheni Patra
17 pages
Feature Engineering
No ratings yet
Feature Engineering
18 pages
The Joint Between The Femur and Tibia Is Known As - Myschool
No ratings yet
The Joint Between The Femur and Tibia Is Known As - Myschool
8 pages
An005 Lua BACNET Client Operations
No ratings yet
An005 Lua BACNET Client Operations
8 pages
dmdw2 2
No ratings yet
dmdw2 2
24 pages
Lecture 10 - Data Transformation-M
No ratings yet
Lecture 10 - Data Transformation-M
8 pages
Summary Chap 1 & 2
No ratings yet
Summary Chap 1 & 2
5 pages
4 - Basics in Statistics and Linear Algebra
No ratings yet
4 - Basics in Statistics and Linear Algebra
7 pages
DSE Inequalitiesdsdsddwdddwdwdw
No ratings yet
DSE Inequalitiesdsdsddwdddwdwdw
4 pages
Presentation #1 Data Mining Minahel Khan BSIT (E) 22!11!1
No ratings yet
Presentation #1 Data Mining Minahel Khan BSIT (E) 22!11!1
7 pages
Automatic Drilling Machine Using PLC I Ji Set
No ratings yet
Automatic Drilling Machine Using PLC I Ji Set
7 pages
Assignment (3) ML - AmanVerma
No ratings yet
Assignment (3) ML - AmanVerma
6 pages
Deep Learning For Data Analytics 2023 Answer
No ratings yet
Deep Learning For Data Analytics 2023 Answer
6 pages
Grease Pencil: Integrating Animated Freehand Drawings Into 3D Production Environments
No ratings yet
Grease Pencil: Integrating Animated Freehand Drawings Into 3D Production Environments
4 pages
PSC Practical Index
No ratings yet
PSC Practical Index
5 pages
Data Mining: A Preprocessing Engine
No ratings yet
Data Mining: A Preprocessing Engine
5 pages
ANSWER SHEET IN Statisctics and Probabilty: Written Work
No ratings yet
ANSWER SHEET IN Statisctics and Probabilty: Written Work
1 page
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet