Chapter-4 - Intro To Machine Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

4

Introduction to Machine Learning

4.1 Introduction to Machine Learning


Now a day’s most of the computer machines are fed with more amounts of
relevant data generated from a particular problem domain [1]. Artificial
Intelligence considered being major part of making the computer machines
to understand the data [2]. Machine learning will be a subset of artificial
intelligence which specifies set of algorithms to help the computer machines
to learn that data automatically without any human being intervention [3].
The main theme behind using machine learning is to make machines
fed with the data and specifying features to understand and enable it to
adapt for new data without using explicit programming [4]. The computers
observe the changes in the new data set identify the patterns to understand
their behavior for making predictions [5].

Role of Machine Learning in Data Science


Much of concepts of data science like Analysis of data, extraction of data
features, and decision making in business are automated and over per-
formed by machine learning and artificial intelligence [6].
Large chunks of data were automatically analyzed by machine learning
[7]. It basically does the data analysis and performs data prediction on real
time data without any intervention of human beings [8]. Machine learning
algorithms have become part of Data science life cycle as it does automatic
building of data sets and any further changes in data are predicted auto-
matically and train the machine for further processing [9].
Machine Learning process starts from feeding data which to be ana-
lyzed for specific features and build a data model [10]. The data model
is further trained to generate new conclusions by using machine learning
algorithm and further it performs predictions for the new dataset which
are uploaded [11].

Kolla Bhanu Prakash. Data Science Handbook: A Practical Approach, (97–122) © 2022 Scrivener
Publishing LLC

97
98 Data Science Handbook

Steps of Machine Learning in the Data Science Lifecycle


MACHINE LEARNING PROCESS
Clean, Prepare
& Manipulate Data Test Data

Get Data 3 Improve

2 4
1 Train Model 5

• Collection of Data
The primary step of machine learning is collection of data
from the real time domain area of problem occurrence.
The data collection should be reliable and relevant so as to
improve its quality [12].
• Preparation of Data
In the preparation of data the first step is data cleaning which
makes the data ready for data analysis. Most of the unwanted
and error prone data points are removed from data set and
convert all data in to standard format and further the data
is partitioned into two parts one for training and other for
performance evaluation [13].
• Model Training
The dataset which is part of training will help in out-
put value prediction. The output value would exhibit
the much diversity with expected desired value for the
first iteration [14]. The epoch or iterations are repeated
by performing some adjustments with initial values and
further the prediction accuracy of training data increases
incrementally.
• Evaluation Model
The rest of the data which is not used for training the model
is used for performance evaluation [15]. The testing of the
model against the left amount of data will really estimate the
applicability of the data model in providing us with effective
solution for all real time problems.
Introduction to Machine Learning 99

• Prediction
After completion of training and evaluation of data model
now it’s time to deploy the model in real time environments
and improve the accuracy by parameter tuning. As we deploy
the model in real time it need to learn new data and predict
the perfect output to answer new questions.

Machine Learning Techniques for Data Science


When you have a dataset, you can classify the problem into three types:

• Regression
• Classification
• Clustering

1) Regression
Regression is used for the output variables which are in con-
tinuous space. The curve-fitting methodology in mathemat-
ics is followed in regression. It also tries to fits the data for
a given equation of a curve and predicts the output value.
The linear regression, Neural Network maintenance and
perceptron management are popular implementation using
regression mechanisms. Many of the financial institutions
like stock markets try to predict the growth of the invest-
ments made by the shareholders. Rental brokers also try to
use prediction of house prices in a given location to manage
real estate business.
2) Classification
Classification is a process of managing the output variables
which are discrete and meant for identifying the categories
of data. Most of the algorithms of classification type deal
with processing data and divide them in to categories. It is
like finding different categories of curves for fitting the data
points. The example scenario of labeling the emails for spam
in Gmail would be one type of classification problem where
the different factors of email are checked for categorizing
them to spam upon matching at least 80%-90% of anom-
alies match. Naïve Bayes, KNearest Neighbor, support vec-
tor machine, Neural Networks and Logistic Regression are
some popular examples of classification algorithms.
100 Data Science Handbook

3) Clustering
Grouping data of without labeling and having similar fea-
tures leads to mechanism of clustering. Similarity functions
are used to group the data points with similar characteris-
tics. Dissimilar features of different clusters exist among the
different grouped data points and unique patterns can be
identified among the data sets which are not labeled in clus-
tering. K-means and agglomerative are popular examples
of clustering. Customer purchases can be categorized using
clustering techniques.
Supervised Learning model- Regression and Classification
Unsupervised Learning model- Clustering.

Popular Real Time Use Case Scenarios of Machine


Learning in Data Science
Machine Learning has its roots of implementation way back from previous
years even without our knowledge of utilizing it in our daily life. Many
popular industry sector starting finance to entertainments are applying
machine learning techniques to manage their tasks effectively. Most popu-
lar mobile app’s like Google Maps, Amazon online shopping uses machine
learning at background to respond to the users with relevant information.
Some of the popular real time scenarios where machine learning is used
with data science are as follows:

• Fraud Detection
Banking sectors implement machine learning algorithm
to detect fraudulent transactions to ensure their customer
safety. Popular machine learning algorithms are used to
train the system to identify transactions with suspicious
features and fault transaction patterns to get detected with
in no time when the authorized customer performing his
normal transactions. Thus the huge amount of daily trans-
actional data is used to train the machine learning model to
detect the frauds in time and provide customer safety while
utilizing online banking services.
• Speech Recognition
Popular chat bot implementations like Alexa, Siri, and
normal Google Assistant work on many machine learning
Introduction to Machine Learning 101

mechanisms along with natural language processing are


used for responding their users instantly by listening to their
audio. Much amount of audio inputs is used to train the sys-
tem with different ascent of users and prepare the response.
• Online Recommendation Engines
Most of the recommendation systems are built using machine
learning to automatically track the customer interests while
doing shopping online, querying the search engine for relevant
information and browsing websites for gaming. The behav-
ioral characteristics of consumers are tracked by machine
learning mechanisms and provide better suggestions for the
business domain to improve their features to attract them.
The popular applications like Amazon shopping tracks cus-
tomer interests and pop only those specific products which
he is interested, YouTube delivers the relevant search of videos
on users interest and Facebook with better friend suggestions
by using efficient trained machine learning models.

4.2 Types of Machine Learning Algorithms


Machine learning (ML) algorithms are of three types:

1. Supervised Learning Algorithms:


It uses a mapped function f that works on mapping a trained
label data for an input variable X to an output variable Y. In
simple it solves following equation:

Y = f (X)

The above equation does generate accurate outputs for a


given new inputs.
Classification and Regression are two ML mechanisms
which come under this supervised learning.
Classification is a mechanism of ML which predicts for
the sample data to the form of output variable in categories.
For example from a patient’s health record sample data of
his symptoms the classification try to categorize by labeling
his profile to either “sick” or “healthy”.
102 Data Science Handbook

Regression is a mechanism of ML which predicts for the


sample data to the form of output variable in to real values.
For example most of the regression models works on pre-
dicting the weather report on intensive rainfall for a par-
ticular year based on the available factors of sample data on
different weather conditions.
The popular algorithms like linear and logistic regres-
sion, Naïve Bayes CART and KNN are of type supervised
learning.
Ensembling is a new type of ML mechanism where two
or more popular algorithms are used for training and try
to use all the appropriate features of it to predict accurately
on the sample data. Random Forest Bagging and XG Boost
boosting are popular for ensemble techniques.
2. Unsupervised Learning Algorithms:
The learning models which does process the input variable X
and doesn’t relate it to any specific output variables is called
unsupervised learning. Most of the unsupervised learn-
ing leads to unlabeled data without any specific structure
defined for it.
There are three important techniques which come under
unsupervised learning (i) Association (ii) Clustering (iii)
Dimensionality reduction.
Association is a technique which correlates the occur-
rence of items in a specific collection. Market Basket Analysis
is good examples which correlate the purchases made by the
customers when they visit the grocery store for buying a
bread will be 80% sure of making purchase of eggs.
Clustering is a technique of grouping similar featured
input variables from a given sample data. It tries to find the
specific criteria for grouping the sample data and differenti-
ate them from each other clusters.
Dimensionality reduction is a technique of choosing
specific criteria for reducing the input data sample for con-
veying the appropriate information relevant to the prob-
lem solution. The specific criteria for selection relate to the
mechanism of feature selection similarly extracting the sam-
ple data fitting to the solution is known as feature extraction.
Thus feature selection performs the selection of specific
input variables satisfying the criteria for solution and feature
Introduction to Machine Learning 103

extraction does simplify the data collection which suits to


the solution space.
Popular algorithms like Apriori, K-means and PCA come
under this unsupervised learning techniques.
3. Reinforcement learning:
The learning model in which an agent does the decision
making to choose the best action based on its current learn-
ing behavior to improve the reward value. It does choice
of providing optimal solution to the problem space in get-
ting better gain by performing appropriate actions. Most of
the automated solution uses this mechanism to improve in
obtaining optimal solution. Example in a gaming application
the reinforcement learning mechanism is applied on player
objects initially to learn the game by moving randomly to
gain points, but slowly it tries to find an optimal way of gain-
ing points with appropriate moves so as to achieve maxi-
mum points with in an optimal time.

1. Linear Regression
Most of the algorithms in machine learning does quantifying the rela-
tionship between input variable (x) and output variable (y) with specific
function. In Linear regression the equation y= f(x)=a+bx is used for estab-
lishing relationship between x and y and a and b are the coefficients which
need to be evaluated where ‘a’ represents the intercept and ‘b’ represent the
slope of the straight line. The Fig 4.1 shows the plotted values of random
points (x, y) of a particular data set. The major objective is to construct a

b = slope of regression line


y

distance from the line to a


typical data point
( = “error” between the line
and this y value)

Fig 4.1 Plot of the points for equation y=a+bx.


104 Data Science Handbook

straight line which is nearest to all the random points. The error value is
computed for each point with its y value.

2. Logistic Regression
Most of the predictions made by the linear regression are on the data of
type continuous like fall of rain in cm for a given location and predictions
made by the logistic regression is on the type of data which is discrete like
no of students who are passed/failed for a given exam by applying function
of transformation.
Logistic regression is used for binary classification where data sets are
denoted in two classes either 0 or 1 for y. Most of the event predictions
will be only two possibilities i.e. either they occur denoted by 1 and not by
0. Like if patient health was predicted for sick using 1 and not by 0 in the
given data set.
The transformation function which is used for logistic expression is
h(x) = 1/(1 + ex) it normally represents s-shaped curve.
The output of the logistic expression represents in the form of probabil-
ity and it value always ranges from 0 to 1. If the probability of patient health
for sick is 0.98 that means the output is assigned to class 1. Thus the output
value is generated using log transforming with x value with function h(x) =
1/(1 + ex). A binary classification is mostly realized using these functions
by applying threshold.

1
The Logistic Function, h(x)=
1 + e-x
1.0

0.8

0.6
h(x)

0.4

0.2

0.0
-8 -6 -4 -2 0 2 4 6 8
x

Fig 4.2 Plot of the transformation function h(x).


Introduction to Machine Learning 105

In fig 4.2 the binary classification of the tumor is malignant or not is


computed using transformation function h(x). Most of the various x-values
of instantaneous data of the tumor is ranged between 0 to 1. For any data
which crosses the shown horizontal line is considered as threshold limit
and to be classified as malignant tumor.
P(x) = e ^ (b0 + b1x) / (1 + e(b0 + b1x)) logistic expression is trans-
formed into ln(p(x) / 1-p(x)) = b0 + b1x. Thus resolving for bo and b1
coefficient with the help of training data set will try to predict the error
between the actual outcome to estimated outcome. The technique called
maximum likelihood estimation can be used to identify the coefficients.

3. CART
Classification and Regression Trees (CART) are one implementation of
Decision Trees.
In Classification and Regression Trees contains non-terminal (internal)
node and terminal (leaf) nodes. One of the internal node acts as a root node
and all non-terminal nodes as decision making nodes for an input variable
(x) and split the node in two branches and this branching of nodes will stop
at leaf nodes which results in the output variable (y). Thus these trees acts as

Root Node
Over 30 yrs.

Internal Node
Yes No

Married Sports Car

Yes No

Leaf Nodes

Mini- Van Sports Car

Fig 4.3 Example of CART.


106 Data Science Handbook

a path of prediction to have walked through complete path of internal nodes


and leading to the output result at the end of the terminal node.
The fig 4.3 is an example decision tree which uses CART features to
find whether a person will purchase sport car or minivan by considering
the factors of age and marital status. The decision factors considered at
the internal node are first if the age is over 30 yrs and married will result
in purchase of minivan. If age is not 30 yrs will result in sports car and age
over 30 yrs and not married also result in sports car.

4. Naïve Bayes
Bayes theorem uses probability occurrence of an event when it occurs
in real time. The probability for bayes theorem is computed by a given
hypothesis (h) and by prior knowledge (d).

Pr(h|d)= (Pr(d|h) Pr(h)) / Pr(d)

where:
• Pr(h|d) represents the posterior probability. Where hypoth-
esis probability of h is true, for the given data d, where
Pr(h|d)= Pr(d1| h) Pr(d2| h)….Pr(dn| h) Pr(d)

Table 4.1 Data set for Naïve bayes computation.


Weather Play
Sunny No
Overcast Yes
Rainy Yes
Sunny Yes
Sunny Yes
Overcast Yes
Rainy No
Rainy No
Sunny Yes
Rainy Yes
Sunny No
Overcast Yes
Overcast Yes
Rainy No
Introduction to Machine Learning 107

• Pr(d|h) represents likelihood where the probability of the


data d for given hypothesis h is true.
• Pr(h) represents the class prior probability where the prob-
ability of hypothesis h being true (irrespective of any data)
• Pr(d) represents the predictor prior probability where prob-
ability of the data (irrespective of the hypothesis)

This algorithm is called ‘naive’ because it assumes that all the variables
are independent of each other, which is a naive assumption to make in
real-world examples.
The algorithm is naïve because the treating of variables is independent
of each other with different assumptions with real world sample examples.
Using the data in above Table 4.1, what is the outcome if weather =
‘sunny’?
To determine the outcome play = ‘yes’ or ‘no’ given the value of variable
weather = ‘sunny’, calculate Pr(yes|sunny) and Pr(no|sunny) and choose
the outcome with higher probability.

->Pr(yes|sunny)= (Pr(sunny|yes) * Pr(yes)) / Pr(sunny) = (3/9 * 9/14) /


(5/14) = 0.60

-> Pr(no|sunny)= (Pr(sunny|no) * Pr(no)) / Pr(sunny) = (2/5 * 5/14) /


(5/14) = 0.40

Thus, if the weather = ‘sunny’, the outcome is play = ‘yes’.

5. KNN
K-Nearest Neighbors algorithm mostly uses the data set which considers
all the data to be training.
The KNN algorithm works through the entire data set for find the
instances which are near to K-nearest or similar with record values then
outputs the mean for solving the regression or the mode for a classification
problem with k value specified. The similarity is computed by using the
measures as a Euclidean distance and hamming distance.

Unsupervised learning algorithms


6. Apriori
Apriori algorithm usually generates association rules by mining frequent
item sets from a transactional database. The market basket analysis is an
108 Data Science Handbook

frq(X ,Y)
Support =
N

frq(X ,Y)
Rule: X Y Confidence =
frq(X)

Support
Lift =
Supp(X) × Supp(Y)

Fig 4.4 Rule defining for support, confidence and lift formulae.

good example for identifying the products which are purchased more
frequently in combination from the available database of customer pur-
chase. The association rule looks like f:X->Y where if a customer pur-
chase X then only he purchase the item Y.
Example: The association rule defined for a customer purchase made
for milk and sugar will surely buy the coffee powder can be given as {milk,
sugar} -> coffee powder. These association rules are generated whenever
the support and confidence will cross the threshold.
The fig 4.4 provides the support, confidence and lift formulae speci-
fied for X and Y. The support measure will help in pruning the number
of candidate item sets for generating frequent item sets as specified by
the Apriori principle. The Apriori principle states that for a frequent
item sets, and then all of its subsets must all also be frequent.

7. K-means
K-means algorithm is mostly used for grouping the similar data into clus-
ters through more iteration. It computes the centroids of the k cluster and
assigns a new data point to the cluster based on the less distance between
its centroid and data point.
Working of K-means algorithm:
Let us consider the value of k=3 from the fig 4.5 we see there are 3 clus-
ters for which we need to assign randomly for each data point. The cen-
troid is computed for each cluster. The red, blue and green are treated as
the centroids for three clusters. Next will reassign each data point which
is closest to the centroid. The top data points are assigned to blue centroid
similarly the other nearest data points are grouped to red and green cen-
troids. Now compute the centroid for new clusters old centroids are turned
to gray color stars, the new centroids are made to red, green and blue stars.
Introduction to Machine Learning 109

K-means initialization Associating each observation to a cluster

Recalculating the centroids Exit of k-means algorithm

Fig 4.5 Pictorial representation of working of k-means algorithm.

Finally, repeat the steps of identifying new data points for nearing to cen-
troid and switch from one cluster to another to get new centroid until two
consecutive steps the centroids are same and then exit the algorithm.

8. PCA
PCA is a Principal Component Analysis which explores and visualizes the
data for less number of input variables. The reduction of capturing the new
data input values is done based of the data for the new coordinate systems
with axes called as “Principal Components”.
Each component is the result of linear combination of the original vari-
ables which are orthogonal to one another. Orthogonality always leads to
specifying that the correlation between components is zero as shown in
Fig 4.6.
Initial principal component captures the data which are variable at max-
imum in one specific direction similarly second principal component is
resulted with computation of variance on the new data other than used for
first component. The other principal components are constructed while the
remaining variance is computed with different correlated data from the previ-
ous component.
110 Data Science Handbook

original data space


component space
PCA
PC 1
PC 2
Gene 3

PC 2
PC 1

Gene 2 Gene 1

Fig 4.6 Construction of PCA.

Ensemble learning techniques:


The combination of two or more multiple learning techniques for improve-
ment in the results with voting or averaging is called Ensembling. The vot-
ing is due done for classification and averaging is done based on regression.
Ensemblers try to improve the results with combination of two or more
learners. Bagging, Boosting and Stacking are three types of ensembling
techniques.

9. Bagging with Random Forests


Bagging uses bootstrap sampling method to create multiple model data
sets where each training data set comprises of random subsamples taken
from original data set.
The training data sets are of same size of the original data set, but some data
is repeated multiple times and some are missing in the records. Thus entire
original data set is considered for testing. If original data set is of N size then
generated training set is also N, with unique records would be about (2N/3)
and the size of test data set is of N.
The second step in bagging is to provide multiple models for same algo-
rithm for different generated training sets.
The Random forests are the results of bagging technique, it looks similar to
the decision tree where each node is split to minimize the error but in random
forest a set of random selected features are used for constructing the best split.
The reason for randomness usage over decision tree is because of choosing
multiple datasets for random split. The splitting over random subset features
means less correlation among predictions leading to many sub trees.
Introduction to Machine Learning 111

The unique parameter which is used for splitting in random forest always
provides with wide variety of features used for searching at each split point.
Thus always bagging results in random forest tree construction with random
sample of records where each split leads to more random samples of predictors.

10. Boosting with AdaBoost


Adaptive Boosting is popularly known as Adaboost. Bagging is an ensem-
ble technique which is built parallel for each model of data set whereas
boosting works on sequential ensemble techniques where each new data
model is constructed based on the misclassification of the old model.
Bagging involves simple voting mechanism where each ensemble algo-
rithm votes to obtain a final outcome. At first to determine the resultant
model in bagging the earlier models are parallel treated with multiple
models. In boosting the weighted voting mechanism is used where each
classifier obtains the vote for final outcome based on the majority. The
sequential models were built based on the previous assignment for attain-
ing greater weights for different misclassified data models.
The fig 4.7 briefs the graphical illustration of AdaBoost algorithm where
a weak learner known as decision stump with 1-level decision tree using a
prediction based on the value of the one feature with a decision tree with
root node directly connected to leaf nodes.
The construction of weak learners continues until a user-defined no of
weak learners until no further improvement by training. Finally it results

1 2

x2 x2

x1 x1
3 4

x2 x2

x1 x1

Fig 4.7 Steps of AdaBoost algorithm.


112 Data Science Handbook

in step 4 with three decision stumps from the study of the previous models
and applying three splitting rules.
First, start with one decision tree stump to make a decision on one input
variable.
The size of the data points show that we have applied equal weights to
classify them as a circle or triangle. The decision stump has generated a
horizontal line in the top half to classify these points. We can see that there
are two circles incorrectly predicted as triangles. Hence, we will assign
higher weights to these two circles and apply another decision stump.
First splitting rule is done with one input variable to make a decision.
Equal weights of data points were considered to classify them in to circle or
triangle. The decision stump shows the separated horizontal line to catego-
rize the points. In fig 4.7 step-1 clearly shows that two circles were wrongly
predicted for triangles. Now we will assign more weightage to these circles
and go for second decision stump.
In the second splitting rule for decision stump is done on another input
variable. As we observe the misclassified circles were assigned with heavier
weights in the second decision stump so they are categorized correctly and
classified to vertical line on the left but three small circles which are not
matching with that heavier weight are not considered in the current deci-
sion stump. Hence will assign another weights to these three circles which
are at the top and go for another stump.
Third, train another decision tree stump to make a decision on another
input variable.
The third splitting rule the decision tree stump try to make decision on
another input variable. The three misclassified circles in second decision
tree stump are raised to heavier weights thus a vertical line separates them
from rest of the circles and triangles as shown in figure.
On fourth step will combine all decision stumps from the previous
models and define a complex rule to classify the data points correctly from
previous weak learners.

Dimensionality Reduction
In machine learning to resolve classification problems very often many
factors are considered for the final classification. The factors which are
considered for classification are known as variables or features. The more
the numbers of features were considered it would be difficult to visual-
ize the training set and to work on it. Most of the features are correlated
hence possibility of occurrence of redundant is more. This technique of
getting redundant features on the given training data set is done using
Introduction to Machine Learning 113

dimensionality reduction algorithm. Dimensionality reduction is a mech-


anism where the no of random variables are reduced based on the avail-
ability principal variables on a given data set. The major steps involved in
dimensionality reduction is extraction of features and selection of features.

Why is Dimensionality Reduction important in Machine


Learning and Predictive Modeling?
The predominant example for understanding dimensionality reduction can
be considered for classifying the simple e-mail messages which we receive in
our inbox. The classification would be the e-mail message received is spam
or not. More no of features can be considered for classifying the e-mail mes-
sages they are like subject title, content, usage of templates etc.., some of the
features can overlap. Another example simple classification would be for
predicting the humidity and rainfall for a given day. Most of the features
which will be used are correlated to a high degree hence we need to reduce
the features and try to classify. Most of the 3-D data classification leads too
hard to visualize, 2-D data can be easily mapped to any two dimensional
space and 1-D data can be made on to a straight line.

Components of Dimensionality Reduction


Dimensionality reduction is carried under two major steps:

• Selection of Features: A trial of subset of data is considered


original data set with specified features of variables or features,
to get minimal data set which can provide solution to the prob-
lem. It uses three popular techniques in choosing the minimal
data set they are filtering technique, wrapper technique and
embedded technique.
• Extraction of Features: In this mechanism the higher
dimensional space are reduced to a lower dimension space
and test set of data can be for lesser no of dimensions.

Methods of Dimensionality Reduction


The various methods used for dimensionality reduction include:

• Principal Component Analysis (PCA)


• Linear Discriminant Analysis (LDA)
• Generalized Discriminant Analysis (GDA)
114 Data Science Handbook

Advantages of Dimensionality Reduction


• Low storage space and promotes high data compression
• Less computation time
• Eliminate more redundant features, if available.

Disadvantages of Dimensionality Reduction


• Data loss can occur after elimination of redundant data.
• Undesirable output can occur for data sets which have fea-
tures more linearly correlated.
• It fails to define mean and covariance as sufficient data sets
are not available for process.
• The no of principal components considered for implemen-
tation is uncertain but thumb rules are used to resolve the
choice of selection.

4.3 Explanatory Factor Analysis

Segment 2 - Explanatory factor analysis

Factor analysis on iris dataset


Introduction to Machine Learning 115

4.4 Principal Component Analysis (PCA)


The popular dimensionality reduction technique is the principal compo-
nent analysis (PCA). It transforms the large set of dataset in to smaller
dimension which still contains much of the information to represent that
large dataset. As we reduce the selected features the accuracy will get
reduced but the major feature of PCA algorithm it simplifies the data set
with little change in accuracy. The PCA results in to smaller data sets which
are easy to process and can be visualized and analyzed properly without
loss of information or variables. Thus PCA preserve the actual important
featured data from the available data set which gives more clarity on the
solution space.

Step by Step Explanation of PCA


Step 1: Standardization
It is a procedure in which range of continuous initial variables which will
contribute equally are analyzed. Most specifically the standardization is
done prior to PCA because latter it would be challenging to compute the
variances of initial data set variables. The variables with large differences
116 Data Science Handbook

between the range of initial variables will dominate over small differences
over the small range which will provide us with biased result. So trans-
forming the data on to comparable scales would be better choice to prevent
this issue.
The below formulae can be used to standardize the data variables by
subtracting the mean from the variable value and dividing it by its standard
deviation.

value − mean
z=
standard deviation

The standardization always results in unique scale form of arranging


the data.

Step 2: Covariance Matrix computation


The variables correlation must be identified among the standardized data.
This step is important because it identifies how input data will vary from
the mean value of the other and results in to reduce that data which is
leading more correlation. Thus covariance matrix helps in identifying the
strong correlated data.
The below is the example covariance matrix for three dimensional data
which will check for all variables possible correlation on x, y, and z.

 Cov( x , x ) Cov( x , y ) Cov( x , z ) 


 
 Cov( y , x ) Cov( y , y ) Cov( y , z ) 
 Cov( z , x ) Cov( z , y ) Cov( z , z ) 
 

Covariance Matrix for 3-Dimensional Data


Most of the diagonal will be same variable variance and the cumulative
covariance will be same values hence in the above matrix the lower and
upper triangular portions will have similar data. The positive value of
covariance will build strong correlation and negative will result in inverse
correlation.
Thus covariance matrix will help us in summarizing the correlated data
between all possible pairs of variables.
Introduction to Machine Learning 117

Step 3: Compute the eigenvectors and eigenvalues of the


covariance matrix to identify the principal components
Eigen vectors and Eigen values are linear algebra concepts that need to be
computed on covariance matrix to determine the principal components of
the data. Principal components are constructed with linear combinations
or mixture of different variables. The new combinations are done such a
way that uncorrelated data and most of the data within the initial variables
will be compressed or squeezed to form the components. Thus depending
on the dimension of data the principal components can be created. The
principal components will always tries to maximize the possible informa-
tion on to first component, then second components with next maximum
remaining data.
The fig 4.8 provides with the possible data as grouped into principal
components. This way of organizing the data without loss of information
will provide the reduction in the unwanted or uncorrelated data. Thus the
principal components with less data can be neglected and remaining are
used for further process.
The pictorial representation of principal components will represent the
directions of the data that gives the maximum computed variance data on
the lines of capture for most of the information. The major relationship
between variance and information are that the larger the variance carried
by a line, larger the dispersion of data which provides more information.
The difference between the data can be clearly observed from the principal
component axes.

40
Percentage of explained variances

30

20

10

0
1 2 3 4 5 6 7 8 9 10
Principal Components

Fig 4.8 Percentage of Variance (Information) for each by PC.


118 Data Science Handbook

Step 4: Feature Vector


The continuation previous step of construction of principal components
from eigen vectors and order based on the Eigen values in descending
order allows us to identify the significance of it. In this step will choose
which components to be discarded (low eigenvalues) and the remaining
will be resultant feature vector.
The feature vector is simply a matrix which has columns with eigen vec-
tors of the components that will be used for further operations. This would
be the first step to achieve dimensionality reduction, among p eigen vec-
tors out of n, thus the final data set would be only p dimensions.

Last Step: Recast the Data Along the Principal Components Axes
From all above steps it is clear that after standardization you make changes
to the data based on the principal component selection and result new fea-
ture vector, but the given input dat is always same.
In this step, which is the last one, the aim is to use the feature vector
formed using the eigenvectors of the covariance matrix, to reorient the
data from the original axes to the ones represented by the principal com-
ponents (hence the name Principal Components Analysis). This can be
done by multiplying the transpose of the original data set by the transpose
of the feature vector.
In the final step will multiply the transposed feature vector with the
transposed original datset.

FinalDataSet = FeatureVectorT ∗ StandardizedOriginalDataSetT

Advantages of Principal Component Analysis


1. Separate the correlated featured data:
In real time scenarios there would be large amount of data
set with variable no of features. It is difficult to run the algo-
rithm for all features and visualize them graphically. So it is
mandatory to reduce the number of features to understand
the data set. The correlation among the features will help us
in selecting the selected features which will result with close
proximity of understanding which is quite impossible with
manual intervention. Thus PCA provides the construction
of principal components with featured vectors which will
Introduction to Machine Learning 119

help us finding the strong correlated features by removing


from the original data.
2. Improving the performance of algorithms:
Most of the algorithms performance depends on the val-
ued data supplied for its input. If the data is not valid then
it will degrade its performance and result in wrong results.
If we provide the highly correlated data it will significantly
improve the performance of algorithms. So if input data is
more to process then PCA would be better choice to reduce
the uncorrelated data.
3. Overfitting reduction:
When more variable features are used in the dataset then
overfitting is the common issue. So PCA reduces the no of
features which will result less over fitting.
4. Visualization is improved:
High dimensional data is difficult to visualize. PCA trans-
forms the high dimensional to low dimensional to improve
the visibility of data. Example IRIS data with four dimension
can be transformed to two dimension by PCA which will
improve the data visualization for processing.

Disadvantages of Principal Component Analysis


1. Interpretation on independent variables is difficult:
PCA results in to the linear combination where original fea-
ture of data will be missing and these resulted principal com-
ponents are less interpretable then with original features.
2. PCA purely depends on data standardization:
The optimal principal components will not be possible if the
input data is not standardized. Scaling factor is very import-
ant among the chosen available data. Any strong variation
will result in biased results which will lead to wrong out-
put. To get optimal performance form the machine learning
algorithms we need to standardize the data to mean 0 and
standard deviation 1.
3. Loss of information:
Principal components will try to cover much of the highly
correlated data with wide features but some information
may be lost due to more convergence then with available
original features.
120 Data Science Handbook

Segment 3 - Principal component analysis (PCA)

PCA on the iris dataset


Introduction to Machine Learning 121

0.6
0

0.3
1

0.0
2

-0.3
-0.6
3

sepal length (cm)

sepal width (cm)

petal length (cm)

petal width (cm)

References
1. Mitchell, Tom (1997). Machine Learning. New York: McGraw Hill. ISBN
0-07-042807-7. OCLC 36417892.
2. Hu, J.; Niu, H.; Carrasco, J.; Lennox, B.; Arvin, F., “Voronoi-Based Multi-
Robot Autonomous Exploration in Unknown Environments via Deep
Reinforcement Learning” IEEE Transactions on Vehicular Technology, 2020.
3. Bishop, C. M. (2006), Pattern Recognition and Machine Learning, Springer,
ISBN 978-0-387-31073-2
122 Data Science Handbook

4. Machine learning and pattern recognition “can be viewed as two facets of the
same field.”[4]: vii
5. Friedman, Jerome H. (1998). “Data Mining and Statistics: What’s the con-
nection?”. Computing Science and Statistics. 29 (1): 3–9.
6. “What is Machine Learning?”. www.ibm.com. Retrieved 2021-08-15.
7. Zhou, Victor (2019-12-20). “Machine Learning for Beginners: An
Introduction to Neural Networks”. Medium. Retrieved 2021-08-15.
8. Domingos 2015, Chapter 6, Chapter 7.
9. Ethem Alpaydin (2020). Introduction to Machine Learning (Fourth ed.).
MIT. pp. xix, 1–3, 13–18. ISBN 978-0262043793.
10. Samuel, Arthur (1959). “Some Studies in Machine Learning Using the Game
of Checkers”. IBM Journal of Research and Development. 3 (3): 210–229.
CiteSeerX 10.1.1.368.2254. doi:10.1147/rd.33.0210.
11. Prakash K.B. Content extraction studies using total distance algorithm,
2017, Proceedings of the 2016 2nd International Conference on Applied and
Theoretical Computing and Communication Technology, iCATccT 2016,
10.1109/ICATCCT.2016.7912085
12. Prakash K.B. Mining issues in traditional indian web documents,2015, Indian
Journal of Science and Technology,8(32),10.17485/ijst/2015/v8i1/77056
13. Prakash K.B., Rajaraman A., Lakshmi M. Complexities in developing
multilingual on-line courses in the Indian context, 2017, Proceedings
of the 2017 International Conference On Big Data Analytics and
Computational Intelligence, ICBDACI 2017, 8070860, 339-342, 10.1109/
ICBDACI.2017.8070860
14. Prakash K.B., Kumar K.S., Rao S.U.M. Content extraction issues in online
web education, 2017,Proceedings of the 2016 2nd International Conference
on Applied and Theoretical Computing and Communication Technology,
iCATccT 2016, 7912086,680-685,10.1109/ICATCCT.2016.7912086
15. Prakash K.B., Rajaraman A., Perumal T., Kolla P. Foundations to fron-
tiers of big data analytics,2016,Proceedings of the 2016 2nd International
Conference on Contemporary

You might also like