0% found this document useful (0 votes)
21 views

Module 06

Uploaded by

62.SHRUTI KAMBLE
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Module 06

Uploaded by

62.SHRUTI KAMBLE
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Machine Learning (MU - Sem 6 - ECS & AIDS) (Dimensionality Reduction) ...

F
6.4.3 Principal in Principal Component Analysis (P.C.A.)... Page no (&-2
6.4.4 PC1, Pc2, axis in P.C.A.
6.4.5 Calculation of Principal Components.....
6.4.6 Principal Component Analysis (P.C.A.)......
6.4.7 Working Rule of PCA
.9
6.4.8 Use of P.C.A. In Machine Leaming. .&9
6.4.9 Output of PCA
6.4.10
.10
6.4.11
Limitations of PCA..
.10
6.5
Comparison between PCA and Cluster Analysis
Feature extraction and Selection
10
6.5.1
10
Feature Selection Ranking.. .&10
6.5.2 Advantages of PCA. .e11
6.5.3
Disadvantages of PCA .&12
6.6
Dimensionality Reduction Techniques. .612
UQ. Why Dimensionality Reduction is very Important step in
Machine Learning ? .612
(Ref. MU(Comp.) -May 19, Dec. 19, 5 Marks)
6.6.1
Dimension Reduction Techniques in ML .6-12
6.6.1(A) Feature Selection. .6-13
6.6.1(B) Feature Extraction... .6-13
6.7
PrincipAL Component Analysis. .614
UQ.
Explain in detail Principal Component Analysis for .614
(Ref. MU (Comp.) -May 15, May 16, 1O Dimension Reduction.
UQ.
Marks)
Use Principal Component analysis (PCA)to ..614
2 1 0 arrive at the transformed matrix for the given
data.
A=4 3 1
(Ref, MU (Comp.) - May 17, 10
UQ.
Marks).
Describe the steps to reduce 614
dimensionality using Principal Component Analysis
mathernatical formulas used. (Ref. MU (Comp.) - May 19, Dec. method by clearty stabng
UQ. Write Short note on ISA and 19, 5 Marks) 6-14
compare it with PCA. (Ref. MU
Chpater End. (Comp.) - May 19, 5 Marks) .6-14
.6-1
siochine Learning (MU - Sem 6 - ECS &AIDS)
(Dimensionality Reduction) ...Page no (6-3)

6.1 CURSE OF DIMENSIONALITY a 6.1.2 Sampling

The curse of dimensionality refers to various Therc is an cxponential increase in volume associated
phenomena that arise when we analyse and organise with adding extra dimensions to a mathematical
data in high-dimensional spaces. These phenomena space.

do not occur in low-dimensional settings such as the thr For example, 10 = 100 evenly spaced sarmple points
ee dimensional physical space of everyday experience. suffice to sample a unit interval with 10 = 0.01
distance between points.
Dimensionally cursed phenomena occur in domains
An equivalent sampling of a 10-dimensional unit
such as numerical analysis, sampling, combinatorics,
hypercube with a lattice that has a spacing of 10
machine learning data mining, and databases.
The actual and common problem is that when the
=0.01 between adjacent points would require 10
10
dimensionality increases, the volume of the space = (10)" sample points.
increases. And it increases so fast that the available This effect is a combination of the combinatorics
data becomes sparse.
problems.
In order to obtain a reliable result, the amount of data
needs often grows exponentially with the 6.1.3 Optimisaiton
dimensionality. When solving dynamic optimisation, problems by
numerical backword induction the objective function
Also, organising and searching data often relies on
detecting areas where objects form groups with similar must be computed for each combination of values.
properties. But, in high dimensional data, all objects Thisis a significant obstacle when the dimension of the
appear to be sparse and dissimilar in many ways and "state variable" is 1large.
that prevents common data organisation strategies from
a 6.1.4 Anomaly Detection
being efficient.
We Come across the following problems when
a 6.1.1 Combinatorics
searching for anomalies in high dimensional data.
In some problems, each variable can take one of several 1. Concentration of scores and distances : derived
discrete values, or the range of possible values is values such as distances become numerically similar.
divided to give a finite number of possibilities.
2. Irrelevant attributes: in high dimensional data, a
Taking the variables together, a huge number of significant number of attributes may be irelevant.
combinations of values must be considered.
3. Definition of reference sets : For local methods,
This effect is known as combinatorial explosion. reference sets are often nearest-neighbor based.
dimensionalities :
Even in the simplest case of d binary variables, the 4. Incomparable scores for different
number of possible combinations is 2". different subspaces produce incomplete scores.
longer
Thus, each additional dimension doubles
the effect 5. Interpretability of scores : the scores often no
needed to try all combinations. convey a semantic meaning.
Exponential search space : the search space
can no
6.
longer be systematically scanned.
Tech-Neo Publications.A SACHIN SHAH Venture
(M6-131)
Machine Learning (^ (Dimensionality Reduction) ...Page no (6-4)
AIDS)
7 Data-snooping bias : given the large search-space, for W 6.2 STATISTICAL FEATURES
every desired significance a hypothesis can be found.
6.2.1 Introduction
8. Hubness: certain objccts occur more frequently in
neighbour lists than others. Statistical features is the most important features uSed
in statistical data science.
6.1.5 Machine Learning
It is the first statistical technique which is applied while
In machine learning. a marginal increase in
exploring adataset and the expressions like bias, mean.
dimensionality also requires a large increase in the variance, standard deviation, median, moments,
volume in the data in order to maintain the same level percentiles etc.
of performance. The field of utility of statistics has been increasing
The curse of dimensionality is the by-product of a steadily and thus different people expressed features of
phenomenon which appears with high dimensional statistics differently according to the developments of
data.
the subject.
6.1.6 How to Combat the COD ? In old days statistics was regarded as the "Science of
statecraft" but today it covers almost every sphere of
Combating COD is not such a big deal until we have natural and human activity. It is more exhanstive and
dimensionality reduction. Dimensionality Reduction is
elaborate in approach.
the process of reducing the number of input variables in
a dataset. 6.2.2 Selected Features
This is known as converting high-dimensional variables (1) Statistics are the classified facts representing the
into lower-dimensional variables without changing conditions of the people in a state specially those facts
their attributes of the same. which can be stated in number or in tables of numbers
or in any tabular or classified arrangement.
Itdoes not contain any extra variables. That makes it
- Webster
very simple for analysts to analyse the data leading to
faster results for algorithms. (2) "Statistics are numerical statements of facts in any
department of enquiry placed in relation to each other"
a 6.1.7 Mitigating Curse of Dimensionality -Bowley
To mitigate the problems associated with high (3) "By statistics we mean quantitative data affected to a
dimensional data, techniques generally, referred to as marked extent by multiplicity of causes"
'Dimensionality reduction' techniques are used. - Yule and Kendall

Dimensionality reduction techniques fall into one of the (4) "Statistics is the aggregate of facts affected to a marked
two categories 'Feature selection' or feature extent by multiplicity of causes, numerically expressed,
extraction'.
enumerated or estimated according to a reasonable
standard of accuracy, collected in a systematic manner,
We shall deal with these techniques in the nest section. for a predetermined purpose and placed in relation to
cach other"

- Prof. Horace Secrist

(M6-131) Tech-Neo Publications...A SACHIN SHAH Venture


A6chineLeaming(MU- Sem 6- ECS &AIDS)
(Dimensionality Reduction) ...Page no (6-5)
6.2.3 Analysis of Features
Very difficult to do s0 in social sciences,
Statisticalfeature analysis includes frequency table, bar particularly when the effect of the some of the
chart, pie chart, histogram, normal probability plot, factors Cannot be mcasured quantitatively.
cumulative frequency table etc. However, statistical techniques have been devised
Term Analysis feature is used with attribute and to study the joint effcct of a number of factors on a

roriable. It is an independent variable. single item or the isolated effect of a single factor
on the given item provided the effect of each of
Remarks
the factors can be measured quantitatively.
) According to Webster, features of statistics are only O Definition of a Feature
numerical| facts. It only restricts the domain of statistics
Feature is defined as to give or bring special attention
to the affairs of a state, i.e. to the social sciences,
to someone or something. For example, a part of the face, a
aonomics etc). But this is very inadequate for modern quality, aspecial
attraction, etc. An example : Nose.
Statistics covers all sciences - SOcial, physical IG
andnatural.
Feature in Data
Each feature represents a measurable piece of data that
Bowley features out statistics in more general way. It
can be used for analysis. For example, name, age, sex,
relates the numerical data in any department of enquiry. fare and so on.
Italso refers to the comparative study of the figures as
Features are also sometimes referred to as Variables
against mere classification and tabulation. or 'Attributes'. Depending upon type of analysis, the
() In the case of social, economic and business features can vary widely.
phenomena, Yule and Kendall's feature refers to data a 6.2.4 Features of Big Data
affected by a multiplicity of causes. For example, Important issues regarding data
In traditlonal file
supply of aparticular commodity is affected by supply,
demand, imports, exports, money in circulation, 1. Volume
competitive products in the market and so on.
2. Velocity
Similarly, the yield of a particular crop depends upon
multiplicity of factors like quality of seed, fertility of 3. Variety
soil, method of cultivation, irrigation facilities, weather
4.Variability
conditions, fertilizers used and so on.
(4) Secrist's feature of statistics is more exhaustive. We 5. Complexity
study them:
(a) Aggregate of facts Fig. 6.2.1 :Important issues regarding data in regarding
data in traditional file
Simple or isolated items cannot be termed as
statistics unless they are a part of aggregate of
1, Volume
facts relating to any particular field of enquiry.
The aggregate of figures of births, deaths, sales, Now a day the volume of data regarding different fields
purchase, production, profits etc. over different
times, places etc. constitute statistics. is high and potentially increasing day by day.
(b) Affected by multipliclty of causes Organizations collect data from a variety of sources,
including business transactions, social media and
In physical sciences, it is possible to isolate the information etc.
is
effect of various factors on a single item but it

Tech-Neo Publications...A SACHIN SHAH Venture


(M6-131)
(Dimensionality Reduction)
Machine Learning (MU - Sem 6- ECS &AIDS) . .Page no (&-6)
2,
Velocity a 6.2.7 Types of varlables in Statistics

The configuration of system with single processor. () Independent and dependent varhabies,
limited RAM and limited storage capacity cannot store (2) Active and Attribute
variables
and manage high volume of data.
(3) Continuous variable
3. Variety (4) Discrete variable
The form of data from different sources is different. (5) Categorical variable
4.
Variability (6) Extraneous variable, and

The flow of data coming from sources like social media (7) Demographic variable
isinconsistent because of daily emerging new trends. It 6.2.8 Statistical Methods
can show sudden increase in size of data which is
difficult to manage. In statistical analysis of raw research data, the methode
used are : mathematical formulae, models and
5. Complexity
techniques (mathematical)
As the data is coming from various sources, it 1s of statistical
The application methods
difficult to link, match and transform such data across extracts
information from research data and provides differet
Systems. It is necessary to connect and correlate ways to assess the strength of research outputs. We nse
relationships, hierarchies and multiple data linkages of numerical (discrete, continuous), categorical and
the data.
ordinal methods for the research output.
All these issues are solved by the new advanced Big
a 6.2.9 Statistical Function
Data Technology.
6.2.5 Features in a Model Statistics help in providing better understanding and
accurate description of nature's phenomenon. Statistics
It is a model that defines features and their helps in the proper and efficient planning ofa statistical
dependencies, typically in the form of a feature inquiry in any field of study.
diagram (with constraints). Also itcould be as a table Statistics helps in collecting appropriate quantitaive
of possible combinations.
data. Basic three elements in statistics are
These are parts or patterns of an object in an image that measurement, comparison and variation.
help to identify it, e.g., a triangle, has three corners and A statistical function, such as mean, median, or
3 edges, these are features of the triangle, and they help
variance summarizes of sample of values by a singe
us to identify it as a 'triangle'. Features include value.
properties like corners, edges, regions of main points,
ridges etc. They expect their parameters to bea probabilistic value
represented by a random sample of values over the Run
a 6.2.6 Four basic Elements of Statistics index.

They are: Population, sample; parameter,statistic The following functions are statistical functions and
(singular); and variable form the basic vocabulary of they are very similar to functions such as sum, Mak, u
statistics. average :

(M6-131) ech:t
Tech-Neo Publications..A SACHIN SHAH Venture
Machine Leaming (MU- Sem 6 - ECS & AIDS)
(Dimensionality Reduction) ...Page no (6-7)
Count Function
1.
number of resources without losing any important or
We use count function when we need to count the
relevant information. Fcature extraction helps to reduce
number of cells containing a number. the amount of redundant data from the data set.
Counta function
2. In the end, the reduction of the data helps to build the
While the counta function only counts the numeric model with less machine's efforts and also increase the
values, the COUNTA function counts all the cclls in a speed of learning and generalization steps in
range that are not empty. The function is useful for the machine learning prOcess.
counting cells containing any type of information, (1) The goal of statistical pattern feature extraction
including error values and empty text. (SPFE) is 'low dimensional reduction'.
3 Count blank function (2) Five measures of statistical features are :

The count blank function counts the number of empty (i) The most extremne values in the data set (the
cells in a range of cells. Cells with formulae that return maximum and minimum values),
empty text are also counted but cells with zero values
(ü) The lower and upper quartiles, and the
are not counted. This is an important function for
(iii) median
summarizing empty cells while analyzing any data.
(3) These values are presented together and ordered
6.3 FEATURES AND CLASSIFICATION from lowest to highest : minimum value, lower
quartile (Q), median value (Q), upper quartile
Classification is the act of forming a class or classes; (Q3), Maximum value.
according to some common relations or attributes while (4) Statistical features of image are :
feature is a distinguishable characteristic of a person or
area,
thing. They are the binary object features. They include
perimeter,
centre of area, axis of least second moment,
6.3.1 Feature Extraction (In Machine Euler member, projection, thinness ratio, and aspect
Learning) ratio.
second momnent,
In real life, all the data we collect are in large amounts. Area, center of area, axis of least

To understand this data, we need a process. Manually, perimeter tell us something about where the object is.
the
it is not possible to process them. Here's when The other four tell us about the shape of the objects.
concept of feature extraction comes in.
6.3.3 Time Domain Features
constructing
It is a general term for methods of
of
combination of the variables to get around these Time domain features refer to the analysis
problems which still describe the data with
sufficient mathematical functions, physical signals or time series
respect to
accuracy. of economic or environmental data, with
to time.
Property optimized feature extraction is the key
value is known
effective model construction. In time-domain, the signal of function's
time,
for all real numbers. It is also valid for continuous
A 6.3.2 Why Feature Extraction is Useful ? or at various separate instants in
the case of discrete

useful when
The technique of extracting the features is
time.
need to reduce the
you have a large data set and
Tech-Neo Publications..A SACHIN SHAH Venture
(M6-131)
Machine Leaming (MU - Sem 6- ECS & (Dimensionality Reduction)
AIDS) . .Page no ((&-8)
For time series data, fcaturc to detcct features such as
cxtraction can be
performed using various time scrics analysis and
algorithms
shaped, edges, or
motion in a digital image or video to process the
decomposition techniques. (3) Auto-encoders :The main purpose of
the auto
Also, features can be obtained by sequence encoders is efficient data coding which is
compar1son
techniques such as dynamic time wrapping and by in naturc. This process comes under unsupervised
subsequence discovery techniques such as motif learning. So Feature extraction procedure isunsupervised
analysis. here to identify the key features from the
applicable
data to cOde
Time-domain feature shows how a signal changes with by learning from the coding of the original data ses.
time, while a frequency domain graph shows how derive new ones.

much of the signal lies within each given frequency NA4 PRINCIPAL
COMPONENT ANALYSIs
band over the range of frequencies. (P.C.A)
IG Feature Values
a 6.4.1 Introduction
It is a determination made by the customer about which
feature is most important to the Principal component represents a multivarjate do.
customers.
Shape Features table as smaller set of variables in order to obserye
trends, jumps, clusters and outliers.
Shape feature provides an alternative to describing an
This overview may give us the relationships between
object, using its most important characteristic and
reduce the amount of information stored. observations and variables and also among variables.

The algorithm is comprised of a Curvature 6.4.2 Definition of P.C.A.


approximation technique. It is a technique used for identification of a smaller
S Global Features
number of uncorrelated variables known as principal
It describes the visual content of the whole image components from a large set of data. This technique is
which represents an image by one vector, whereas local widely used to emphasise variation and capture strong
features describe them as a set of vectors. patterns in a data.

a 6.3.4 Applications of Feature Extraction Mathematically, P.C.A. can be described follows:

The principal component is a collection of points in a


(1) Bag of Words- Bag-of-Words is the _most used
real coordinate space as a sequence of P-unit vectors,
technique for natural language processing. In this (th vector is the direction of aline that best
where the (i)
process they extract the words or the features from a fits the data and is orthogonal to the first (i - 1)vectors.
sentence, document, website, etc. and then they
classify them into the frequency of use. So in this a 6.4.3 Principal in Principal Component
whole process feature extraction is one of the most Analysis (P.C.A.)
important parts.
It is a mathematical algorithm that reduces the
(2) Image Processing : Image processing is one of the best dimensionality of the data while retaining most of the
and most interesting domain. In this domain basically variations in the data-set.
you will start playing with your images in order to
understand them. So here we use many techniques
which includes feature extraction as well and

(M6-131) a Tech-Neo Publications..A SACHIN SHAH Venture


Learning(MU- Sem 6 - ECS 8 AIDS)
Machine
(Dimensionality Reduction) ...Page no (6-9)

ancomplishes this by identifying directions, called a 6.4.7 Working Rule of PCA


principle components (PC) along which the variation in
maximal. () We calculate covariance matrix
the data is
First consider tWo-dirmensional data set (bivariate data
6.4.4 PC1, PC2, axis in P.C.A. set ) along X as Y-axis. Let (xi: y) be the
coordinates

I PCA directions with the largest variances are the of the points.
(a) Covariance
m0st important (i.e. most principal). PC1 axis is the Let x and y bc the means of two data,
first principal direction along which the samples show measures the relationship between the two variables.
the largest variation. (b) Variance operates on a single variable
The PC2 is the second most direction and it is
to PC1 axis (Refer to above (,-)(y,- )
orthogonal
i=1 ...(1)
article 6.4.2 ) Now, COV (x, y)

6.4.5 Calculation of Principal Components

1. Consider the whole dataset consisting of (d + 1) i=] ...(ü)


While Var (x) =
dimensions and ignore labels so that our new dataset
becomes of 'd' dimension.
of the whole X (x;-) (X-)
2. Compute mean for every dimension
i=1 ..(m)
dataset.
whole dataset.
3. Compute the covariance matrix of the The covariance matrix is
each dataset
4. Compute eigenvalues and eigenvectors for cov (x, X) cos (X, y)
(i.e. of each dimensional). C=
cov (y,x) cos (y,y)
Variance is calculated as; the ratio of
eigenvalue of a var (x) cov (X, y) cov (x,x) = var (x) and
cov (y, y)= var )
particular principal component (eigenvector) with the cov (y,x) var (y)
eigenvectors of the
total eigen values. (ii) To calculate eigenvalues and
covariance matrix
a 6.4.6 Principal Component Analysis
(P.C.A.) Let y, , .... be the eigenvalues.
A, ...
information We arrange them in decreasing order, ie.,
PCA is a method used for maximizing the eigenvectors; then the
content in high-dimensional data. In
PCA, a large Let ej, e, ....e, be the
with maximum
transformed into a eigenvector e will represent the axis
number of correlated variables are
and these are variance.
smaller number of uncorrelated variables
and anranged in
calledas principal components. (iii) Once eigenvalues are determined
of lesser
orthogonal decreasing order, we ignore the eigenvalues
In mathematical terms, PCA finds a set of m original data
vectors in data space having largest variance
and then significance, and the dimensionality of the
eigenvoues and we decide to
to m-dimensional is reduced. If there are n
we project the data of n-dimension
SUbspace where m <<n, and we are able
to reduce the keep only m largest eigenvalues.
dimensionality of the given data.
Tech-Neo Publications...A SACHIN SHAH Venture
(M6-131)
Machine Learning (MU - Sem 6 - ECS &AIDS) (Dimensionality Reduction) ...
Page no (&-10,
Then thc coresponding m cighvcctors form the (i) In dimensionality reduction there is information
columns of a matrix callcd as a 'fcaturc vector'. Thcn P.C.A., the independent
we project the original data set on the featurc spacc
(üü) On applying
features become
less interpretable because these principal
formed bythe feature vector arc also not readable or interpretable. components
A 6.4.8 Use of P.C.A. In Machine Learning (iv) Data must be
standardized before implementing
otherwise PCA will not be able to find the PCA,
PCA is unsupervised learning algorithm that is used for principalcomponents. optimal
the dimensionality reduction in machine learning.
(v) If care is in selecting the
not taken
PCA works by considering the variance of each components, it may miss some information. principal
attribute because higher attribute shows the good split
between the classes and hence it reduces a 6.4.11 Comparison between PCA and Cluster
dimensionality. Analysis
P.C.A. is a statistical process that converts the Cluster analysis groups observations while
(i) PCA groups
observations of correlated features into a set of linearly variables rather than observations.
uncorrelated features with the help of orthogonal (ii) PCA Can be used as a final method or to reduce nmmber
transformations. of variables to conduct another analysis such as
These new transformed features are called the principal
regression analysis or other data mining techniques.
Components.
(iüi) Clustering reduces the number of "data points by
It isone of the popular tools that is used for exploratory
Summarising several points by their expectation/k
data analysis and predictive modeling. mean.

It is a technique to draw strong patterns from the given


dataset by reducing the variances. 6.5 FEATURE EXTRACTION AND
SELECTION
PCA tries to find the lower-dimensional surface to
project the high-dimensional data. Principal Component Analysis (PCA) is the process of
A 6.4.9 Output of PCA computing the principal components and using them to
perform a change of basis in the data, sometimes using
Just now we mentioned that PCA is dimensionality only the first few principal components and ignoring
reduction algorithm. It helps in reducing the dimensions of the rest.
our data. PCA gives an output of eigenvectors in decreasing
PCA is used in exploratory data analysis and for
order such as PC1, PC2, ...And these will become new axes,
making predictive models. It is commonly used for
for our data.
dimensionality reduction by projecting cach Gata
a 6.4.10 Limitations of PCA points onto only the first few principal componens.
Thereby lower dimensional data is obtained and it
i) Low interpretability of principal components. Principal preserves as much of the data's variation as possible.
components are lincar combinations of the features
from the original data but they are not so easy to The first principal component is defined as a direction
that maximises the variance of the projected data.
interpret.
The fi principal component can be taken as at direction
and
orthogonal to the first (i 1) principal components
(M6-131)
Tech-Neo Publications..A SACHIN SHAH Venture
(6-11)
(MU- Sem 6 - ECS &AIDS) (Dimensionality Reduction) ... Page no
achineLeaming
max.
maXimises the variance of the projccted data. The W) = arg lwll =
that
principal components are eigenvectors of the
data's max

cOvariance matrix,
by cigen
Principal components are after computed S Further components
decompostion of the data covariance matrix or th found by subtracting the first
matrix. PCA The k component can be
Gingular value decomposition of the data
analysis. k-lprincipal components from X.
is closely related to factor abbreviations
Table 6.5.1: Symbols and
domain specific
Factor analysis incorporates more
and solves Dimensions Indices
Assumptions about the underlying structure Symbol Meaning
PCA is also
cigenvectors of a slightly different matrix. i= 1,..n
(CCA). |X= (Xy Data matrix consisting n Xp
related to canonical correlation Analysis |j= 1,..n
of the set of all data
that optimally
CCA defines coordinate systems vectors, one vector per
between two datasets
describe the croSS covariance raw.
orthogonal coordinate
while PCA defines a new Scalar
variance in a single The number of row1 x1
system and it optimally describes n
vectors in the data set
dataset.
orthogonal linear Scalar
Thus PCA is defined as an The number of1 x 1
to a new
transformation that transforms the data element in each row

coordinate system such that the greatest variance by vector (dimension)


data comes to lie on the
some scalar projection of the of1 x 1 Scalar
component), The number
firs coordinate (called the first principal
L
|dimensions in the
the second coordinate
the second greatest variance on
dimensionally reduced
and so on.
columnwise zero
subspace 1<Lsp
Consider an n X p data matrix, X with
j=1,..p
empirical mean, where each of the n rows represents a U = {u} Vector of empiicalp x 1
different repetition of the experiment, and each of the p means, one mean for

columns gives a particular kind of feature. each column i of the

Mathematically, the transformation is defined by a set data matrix

or coefficients.
of size l of p-dimensional vectors of weights s =(s)One standardpx1 j=1....p
vector X) Of
Wa) = (Wj.,Wp)k) that map peach row deviation for gach
X to a new vector of principal component scores ti) = column j of the data
(ttdo given by tx=Xa Wo, for i= 1, 2,...,n. (1 matrix
reduce
is usually selected as strictly less than p to i=1,..,n
h =(h)Vector of all l's 1xn
dimensionality)
First component a 6.5.1 Feature Selection Ranking
In order to maximize variance, the first weight vector
Many feature selection methods used in classification
Wa) satisfies
are directly applied to ranking. Specifically, for each

(M6-131) Tech-Neo Publications..A SACHIN SHAH Venture


Machine Leaming (MU - Sem 6 - ECS &AIDS) (Dimensionality Reduction)

feature we use its value to rank the training instances 5


.Page no(&-12)
Implemcnting PCA over datasets leads to
and define the ranking
accuracy in terms or
pertormance measure or a loss function as the combinations of actual features, thercfore
transfornliinngear
actual features in principal components that
are
importance of the feature. components are difficult to read or principleas
Feature-ranking can be used for feature selection by comparcd to actual features.
interpret
evaluating the information gain of each variable in the
6.6 DIMENSIONALITY REDUCTIOM
context of the target variable. The effective feature TECHNIQUES
selection can improve both accuracy and efficiency of
learning for ranking. UQ. Why Dimensionality Reduction is very
step in Machine Learming ? Important
6.5.2 Advantages of PCA (Ref. (MU (Comp.) --May 19, Dec..19,SMarke
1. Lack of redundancy of data given the orthogonal Dimension Reduction alludes to the way toWard
components.
changing over an arrangement of information
2 Principal components are independent of each other, so tremendous dimensions into information with lesse having
removes correlated features. dimensions guaranteeing that it passes on comparable
3 PCA improves the performance of the ML algorithm as data briefly. These methods are normally utilized whila
it eliminates correlated varijables that don't contribute in
tackling machine learning issues to acquire better
any decision making.
properties for classification or regression problem.
We should take a gander at the picture demonstrated as
4 PCA helps in overcoming data overfitting issues by follows. It indicates 2 dimensions P and P, which are
decreasing the number of features. given us a chance to state estimations of a few objects
5 PCA results in high variance and thus improves cm (P;) and inches (P). Presently, if we somehow
visualization. managed to utilize both the measurements in machine
6 Reduction of noise since the maximum variation basis learning, they will pass on comparable data and present
a lot of noise in the system, so it is better to simply
is chosen and so the small variations in the background
utilizing one dimension. Dimension of information
are ignored automatically.
present here is changed Over from 2D
A 6.5.3 Disadvantages of PCA (from P, and P,) to ID (Q,), to make the information
moderately less demanding to clarify.
1 It is difficult to evaluate the covariance in a proper
way.
2 Even the simplest invariance could not be captured by
the PCA unless the training data explicitly provide this P2
information. (inches)
3. Data needs to be standardized before implementing
PCA else it becomes difficult to identify optimal
principal components.
P1(cm)
4 Though PCA covers maximum variance amid data
features, sometimes it may skip a bit of information in Q1
comparison to the actual list of features. Fig. 6.6.1:Example of Dimension Reduction
(M6-131) lTech-Neo Publications..A SACHIN SHAH VentUre
MachineLearning(MU- Sem 6- ECS &AIDS)
There are number of
(Dimensionality Reduction) ...Page no (&-13)
methods using which, wc can get
. Aimensions by
reducing n dimensions (k <n) of
informational index. These k
dimensions can be
specifically distinguished (sifted) or can be a
blend of 80
60
dimensions (weighted midpoints of dimensions) or new 40
20
20
15
dimension(s) that
speak to cxisting numerous 20
40
10
5

dimensions well. Astandout amongst the most well 60


-80 10
known use of this procedure is Image preparing. 15 -10 5
5
15
10 15 -20
Now we will see the significance of applying Z2 20

Dimension Reduction method


5
It assists in information packing and diminishing 0

the storage room required -5 -


-10 -
It decreases the time which is required for doing -15 -
-20 -
same calculations. If the dimensions are less then -25
80 -60 -40 -20 0 20
processing will be less, added advantage of
having less dimensions is permission to use
Fig. 6.6.2: Example of Dimension Reduction
calculations unfit for countless.
It is helpful in noise evacuation additionally and
It handles with multi-collinearity that is used to because of this we can enhance the execution of
models.
enhance the execution of the model. It evacuates
The classifier's performance usually will degrade
excess higblights. For example: it makes no sense for a large number of features.
to put away an incentive in two distinct units performance
Classifier
(meters and inches).

Lessening the dimensions of information to 3D or


2D may enable us to plot and visualize exactly.
Number of variables
You would then be able to watch designs all the
more unmistakably. Beneath we can see that, how Fig. 6.6.3 : Classifier performance and amount of data
a 3D information is changed over into 2D. It has
a 6.6.1 Dimension Reduction Techniques in
distinguished the 2D plane at that point spoke to ML
the focuses on these two new axis z and zT.
6.6.1(A) Feature Selection

Given a set of features F = (X,......X,). The feature


selection problem is to find a subset FcF that maximizes
the learner's ability to classify the patterns. Finally Fshould
maximize some scoring function.

Tech-Neo Publications..A SACHIN SHAH Venture


(M6-131)
Aachine Leaning (MU -Sem 6- ECS &AIDS) (Dimensionality Reduction) ...Page no
(6-14)
X1 X

Fcaturc cxtraction
Feature selcction =f

Ym
XN
XiM
projection matrix w is computed from N-dimensional
A
Feature Selection Steps low error.
to M-dimensional vectors to achieve
Feature selection is an optimization problem. Tx. Principle component analysis and
Z = w
Step 1: Search the space of possible feature subsets. Independent component analysis are the feature extractice
Step 2: Pick the subset that is optimal or near-optimal methods.

with respect to some objective function. PRINCIPAL COMPONENT


6.7
IS Subset selection ANALYSIS

There are d initial features and 2 possible subsets


Criteria to decide which subset is the best:
UQ. Explain in detail Principal Component Analysis for !
Dimension Reduction.
Classifier based on these m features has the
lowest probabilityof error of allsuch classifiers (Ref. MU (Comp.) - May 15, May 16, 10 Marks) i
It is not possible to go over all 2 possibilities, so we UQ. Use Principal Component analysis (PCA) to arrive at !
need some heuristics the transformed matrix for the given data.
Here we select uncorrelated features L4 31 0.5
Forward search (Ref. MU (Comp.) - May 17, 10 Marks)
Start from empty set of features UQ. Describe the steps to reduce dimensionality using
Try cach of remaining features Principal Component Analysis method by cleariy !
Estimate classification/regression error for adding stating mathematical formulas used.
specific feature
(Ref. MU (Comp.) - May 19, Dec. 19, 5 Marks)
Select feature that gives maximum improvement UQ. Write Short note on ISA and compare it with PCA
in validation error
(Ref. MU (Comp.) - May 19, 5 Marks)!
Stop when no significant improvement
Backward search Dimensional reduction is regularly the best beginning
Start with original set of size d stage when dealing with high dimensional information.
Drop features with smallest impact on error. It is utilized for an assortment of reasons, from
perception to denoising, and in a wide range of uses,
a 6.6.1 (B) Feature Extraction
from signal processing to bioinformatics.
Suppose a set of features F = (X1, ., XN is
A standout amongst the most broadly utilized
given. The Feature Extraction task is to map F to some dimensional reduction tools is Principal Componeni
feature set F" that will maximize the learner's ability to Analysis (PCA).
classify patterns.

(M6-131) alTech-Neo Publications..A SACHIN SHAH Venture


cCS & AIDS)
(Dimensionality Reduction) .Page no (6-15)
PCA verifiably accept that the
dataset under thought is loses its importancc. In the event that interpretability of
ypically dispersed,
furthermore, chooses the subspacc the outcomes iis critical for your investigation, PCA
bich expands the anticipated difference. isn't thc correct systcm for your task.
We consider a centercd data sct,, and develop the
Ex. 6.7.1 :Apply PCA on the following data and find the
sanmple covariance matrix, at that point q-dimensional principal component
PCA is identical to anticipating onto the q-dimensional X 2.5 0.5 2.2 1.9 3.1 2.3 15 1.1
subspace spread over by the q eigenvectors of S with
biggest eigen values. 2.4 0.7 | 2.9 2.23 2.7 1.6 1.1| 1.6 0.9
In this system, variables are changed into another V Soln. :
arrangement of variables, which are straight blend of First we willfind the mean values
unique variables. These new arrangement of variables
are known as principal components. They are
calculated that first principal component S N= number of data points = 10
Xm = 1.81
represents a large portion of the conceivable variety of
unique information after which each succeeding Ym = 2N
component has the most noteworthy conceivable
variance. Ym = 1.91

The second principal component should be symmetrical


Cxx
Now we will find the covariance matrix, C=c Cy
to the primary principal component. As it were, it does
its best to catch the difference in the information that Cxx = N-1
isn't caught by the primary principal component. For
= 0.6165
two-dimensional dataset, there can be just two principal
Components. The following is a depiction of the Cyy=®-Xm)(Y-Y)
information and its first and second principal
CxY = N-1
component. You can See that second principal = 0.61544

component is symmetrical to first principal component.


=0.7165
1 principal component N-1
0.6165 0.615441
C= LO.61544 0.7165
2 principal component
Now to find the eigen values following equation is used
I C - |= 0
0.6165 0.615447
Lo.61544 0.7165

By solving the above determinant we will get quadratic


Fig. 6.7.1
equation and solving that equation we will get two
The principal components are sensitive to the size of cigen values of Aas = 0.0489 and dy = 1.284
estimation, now to settle this issue we ought to Now we will find the eigen vectors corresponding to
dependably institutionalize factors before applying eigen values
PCA, Applying PCA to your informational collection
(M6-131) ATech-Neo Publications.A SACHIN SHAH Venture
Machine Learning (MU - Sem 6- ECS& AIDS) (Dimensionality Reduction) ..

.Page no (6-16)
First we will find the first cigen vector Cx V = yx Vy
to first cigen valuc, corresponding [0.6165 0.615447
h=0.0489. Lo.61544 0.7165

0.6165 0.61544]
L0.61544 0.7165 | LV12
Fromthe above cquation we will get two
= 0.0489
0.6165V,1 +0.61544V2 =0.0489 V,,
equations as
From the above equation we willget twoequations as 0.61544 V +0.7165V2 =0.0489 V
0.6165 V+0.61544 V12= 0.0489 V11 ...(1) To find the eigen vector we can take either
0.61544 V +0.7165 V12= 0.0489 V ...(2) or (2), for both the equations answer will be the
Equation (1)
To find the eigen vector we can take either Equation (1) let's take the first equation

or (2), for both the equations answer will be the same, = 0.6165 V21 +0.61544 V
let's take the first equation 1.284 V21-0.6675 V21=-0.61544V,
0.6165 Vu +0.61544 V12 =0.0489 V11 V21 = 0.922 V
0.5676 V,u +0.61544 V12=0 Now we willassume V2= l then V1=0.922
V| = -0.61544 V12 V,=0.9227
1 J
V11 = -1.0842 V12
As we are assuming the value of V2 we have to
Now we will assume V1= 1 then VË1=-1.0842
normalized V, as follows
1
VN = Vo.922 +1L I
As we are assuming the value of Vj2 we have to
f0.677]
normalized V, as follow VN = LO.735
1
VIN = Now we have to find the principal component, it is
Vi0842 +1 equal to the eigen vector corresponding to maximum
-0.7357
VIN = 0.677 Eigen value, in this y is maximum, hence principal
component is Eigen vector V2N
Similarly we will find the second eigen vector
corresponding to second eigen value, y = 1.284 Principal component =|
Chapter Ends..
O00

You might also like