0% found this document useful (0 votes)
65 views23 pages

Chapter 4: Normalized Principal Components Analysis: Dr. Lassad El Moubarki Tunis Business School

1) The document discusses normalized principal component analysis (PCA). It describes analyzing data through orthogonal projections to reduce dimensionality while preserving variation. 2) PCA aims to group homogeneous individuals, identify outliers, and analyze variable relationships. It projects data onto new axes defined by eigenvectors of the correlation or covariance matrix. 3) The number of principal axes retained is based on evaluating eigenvalues and the percentage of total inertia explained. Axes capture maximum variance while keeping dimensions low.

Uploaded by

Rania Gouja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views23 pages

Chapter 4: Normalized Principal Components Analysis: Dr. Lassad El Moubarki Tunis Business School

1) The document discusses normalized principal component analysis (PCA). It describes analyzing data through orthogonal projections to reduce dimensionality while preserving variation. 2) PCA aims to group homogeneous individuals, identify outliers, and analyze variable relationships. It projects data onto new axes defined by eigenvectors of the correlation or covariance matrix. 3) The number of principal axes retained is based on evaluating eigenvalues and the percentage of total inertia explained. Axes capture maximum variance while keeping dimensions low.

Uploaded by

Rania Gouja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Chapter 4: Normalized Principal Components Analysis

Dr. Lassad El Moubarki


Tunis Business School

November 10, 2020

L. El Moubarki Sampling November 10, 2020 1 / 23


1 Initial data

2 Example of data

3 Objectives from a technical point of view.

4 Individuals Scatter plot analysis

5 Variable scatterplot analysis

6 Inertia and the choice of the number of principal axes

7 Interpretation of variables factor maps

8 Interpretation of the individual factors map

9 PCA with R

L. El Moubarki Sampling November 10, 2020 2 / 23


Initial data

Initial data

We have a rectangular wherein columns are quantitative variables


(turnover, rate, weight ...) and rows represent statistical individuals (basic
units such as human beings, countries, years.

x11 . . . x1j . . . x1p


 
 j 
 .. . . .. . . ..  x1
 . . . . .   .. 
 .
j j
xi1 . . . xip
   
X = 1
 xi . . . xi . . . · · ·  ; X =  j ..
   ; Xi =

 .. . . .. . . ..   xi . 
 . . . . . 
xnj
xn1 . . . xnj . . . xnp

L. El Moubarki Sampling November 10, 2020 3 / 23


Example of data

Example of data

PETI CHER DISPO GOUT ARRG SUCRE SATISF


cocacola 7.47 3.75 8.16 7.73 5.15 7.27 6.04
dietpepsi 6.19 4.11 7.43 4.23 6.87 4.70 4.00
sevenup 7.09 4.05 7.65 7.21 4.25 6.54 6.34
rootbeer 4.71 4.70 5.26 5.45 6.27 6.74 4.47
mountaind 6.14 5.79 5.99 5.12 5.50 6.46 4.88
pepsicola 7.56 4.35 7.98 6.85 5.35 6.93 5.40
fresta 6.28 4.46 6.67 5.98 6.22 6.07 5.16
crush 5.76 4.91 5.74 6.84 5.53 6.58 5.92
sprite 6.58 4.61 6.80 6.74 4.70 7.00 6.55

L. El Moubarki Sampling November 10, 2020 4 / 23


Objectives from a technical point of view.

Objectives from a technical point of view.

Reduce the number of dimension of the data by looking for the best
planes visualization and this by applying orthogonal projections of the
data.
Group together homogeneous individuals and identify exceptional
individuals.
Analyze the relationships between the variables.

L. El Moubarki Sampling November 10, 2020 5 / 23


Objectives from a technical point of view.

L. El Moubarki Sampling November 10, 2020 6 / 23


Objectives from a technical point of view.

Cases studies

Study the perception of a brand by the consumer. (Example:


boisson.csv).
Study the evolution of the financial situation of a company over time.
(Example: groupe petrolier.csv).
Compare several car brands on the market. (Example: boisson.csv).

L. El Moubarki Sampling November 10, 2020 7 / 23


Objectives from a technical point of view.

Data standardization: centring and reducing the data

 
. .
 . . . . 
xij −X¯j
 
Z =  . . zij =
 
σj . . 
 
 . . . . 
. .
Remark: throughout the rest of the chapter we assume that the data is
normalized.

L. El Moubarki Sampling November 10, 2020 8 / 23


Individuals Scatter plot analysis

Identification of the principal axes.

Principals axes :
1 Principals axes ∆1 , ∆2 , . . . ,k are identified by looking for eigenvalue of
the eigenvectors of the correlation matrix R =t ZDZ . (D = n1 In is the
weight matrix).
2 In a next step We sort the eigenvalue in an decreasing order
λ1 >> λ2 > . . . > λp . We denote by U the (pXp) matrix of
eigenvectors uj organized in columns.
Principals components
1 The coordinates, over the new axes formed by the eigenvectors, of a
given individuals
 1are given by the scalar
 products:
C1 . C1α . C1p
 . . . . . 
C = ZU =   . .

. . . 
Cn1 . Cnα . Cnp
2 Any couple of columns of the matrix U form a factors map.

L. El Moubarki Sampling November 10, 2020 9 / 23


Individuals Scatter plot analysis

Factors map and projection quality

Absolute contribution :
The absolute contribution of a given point i to the projection inertia
p (C α )2
over the axis α is: ACTR(i, α) = i λαi
Relative contribution or cosine square:
The relative contribution of a given point i over the axis α is:
(C α )2
RCTR(i, α) = cos 2 (zi , zˆi α ) = ||zi ||2 . where zˆi α is the orthogonal
i
projection of Zi over the axis α.

L. El Moubarki Sampling November 10, 2020 10 / 23


Individuals Scatter plot analysis

Remarks

C1 is the variable which gives the best description of the data


dispersion.
The best plane data visualization is given by the factors map formed
by the two axes C 1 and C 2 ).
The variables C α (α = 1, . . . , p) are orthogonal (not correlated).
The variables C α (α = 1, . . . , p) are a linear combination of Z j
variables, so, Cα is also standardized.
For all α ≤ p : Var (C α ) = λα

L. El Moubarki Sampling November 10, 2020 11 / 23


Variable scatterplot analysis

Variable scatterplot analysis

 
z1j
z2j
 
 
The coordinates of the j th variable point : Z j =  .
 

.
 
 
znj
The eigenvectors (v1 , v2 , . . .) defining the principals axes of the
second scatterplot are given by the transition formula:
vα = √1λ Zuα .
α
The new factor coordinates
√ of each variable point j over the axis α
α j
are given by: Sj = λα uα

L. El Moubarki Sampling November 10, 2020 12 / 23


Variable scatterplot analysis

Projection quality of the variables points

Relative contribution (or cosine square) of the projection of the


variable point Z j according to the axis α is given by the cosine square
of the angle formed the two vectors Z j and its projection ẑ j,α F α :
CTR(j, α) = cos 2 (Z j , ẑ j,α ) = (Sjα )2
The communality according to the first factor map is:
Com(j, (1, 2)) = (Sj1 )2 + (Sj2 )2

L. El Moubarki Sampling November 10, 2020 13 / 23


Variable scatterplot analysis

Correlation circle

The projections ẑ j of the variables Z j the principal maps lie in a circle of


center O and radius 1: this circle is named correlation circle.

L. El Moubarki Sampling November 10, 2020 14 / 23


Inertia and the choice of the number of principal axes

Inertia and the choice the number of axes to retain


The total inertia of the initial scatterplot of individual is equal to:
p
X
I= λi = p
i=1

The overall quality of the representation of the the scatterplot on the


main sub-space formed by (u1 , u2 , . . . , uq ) is measured by the
proportion of the inertia absorbed by this subspace, it is worth:
λ1 + . . . , λq
p
So, the rate of inertia absorbed by the first factors map (or principal
factor map) is:

λ1 + λ2
p
L. El Moubarki Sampling November 10, 2020 15 / 23
Inertia and the choice of the number of principal axes

Number of axes to retain: inertia criterion

The criterion usually employed to measure PCA quality is the percentage


of total inertia explained by the first choosen k axes. It is defined by:
λ1 + . . . + λk λ1 + . . . + λk
Ratek = Pp =
i=1 λi p

This rate defines the explanatory power of the k first axis (or factors): it
represents the part of total variance taken into account by these k axis.
However, its appreciation must take into account the number of variables
and the number of individuals. For example, an inertia rate relative to an
axis of 10 % can be an important value if the we have 100 variables and
low if it has only 7 variables.

L. El Moubarki Sampling November 10, 2020 16 / 23


Inertia and the choice of the number of principal axes

Number of axes to retain: criterion

It consists in keeping, in a normalized PCA, only those axes whose


eigenvalue is greater than 1 (ie average inertia).

L. El Moubarki Sampling November 10, 2020 17 / 23


Interpretation of variables factor maps

Interpretation of variables factor maps

1 Variables to keep: We keep only the variables well represented on the


factor map (i.e. variables close to the correlation circle).
2 Variable-axis: variables strongly correlated with a factor will
contribute to the definition of this axis.
3 Variable-variable:
the proximity of projections of 2 variables indicates a strong positive
correlation between them. item 2 Diametrically opposite projections
indicate a negative correlation between them.
4 Nearly orthogonal directions indicate a weak linear correlation.

L. El Moubarki Sampling November 10, 2020 18 / 23


Interpretation of variables factor maps

Interpretation of variable factor maps

L. El Moubarki Sampling November 10, 2020 19 / 23


Interpretation of variables factor maps

1 Interpretation according to the principal axes:


Avoid the variable v6 from the analysis.
The first principal component is strongly correlated to the variables. v1 ,
v2 et v4 .
The first component is very little correlated to v3 et v5
The first component opposes the variable v4 with the variables v1 and
v2 .
The second component opposes the variable v3 to the variable v5 .
2 Interpretation with respect to the position of the variables:
Positive correlation between v1 and v2 .
Negative correlation between v3 and v5 .
Negative correlation between {v1 , v2 } and v4 .
No correlation between {v1 , v2 } and {v5 , v3 }.
No correlation between v4 and {v5 , v3 }.

L. El Moubarki Sampling November 10, 2020 20 / 23


Interpretation of variables factor maps

Rotation

To ease the interpretation task, it may be convenient, once the number of


factors determined, to rotate the axes. The rotation (the varimax method,
. . .) allows to get closer to a simple structure:
One component is strongly correlated with some variables and little
correlated with the others.
A variable is correlated with a single component. In this case, the
information restored by the factorial plane remains the same but that
restored by the axes changes.

L. El Moubarki Sampling November 10, 2020 21 / 23


Interpretation of the individual factors map

Interpretation of the individual factors map

L. El Moubarki Sampling November 10, 2020 22 / 23


PCA with R

Case Studies

This is the list of data sets available on the drive:


voiture.csv
depenses.csv
budgetemps.csv
chaises.csv
groupe petrolier.csv
Remark: These data sets will be analysed in class

L. El Moubarki Sampling November 10, 2020 23 / 23

You might also like