0% found this document useful (0 votes)
52 views77 pages

Principal Components Analysis (Part I) : Data Science

Principal Components Analysis (PCA) is introduced as a technique to obtain a low-dimensional summary of NBA team statistics data. PCA works by transforming the data to a new coordinate system such that the greatest variance by any projection of the data lies on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. This has the effect of maximizing the variance that can be explained by each successive component. An example visualization using the first two principal components projects the NBA teams as points in a two-dimensional space that accounts for most of the variance in the original high-dimensional data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views77 pages

Principal Components Analysis (Part I) : Data Science

Principal Components Analysis (PCA) is introduced as a technique to obtain a low-dimensional summary of NBA team statistics data. PCA works by transforming the data to a new coordinate system such that the greatest variance by any projection of the data lies on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. This has the effect of maximizing the variance that can be explained by each successive component. An example visualization using the first two principal components projects the NBA teams as points in a two-dimensional space that accounts for most of the variance in the original high-dimensional data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

Principal Components Analysis (part I)

Data Science
UTB

CC BY-SA 4.0
Introduction

2 / 77
NBA Team Stats

I NBA Team Stats: regular season (2016-17)


I Github file: data/nba-teams-2017.csv
I Source: stats.nba.com
I http://stats.nba.com/teams/traditional/#!
?sort=GP&dir=-1

3 / 77
# variables
dat <- read.csv('data/nba-teams-2017.csv')

dim(dat)

[1] 30 27

names(dat)

[1] "team" "games_played" "wins"


[4] "losses" "win_prop" "minutes"
[7] "points" "field_goals" "field_goals_attempted"
[10] "field_goals_prop" "points3" "points3_attempted"
[13] "points3_prop" "free_throws" "free_throws_att"
[16] "free_throws_prop" "off_rebounds" "def_rebounds"
[19] "rebounds" "assists" "turnovers"
[22] "steals" "blocks" "block_fga"
[25] "personal_fouls" "personal_fouls_drawn" "plus_minus"

5 / 77
Exploratory Data Analysis

For illustration purposes, let’s focus on the following variables:


I wins

I losses

I points

I field goals

I assists

I turnovers

I steals

I blocks

6 / 77
EDA: Objects and Variables Perspectives
1 j p

i xij

Rows Variables
(objects) n (columns)

X
data matrix (centered)

XXT (1/n) XTX


inner products of rows covariance matrix

7 / 77
EDA: Objects and Variables Perspectives

Data Perspectives
We are interested in analyzing a data set from both
perspectives: objects and variables
At its simplest we are interested in 2 fundamental purposes:
I Study resemblance among individuals
(resemblance among NBA teams)
I Study relationship among variables
(relationship among team statistics)

8 / 77
EDA

Exploration
Likewise, we can explore variables at different stages:
I Univariate: one variable at a time
I Bivariate: two variables simultaneously
I Multivariate: multiple variables
Let’s see a shiny-app demo (see apps/ folder in github repo)

9 / 77
points
field_goals losses

assists wins

turnovers blocks
steals

Spurs Celtics Raptors Clippers


Warriors Rockets Jazz Cavaliers

Thunder Hawks Bucks Blazers


Wizards Grizzlies Pacers Bulls

Nuggets Hornets Mavericks Timberwolves


Heat Pistons Pelicans Kings

Magic Lakers Nets


Knicks 76ers Suns

10 / 77
Correlation heatmap

blocks

steals Pearson
Correlation
turnovers 1.0

assists 0.5
Var1

field_goals 0.0

points −0.5

losses −1.0

wins
s
es

_g s
as ls
rn ts

st s
bl ls
ks
in

ld int

er
oa

ea
tu sis

oc
ss
w

ov
fie po
lo

Var2 11 / 77
20 40 60 36 38 40 42 12 14 16 4.0 5.0 6.0

60
wins

40
20
60
40

losses
20

110
points

100
42

field_goals
39
36

26
assists

20
12 14 16

turnovers

9.5
steals

8.0
6.5
5.5

blocks
4.0

20 40 60 100 110 20 24 28 6.5 7.5 8.5 9.5

12 / 77
What if we could get a better
low-dimensional summary of the data?

13 / 77
4

Nets ● 76ers ●

Suns
Lakers ●
2
Dim 2 (20.22%)

Timberwolves ● Hawks
● Nuggets ● Warriors
Kings ● Knicks

Bucks ● ● Thunder● Rockets●

Magic ● Pelicans ● Pacers




Wizards
0

Bulls ● ●
● Blazers
Heat
Grizzlies Clippers Cavaliers
Mavericks
● ●

● ● ●● ●

Pistons
Hornets Raptors Celtics Spurs ●
−2


Jazz
−4

−4 −2 0 2 4 6 8

Dim 1 (46.01%)

14 / 77
1.0
turnovers

losses
0.5

steals

field_goals points
Dim 2 (20.22%)

blocks assists
0.0


−0.5

wins
−1.0

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

Dim 1 (46.01%)

15 / 77
About PCA

16 / 77
Data Structure

Principal Components Analysis (PCA) is a multivariate


method that allows us to study and explore a set of
quantitative variables measured on some objects.

17 / 77
Landmarks

I PCA was first introduced by Karl Pearson (1904)


On lines and planes of closest fit to systems of points in
space
I Further developed by Harold Hotelling (1933)
Analysis of a complex of statistical variables into principal
components
I Singular Value Decomposition (SVD) theorem by
Eckart-Young (1936)
The approximation of a matrix by another of a lower rank
I Computationally implemented in the 1960s

18 / 77
Core Idea

With PCA we seek to reduce the


dimensionality (condense information in
variables) of a data set while retaining as
much as possible of the variation present in
the data

19 / 77
PCA: Overall Goals

I Summarize a data set with the help of a small number of


synthetic variables (i.e. the Principal Components).
I Visualize the position (resemblance) of individuals.
I Visualize how variables are correlated.
I Interpret the synthetic variables.

20 / 77
Applications

PCA can be used for


1. Dimension Reduction
2. Visualization
3. Feature Extraction
4. Data Compression
5. Smoothing of Data
6. Detection of Outliers
7. Preliminary process for further analyses

21 / 77
About PCA

Approaches:
PCA can be presented using various—different but
equivalent—approaches. Each approach corresponds to a
unique perspective and a way of thinking about data.
I Data dispersion from the individuals standpoint
I Data variability from the variables standpoint
I Data that follows a decomposition model
I will present PCA by mixing and connecting all of these
approaches.

22 / 77
Geometric Approach

23 / 77
Geometric mindset

PCA for Data Visualization


One way to present PCA is based on a data visualization
approach.

To help you understand the main idea of PCA from a


geometric standpoint, I’d like to begin showing you my
mug-data example.

24 / 77
Imagine a data set in a ”high-dimensional space”

25 / 77
We are looking for Candidate Subspaces

HC

HA

HB

26 / 77
with the best low-dimensional representation

HC

HA

HB

27 / 77
Best low-dimensional projection

HC

HA

HB

28 / 77
Geometric Idea

Looking at the cloud of points


Under a purely geometric approach, PCA aims to represent
the cloud of points in a space with reduced dimensionality in
an “optimal” way.

We look for the “best” graphical representation that allows us


to visualize the cloud of individuals in a low dimensional space
(usually 2-dimensions).

29 / 77
Objects in a high-dimensional space

Rp

30 / 77
We look for a subspace such that

Rp

31 / 77
the projection of points on it

Rp

32 / 77
is the best low-dimensional representation

How do you find the associated axes?


33 / 77
Focus on Distances

Distances between individuals


Looking for the best low-dimensional projection means that we
want to find a subspace in which the projected distances
among points are as much similar as possible to the original
distances.

34 / 77
Focus on distances between objects

d2(i, h) h object
object i

g
centroid

H
subspace

35 / 77
We want projected dists to preserve original dists

d2(i, h) h object
object i

projection h
g 2(i,
dH h)
centroid projection i

H
subspace

d2(i, h) as close as possible to dH2(i, h)

36 / 77
Focus on projected distances

The idea is to project the cloud of points on a plane (or a


low-dim space) of Rp , chosen in such a manner as to minimize
distorting the distances between individuals as little as possible.

37 / 77
Distances and Dispersion

Dispersion of Data
Focusing on distances among all pairs of objects implicitly
entails taking into account the dispersion or spread (i.e.
variation) of the data.

Data Configuration
The reason to pay attention to distances and dispersion is to
summarize in a quantitative way the original configuration of
the data points.

38 / 77
How to measure dispersion?
The concept of Inertia

39 / 77
Sum of Square Distances

Pair-wise Square distances


One way to consider the dispersion of data (in a mathematical
form) is by adding the square distances among all pairs of
points.

Square distances from centroid


Another way to measure the dispersion of data is by
considering the square distances of all points around the
center of gravity (i.e. centroid)

40 / 77
Imagine 3 points and its centroid
Xp

GSW

UTA
LAL
X2

X1

Centroid g is the “average” team.

41 / 77
Dispersion: Sum of all squared dists
Xp

GSW

UTA
LAL
X2

X1

SSD = 2d2 (LAL, GSW) + 2d2 (LAL, UTA) + 2d2 (GSW, UTA)

42 / 77
2n × (sum of squared dists w.r.t. centroid)
Xp

GSW

UTA
LAL
X2

X1

SSD = (2 × 3) × {d2 (LAL, g) + d2 (GSW, g) + d2 (UTA, g)}

43 / 77
Inertia
One way to take into account the dispersion of the data is
with the concept of Inertia.
I Inertia is a term borrowed from the moment of inertia in
mechanics (physics).
I This involves thinking about data as a rigid body (i.e.
particles).
I We use the term Inertia to convey the idea of dispersion
in the data.
I In multivariate methods, the term Inertia generalizes
the notion of variance.
I Think of Inertia as a “multidimensional variance”

44 / 77
Cloud of teams in p-dimensional space
Xp

GSW

LAL
X2

X1

45 / 77
Centroid (i.e. the average team)
Xp

GSW

LAL
X2

X1

46 / 77
Formula of Total Inertia

The Total Inertia, I, is a weighted sum of squared distances


among all pairs of objects:
n n
1 XX 2
I= d (i, h)
2n2 i=1 h=1

47 / 77
Overall variation/spread (around centroid)
Xp

GSW

LAL
X2

X1

48 / 77
Formula of Total Inertia

Equivalently, the Total Inertia can be calculated in terms of


the centoid g:
n
1X 2
I= d (xi , g)
n i=1
The Inertia is an average sum of squared distances around the
centroid g

49 / 77
Centered data: centroid is the origin
Xp

GSW

LAL
X2

X1

50 / 77
Computing Inertia

n
X
Inertia = mi d2 (xi , g)
i=1
n
X 1
= (xi − g)T (xi − g)
i=1
n
1
= tr(XT X)
n
1
= tr(XXT )
n
where mi is the mass (i.e. weight) of individual i, usually 1/n

51 / 77
Finding Principal Components

52 / 77
Inertia Concept

Inertia and PCA


In PCA we look for a low-dimensional subspace having
Projected Inertia as close as possible to the Original Inertia.

Criterion
The criterion used for dimensionality reduction implies that the
inertia of a cloud of points in the optimal subspace is
maximum (but less than the inertia in the original space).

53 / 77
Criterion

Maximize Projected Inertia


We want to maximize the Projected Inertia on subspace H:
X
max projected d2H (xi , g)
i

Axis of Inertia
To find the subspace H we can look for each of its axes
∆1 , ∆2 , . . . , ∆k and its corresponding vectors v1 , v2 , . . . , vk
(k < p).

54 / 77
Looking for an axis 1

Xp

GSW

LAL
X2

X1

NBA teams in a p-dimensional space

55 / 77
1st axis

Xp
axis 1

X2

X1

We want a 1st axis that retains most of the projected inertia

56 / 77
First Axis and Principal Component

Projection of object i on axis ∆1 generated by vector v1


p
X
xTi v1 = xij v1j
j=1

The 1st component z1 is the projection of all points on v1

Xv1 = z1

we don’t really manipulate the axis ∆1 , but its associated vector v1

57 / 77
First Axis and Principal Component

I The axis ∆1 passes through the centroid g


(with centered data, g is the origin)
I The axis ∆1 is created by the unit-norm vector v1 ,
eigenvector of n1 XT X, associated to the largest
eigenvalue λ1
I The explained inertia by the axis ∆1 is equal to λ1
I With standardized data, the proportion of explained
inertia by ∆1 is λ1 /p

58 / 77
2nd axis
Xp
axis 2
axis 1

X2

X1

We want a 2nd axis, orthogonal to ∆1 , that retains most of


the remaining projected inertia

59 / 77
Second Axis and Principal Component

Projection of object i on axis ∆2 generated by vector v2


p
X
xTi v2 = xij v2j
j=1

The 2nd component z2 is the projection of all points on v2

Xv2 = z2

we don’t really manipulate the axis ∆2 , but its associated vector v2

60 / 77
Second Axis and Principal Component

I The axis ∆2 passes through the centroid g


and it is perpendicular to ∆1
I The axis ∆2 is created by the unit-norm vector v2 ,
eigenvector of n1 XT X, associated to the second largest
eigenvalue λ2
I The explained inertia by the axis ∆2 is equal to λ2
I With standardized data, the proportion of explained
inertia by ∆2 is λ2 /p

61 / 77
Computational note

In practice, most software routines for PCA don’t really work


with the population covariance matrix n1 XT X.

Instead, most programs work with the sample covariance


1
matrix: n−1 XT X
1
Notice that with standardized data, n−1
XT X = R, is the
(sample) correlation matrix.

62 / 77
Looking at Variables

63 / 77
Looking at the cloud of standardized variables

i
Rn

Xj

Xl
θjl
2

64 / 77
Looking at the cloud of standardized variables

I With standardized data, the p variables are located within


a hypersphere of radius 1 in an n-dimensional space..
I We represent them graphically as vectors.
I The scalar product between two variables Xj and Xl is:
n
X
hXj , Xl i = xij xil = kxj k kxl k cos(θjl )
i=1

I Notice that:
xTj xl
cos(θjl ) = = cor(Xj , Xl )
kxj k kxl k

65 / 77
Projecting the cloud of standardized variables

I The property cos(θjl ) = cor(Xj , Xl ) is essential in PCA


I A representation of the cloud of variables can be used to
visualize the correlations (through the angles between
variables)
I The cloud of variables is also projected onto a low
dimensional space.
I In this case, the distance between two variables is
computed with inner products.

66 / 77
Projection of best subspace

A
HA
D

HA HD HB
HB
HD
HC HC

B
C

Projection of the scatterplot of variables on the main plane of


variability

67 / 77
Projecting the cloud of standardized variables

The projection of variable j onto an axis k, is equal to the


cosine of the angle θjk .

The criterion maximizes:


p p
X X
2
cos (θjk ) = cor2 (xj , zk )
j=i j=1

where zk is the new variable which is the most correlated with


all of the original variables.

68 / 77
Finding subspace for variables

Projection of variable j on axis H1 generated by vector u1


n
X
xTj u1 = xij ui1
i=1

The synthetic variable u1 can be used to obtain a factor q1

XT u1 = q1

we don’t really manipulate the axis H1 , but its associated vector u1

69 / 77
Finding subspace for variables

Solution: u1 is the first eigenvector of n1 XXT , the matrix of


inner products between individuals
1
XXT u1 = λ1 u1
n
The subsequent dimensions are the other eigenvectors
u2 , u3 , . . .

And the corresponding variable factors are given by:

Q = XT U

70 / 77
Finding subspace for variables

It can be shown that the PCs can also be obtained as:


1
Z = √ UΛ1/2
n

where:
I Λ is the diagonal matrix of eigenvectors of 1 XXT
n
I U is the matrix of eigenvectors of 1 XXT
n
But keep in mind that PCs can be rescaled.

Note that most PCA programs work with n − 1 instead of n.

71 / 77
Relationship between the
representations
of Individuals and Variables

72 / 77
Link between representations

1
SVD of: √ X = UDVT
n−1

1 1
Z = XV = √ XQD−1 ⇒ V=√ QD−1
n−1 n−1

1 1
Q = XT U = √ XT ZD−1 ⇒ U= √ ZD−1
n−1 n−1

73 / 77
Link between representations

Principal Components or Scores


p
1 1 X
zik = √ ×√ × xij qjk
n−1 λk j=1

Factors for Variables


n
1 1 X
qjk =√ ×√ × xij zik
n−1 λk i=1

74 / 77
Principal Components?

Meaning of Principal
The term Principal, as used in PCA, has to do with the
notion of principal axis from geometry and linear algebra

Principal Axis
A principal axis is a certain line in a Euclidean space
associated to an ellipsoid or hyperboloid, generalizing the
major and minor axes of an ellipse

75 / 77
References

I Exploratory Multivariate Analysis by Example Using R by


Husson, Le and Pages (2010). Chapter 1: Principal Component
Analysis (PCA). CRC Press.
I An R and S-Plus Companion to Multivariate Analysis by Brian
Everitt (2004). Chapter 3: Principal Components Analysis.
Springer.
I Principal Component Analysis by Ian jolliffe (2002). Springer.
I Data Mining and Statistics for Decision Making by Stephane
Tuffery (2011). Chapter 7: Factor Analysis. Editions Technip, Paris.

76 / 77
References (French Literature)

I Statistique Exploratoire Multidimensionnelle by Lebart et al


(2004). Chapter 3, section 3: Analyse factorielle discriminante.
Dunod, Paris.
I Probabilites, analyse des donnees et statistique by Gilbert
Saporta (2011). Chapter 6: Analyse en Composantes Principaux.
Editions Technip, Paris.
I Statistique: Methodes pour decrire, expliquer et prevoir by
Michel Tenenhaus (2008). Chapter 10: L’analyse discriminante.
Dunod, Paris.
I Analyses factorielles simples et multiples by Brigitte Escofier et
Jerome Pages (2016, 5th edition). Chapter 2: L’analyse
discriminante. Dunod, Paris.

77 / 77

You might also like