Tutorial 4
Tutorial 4
R Tutorial 4
15 November 2024
Important Instructions
• These weekly exercises are highly relevant to the group assignment.
1
Principal Component Analysis: A Summary
Running a Principal Component Analysis means that we transform some standard-
ized variables X = [X1 , X2 , . . . , XN ] into a new set of variables (principal compo-
nents) Z = [Z1 , Z2 , . . . , ZN ] by multiplying them by a matrix Φ = [ϕ1 , ϕ2 , . . . , ϕN ].
That is:
X Φ = Z (1)
T ×N N ×N T ×N
For example, in our tutorial, the columns of X will represent standardized asset
returns observed over T periods (that is why X is a T × N matrix). The key feature
of this transformation is that each of the elements in Z (i.e., each of the principal
components) has the largest variance possible and all of them are uncorrelated.
To obtain the matrix Φ we do an eigendecomposition of the variance-covariance
matrix of X:
Var(X) = ΦΛΦT (2)
Each of the columns in Φ represents the eigenvectors of such decomposition and
diag(λ) = λ1 , λ2 , . . . , λN represent the eigenvalues. These eigenvalues are equal to
the variances of each of the principal components in Z. That is λ1 = Var(Z1 ),
λ2 = Var(Z2 ), . . . λN = Var(ZN ).
2
Question 1
For this tutorial, we will use data on total return series for several indices that track
the performance of global markets (equity indices, fixed income indices, etc.). The
data can be found in “Indices.xlsx”, sheet “data”. The sheet “info” contains their
complete name and their asset class classification.1 Our goal is to extract Principal
Components (PCs), and illustrate how one could (potentially) “identify” them. In
the tutorial, we will focus on the identification of the first one, but of course, you
can work on the identification of the others on your own.
1. Load “Indices.xlsx” into R. To this end, you can use the function read excel
from the library readxl. Make sure you format the date correctly. Next,
compute 52-week returns for every series. Visualise how correlated these series
are (Hint: you can look at the package ggcorrplot).
2. Use the function prcomp to run a Principal Components Analysis on the time
series of returns computed in the previous item. Make sure to deal with
empty values beforehand, as prcomp will otherwise give you errors. Analyse
the output. What is the proportion of variance explained by the first principal
component?
3. Extract the eigenvalues. You can do so directly from the output of the prcomp
function, or use the function get eig from the package factoextra.
4. Extract the principal components (PCs, from now on). Visualise the first
three. Can you associate the behaviour of any of them with some macroeco-
nomic variable?
5. For each PC, compute its variance, the proportion it represents with respect
to the total variance (the aggregate variance of all PCs). Which principal
component has the largest variance? Compare your calculations to the output
you get from applying the summary function to the output you get from
prcomp.
6. Confirm that the PCs you obtain are uncorrelated. Confirm as well that you
can compute the exact same PCs by using Equation 1.
7. Keep only the first four principal components. Compute the correlation of
each index return in the dataset with each of the principal components. You
can find information about the index (i.e., full name, asset class) under the
sheet “info”. Which assets are more correlated with the first and second
principal components?
1
These indices have been collected from a Bloomberg terminal (the tickers are the same as the
column headers).
3
Question 2
The objective of this question is to classify the customers of a given firm based
on their transactions included in the dataset ProductTransactions. In this exercise,
you will practice how to estimate a clustering model through k-means and review
how to construct variables by grouping.
1. The database contains returns and other strange entries. To delete them,
drop every observation with negative prices or quantities and those whose
StockCode is not numeric. Report how many observations were dropped.
4. Some of the variables take very extreme variables. To avoid them having a
big impact, winsorize the variables at the 5%. To avoid the skewness of the
quantities in italics, take logs. Provide the summary statistics.
5. Apply the k-means algorithm to the data and separate clients into two groups.
Make sure you scale the variables properly before doing this. Provide the
number of clients in each group and the center of the groups. Discuss which
variables differentiate the groups more.
2
The average should be quantity-weighted.