0% found this document useful (0 votes)
3 views4 pages

Tutorial 4

This document is a tutorial on applying data science methods in finance, specifically focusing on Principal Component Analysis (PCA) and k-means clustering. It provides step-by-step instructions for analyzing financial data using R, including tasks such as loading data, computing returns, extracting principal components, and classifying customers based on transaction data. The tutorial emphasizes the importance of these exercises for a group assignment and encourages optional participation.

Uploaded by

q.s.b.bibo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views4 pages

Tutorial 4

This document is a tutorial on applying data science methods in finance, specifically focusing on Principal Component Analysis (PCA) and k-means clustering. It provides step-by-step instructions for analyzing financial data using R, including tasks such as loading data, computing returns, extracting principal components, and classifying customers based on transaction data. The tutorial emphasizes the importance of these exercises for a group assignment and encourages optional participation.

Uploaded by

q.s.b.bibo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Data Science Methods in Finance

R Tutorial 4

15 November 2024

Important Instructions
• These weekly exercises are highly relevant to the group assignment.

• It is optional, but we strongly encourage you to work through it.

NO write-up of your answers or submission is required

1
Principal Component Analysis: A Summary
Running a Principal Component Analysis means that we transform some standard-
ized variables X = [X1 , X2 , . . . , XN ] into a new set of variables (principal compo-
nents) Z = [Z1 , Z2 , . . . , ZN ] by multiplying them by a matrix Φ = [ϕ1 , ϕ2 , . . . , ϕN ].
That is:
X Φ = Z (1)
T ×N N ×N T ×N

For example, in our tutorial, the columns of X will represent standardized asset
returns observed over T periods (that is why X is a T × N matrix). The key feature
of this transformation is that each of the elements in Z (i.e., each of the principal
components) has the largest variance possible and all of them are uncorrelated.
To obtain the matrix Φ we do an eigendecomposition of the variance-covariance
matrix of X:
Var(X) = ΦΛΦT (2)
Each of the columns in Φ represents the eigenvectors of such decomposition and
diag(λ) = λ1 , λ2 , . . . , λN represent the eigenvalues. These eigenvalues are equal to
the variances of each of the principal components in Z. That is λ1 = Var(Z1 ),
λ2 = Var(Z2 ), . . . λN = Var(ZN ).

2
Question 1
For this tutorial, we will use data on total return series for several indices that track
the performance of global markets (equity indices, fixed income indices, etc.). The
data can be found in “Indices.xlsx”, sheet “data”. The sheet “info” contains their
complete name and their asset class classification.1 Our goal is to extract Principal
Components (PCs), and illustrate how one could (potentially) “identify” them. In
the tutorial, we will focus on the identification of the first one, but of course, you
can work on the identification of the others on your own.

1. Load “Indices.xlsx” into R. To this end, you can use the function read excel
from the library readxl. Make sure you format the date correctly. Next,
compute 52-week returns for every series. Visualise how correlated these series
are (Hint: you can look at the package ggcorrplot).

2. Use the function prcomp to run a Principal Components Analysis on the time
series of returns computed in the previous item. Make sure to deal with
empty values beforehand, as prcomp will otherwise give you errors. Analyse
the output. What is the proportion of variance explained by the first principal
component?

3. Extract the eigenvalues. You can do so directly from the output of the prcomp
function, or use the function get eig from the package factoextra.

4. Extract the principal components (PCs, from now on). Visualise the first
three. Can you associate the behaviour of any of them with some macroeco-
nomic variable?

5. For each PC, compute its variance, the proportion it represents with respect
to the total variance (the aggregate variance of all PCs). Which principal
component has the largest variance? Compare your calculations to the output
you get from applying the summary function to the output you get from
prcomp.

6. Confirm that the PCs you obtain are uncorrelated. Confirm as well that you
can compute the exact same PCs by using Equation 1.

7. Keep only the first four principal components. Compute the correlation of
each index return in the dataset with each of the principal components. You
can find information about the index (i.e., full name, asset class) under the
sheet “info”. Which assets are more correlated with the first and second
principal components?

1
These indices have been collected from a Bloomberg terminal (the tickers are the same as the
column headers).

3
Question 2
The objective of this question is to classify the customers of a given firm based
on their transactions included in the dataset ProductTransactions. In this exercise,
you will practice how to estimate a clustering model through k-means and review
how to construct variables by grouping.

1. The database contains returns and other strange entries. To delete them,
drop every observation with negative prices or quantities and those whose
StockCode is not numeric. Report how many observations were dropped.

2. Construct the premium of each transaction defined as p−p


p
where p is the unit
price and p is the average price of that product in the sample. Provide the
summary statistics of the new variable.

3. Construct a database in which each observation is a customer and contains


the following variables:

• Average number of different articles in an invoice.


• Number of invoices of that customer.
• Total number of articles bought.
• Total value of articles bought.
• Standard deviation of the total quantity per invoice
• Standard deviation of the total value per invoice
• The average transaction premium2

Provide the summary statistics of these variables.

4. Some of the variables take very extreme variables. To avoid them having a
big impact, winsorize the variables at the 5%. To avoid the skewness of the
quantities in italics, take logs. Provide the summary statistics.

5. Apply the k-means algorithm to the data and separate clients into two groups.
Make sure you scale the variables properly before doing this. Provide the
number of clients in each group and the center of the groups. Discuss which
variables differentiate the groups more.

2
The average should be quantity-weighted.

You might also like