Tutorial 4

This document is a tutorial on applying data science methods in finance, specifically focusing on Principal Component Analysis (PCA) and k-means clustering. It provides step-by-step instructions for analyzing financial data using R, including tasks such as loading data, computing returns, extracting principal components, and classifying customers based on transaction data. The tutorial emphasizes the importance of these exercises for a group assignment and encourages optional participation.

Uploaded by

q.s.b.bibo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views4 pages

Tutorial 4

Uploaded by

q.s.b.bibo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Data Science Methods in Finance

R Tutorial 4

15 November 2024

Important Instructions
• These weekly exercises are highly relevant to the group assignment.

• It is optional, but we strongly encourage you to work through it.

NO write-up of your answers or submission is required

1
Principal Component Analysis: A Summary
Running a Principal Component Analysis means that we transform some standard-
ized variables X = [X1 , X2 , . . . , XN ] into a new set of variables (principal compo-
nents) Z = [Z1 , Z2 , . . . , ZN ] by multiplying them by a matrix Φ = [ϕ1 , ϕ2 , . . . , ϕN ].
That is:
X Φ = Z (1)
T ×N N ×N T ×N

For example, in our tutorial, the columns of X will represent standardized asset
returns observed over T periods (that is why X is a T × N matrix). The key feature
of this transformation is that each of the elements in Z (i.e., each of the principal
components) has the largest variance possible and all of them are uncorrelated.
To obtain the matrix Φ we do an eigendecomposition of the variance-covariance
matrix of X:
Var(X) = ΦΛΦT (2)
Each of the columns in Φ represents the eigenvectors of such decomposition and
diag(λ) = λ1 , λ2 , . . . , λN represent the eigenvalues. These eigenvalues are equal to
the variances of each of the principal components in Z. That is λ1 = Var(Z1 ),
λ2 = Var(Z2 ), . . . λN = Var(ZN ).

2
Question 1
For this tutorial, we will use data on total return series for several indices that track
the performance of global markets (equity indices, fixed income indices, etc.). The
data can be found in “Indices.xlsx”, sheet “data”. The sheet “info” contains their
complete name and their asset class classification.1 Our goal is to extract Principal
Components (PCs), and illustrate how one could (potentially) “identify” them. In
the tutorial, we will focus on the identification of the first one, but of course, you
can work on the identification of the others on your own.

1. Load “Indices.xlsx” into R. To this end, you can use the function read excel
from the library readxl. Make sure you format the date correctly. Next,
compute 52-week returns for every series. Visualise how correlated these series
are (Hint: you can look at the package ggcorrplot).

2. Use the function prcomp to run a Principal Components Analysis on the time
series of returns computed in the previous item. Make sure to deal with
empty values beforehand, as prcomp will otherwise give you errors. Analyse
the output. What is the proportion of variance explained by the first principal
component?

3. Extract the eigenvalues. You can do so directly from the output of the prcomp
function, or use the function get eig from the package factoextra.

4. Extract the principal components (PCs, from now on). Visualise the first
three. Can you associate the behaviour of any of them with some macroeco-
nomic variable?

5. For each PC, compute its variance, the proportion it represents with respect
to the total variance (the aggregate variance of all PCs). Which principal
component has the largest variance? Compare your calculations to the output
you get from applying the summary function to the output you get from
prcomp.

6. Confirm that the PCs you obtain are uncorrelated. Confirm as well that you
can compute the exact same PCs by using Equation 1.

7. Keep only the first four principal components. Compute the correlation of
each index return in the dataset with each of the principal components. You
can find information about the index (i.e., full name, asset class) under the
sheet “info”. Which assets are more correlated with the first and second
principal components?

1
These indices have been collected from a Bloomberg terminal (the tickers are the same as the
column headers).

3
Question 2
The objective of this question is to classify the customers of a given firm based
on their transactions included in the dataset ProductTransactions. In this exercise,
you will practice how to estimate a clustering model through k-means and review
how to construct variables by grouping.

1. The database contains returns and other strange entries. To delete them,
drop every observation with negative prices or quantities and those whose
StockCode is not numeric. Report how many observations were dropped.

2. Construct the premium of each transaction defined as p−p

p
where p is the unit
price and p is the average price of that product in the sample. Provide the
summary statistics of the new variable.

3. Construct a database in which each observation is a customer and contains

the following variables:

• Average number of different articles in an invoice.

• Number of invoices of that customer.
• Total number of articles bought.
• Total value of articles bought.
• Standard deviation of the total quantity per invoice
• Standard deviation of the total value per invoice
• The average transaction premium2

Provide the summary statistics of these variables.

4. Some of the variables take very extreme variables. To avoid them having a
big impact, winsorize the variables at the 5%. To avoid the skewness of the
quantities in italics, take logs. Provide the summary statistics.

5. Apply the k-means algorithm to the data and separate clients into two groups.
Make sure you scale the variables properly before doing this. Provide the
number of clients in each group and the center of the groups. Discuss which
variables differentiate the groups more.

2
The average should be quantity-weighted.

MANOVA and Sample Report
No ratings yet
MANOVA and Sample Report
24 pages
DAPv9d Mac2011
No ratings yet
DAPv9d Mac2011
36 pages
Kinya Sharon - Ass2 - Machine Learning
No ratings yet
Kinya Sharon - Ass2 - Machine Learning
12 pages
L 10 Principal Component Analysis 09052024 072206pm
No ratings yet
L 10 Principal Component Analysis 09052024 072206pm
37 pages
Principal Component Analysis: 2.1 Definition of Principal Components
No ratings yet
Principal Component Analysis: 2.1 Definition of Principal Components
8 pages
ACPusing R
No ratings yet
ACPusing R
25 pages
Pca Portfolio Selection
No ratings yet
Pca Portfolio Selection
18 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
14 pages
Week 9 Lecture - Revision Test-Dual-Translated
No ratings yet
Week 9 Lecture - Revision Test-Dual-Translated
92 pages
Lecture FPCA
No ratings yet
Lecture FPCA
67 pages
Remote Sensing Assignment
No ratings yet
Remote Sensing Assignment
10 pages
PCA Analysis
No ratings yet
PCA Analysis
28 pages
PCA Term Structure
No ratings yet
PCA Term Structure
28 pages
FactorsRisk (UP)
No ratings yet
FactorsRisk (UP)
37 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
9 pages
PCA Explained Stepbystep
No ratings yet
PCA Explained Stepbystep
4 pages
Module 4-2 Principal Components Analysis
No ratings yet
Module 4-2 Principal Components Analysis
18 pages
DR Pca
No ratings yet
DR Pca
22 pages
Unit 3
No ratings yet
Unit 3
28 pages
MScFE 650 MLF - Video - Transcripts - M2
No ratings yet
MScFE 650 MLF - Video - Transcripts - M2
23 pages
PCA Notes
No ratings yet
PCA Notes
3 pages
PCA Complete
No ratings yet
PCA Complete
8 pages
Steps For PCA
No ratings yet
Steps For PCA
5 pages
ML Chapter 4 Part3
No ratings yet
ML Chapter 4 Part3
82 pages
Principal Component Analysis
100% (1)
Principal Component Analysis
10 pages
Principal Component Analysis (PCA) Final
No ratings yet
Principal Component Analysis (PCA) Final
37 pages
R PCA (Principal Component Analysis) - DataCamp
No ratings yet
R PCA (Principal Component Analysis) - DataCamp
54 pages
Unit 3: Discriminant Analysis and Cluster Analysis
No ratings yet
Unit 3: Discriminant Analysis and Cluster Analysis
43 pages
Advanced Data Analysis: Factor Analysis & Cluster Analysis
No ratings yet
Advanced Data Analysis: Factor Analysis & Cluster Analysis
20 pages
4 1 Pca
No ratings yet
4 1 Pca
21 pages
ML Unit - 3 DimensionalitY Reduction
No ratings yet
ML Unit - 3 DimensionalitY Reduction
39 pages
9.3 Correlation and Covariation
No ratings yet
9.3 Correlation and Covariation
15 pages
6 Principal Component Analysis
No ratings yet
6 Principal Component Analysis
7 pages
Ch. 10 Principal Components Analysis (PCA)
No ratings yet
Ch. 10 Principal Components Analysis (PCA)
17 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
34 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
82 pages
Qrm2024 Topic5 Pca Fa
No ratings yet
Qrm2024 Topic5 Pca Fa
67 pages
Chapter2 PCA
No ratings yet
Chapter2 PCA
65 pages
No. 19 Potential PCA Interpretation Problems For Volatility Smile Dynamics Dimitri Reiswich, Robert Tompkins
No ratings yet
No. 19 Potential PCA Interpretation Problems For Volatility Smile Dynamics Dimitri Reiswich, Robert Tompkins
42 pages
Liam - Mescall - PCA Project
No ratings yet
Liam - Mescall - PCA Project
15 pages
Multivariate Statistics Principal Component Analysis (PCA)
No ratings yet
Multivariate Statistics Principal Component Analysis (PCA)
41 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
13 pages
PCA Using R
No ratings yet
PCA Using R
12 pages
Eigen Value and Eigen Vectors
No ratings yet
Eigen Value and Eigen Vectors
4 pages
Project 2: Submitted By: Sumit Sinha Program & Group: Pgpbabionline May19 - A
No ratings yet
Project 2: Submitted By: Sumit Sinha Program & Group: Pgpbabionline May19 - A
17 pages
Data Analytics
No ratings yet
Data Analytics
28 pages
L08 PrincipalComponentAnalysis
No ratings yet
L08 PrincipalComponentAnalysis
36 pages
PCA & Clustering
No ratings yet
PCA & Clustering
6 pages
Unit 4 Part 2
No ratings yet
Unit 4 Part 2
17 pages
Pca Kmeans GMM
No ratings yet
Pca Kmeans GMM
96 pages
Education - Post 12th Standard - CSV
No ratings yet
Education - Post 12th Standard - CSV
11 pages
PCA - Ensemble Classifiers
No ratings yet
PCA - Ensemble Classifiers
9 pages
Projecting Data To A Lower Dimension With PCA
No ratings yet
Projecting Data To A Lower Dimension With PCA
6 pages
Lecture 6 - PCA - Lecturefin
No ratings yet
Lecture 6 - PCA - Lecturefin
71 pages
Unit V Foml
No ratings yet
Unit V Foml
18 pages
Factor Analysis
No ratings yet
Factor Analysis
57 pages
Lesson 7-Feature Selection and Principal Component Analysis
No ratings yet
Lesson 7-Feature Selection and Principal Component Analysis
24 pages
SMDM Predictive Modeling Business Report 05.02.2022 PDF
No ratings yet
SMDM Predictive Modeling Business Report 05.02.2022 PDF
38 pages
Dimensional Reduction in R
No ratings yet
Dimensional Reduction in R
24 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
8 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
5 pages
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
8614 Assignment 1
No ratings yet
8614 Assignment 1
21 pages
Investigation 4-Worksheet FINAL
No ratings yet
Investigation 4-Worksheet FINAL
11 pages
Statistical Modeling For Data Analysis
100% (1)
Statistical Modeling For Data Analysis
24 pages
Autocorrelation
No ratings yet
Autocorrelation
18 pages
Unit Iv
No ratings yet
Unit Iv
12 pages
Sample Project Report
100% (1)
Sample Project Report
26 pages
The Use of Undercover Game Application To Improve Students' Vocabulary
No ratings yet
The Use of Undercover Game Application To Improve Students' Vocabulary
16 pages
ML Unit 4
No ratings yet
ML Unit 4
34 pages
Central Tendency and Dispersion
No ratings yet
Central Tendency and Dispersion
3 pages
Univariate Analysis of Variance: Notes
No ratings yet
Univariate Analysis of Variance: Notes
4 pages
Business Intelligence
No ratings yet
Business Intelligence
8 pages
Computational Statistics - 3rd Sem-1
No ratings yet
Computational Statistics - 3rd Sem-1
4 pages
Ctia Course Outline
No ratings yet
Ctia Course Outline
13 pages
Mini Report Python
No ratings yet
Mini Report Python
24 pages
Studies On Economic Efficiency of Coffee Production in Ilu Abbabor Zone, Oromia Region, Ethiopia
No ratings yet
Studies On Economic Efficiency of Coffee Production in Ilu Abbabor Zone, Oromia Region, Ethiopia
14 pages
Wa0000.
No ratings yet
Wa0000.
52 pages
Google - Business Systems Analyst, Android and Business Communication - Google - Hyderabad, Telangana, India - Google Careers
No ratings yet
Google - Business Systems Analyst, Android and Business Communication - Google - Hyderabad, Telangana, India - Google Careers
3 pages
R Hitung Dan R Tabel
No ratings yet
R Hitung Dan R Tabel
6 pages
Prediction & Forecasting: Regression Analysis
No ratings yet
Prediction & Forecasting: Regression Analysis
3 pages
12 Steps of QCC
100% (5)
12 Steps of QCC
16 pages
Summer Training Report
No ratings yet
Summer Training Report
62 pages
Automobile Mechatronics
No ratings yet
Automobile Mechatronics
68 pages
Optimization of A Battery Manufacturing Line Using Computer Simulation
No ratings yet
Optimization of A Battery Manufacturing Line Using Computer Simulation
107 pages
Akash Final Sip PDF
No ratings yet
Akash Final Sip PDF
51 pages
Ex 5.1 Customer Behaviour Prediction
No ratings yet
Ex 5.1 Customer Behaviour Prediction
8 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
Unit Iv BRM
No ratings yet
Unit Iv BRM
15 pages
Michael Edward Hohn (Auth.) - Geostatistics and Petroleum Geology-Springer Netherlands (1999) PDF
No ratings yet
Michael Edward Hohn (Auth.) - Geostatistics and Petroleum Geology-Springer Netherlands (1999) PDF
243 pages