0% found this document useful (0 votes)
17 views

Tutorial 1

The document is a tutorial for applying data science methods in finance using R, focusing on equity data from the CRSP. It includes exercises on loading datasets, filtering samples, calculating market returns, and creating predictive features for stock returns. Additionally, it outlines preprocessing steps for machine learning models and references key academic papers related to the methods discussed.

Uploaded by

q.s.b.bibo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Tutorial 1

The document is a tutorial for applying data science methods in finance using R, focusing on equity data from the CRSP. It includes exercises on loading datasets, filtering samples, calculating market returns, and creating predictive features for stock returns. Additionally, it outlines preprocessing steps for machine learning models and references key academic papers related to the methods discussed.

Uploaded by

q.s.b.bibo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Data Science Methods in Finance

R Tutorial 1

25 October 2024

Important Instructions
• These weekly exercises are highly relevant to the group assignment.

• It is optional, but we strongly encourage you to work through it.

NO write-up of your answers or submission is required

1
Question 1
The task in this question is to familiarize yourself with the equity data
from the Center for Research in Security Prices (CRSP).

1. Load the following dataset into R: “CRSP Monthly.csv”. Before loading a


comma-separated values file into R, The following functions load text files
(such as csv) into R: read.csv from Base R, read delim from readr, and fread
from data.table. Note that fread from data.table is much faster than the other
methods when the dataset is large.

2. Examine the dataset using the head() and summary() functions. You should
see that the price variable contains positive and negative values. Find the
reason why price takes on negative values and solve the problem. Hint: consult
the variable descriptions tab of the CRSP Monthly Stock File on the WRDS
website.

Question 2
The purpose with this exercise is to get you start thinking about restrict-
ing a sample for a specific purpose. In research, filters are almost always used
to convert the raw data into a relevant sample. After having filtered your data you
will construct a value-weighted portfolio return.

1. Restrict the sample to common stocks. The variable shrcd can be used for this
purpose. More information is in the variable descriptions tab of the CRSP
Monthly Stock File on the WRDS website.

2. Restrict the sample to stocks that trade on the following exchanges: New York
Stock Exchange (NYSE), American Stock Exchange (AMEX), National Asso-
ciation of Securities Dealers Automated Quotations (NASDAQ). The variable
exchcd can be used for this purpose.

3. Calculate the value-weighted market returns of this sample. Make sure you
use the correct return definition that includes dividend and adjustments for
corporate events. In a value-weighted portfolio every stock is assigned a weight
proportional to its market capitalization. This is quite tricky, since data for
the 31st of January contains the price and shares outstanding at the 31st of
January but the return during the month of January. Hint: Lag market value
and look up the function “weighted.mean” in the dplyr library. Alternatively
construct the weights yourself.

4. Optional: Check the correlation between the U.S. market return you calcu-
lated with the market factor available at Ken French’s website.

2
Question 3
The task in this question is to create features (characteristics) that we
will later use to predict returns with supervised learning methods.

1. Construct the variable Short-Term Reversal based on the paper of Jegadeesh


(1990). We recommend you stick to the notation we introduce here (left hand
side of the equations)

RIt
ret 1 0 = − 1 = (1 + rett ) − 1, (1)
RIt−1

where RI is equal to the cumulative return.1 Construct also the variable


Momentum 1-12 Months based on the paper of Fama and French (1996).

RIt−1
ret 12 1 = −1 (2)
RIt−12

2. Lag the characteristics by 1 month to prepare the data for creating portfolios.

3. Construct a portfolio that takes a long position in the stocks that are in the
top 10% of the distribution of the variable “Momentum 1-12 Months” in a
specific month. Take a short position in the stocks that are in the bottom
10% of the same distribution in the same month. We recommend that you use
the quantile function to create the cut-off points you need to allocate stocks
into different portfolios.

4. After having assigned stocks to portfolios, calculate the value-weighted re-


turn of the long leg (top 10% of stocks based on the variable Momentum
1-12 Months) and the short leg (bottom 10% of stocks based on the variable
Momentum 1-12 Months).

5. Create the “factor” as the return of a long-short portfolio strategy.

6. Show that this strategy delivers a positive alpha relative to the Capital Asset
Pricing Model (CAPM).2

1
Note that the cumprod function (Base R) does not work if there are gaps in the data.
2
Hint: you have the return of the long-short portfolio, your dependent variable, and the value-
weighted market return from Question 2, your independent variable.

3
List of additional characteristics you can test

1. Construct the variable Momentum 1-3 Months based on the paper of Je-
gadeesh and Titman (1993).

RIt−1 
ret 3 1 = − 1 = (1 + rett−1 ) × (1 + rett−2 ) − 1, (3)
RIt−3

2. Construct the variable Momentum 1-6 Months based on the paper of Je-
gadeesh and Titman (1993).

RIt−1
ret 6 1 = −1
RIt−6

= (1 + rett−1 ) × (1 + rett−2 ) × (1 + rett−3 ) × (1 + rett−4 ) × (1 + rett−5 ) − 1
(4)

3. Construct the variable Momentum 1-9 Months based on the paper of Je-
gadeesh and Titman (1993).

RIt−1
ret 9 1 = −1 (5)
RIt−9

4. Construct the variable Momentum 7-12 Months based on the paper of Novy-
Marx (2013).
RIt−7
ret 12 7 = −1 (6)
RIt−12
5. Construct the variable Momentum 13-36 Months based on the paper of Bondt
and Thaler (1985).
RIt−13
ret 36 13 = −1 (7)
RIt−36
6. Construct the variable Long-Term Reversal based on the paper of Bondt and
Thaler (1987).
RIt−12
ret 60 12 = −1 (8)
RIt−60

4
Question 4
This question deals with preprocessing of data for return predictions.
Load into R: ”CRSP Monthly Including Lagged Characteristics.csv”. The features
are already lagged by one month to prevent the use of future information. Note:
it is essentially the same dataset as in the previous question. If you want, you can
use fread from data.table to load files (it is faster than most alternatives when the
dataset is large). In dealing with dates, we recommend the lubridate package.
1. Select (create if not present in the data) the following variables:
(a) Permno (permno)
(b) Date (date)
(c) Year (year)
(d) Month (month),
(e) Lagged Market Value (lag1.market value)
(f) Return (ret)
(g) Short-Term Reversal (ret 1 0)
(h) Momentum 1-3 Months (ret 3 1)
(i) Momentum 1-6 Months (ret 6 1)
(j) Momentum 1-9 Months (ret 9 1)
(k) Momentum 7-12 Months (ret 12 1)
2. Machine learning models can not be used with missing data. To solve this
problem, we will follow the steps as suggested by Gu, Kelly, and Xiu (2020).
Impute missing features with their cross-sectional median as follows: (i) cal-
culate the cross-sectional median for each stock-level predictive characteristic,
(ii) check whether the stock-level predictive characteristic is missing, and (iii)
replace the stock-level predictive characteristic with its cross-sectional median
if it is missing.3
3. Restrict the sample to 2001 and onward (you lose the first year due to the con-
struction of the variables) and to firm × date observations with non-missing
lagged market values. Set missing returns (ret) to zero.
4. Calculate the summary statistics. You probably noticed that some of the
returns are extremely large. To prevent outliers from influencing your results,
winsorize the returns at 0.5% level (e.g., use the Winsorize function from the
DescTools library)
5. Normalize all features between -1 and 1 in the cross-section:
2 × (x − min(x))
−1
(max(x) − min(x))
6. Replace the stock-level predictive characteristics with 0 if missing after the
previous steps and drop all the rows for which the ret column contains a
missing value.
3
This problem can be solved for all stock-level predictive characteristics using the mutate at
function from dplyr.

5
References
Bondt, Werner F. M. De and Richard Thaler. 1985. “Does the Stock Market Overreact?”
The Journal of Finance 40 (3):793–805.

Bondt, Werner F. M. De and Richard H. Thaler. 1987. “Further Evidence on Investor


Overreaction and Stock Market Seasonality.” The Journal of Finance 42 (3):557–581.

Fama, Eugene F. and Kenneth R. French. 1996. “Multifactor Explanations of Asset


Pricing Anomalies.” The Journal of Finance 51 (1):55–84.

Gu, Shihao, Bryan Kelly, and Dacheng Xiu. 2020. “Empirical Asset Pricing via Machine
Learning.” The Review of Financial Studies 33 (5):2223–2273.

Jegadeesh, Narasimhan. 1990. “Evidence of Predictable Behavior of Security Returns.”


The Journal of Finance 45 (3):881–898.

Jegadeesh, Narasimhan and Sheridan Titman. 1993. “Returns to Buying Winners and
Selling Losers: Implications for Stock Market Efficiency.” The Journal of Finance
48 (1):65–91.

Novy-Marx, Robert. 2013. “The other side of value: The gross profitability premium.”
Journal of Financial Economics 108 (1):1–28.

You might also like