Scheugh Christoph (Inter Alios) - Tidy Finance With Python
Scheugh Christoph (Inter Alios) - Tidy Finance With Python
This textbook shows how to bring theoretical concepts from finance and econometrics to the
data. Focusing on coding and data analysis with Python, we show how to conduct research in
empirical finance from scratch. We start by introducing the concepts of tidy data and coding
principles using pandas, numpy, and plotnine. Code is provided to prepare common open-
source and proprietary financial data sources (CRSP, Compustat, Mergent FISD, TRACE) and
organize them in a database. We reuse these data in all the subsequent chapters, which we keep
as self-contained as possible. The empirical applications range from key concepts of empirical
asset pricing (beta estimation, portfolio sorts, performance analysis, Fama-French factors) to
modeling and machine learning applications (fixed effects estimation, clustering standard er-
rors, difference-in-difference estimators, ridge regression, Lasso, Elastic net, random forests,
neural networks) and portfolio optimization techniques.
Key Features:
• Self-contained chapters on the most important applications and methodologies in finance,
which can easily be used for the reader’s research or as a reference for courses on empirical
finance.
• Each chapter is reproducible in the sense that the reader can replicate every single figure,
table, or number by simply copying and pasting the code we provide.
• A full-fledged introduction to machine learning with scikit-learn based on tidy principles to
show how factor selection and option pricing can benefit from Machine Learning methods.
• We show how to retrieve and prepare the most important datasets financial economics: CRSP
and Compustat, including detailed explanations of the most relevant data characteristics.
• Each chapter provides exercises based on established lectures and classes which are designed
to help students to dig deeper. The exercises can be used for self-studying or as a source of
inspiration for teaching exercises.
Chapman & Hall/CRC
The Python Series
About the Series
Python has been ranked as the most popular programming language, and it is widely used in education and
industry. This book series will offer a wide range of books on Python for students and professionals. Titles
in the series will help users learn the language at an introductory and advanced level, and explore its many
applications in data science, AI, and machine learning. Series titles can also be supplemented with Jupyter
notebooks.
Python Packages
Tomas Beuzen and Tiffany-Anne Timbers
Christoph Scheuch
Stefan Voigt
Patrick Weiss
Christoph Frey
Designed cover image: © Christoph Frey, Christoph Scheuch, Stefan Voigt and Patrick Weiss
First edition published 2024
by CRC Press
2385 NW Executive Center Drive, Suite 320, Boca Raton FL 33431
© 2024 Christoph Frey, Christoph Scheuch, Stefan Voigt and Patrick Weiss
Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot as-
sume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have
attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders
if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please
write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or
utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho-
tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission
from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the
Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are
not available on CCC please contact [email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for iden-
tification and explanation without intent to infringe.
DOI: 10.1201/9781032684307
Publisher’s note: This book has been prepared from camera-ready copy provided by the authors.
Contents
Preface ix
Author Biographies xv
I Getting Started 1
1 Setting Up Your Environment 3
1.1 Python Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Installation via Anaconda . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Python IDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Installation via RStudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
II Financial Data 23
3 Accessing and Managing Financial Data 25
3.1 Fama-French Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 q-Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Macroeconomic Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Other Macroeconomic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5 Setting Up a Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.6 Managing SQLite Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
v
vi Contents
Appendices 209
A Colophon 209
B Proofs 213
B.1 Optimal Portfolio Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
B.1.1 Minimum variance portfolio . . . . . . . . . . . . . . . . . . . . . . . 213
B.1.2 Efficient portfolio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
B.1.3 Equivalence between certainty equivalent maximization and minimum
variance optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 214
Bibliography 235
Index 243
Preface
ix
x Preface
• The first part helps you to set up your Python development environment and introduces
you to essential programming concepts around which our approach to Tidy Finance
revolves.
• The second part provides tools to organize your data and prepare the most common
datasets used in financial research. Although many important data are behind paywalls,
we start by describing different open-source data and how to download them. We then
move on to prepare two of the most popular datasets in financial research: CRSP and
Compustat. Then, we cover corporate bond data from TRACE. We reuse the data from
these chapters in all subsequent chapters. The last chapter of this part contains an
overview of common alternative data providers for which direct access via R packages
exists.
• The third part deals with key concepts of empirical asset pricing, such as beta estimation,
portfolio sorts, performance analysis, and asset pricing regressions.
• In the fourth part, we apply linear models to panel data and machine learning methods
to problems in factor selection and option pricing.
• The last part provides approaches for portfolio optimization and backtesting procedures.
Each chapter is self-contained and can be read individually. Yet, the data chapters provide
an important background necessary for data management in all other chapters.
practitioners and have a hands-on focus. Our book, in contrast, emphasizes reproducibil-
ity and starts with the applications right away to utilize Python as the tool to perform
data transformations and statistical analysis. Hence, we clearly focus on state-of-the-art
applications for academic research in finance. Thus, we fill a niche that allows aspiring
researchers or instructors to rely on a well-designed code base.
• Furthermore, Weiming (2019) and Kelliher (2022) are comprehensive introductions to
quantitative finance with a greater focus on option pricing, quantitative modeling for
various markets besides equity, and algorithmic trading. Again, these books are primarily
written for finance professionals to introduce Python or enhance their Python knowledge.
• Coqueret and Guida (2023) constitutes a great compendium to our book concerning
applications related to return prediction and portfolio formation. The book primarily
targets practitioners and has a hands-on focus. Our book, in contrast, relies on the
typical databases used in financial research and focuses on the preparation of such
datasets for academic applications. In addition, our chapter on machine learning focuses
on factor selection instead of return prediction.
Although we emphasize the importance of reproducible workflow principles, we do not
provide introductions to some of the core tools that we relied on to create and maintain
this book:
• Version control systems such as Git2 are vital in managing any programming project.
Originally designed to organize the collaboration of software developers, even solo data
analysts will benefit from adopting version control. Git also makes it simple to publicly
share code and allow others to reproduce your findings. We refer to Bryan (2022) for a
gentle introduction to the (sometimes painful) life with Git.
• Good communication of results is a key ingredient to reproducible and transparent
research. To compile this book, we heavily draw on a suite of fantastic open-source
tools. First, Kibirige (2023b) provides a highly customizable yet easy-to-use system
for creating data visualizations based on the Grammar of Graphics (Wilkinson, 2012).
Second, in our daily work and to compile this book, we used Quarto, an open-source
scientific and technical publishing system described in Allaire et al. (2023). Markdown
documents are fully reproducible and support static and dynamic output formats. We
do not provide introductions to these tools, as the resources above already provide easily
accessible tutorials.
• Good writing is also important for the presentation of findings. We neither claim to be
experts in this domain nor do we try to sound particularly academic. On the contrary,
we deliberately use a more colloquial language to describe all the methods and results
presented in this book in order to allow our readers to relate more easily to the rather
technical content. For those who desire more guidance with respect to formal academic
writing for financial economics, we recommend Kiesling (2003), Cochrane (2005), and
Jacobsen (2014), who all provide essential tips (condensed to a few pages).
Why Tidy?
As you start working with data, you quickly realize that you spend a lot of time reading,
cleaning, and transforming your data. In fact, it is often said that more than 80 percent of
2 https://git-scm.com/
xii Preface
data analysis is spent on preparing data. By tidying data, we want to structure datasets to
facilitate further analyses. As Wickham (2014) puts it:
[T]idy datasets are all alike, but every messy dataset is messy in its own way. Tidy
datasets provide a standardized way to link the structure of a dataset (its physical
layout) with its semantics (its meaning).
In its essence, tidy data follows these three principles:
Throughout this book, we try to follow these principles as best as possible. If you want to
learn more about tidy data principles in an informal manner, we refer you to this vignette3
as part of Wickham and Girlich (2022).
In addition to the data layer, there are also tidy coding principles outlined in the tidy tools
manifesto4 that we try to follow:
Why Python?
Python (Python Software Foundation, 2023) is an open-source, general-purpose, high-level
programming language widely used across various industries. Python is prevalent for data
science according to the Python Developers Survey (Foundation and JetBrains, 2022), par-
ticularly for financial applications. Similar to R, the Python community values readable
and straightforward code. Thus, Python is an excellent choice for first-time programmers.
At the same time, experienced researchers in financial economics and analysts benefit from
the wide range of possibilities to express complex ideas with concise and understandable
code. Some of the highlights of Python include:
• Open-source: Python uses a source license, making it usable and distributable for academic
and commercial use.
• Flexibility: Python’s extensive ecosystem of standard libraries and community-contributed
modules allows for all kinds of unique projects. It seamlessly integrates various data sources
and APIs, facilitating efficient data retrieval and processing.
• Versatility: Python is a cross-platform, multipurpose language that can be used to write
fast low-level executable code, large applications, and even graphical user interfaces (GUI).
• Speed: Python is fast. In addition, parallelization is straightforward to implement in order
to tackle big data problems without hassle.
• Robustness: Python provides robust tools for data manipulation, analysis, and visualiza-
tion, which are crucial components in finance research.
3 https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html
4 https://tidyverse.tidyverse.org/articles/manifesto.html
Preface xiii
• Importance: The language’s active community support and continuous development en-
sure access to cutting-edge technologies and methodologies. Learning Python enhances
one’s ability to conduct sophisticated financial analysis, making it a valuable skill for
professionals across diverse fields.
The so-called Zen of Python by Tim Peters summarizes its major syntax guidelines for struc-
tured, tidy, and human-readable code. It is easily accessible in every Python environment
through:
import this
License
This book is licensed to you under Creative Commons Attribution-NonCommercial-
ShareAlike 4.0 International CC BY-NC-SA 4.0. The code samples in this book are licensed
under Creative Commons CC0 1.0 Universal (CC0 1.0), i.e., public domain. You can cite
the Python project as follows:
Scheuch, C., Voigt, S., Weiss, P., & Frey, C. (2024). Tidy Finance with Python (1st ed.).
Chapman and Hall/CRC. https://www.tidy-finance.org.
You can also use the following BibTeX entry:
@book{Scheuch2024,
title = {Tidy Finance with Python (1st ed.)},
author = {and Scheuch, Christoph and Voigt, Stefan and Weiss, Patrick
and Frey, Christoph},
year = {2024},
publisher = {Chapman and Hall/CRC},
edition = {1st},
url = {https://tidy-finance.org/python}
}
5 https://tidy-finance.org/python/changelog.html
Author Biographies
Christoph Scheuch is the Head of Artificial Intelligence at the social trading platform
wikifolio.com. He is responsible for researching, designing, and prototyping of cutting-edge
AI-driven products using R and Python. Before his focus on AI, he was responsible for
product management and business intelligence at wikifolio.com and an external lecturer at
the Vienna University of Economics and Business, Austria, where he taught finance students
how to manage empirical projects.
Stefan Voigt is an Assistant Professor of Finance at the Department of Economics at the
University in Copenhagen, Denmark and a research fellow at the Danish Finance Institute. His
research focuses on blockchain technology, high-frequency trading, and financial econometrics.
Stefan’s research has been published in the leading finance and econometrics journals and
he received the Danish Finance Institute Teaching Award 2022 for his courses for students
and practitioners on empirical finance based on Tidy Finance.
Patrick Weiss is an Assistant Professor of Finance at Reykjavik University, Iceland, and
an external lecturer at the Vienna University of Economics and Business, Austria. His
research activity centers around the intersection of empirical asset pricing and corporate
finance, with his research appearing in leading journals in financial economics. Patrick is
especially passionate about empirical asset pricing and strives to understand the impact of
methodological uncertainty on research outcomes.
Christoph Frey is a Quantitative Researcher and Portfolio Manager at a family office
in Hamburg, Germany, and a Research Fellow at the Centre for Financial Economet-
rics, Asset Markets and Macroeconomic Policy at Lancaster University, UK. Prior to
this, he was the leading quantitative researcher for systematic multi-asset strategies at
Berenberg Bank and worked as an Assistant Professor at the Erasmus Universiteit Rotterdam,
Netherlands. Christoph published research on Bayesian Econometrics and specializes in
financial econometrics and portfolio optimization problems.
xv
Part I
Getting Started
1
Setting Up Your Environment
We aim to lower the bar for starting empirical research in financial economics. We want
to make using Python easy for you. However, given that Tidy Finance is a platform that
supports multiple programming languages, we also consider the possibility that you are
unfamiliar with Python. Maybe you are transitioning from R to Python, i.e., following the
journey of Tidy Finance, which started in R. Hence, we provide you with a simple guide
to get started with Python. If you have not used Python before, you will be able to use it
after reading this chapter.
3
4 Setting Up Your Environment
For the installation, we use the Anaconda Python distribution you downloaded in the step
before.2 Additionally, you need the packages listed in the provided requirements.txt-file3
in a dedicated folder for the project. You can find the detailed list of packages in Appendix A.
We recommend you start with the package installation right away. After you have prepared
your system, you can open the Anaconda prompt and install your virtual environment with
the following commands:
All other packages found with the command pip list are installed automati-
cally as dependencies with the required packages in the file requirements.txt.
Note that we make reference to two distinct folders. The first one,
C:\Apps\Anaconda3\envs\tidy_finance_environment refers to the location of your
Python environment used for Tidy Finance. Apart from that, you should store your data,
program codes, and scripts in another location: Your Tidy Finance working folder.
Now, you are basically ready to go. However, you will now need a Python integrated devel-
opment environment (IDE) to make your coding experience pleasant.
the previous step to use Python version 3.10.11. The neat aspect is Python’s capability to accommodate
version control in this respect.
3 https://github.com/tidy-finance/website/blob/main/requirements.txt
4 https://www.spyder-ide.org/
5 https://code.visualstudio.com/docs/python/python-tutorial
6 https://wiki.python.org/moin/IntegratedDevelopmentEnvironments
7 https://posit.co/download/rstudio-desktop/
Installation via RStudio 5
environment in RStudio. Alternatively, you can also start with the installation guide starting
from RStudio, which we present below.
Now you are ready to execute all code that you can find in this book or its sibling Tidy
Finance with R.12
8 https://github.com/tidy-finance/environment
9 https://cran.r-project.org/
10 https://posit.co/download/rstudio-desktop/
11 https://github.com/tidy-finance/environment
12 https://www.tidy-finance.org/r/index.html
2
Introduction to Tidy Finance
The main aim of this chapter is to familiarize yourself with pandas (Wes McKinney, 2010)
and numpy (Harris et al., 2020), the main workhorses for data analysis in Python. We start
by downloading and visualizing stock data from Yahoo!Finance. Then, we move to a simple
portfolio choice problem and construct the efficient frontier. These examples introduce you
to our approach of Tidy Finance.
Note that import pandas as pd implies that we can call all pandas functions later with a
simple pd.function(). Instead, utilizing from pandas import * is generally discouraged,
as it leads to namespace pollution. This statement imports all functions and classes from
pandas into your current namespace, potentially causing conflicts with functions you define
or those from other imported libraries. Using the pd abbreviation is a very convenient way
to prevent this.
We first download daily prices for one stock symbol, e.g., the Apple stock, AAPL, directly
from the data provider Yahoo!Finance. To download the data, you can use the function
yf.download(). The data from Yahoo!Finance comes as a dataframe, a two-dimensional,
tabular data structure in which each row is indexed, and each column has a name. After
the download, we apply a set of functions directly on the dataframe. First, we put the date
index into a separate column. Second, we add the column symbol that stores the ticker
information, and finally, we rename all columns to lowercase names. dataframes allow for
chaining all these operations sequentially through using ..
prices = (yf.download(
tickers="AAPL",
start="2000-01-01",
end="2022-12-31",
progress=False
)
.reset_index()
7
8 Introduction to Tidy Finance
.assign(symbol="AAPL")
.rename(columns={
"Date": "date",
"Open": "open",
"High": "high",
"Low": "low",
"Close": "close",
"Adj Close": "adjusted",
"Volume": "volume"}
)
)
prices.head().round(3)
yf.download() downloads stock market data from Yahoo!Finance. The above code chunk
returns a dataframe with eight quite self-explanatory columns: date, the market prices at
the open, high, low, and close, the adjusted price in USD, the daily volume (in the
number of traded shares), and the symbol. The adjusted prices are corrected for anything
that might affect the stock price after the market closes, e.g., stock splits and dividends.
These actions affect the quoted prices but have no direct impact on the investors who hold
the stock. Therefore, we often rely on adjusted prices when it comes to analyzing the returns
an investor would have earned by holding the stock continuously.
Next, we use the plotnine (Kibirige, 2023b) package to visualize the time series of adjusted
prices in Figure 2.1. This package takes care of visualization tasks based on the principles
of the Grammar of Graphics (Wilkinson, 2012). Note that generally, we do not recommend
using the * import style. However, we use it here only for the plotting functions, which are
distinct to plotnine and have very plotting-related names. So, the risk of misuse through
a polluted namespace is marginal.
from plotnine import *
Creating figures becomes very intuitive with the Grammar of Graphics, as the following
code chunk demonstrates.
prices_figure = (
ggplot(prices,
aes(y="adjusted", x="date")) +
geom_line() +
labs(x="", y="",
title="Apple stock prices from 2000 to 2022")
)
prices_figure.draw()
Figure 2.1: The figure shows Apple stock prices between the beginning of 2000 and the end
of 2022. Prices are in USD, adjusted for dividend payments and stock splits.
where 𝑝𝑝𝑡𝑡 is the adjusted price on day 𝑡𝑡. In that context, the function pct_change() is
helpful because it computes this percentage change.
returns = (prices
.sort_values("date")
.assign(ret=lambda x: x["adjusted"].pct_change())
.get(["symbol", "date", "ret"])
)
The resulting dataframe contains three columns, where the last contains the daily returns
(ret). Note that the first entry naturally contains a missing value (NaN) because there is
no previous price. Obviously, the use of pct_change() would be meaningless if the time
series is not ordered by ascending dates. The function sort_values() provides a convenient
way to order observations in the correct way for our application. In case you want to order
observations by descending dates, you can use the parameter ascending=False.
10 Introduction to Tidy Finance
For the upcoming examples, we remove missing values, as these would require separate
treatment when computing, e.g., sample averages. In general, however, make sure you un-
derstand why NA values occur and carefully examine if you can simply get rid of these
observations. The dataframe dropna() method kicks out all rows that contain a missing
value in any column.
returns = returns.dropna()
Next, we visualize the distribution of daily returns in a histogram in Figure 2.2, where we
also introduce the mizani (Kibirige, 2023a) package for formatting functions. Additionally,
we add a dashed line that indicates the five percent quantile of the daily returns to the
histogram, which is a (crude) proxy for the worst return of the stock with a probability
of at most five percent. The five percent quantile is closely connected to the (historical)
value-at-risk, a risk measure commonly monitored by regulators. We refer to Tsay (2010)
for a more thorough introduction to stylized facts of returns.
from mizani.formatters import percent_format
quantile_05 = returns["ret"].quantile(0.05)
returns_figure = (
ggplot(returns, aes(x="ret")) +
geom_histogram(bins=100) +
geom_vline(aes(xintercept=quantile_05),
linetype="dashed") +
labs(x="", y="",
title="Distribution of daily Apple stock returns") +
scale_x_continuous(labels=percent_format())
)
returns_figure.draw()
Here, bins=100 determines the number of bins used in the illustration and, hence, implicitly
the width of the bins. Before proceeding, make sure you understand how to use the geom
geom_vline() to add a dashed line that indicates the five percent quantile of the daily
returns. A typical task before proceeding with any data is to compute summary statistics
for the main variables of interest.
pd.DataFrame(returns["ret"].describe()).round(3).T
We see that the maximum daily return was 13.9 percent. Perhaps not surprisingly, the
average daily return is close to but slightly above 0. In line with the illustration above,
the large losses on the day with the minimum returns indicate a strong asymmetry in the
distribution of returns.
You can also compute these summary statistics for each year individually by imposing
.groupby(returns["date"].dt.year), where the call .dt.year returns the year of a date
variable. More specifically, the few lines of code below compute the summary statistics from
above for individual groups of data defined by year. The summary statistics, therefore, allow
an eyeball analysis of the time-series dynamics of the return distribution.
Working with Stock Market Data 11
Figure 2.2: The figure shows a histogram of daily Apple stock returns in percent. The dotted
vertical line indicates the historical five percent quantile.
(returns["ret"]
.groupby(returns["date"].dt.year)
.describe()
.round(3)
)
Next, we can use yf.download() to download prices for all stock symbols in the above list
and again chain a couple of pandas dataframe functions to create a tidy dataset.
index_prices = (yf.download(
tickers=symbols,
start="2000-01-01",
end="2022-12-31",
progress=False
)
.melt(ignore_index=False, var_name=["variable", "symbol"])
.reset_index()
.pivot(index=["Date", "symbol"], columns="variable", values="value")
1 https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average
Scaling Up the Analysis 13
.reset_index()
.rename(columns={
"Date": "date",
"Open": "open",
"High": "high",
"Low": "low",
"Close": "close",
"Adj Close": "adjusted",
"Volume": "volume"}
)
)
The resulting dataframe contains 173,610 daily observations for 30 different stock symbols.
Figure 2.3 illustrates the time series of downloaded adjusted prices for each of the con-
stituents of the Dow Jones index. We again draw on the mizani package, but this time we
use its useful date formatting function to get nicer axis labels. Make sure you understand
every single line of code! What are the arguments of aes()? Which alternative geoms could
you use to visualize the time series? Hint: If you do not know the answers, try to change
the code to see what difference your intervention causes.
from mizani.breaks import date_breaks
from mizani.formatters import date_format
index_prices_figure = (
ggplot(index_prices,
aes(y="adjusted", x="date", color="symbol")) +
geom_line() +
labs(x="", y="", color="",
title="Stock prices of DOW index constituents") +
theme(legend_position="none") +
scale_x_datetime(date_breaks="5 years", date_labels="%Y")
)
index_prices_figure.draw()
Do you notice the small differences relative to the code we used before?
yf.download(symbols) returns a dataframe for several symbols as well. All we need to
do to illustrate all symbols simultaneously is to include color="symbol" in the ggplot
aesthetics. In this way, we generate a separate line for each symbol. Of course, there are
simply too many lines on this graph to identify the individual stocks properly, but it
illustrates the point well.
The same holds for stock returns. Before computing the returns, we use groupby("symbol")
such that the assign() command is performed to calculate the returns for each symbol
individually and assign it to the variable ret in the dataframe. The same logic also applies
to the computation of summary statistics: groupby("symbol") is the key to aggregating
the time series into symbol-specific variables of interest.
all_returns = (index_prices
.assign(ret=lambda x: x.groupby("symbol")["adjusted"].pct_change())
.get(["symbol", "date", "ret"])
.dropna(subset="ret")
)
14 Introduction to Tidy Finance
Figure 2.3: The figure shows the stock prices of DOW index constituents. Prices are in USD,
adjusted for dividend payments and stock splits.
(all_returns
.groupby("symbol")["ret"]
.describe()
.round(3)
)
trading_volume_figure = (
ggplot(trading_volume,
aes(x="date", y="trading_volume")) +
geom_line() +
labs(x="", y="",
title=("Aggregate daily trading volume of DOW index constituents "
"in billion USD")) +
scale_x_datetime(date_breaks="5 years", date_labels="%Y")
)
trading_volume_figure.draw()
16 Introduction to Tidy Finance
Figure 2.4: The figure shows the total daily trading volume in billion USD.
Figure 2.4 indicates a clear upward trend in aggregated daily trading volume. In particular,
since the outbreak of the COVID-19 pandemic, markets have processed substantial trading
volumes, as analyzed, for instance, by Goldstein et al. (2021). One way to illustrate the
persistence of trading volume would be to plot volume on day 𝑡𝑡 against volume on day 𝑡𝑡 𝑡 𝑡
as in the example below. In Figure 2.5, we add a dotted 45°-line to indicate a hypothetical
one-to-one relation by geom_abline(), addressing potential differences in the axes’ scales.
trading_volume_figure = (
ggplot(trading_volume,
aes(x="trading_volume_lag", y="trading_volume")) +
geom_point() +
geom_abline(aes(intercept=0, slope=1), linetype="dashed") +
labs(x="Previous day aggregate trading volume",
y="Aggregate trading volume",
title=("Persistence in daily trading volume of DOW constituents "
"in billion USD"))
)
trading_volume_figure.draw()
Figure 2.5: The figure a scatterplot of aggregate trading volume against previous-day aggre-
gate trading volume.
selection considers investors that prefer higher future returns but dislike future return volatil-
ity (defined as the square root of the return variance, i.e., the risk): the mean-variance
investor (Markowitz, 1952).
An essential tool to evaluate portfolios in the mean-variance context is the efficient fron-
tier, the set of portfolios that satisfies the condition that no other portfolio exists with a
higher expected return but with the same volatility, see, e.g., Merton (1972). We compute
and visualize the efficient frontier for several stocks. First, we extract each asset’s monthly
returns. In order to keep things simple, we work with a balanced panel and exclude DOW
constituents for which we do not observe a price on every single trading day since the year
2000.
prices = (index_prices
.groupby("symbol")
.apply(lambda x: x.assign(counts=x["adjusted"].dropna().count()))
.reset_index(drop=True)
.query("counts == counts.max()")
)
Next, we transform the returns from a tidy dataframe into a (𝑇𝑇 𝑇 𝑇𝑇 𝑇 matrix with one
column for each of the 𝑁𝑁 symbols and one row for each of the 𝑇𝑇 trading days to compute
the sample average return vector
1 𝑇𝑇
𝜇𝜇̂ 𝜇 ∑ 𝑟𝑟𝑡𝑡 , (2.2)
𝑇𝑇 𝑡𝑡𝑡𝑡
18 Introduction to Tidy Finance
where 𝑟𝑟𝑡𝑡 is the 𝑁𝑁 vector of returns on date 𝑡𝑡 and the sample covariance matrix
𝑇𝑇
1
Σ̂ = ̂ 𝑡𝑡 − 𝜇𝜇𝜇̂ ′ .
∑(𝑟𝑟𝑡𝑡 − 𝜇𝜇𝜇𝜇𝜇𝜇 (2.3)
𝑇𝑇 𝑇 𝑇 𝑡𝑡𝑡𝑡
We achieve this by using pivot() with the new column names from the column symbol and
setting the values to adjusted.
In financial econometrics, a core focus falls on problems that arise if the investor has to rely
on estimates 𝜇𝜇̂ and Σ̂ instead of using the vector of expected returns 𝜇𝜇 and the variance-
covariance matrix Σ. We highlight the impact of estimation uncertainty on the portfolio
performance in various backtesting applications in Chapter 17 and Chapter 18.
For now, we focus on a much more restricted set of assumptions: The 𝑁𝑁 assets are fixed, and
the first two moments of the distribution of the returns are determined by the parameters
𝜇𝜇 and Σ. Thus, even though we proceed with the vector of sample average returns and
the sample variance-covariance matrix, those will be handled as the true parameters of the
return distribution for the rest of this chapter. We, therefore, refer to Σ and 𝜇𝜇 instead of
explicitly highlighting that the sample moments are estimates.
returns_matrix = (prices
.pivot(columns="symbol", values="adjusted", index="date")
.resample("m")
.last()
.pct_change()
.dropna()
)
mu = np.array(returns_matrix.mean()).T
sigma = np.array(returns_matrix.cov())
Then, we compute the minimum variance portfolio weights 𝜔𝜔mvp as well as the expected re-
turn 𝜔𝜔mvp
′
𝜇𝜇 and volatility √𝜔𝜔mvp
′ Σ𝜔𝜔mvp of this portfolio. Recall that the minimum variance
portfolio is the vector of portfolio weights that are the solution to
𝑁𝑁
𝜔𝜔mvp = arg min 𝜔𝜔′ Σ𝜔𝜔 s.t. ∑ 𝜔𝜔𝑖𝑖 = 1. (2.4)
𝑖𝑖𝑖𝑖
The constraint that weights sum up to one simply implies that all funds are distributed
across the available asset universe, i.e., there is no possibility to retain cash. It is easy to
′ Σ−1 𝜄𝜄 , where 𝜄𝜄 is a vector of ones and Σ
show analytically that 𝜔𝜔mvp = 𝜄𝜄Σ is the inverse of
−1
𝜄𝜄 −1
value
average_ret 0.008
volatility 0.032
𝜔𝜔eff (𝜇𝜇𝜇̄ 𝜇 arg min 𝜔𝜔′ Σ𝜔𝜔 s.t. 𝜔𝜔′ 𝜄𝜄 𝜄 𝜄 and 𝜔𝜔′ 𝜇𝜇 𝜇 𝜇𝜇𝜇𝜇 (2.5)
In Appendix B, we show that the efficient portfolio takes the form (for 𝜇𝜇̄ 𝜇 𝜇𝜇𝜇𝜇𝜇 𝜇 𝜇𝜇′ 𝜔𝜔mvp )
𝜆𝜆̃ 𝐷𝐷
𝜔𝜔eff (𝜇𝜇𝜇̄ 𝜇 𝜇𝜇mvp + (Σ−1 𝜇𝜇 𝜇 Σ−1 𝜄𝜄𝜄
2 𝐶𝐶
where 𝐶𝐶 𝐶𝐶𝐶𝐶′ Σ−1 𝜄𝜄, 𝐷𝐷𝐷𝐷𝐷𝐷′ Σ−1 𝜇𝜇, 𝐸𝐸 𝐸𝐸𝐸𝐸′ Σ−1 𝜇𝜇, and 𝜆𝜆̃ 𝜆𝜆 𝐸𝐸𝐸𝐸𝐸
𝜇𝜇𝜇𝜇𝜇𝜇𝜇𝜇
̄
2 /𝐶𝐶 .
The code below implements the analytic solution to this optimization problem for a bench-
mark return 𝜇𝜇,̄ which we set to 3 times the expected return of the minimum variance
portfolio. We encourage you to verify that it is correct.
benchmark_multiple = 3
mu_bar = benchmark_multiple*mvp_return
C = iota.T @ sigma_inv @ iota
D = iota.T @ sigma_inv @ mu
E = mu.T @ sigma_inv @ mu
lambda_tilde = 2*(mu_bar-D/C)/(E-D**2/C)
efp_weights = mvp_weights+lambda_tilde/2*(sigma_inv @ mu-D*mvp_weights)
understand the code better, make sure to familiarize yourself with the inner workings of the
for loop.
length_year = 12
a = np.arange(-0.4, 2.0, 0.01)
res = pd.DataFrame(columns=["mu", "sd"], index=a).astype(float)
for i in a:
w = (1-i)*mvp_weights+i*efp_weights
res.loc[i, "mu"] = (w.T @ mu)*length_year
res.loc[i, "sd"] = np.sqrt(w.T @ sigma @ w)*np.sqrt(length_year)
The code above proceeds in two steps: First, we compute a vector of combination weights
𝑎𝑎, and then we evaluate the resulting linear combination with 𝑎𝑎 𝑎 𝑎:
𝜆𝜆∗ 𝐷𝐷
𝜔𝜔∗ = 𝑎𝑎𝑎𝑎eff (𝜇𝜇𝜇̄ 𝜇 𝜇𝜇 𝜇 𝜇𝜇𝜇𝜇𝜇mvp = 𝜔𝜔mvp + (Σ−1 𝜇𝜇 𝜇 Σ−1 𝜄𝜄𝜄 (2.6)
2 𝐶𝐶
with 𝜆𝜆∗ = 2 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎
𝑎 𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎
𝑎
𝐸𝐸𝐸𝐸𝐸2 /𝐶𝐶 𝑎 in other words, 𝜔𝜔∗ is
. It follows that 𝜔𝜔∗ = 𝜔𝜔eff (𝑎𝑎 𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑎𝑎𝑎𝑎𝑎𝑎,
an efficient portfolio that proofs the mutual fund separation theorem.
Finally, it is simple to visualize the efficient frontier alongside the two efficient portfolios
within one powerful figure using the ggplot function from plotnine (see Figure 2.6). We
also add the individual stocks in the same call. We compute annualized returns based on
the simple assumption that monthly returns are independent and identically distributed.
Thus, the average annualized return is just twelve times the expected monthly return.
mvp_return = (mu.T @ mvp_weights)*length_year
mvp_volatility = (np.sqrt(mvp_weights.T @ sigma @ mvp_weights)*
np.sqrt(length_year))
efp_return = mu_bar*length_year
efp_volatility = (np.sqrt(efp_weights.T @ sigma @ efp_weights)*
np.sqrt(length_year))
res_figure = (
ggplot(res, aes(x="sd", y="mu")) +
geom_point() +
geom_point(
pd.DataFrame({"mu": [mvp_return, efp_return],
"sd": [mvp_volatility, efp_volatility]}),
size=4
) +
geom_point(
pd.DataFrame({"mu": mu*length_year,
"sd": np.sqrt(np.diag(sigma))*np.sqrt(length_year)})
) +
labs(x="Annualized standard deviation",
y="Annualized expected return",
title="Efficient frontier for DOW index constituents") +
scale_x_continuous(labels=percent_format()) +
scale_y_continuous(labels=percent_format())
)
res_figure.draw()
Exercises 21
Figure 2.6: The figure shows the efficient frontier for DOW index constituents. The big dots
indicate the location of the minimum variance and the efficient portfolio that delivers three
times the expected return of the minimum variance portfolio, respectively. The small dots
indicate the location of the individual constituents.
The line in Figure 2.6 indicates the efficient frontier: the set of portfolios a mean-variance
efficient investor would choose from. Compare the performance relative to the individual
assets (the dots); it should become clear that diversifying yields massive performance gains
(at least as long as we take the parameters Σ and 𝜇𝜇 as given).
2.6 Exercises
1. Download daily prices for another stock market symbol of your choice from Ya-
hoo!Finance with yf.download() from the yfinance package. Plot two time
series of the ticker’s unadjusted and adjusted closing prices. Explain the differ-
ences.
2. Compute daily net returns for an asset of your choice and visualize the distribution
of daily returns in a histogram using 100 bins. Also, use geom_vline() to add
a dashed red vertical line that indicates the five percent quantile of the daily
returns. Compute summary statistics (mean, standard deviation, minimum and
maximum) for the daily returns.
3. Take your code from before and generalize it such that you can perform all the
computations for an arbitrary vector of tickers (e.g., ticker = ["AAPL", "MMM",
"BA"]). Automate the download, the plot of the price time series, and create a
table of return summary statistics for this arbitrary number of assets.
22 Introduction to Tidy Finance
4. Are days with high aggregate trading volume often also days with large absolute
returns? Find an appropriate visualization to analyze the question using the ticker
AAPL.
5. Compute monthly returns from the downloaded stock market prices. Compute
the vector of historical average returns and the sample variance-covariance matrix.
Compute the minimum variance portfolio weights and the portfolio volatility and
average returns. Visualize the mean-variance efficient frontier. Choose one of your
assets and identify the portfolio which yields the same historical volatility but
achieves the highest possible average return.
6. In the portfolio choice analysis, we restricted our sample to all assets trading
every day since 2000. How is such a decision a problem when you want to infer
future expected portfolio performance from the results?
7. The efficient frontier characterizes the portfolios with the highest expected return
for different levels of risk. Identify the portfolio with the highest expected return
per standard deviation. Which famous performance measure is close to the ratio
of average returns to the standard deviation of returns?
Part II
Financial Data
3
Accessing and Managing Financial Data
In this chapter, we suggest a way to organize your financial data. Everybody who has ex-
perience with data is also familiar with storing data in various formats like CSV, XLS,
XLSX, or other delimited value storage. Reading and saving data can become very cum-
bersome when using different data formats and across different projects. Moreover, storing
data in delimited files often leads to problems with respect to column type consistency. For
instance, date-type columns frequently lead to inconsistencies across different data formats
and programming languages.
This chapter shows how to import different open-source datasets. Specifically, our data
comes from the application programming interface (API) of Yahoo!Finance, a downloaded
standard CSV file, an XLSX file stored in a public Google Drive repository, and other
macroeconomic time series. We store all the data in a single database, which serves as the
only source of data in subsequent chapters. We conclude the chapter by providing some tips
on managing databases.
First, we load the Python packages that we use throughout this chapter. Later on, we load
more packages in the sections where we need them.
import pandas as pd
import numpy as np
Moreover, we initially define the date range for which we fetch and store the financial data,
making future data updates tractable. In case you need another time frame, you can adjust
the dates below. Our data starts with 1960 since most asset pricing studies use data from
1962 on.
start_date = "1960-01-01"
end_date = "2022-12-31"
We can use the pdr.DataReader() function of the package to download monthly Fama-
French factors. The set Fama/French 3 Factors contains the return time series of the market
(mkt_excess), size (smb), and value (hml) factors alongside the risk-free rates (rf). Note
25
26 Accessing and Managing Financial Data
that we have to do some manual work to parse all the columns correctly and scale them
appropriately, as the raw Fama-French data comes in a unique data format. For precise
descriptions of the variables, we suggest consulting Prof. Kenneth French’s finance data
library directly. If you are on the website, check the raw data files to appreciate the time
you can save thanks to pandas_datareader.
factors_ff3_monthly_raw = pdr.DataReader(
name="F-F_Research_Data_Factors",
data_source="famafrench",
start=start_date,
end=end_date)[0]
factors_ff3_monthly = (factors_ff3_monthly_raw
.divide(100)
.reset_index(names="month")
.assign(month=lambda x: pd.to_datetime(x["month"].astype(str)))
.rename(str.lower, axis="columns")
.rename(columns={"mkt-rf": "mkt_excess"})
)
We also download the set 5 Factors (2x3), which additionally includes the return time series
of the profitability (rmw) and investment (cma) factors. We demonstrate how the monthly
factors are constructed in Chapter 11.
factors_ff5_monthly_raw = pdr.DataReader(
name="F-F_Research_Data_5_Factors_2x3",
data_source="famafrench",
start=start_date,
end=end_date)[0]
factors_ff5_monthly = (factors_ff5_monthly_raw
.divide(100)
.reset_index(names="month")
.assign(month=lambda x: pd.to_datetime(x["month"].astype(str)))
.rename(str.lower, axis="columns")
.rename(columns={"mkt-rf": "mkt_excess"})
)
factors_ff3_daily = (factors_ff3_daily_raw
.divide(100)
.reset_index(names="date")
.rename(str.lower, axis="columns")
.rename(columns={"mkt-rf": "mkt_excess"})
)
q-Factors 27
In a subsequent chapter, we also use the monthly returns from ten industry portfolios, so
let us fetch that data, too.
industries_ff_monthly_raw = pdr.DataReader(
name="10_Industry_Portfolios",
data_source="famafrench",
start=start_date,
end=end_date)[0]
industries_ff_monthly = (industries_ff_monthly_raw
.divide(100)
.reset_index(names="month")
.assign(month=lambda x: pd.to_datetime(x["month"].astype(str)))
.rename(str.lower, axis="columns")
)
It is worth taking a look at all available portfolio return time series from
Kenneth French’s homepage. You should check out the other sets by calling
pdr.famafrench.get_available_datasets().
3.2 q-Factors
In recent years, the academic discourse experienced the rise of alternative factor models, e.g.,
in the form of the Hou et al. (2014) q-factor model. We refer to the extended background1
information provided by the original authors for further information. The q-factors can be
downloaded directly from the authors’ homepage from within pd.read_csv().
We also need to adjust this data. First, we discard information we will not use in the
remainder of the book. Then, we rename the columns with the “R_”-prescript using regular
expressions and write all column names in lowercase. We then query the data to select
observations between the start and end dates. Finally, we use the double asterisk (**)
notation in the assign function to apply the same transform of dividing by 100 to all four
factors by iterating through them. You should always try sticking to a consistent style for
naming objects, which we try to illustrate here - the emphasis is on try. You can check out
style guides available online, e.g., Hadley Wickham’s tidyverse style guide.2
factors_q_monthly_link = (
"https://global-q.org/uploads/1/2/2/6/122679606/"
"q5_factors_monthly_2022.csv"
)
factors_q_monthly = (pd.read_csv(factors_q_monthly_link)
.assign(
month=lambda x: (
pd.to_datetime(x["year"].astype(str) + "-" +
x["month"].astype(str) + "-01"))
)
1 http://global-q.org/background.html
2 https://style.tidyverse.org/index.html
28 Accessing and Managing Financial Data
Next, we read in the new data and transform the columns into the variables that we later
use:
1. The dividend price ratio (dp), the difference between the log of dividends and
the log of prices, where dividends are 12-month moving sums of dividends paid
on the S&P 500 index, and prices are monthly averages of daily closing prices
(Campbell and Shiller, 1988; Campbell and Yogo, 2006).
2. Dividend yield (dy), the difference between the log of dividends and the log of
lagged prices (Ball, 1978).
3. Earnings price ratio (ep), the difference between the log of earnings and the log
of prices, where earnings are 12-month moving sums of earnings on the S&P 500
index (Campbell and Shiller, 1988).
4. Dividend payout ratio (de), the difference between the log of dividends and the
log of earnings (Lamont, 1998).
5. Stock variance (svar), the sum of squared daily returns on the S&P 500 index
(Guo, 2006).
6. Book-to-market ratio (bm), the ratio of book value to market value for the Dow
Jones Industrial Average (Kothari and Shanken, 1997).
3 https://sites.google.com/view/agoyal145
Other Macroeconomic Data 29
7. Net equity expansion (ntis), the ratio of 12-month moving sums of net issues
by NYSE listed stocks divided by the total end-of-year market capitalization of
NYSE stocks (Campbell et al., 2008).
8. Treasury bills (tbl), the 3-Month Treasury Bill: Secondary Market Rate from the
economic research database at the Federal Reserve Bank at St. Louis (Campbell,
1987).
9. Long-term yield (lty), the long-term government bond yield from Ibbotson’s
Stocks, Bonds, Bills, and Inflation Yearbook (Welch and Goyal, 2008).
10. Long-term rate of returns (ltr), the long-term government bond returns from
Ibbotson’s Stocks, Bonds, Bills, and Inflation Yearbook (Welch and Goyal, 2008).
11. Term spread (tms), the difference between the long-term yield on government
bonds and the Treasury bill (Campbell, 1987).
12. Default yield spread (dfy), the difference between BAA and AAA-rated corporate
bond yields (Fama and French, 1989).
13. Inflation (infl), the Consumer Price Index (All Urban Consumers) from the
Bureau of Labor Statistics (Campbell and Vuolteenaho, 2004).
For variable definitions and the required data transformations, you can consult the material
on Amit Goyal’s website.4
macro_predictors = (
pd.read_csv(macro_predictors_link, thousands=",")
.assign(
month=lambda x: pd.to_datetime(x["yyyymm"], format="%Y%m"),
dp=lambda x: np.log(x["D12"])-np.log(x["Index"]),
dy=lambda x: np.log(x["D12"])-np.log(x["D12"].shift(1)),
ep=lambda x: np.log(x["E12"])-np.log(x["Index"]),
de=lambda x: np.log(x["D12"])-np.log(x["E12"]),
tms=lambda x: x["lty"]-x["tbl"],
dfy=lambda x: x["BAA"]-x["AAA"]
)
.rename(columns={"b/m": "bm"})
.get(["month", "dp", "dy", "ep", "de", "svar", "bm",
"ntis", "tbl", "lty", "ltr", "tms", "dfy", "infl"])
.query("month >= @start_date and month <= @end_date")
.dropna()
)
cpi_monthly = (pdr.DataReader(
name="CPIAUCNS",
data_source="fred",
start=start_date,
end=end_date
)
.reset_index(names="month")
.rename(columns={"CPIAUCNS": "cpi"})
.assign(cpi=lambda x: x["cpi"]/x["cpi"].iloc[-1])
)
Note that we use the assign() in the last line to set the current (latest) price level as
the reference inflation level. To download other time series, we just have to look it up on
the FRED website and extract the corresponding key from the address. For instance, the
producer price index for gold ores can be found under the PCU21222121222106 key.
An SQLite-database is easily created - the code below is really all there is. You do not need
any external software. Otherwise, date columns are stored and retrieved as integers. We will
use the resulting file tidy_finance.db in the subfolder data for all subsequent chapters to
retrieve our data.
tidy_finance = sqlite3.connect(database="data/tidy_finance_python.sqlite")
Next, we create a remote table with the monthly Fama-French factor data. We do so with
the pandas function to_sql(), which copies the data to our SQLite-database.
(factors_ff3_monthly
.to_sql(name="factors_ff3_monthly",
con=tidy_finance,
if_exists="replace",
index=False)
)
6 https://fred.stlouisfed.org/series/PCU2122212122210
7 https://SQLite.org/
8 https://en.wikipedia.org/wiki/SQL
Setting Up a Database 31
Now, if we want to have the whole table in memory, we need to call pd.read_sql_query()
with the corresponding query. You will see that we regularly load the data into the memory
in the next chapters.
pd.read_sql_query(
sql="SELECT month, rf FROM factors_ff3_monthly",
con=tidy_finance,
parse_dates={"month"}
)
month rf
0 1960-01-01 0.0033
1 1960-02-01 0.0029
2 1960-03-01 0.0035
3 1960-04-01 0.0019
4 1960-05-01 0.0027
... ... ...
751 2022-08-01 0.0019
752 2022-09-01 0.0019
753 2022-10-01 0.0023
754 2022-11-01 0.0029
755 2022-12-01 0.0033
The last couple of code chunks are really all there is to organizing a simple database! You
can also share the SQLite database across devices and programming languages.
Before we move on to the next data source, let us also store the other six tables in our new
SQLite database.
data_dict = {
"factors_ff5_monthly": factors_ff5_monthly,
"factors_ff3_daily": factors_ff3_daily,
"industries_ff_monthly": industries_ff_monthly,
"factors_q_monthly": factors_q_monthly,
"macro_predictors": macro_predictors,
"cpi_monthly": cpi_monthly
}
From now on, all you need to do to access data that is stored in the database is to follow
two steps: (i) Establish the connection to the SQLite-database and (ii) execute the query
to fetch the data. For your convenience, the following steps show all you need in a compact
fashion.
import pandas as pd
import sqlite3
32 Accessing and Managing Financial Data
tidy_finance = sqlite3.connect(database="data/tidy_finance_python.sqlite")
factors_q_monthly = pd.read_sql_query(
sql="SELECT * FROM factors_q_monthly",
con=tidy_finance,
parse_dates={"month"}
)
The VACUUM command actually performs a couple of additional cleaning steps, which you
can read about in this tutorial.9
3.7 Exercises
1. Download the monthly Fama-French factors manually from Kenneth French’s
data library10 and read them in via pd.read_csv(). Validate that you get the
same data as via the pandas-datareader package.
2. Download the daily Fama-French 5 factors using the pdr.DataReader() pack-
age. After the successful download and conversion to the column format
that we used above, compare the rf, mkt_excess, smb, and hml columns of
factors_ff3_daily to factors_ff5_daily. Discuss any differences you might
find.
9 https://www.sqlitetutorial.net/sqlite-vacuum/
10 https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html
4
WRDS, CRSP, and Compustat
This chapter shows how to connect to Wharton Research Data Services (WRDS),1 a popular
provider of financial and economic data for research applications. We use this connection to
download the most commonly used data for stock and firm characteristics, CRSP and Com-
pustat. Unfortunately, this data is not freely available, but most students and researchers
typically have access to WRDS through their university libraries. Assuming that you have
access to WRDS, we show you how to prepare and merge the databases and store them
in the SQLite database introduced in the previous chapter. We conclude this chapter by
providing some tips for working with the WRDS database.
If you don’t have access to WRDS but still want to run the code in this book, we refer to
Appendix C, where we show how to create a dummy database that contains the WRDS
tables and corresponding columns. With this database at hand, all code chunks in this book
can be executed with this dummy database.
First, we load the Python packages that we use throughout this chapter. Later on, we load
more packages in the sections where we need them. The last two packages are used for
plotting.
import pandas as pd
import numpy as np
import sqlite3
We use the same date range as in the previous chapter to ensure consistency. However, we
have to use the date format that the WRDS database expects.
start_date = "01/01/1960"
end_date = "12/31/2022"
33
34 WRDS, CRSP, and Compustat
in an SQL database, although they use the PostgreSQL2 engine. This database engine is
just as easy to handle with Python as SQL. We use the sqlalchemy package to establish a
connection to the WRDS database because it already contains a suitable driver.3
from sqlalchemy import create_engine
To establish a connection, you use the function create_engine() with a connection string.
Note that you need to replace the WRDS_USER and WRDS_PASSWORD arguments with your
own credentials. We defined environment variables for the purpose of this book because
we obviously do not want (and are not allowed) to share our credentials with the rest of
the world (these environment variables are stored in the .env file of our project directory,
loaded with load_dotenv(), and then called with the os.getenv() function).
Additionally, you have to use two-factor authentication since May 2023 when establishing a
remote connection to WRDS. You have two choices to provide the additional identification.
First, if you have Duo Push enabled for your WRDS account, you will receive a push
notification on your mobile phone when trying to establish a connection with the code
below. Upon accepting the notification, you can continue your work. Second, you can log
in to a WRDS website that requires two-factor authentication with your username and
the same IP address. Once you have successfully identified yourself on the website, your
username-IP combination will be remembered for 30 days, and you can comfortably use the
remote connection below.
import os
from dotenv import load_dotenv
load_dotenv()
connection_string = (
"postgresql+psycopg2://"
f"{os.getenv('WRDS_USER')}:{os.getenv('WRDS_PASSWORD')}"
"@wrds-pgdata.wharton.upenn.edu:9737/wrds"
)
The remote connection to WRDS is very useful. Yet, the database itself contains many
different tables. You can check the WRDS homepage4 to identify the table’s name you are
looking for (if you go beyond our exposition).
We use the two remote tables to fetch the data we want to put into our local database. Just
as above, the idea is that we let the WRDS database do all the work and just download the
data that we actually need. We apply common filters and data selection criteria to narrow
down our data of interest: (i) we keep only data in the time windows of interest, (ii) we keep
only US-listed stocks as identified via no special share types (sharetype = 'NS'), security
type equity (securitytype = 'EQTY'), security sub type common stock (securitysubtype
= 'COM'), issuers that are a corporation (issuertype %in% c("ACOR", "CORP")), and (iii)
we keep only months within permno-specific start dates (secinfostartdt) and end dates
(secinfoenddt). As of July 2022, there is no need to additionally download delisting infor-
mation since it is already contained in the most recent version of msf.
crsp_monthly_query = (
"SELECT msf.permno, msf.mthcaldt AS date, "
"date_trunc('month', msf.mthcaldt)::date AS month, "
"msf.mthret AS ret, msf.shrout, msf.mthprc AS altprc, "
"msf.primaryexch, msf.siccd "
"FROM crsp.msf_v2 AS msf "
"LEFT JOIN crsp.stksecurityinfohist AS ssih "
"ON msf.permno = ssih.permno AND "
"ssih.secinfostartdt <= msf.mthcaldt AND "
"msf.mthcaldt <= ssih.secinfoenddt "
f"WHERE msf.mthcaldt BETWEEN '{start_date}' AND '{end_date}' "
"AND ssih.sharetype = 'NS' "
"AND ssih.securitytype = 'EQTY' "
"AND ssih.securitysubtype = 'COM' "
"AND ssih.usincflg = 'Y' "
"AND ssih.issuertype in ('ACOR', 'CORP')"
)
crsp_monthly = (pd.read_sql_query(
sql=crsp_monthly_query,
con=wrds,
dtype={"permno": int, "siccd": int},
parse_dates={"date", "month"})
.assign(shrout=lambda x: x["shrout"]*1000)
)
Now, we have all the relevant monthly return data in memory and proceed with preparing
the data for future analyses. We perform the preparation step at the current stage since
we want to avoid executing the same mutations every time we use the data in subsequent
chapters.
The first additional variable we create is market capitalization (mktcap), which is the prod-
uct of the number of outstanding shares (shrout) and the last traded price in a month
(altprc). Note that in contrast to returns (ret), these two variables are not adjusted ex-
post for any corporate actions like stock splits. We also keep the market cap in millions
of USD just for convenience, as we do not want to print huge numbers in our figures and
tables. In addition, we set zero market capitalization to missing as it makes conceptually
little sense (i.e., the firm would be bankrupt).
crsp_monthly = (crsp_monthly
.assign(mktcap=lambda x: x["shrout"]*x["altprc"]/1000000)
36 WRDS, CRSP, and Compustat
The next variable we frequently use is the one-month lagged market capitalization. Lagged
market capitalization is typically used to compute value-weighted portfolio returns, as we
demonstrate in a later chapter. The most simple and consistent way to add a column
with lagged market cap values is to add one month to each observation and then join the
information to our monthly CRSP data.
mktcap_lag = (crsp_monthly
.assign(
month=lambda x: x["month"]+pd.DateOffset(months=1),
mktcap_lag=lambda x: x["mktcap"]
)
.get(["permno", "month", "mktcap_lag"])
)
crsp_monthly = (crsp_monthly
.merge(mktcap_lag, how="left", on=["permno", "month"])
)
crsp_monthly["exchange"] = (crsp_monthly["primaryexch"]
.apply(assign_exchange)
)
Similarly, we transform industry codes to industry descriptions following Bali et al. (2016).
Notice that there are also other categorizations of industries (e.g., Fama and French, 1997)
that are commonly used.
def assign_industry(siccd):
if 1 <= siccd <= 999:
return "Agriculture"
elif 1000 <= siccd <= 1499:
return "Mining"
elif 1500 <= siccd <= 1799:
return "Construction"
elif 2000 <= siccd <= 3999:
return "Manufacturing"
elif 4000 <= siccd <= 4899:
return "Transportation"
elif 4900 <= siccd <= 4999:
Downloading and Preparing CRSP 37
return "Utilities"
elif 5000 <= siccd <= 5199:
return "Wholesale"
elif 5200 <= siccd <= 5999:
return "Retail"
elif 6000 <= siccd <= 6799:
return "Finance"
elif 7000 <= siccd <= 8999:
return "Services"
elif 9000 <= siccd <= 9999:
return "Public"
else:
return "Missing"
crsp_monthly["industry"] = (crsp_monthly["siccd"]
.apply(assign_industry)
)
Next, we compute excess returns by subtracting the monthly risk-free rate provided by our
Fama-French data. As we base all our analyses on the excess returns, we can drop the risk-
free rate from our data frame. Note that we ensure excess returns are bounded by -1 from
below as a return less than -100% makes no sense conceptually. Before we can adjust the
returns, we have to connect to our database and load the data frame factors_ff3_monthly.
tidy_finance = sqlite3.connect(database="data/tidy_finance_python.sqlite")
factors_ff3_monthly = pd.read_sql_query(
sql="SELECT month, rf FROM factors_ff3_monthly",
con=tidy_finance,
parse_dates={"month"}
)
crsp_monthly = (crsp_monthly
.merge(factors_ff3_monthly, how="left", on="month")
.assign(ret_excess=lambda x: x["ret"]-x["rf"])
.assign(ret_excess=lambda x: x["ret_excess"].clip(lower=-1))
.drop(columns=["rf"])
)
Since excess returns and market capitalization are crucial for all our analyses, we can safely
exclude all observations with missing returns or market capitalization.
crsp_monthly = (crsp_monthly
.dropna(subset=["ret_excess", "mktcap", "mktcap_lag"])
)
index=False)
)
securities_per_exchange_figure = (
ggplot(securities_per_exchange,
aes(x="date", y="n", color="exchange", linetype="exchange")) +
geom_line() +
labs(x="", y="", color="", linetype="",
title="Monthly number of securities by listing exchange") +
scale_x_datetime(date_breaks="10 years", date_labels="%Y") +
scale_y_continuous(labels=comma_format())
)
securities_per_exchange_figure.draw()
Next, we look at the aggregate market capitalization grouped by the respective listing
exchanges in Figure 4.2. To ensure that we look at meaningful data that is comparable over
time, we adjust the nominal values for inflation. In fact, we can use the tables that are
already in our database to calculate aggregate market caps by listing exchange. All values
in Figure 4.2 are in terms of the end of 2022 USD to ensure intertemporal comparability.
NYSE-listed stocks have by far the largest market capitalization, followed by NASDAQ-
listed stocks.
cpi_monthly = pd.read_sql_query(
sql="SELECT * FROM cpi_monthly",
con=tidy_finance,
parse_dates={"month"}
)
market_cap_per_exchange = (crsp_monthly
.merge(cpi_monthly, how="left", on="month")
.groupby(["month", "exchange"])
First Glimpse of the CRSP Sample 39
Figure 4.1: The figure shows the monthly number of stocks in the CRSP sample listed at
each of the US exchanges.
.apply(
lambda group: pd.Series({
"mktcap": group["mktcap"].sum()/group["cpi"].mean()
})
)
.reset_index()
)
market_cap_per_exchange_figure = (
ggplot(market_cap_per_exchange,
aes(x="month", y="mktcap/1000",
color="exchange", linetype="exchange")) +
geom_line() +
labs(x="", y="", color="", linetype="",
title=("Monthly market cap by listing exchange "
"in billions of Dec 2022 USD")) +
scale_x_datetime(date_breaks="10 years", date_labels="%Y") +
scale_y_continuous(labels=comma_format())
)
market_cap_per_exchange_figure.draw()
Next, we look at the same descriptive statistics by industry. Figure 4.3 plots the number of
stocks in the sample for each of the SIC industry classifiers. For most of the sample period,
the largest share of stocks is in manufacturing, albeit the number peaked somewhere in
the 90s. The number of firms associated with public administration seems to be the only
40 WRDS, CRSP, and Compustat
Figure 4.2: The figure shows the monthly market capitalization by listing exchange. Market
capitalization is measured in billion USD, adjusted for consumer price index changes such
that the values on the horizontal axis reflect the buying power of billion USD in December
2022.
category on the rise in recent years, even surpassing manufacturing at the end of our sample
period.
securities_per_industry = (crsp_monthly
.groupby(["industry", "date"])
.size()
.reset_index(name="n")
)
securities_per_industry_figure = (
ggplot(securities_per_industry,
aes(x="date", y="n", color="industry", linetype="industry")) +
geom_line() +
labs(x="", y="", color="", linetype="",
title="Monthly number of securities by industry") +
scale_x_datetime(date_breaks="10 years", date_labels="%Y") +
scale_y_continuous(labels=comma_format()) +
scale_linetype_manual(
values=[linetypes[l % len(linetypes)] for l in range(n_industries)]
)
First Glimpse of the CRSP Sample 41
)
securities_per_industry_figure.draw()
Figure 4.3: The figure shows the monthly number of stocks in the CRSP sample associated
with different industries.
We also compute the market cap of all stocks belonging to the respective industries and
show the evolution over time in Figure 4.4. All values are again in terms of billions of end
of 2022 USD. At all points in time, manufacturing firms comprise of the largest portion of
market capitalization. Toward the end of the sample, however, financial firms and services
begin to make up a substantial portion of the market cap.
market_cap_per_industry = (crsp_monthly
.merge(cpi_monthly, how="left", on="month")
.groupby(["month", "industry"])
.apply(
lambda group: pd.Series({
"mktcap": (group["mktcap"].sum()/group["cpi"].mean())
})
)
.reset_index()
)
market_cap_per_industry_figure = (
ggplot(market_cap_per_industry,
aes(x="month", y="mktcap/1000",
color="industry", linetype="industry")) +
geom_line() +
labs(x="", y="", color="", linetype="",
42 WRDS, CRSP, and Compustat
Figure 4.4: The figure shows the total Market capitalization in billion USD, adjusted for
consumer price index changes such that the values on the y-axis reflect the buying power
of billion USD in December 2022.
data about all stocks at once, download the data in small batches of stocks consecutively.
Such operations can be implemented in for-loops, where we download, prepare, and store
the data for a small number of stocks in each iteration. This operation might take around
five minutes, depending on your internet connection. To keep track of the progress, we create
ad-hoc progress updates using print(). Notice that we also use the function to_sql() here
with the option to append the new data to an existing table when we process the second
and all following batches. As for the monthly CRSP data, there is no need to adjust for
delisting returns in the daily CRSP data since July 2022.
factors_ff3_daily = pd.read_sql(
sql="SELECT * FROM factors_ff3_daily",
con=tidy_finance,
parse_dates={"date"}
)
permnos = list(crsp_monthly["permno"].unique().astype(str))
batch_size = 500
batches = np.ceil(len(permnos)/batch_size).astype(int)
permno_batch = permnos[
((j-1)*batch_size):(min(j*batch_size, len(permnos)))
]
permno_batch_formatted = (
", ".join(f"'{permno}'" for permno in permno_batch)
)
permno_string = f"({permno_batch_formatted})"
crsp_daily_sub_query = (
"SELECT permno, dlycaldt AS date, dlyret AS ret "
"FROM crsp.dsf_v2 "
f"WHERE permno IN {permno_string} "
f"AND dlycaldt BETWEEN '{start_date}' AND '{end_date}'"
)
crsp_daily_sub = (pd.read_sql_query(
sql=crsp_daily_sub_query,
con=wrds,
dtype={"permno": int},
parse_dates={"date"}
)
.dropna()
)
if not crsp_daily_sub.empty:
crsp_daily_sub = (crsp_daily_sub
.assign(
44 WRDS, CRSP, and Compustat
month = lambda x:
x["date"].dt.to_period("M").dt.to_timestamp()
)
.merge(factors_ff3_daily[["date", "rf"]],
on="date", how="left")
.assign(
ret_excess = lambda x:
((x["ret"] - x["rf"]).clip(lower=-1))
)
.get(["permno", "date", "month", "ret_excess"])
)
if j == 1:
if_exists_string = "replace"
else:
if_exists_string = "append"
crsp_daily_sub.to_sql(
name="crsp_daily",
con=tidy_finance,
if_exists=if_exists_string,
index=False
)
Eventually, we end up with more than 71 million rows of daily return data. Note that we
only store the identifying information that we actually need, namely permno, date, and
month alongside the excess returns. We thus ensure that our local database contains only
the data that we actually use.
compustat_query = (
"SELECT gvkey, datadate, seq, ceq, at, lt, txditc, txdb, itcb, pstkrv, "
"pstkl, pstk, capx, oancf, sale, cogs, xint, xsga "
"FROM comp.funda "
"WHERE indfmt = 'INDL' "
"AND datafmt = 'STD' "
"AND consol = 'C' "
f"AND datadate BETWEEN '{start_date}' AND '{end_date}'"
)
compustat = pd.read_sql_query(
sql=compustat_query,
con=wrds,
dtype={"gvkey": str},
parse_dates={"datadate"}
)
Next, we calculate the book value of preferred stock and equity be and the operating
profitability op inspired by the variable definitions in Kenneth French’s data library.7 Note
that we set negative or zero equity to missing, which is a common practice when working
with book-to-market ratios (see Fama and French, 1992, for details).
compustat = (compustat
.assign(
be=lambda x:
(x["seq"].combine_first(x["ceq"]+x["pstk"])
.combine_first(x["at"]-x["lt"])+
x["txditc"].combine_first(x["txdb"]+x["itcb"]).fillna(0)-
x["pstkrv"].combine_first(x["pstkl"])
.combine_first(x["pstk"]).fillna(0))
)
.assign(
be=lambda x: x["be"].apply(lambda y: np.nan if y <= 0 else y)
)
.assign(
op=lambda x:
((x["sale"]-x["cogs"].fillna(0)-
x["xsga"].fillna(0)-x["xint"].fillna(0))/x["be"])
)
)
We keep only the last available information for each firm-year group (by using the tail(1)
pandas function for each group). Note that datadate defines the time the corresponding
financial data refers to (e.g., annual report as of December 31, 2022). Therefore, datadate
is not the date when data was made available to the public. Check out the Exercises for
more insights into the peculiarities of datadate.
compustat = (compustat
.assign(year=lambda x: pd.DatetimeIndex(x["datadate"]).year)
.sort_values("datadate")
7 https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html
46 WRDS, CRSP, and Compustat
.groupby(["gvkey", "year"])
.tail(1)
.reset_index()
)
We also compute the investment ratio (inv) according to Kenneth French’s variable defi-
nitions as the change in total assets from one fiscal year to another. Note that we again
use the approach using joins as introduced with the CRSP data above to construct lagged
assets.
compustat_lag = (compustat
.get(["gvkey", "year", "at"])
.assign(year=lambda x: x["year"]+1)
.rename(columns={"at": "at_lag"})
)
compustat = (compustat
.merge(compustat_lag, how="left", on=["gvkey", "year"])
.assign(inv=lambda x: x["at"]/x["at_lag"]-1)
.assign(inv=lambda x: np.where(x["at_lag"] <= 0, np.nan, x["inv"]))
)
With the last step, we are already done preparing the firm fundamentals. Thus, we can store
them in our local database.
(compustat
.to_sql(name="compustat",
con=tidy_finance,
if_exists="replace",
index=False)
)
ccmxpf_linktable = pd.read_sql_query(
sql=ccmxpf_linktable_query,
con=wrds,
dtype={"permno": int, "gvkey": str},
parse_dates={"linkdt", "linkenddt"}
)
We use these links to create a new table with a mapping between stock identifier, firm
identifier, and month. We then add these links to the Compustat gvkey to our monthly
stock data.
ccm_links = (crsp_monthly
.merge(ccmxpf_linktable, how="inner", on="permno")
.query("~gvkey.isnull() & (date >= linkdt) & (date <= linkenddt)")
.get(["permno", "gvkey", "date"])
)
crsp_monthly = (crsp_monthly
.merge(ccm_links, how="left", on=["permno", "date"])
)
As the last step, we update the previously prepared monthly CRSP file with the linking
information in our local database.
(crsp_monthly
.to_sql(name="crsp_monthly",
con=tidy_finance,
if_exists="replace",
index=False)
)
Before we close this chapter, let us look at an interesting descriptive statistic of our data.
As the book value of equity plays a crucial role in many asset pricing applications, it
is interesting to know for how many of our stocks this information is available. Hence,
Figure 4.5 plots the share of securities with book equity values for each exchange. It turns
out that the coverage is pretty bad for AMEX- and NYSE-listed stocks in the 1960s but
hovers around 80 percent for all periods thereafter. We can ignore the erratic coverage of
securities that belong to the other category since there is only a handful of them anyway in
our sample.
share_with_be = (crsp_monthly
.assign(year=lambda x: pd.DatetimeIndex(x["month"]).year)
.sort_values("date")
.groupby(["permno", "year"])
.tail(1)
.reset_index()
.merge(compustat, how="left", on=["gvkey", "year"])
.groupby(["exchange", "year"])
.apply(
lambda x: pd.Series({
48 WRDS, CRSP, and Compustat
"share": x["permno"][~x["be"].isnull()].nunique()/x["permno"].nunique()
})
)
.reset_index()
)
share_with_be_figure = (
ggplot(share_with_be,
aes(x="year", y="share", color="exchange", linetype="exchange")) +
geom_line() +
labs(x="", y="", color="", linetype="",
title="Share of securities with book equity values by exchange") +
scale_y_continuous(labels=percent_format()) +
coord_cartesian(ylim=(0, 1))
)
share_with_be_figure.draw()
Figure 4.5: The figure shows the end-of-year share of securities with book equity values by
listing exchange.
4.7 Exercises
1. Compute mkt_cap_lag using lag(mktcap) rather than using joins as above. Fil-
ter out all the rows where the lag-based market capitalization measure is different
from the one we computed above. Why are the two measures different?
Exercises 49
2. Plot the average market capitalization of firms for each exchange and industry,
respectively, over time. What do you find?
3. In the compustat table, datadate refers to the date to which the fiscal year of
a corresponding firm refers. Count the number of observations in Compustat by
month of this date variable. What do you find? What does the finding suggest
about pooling observations with the same fiscal year?
4. Go back to the original Compustat and extract rows where the same firm has
multiple rows for the same fiscal year. What is the reason for these observations?
5. Keep the last observation of crsp_monthly by year and join it with the compustat
table. Create the following plots: (i) aggregate book equity by exchange over time
and (ii) aggregate annual book equity by industry over time. Do you notice any
different patterns to the corresponding plots based on market capitalization?
6. Repeat the analysis of market capitalization for book equity, which we computed
from the Compustat data. Then, use the matched sample to plot book equity
against market capitalization. How are these two variables related?
5
TRACE and FISD
In this chapter, we dive into the US corporate bond market. Bond markets are far more
diverse than stock markets, as most issuers have multiple bonds outstanding simultaneously
with potentially very different indentures. This market segment is exciting due to its size
(roughly ten trillion USD outstanding), heterogeneity of issuers (as opposed to government
bonds), market structure (mostly over-the-counter trades), and data availability. We in-
troduce how to use bond characteristics from FISD and trade reports from TRACE and
provide code to download and clean TRACE in Python.
Many researchers study liquidity in the US corporate bond market, with notable contribu-
tions from Bessembinder et al. (2006), Edwards et al. (2007), and O’Hara and Zhou (2021),
among many others. We do not cover bond returns here, but you can compute them from
TRACE data. Instead, we refer to studies on the topic such as Bessembinder et al. (2008),
Bai et al. (2019), and Kelly et al. (2021) and a survey by Huang and Shi (2021).
This chapter also draws on the resources provided by the project Open Source Bond Asset
Pricing1 and their related publication, i.e., Dickerson et al. (2023). We encourage you to
visit their website to check out the additional resources they provide. Moreover, WRDS
provides bond returns computed from TRACE data at a monthly frequency.
The current chapter relies on the following set of Python packages.
import pandas as pd
import numpy as np
import sqlite3
import httpimport
Compared to previous chapters, we load httpimport (Torakis, 2023) to source code provided
in the public Gist.2 Note that you should be careful with loading anything from the web
via this method, and it is highly discouraged to use any unsecured “HTTP” links. Also,
you might encounter a problem when using this from a corporate computer that prevents
downloading data through a firewall.
1 https://openbondassetpricing.com/
2 https://docs.github.com/en/get- started/writing- on- github/editing- and- sharing- content- with-
gists/creating-gists
51
52 TRACE and FISD
connection_string = (
"postgresql+psycopg2://"
f"{os.getenv('WRDS_USER')}:{os.getenv('WRDS_PASSWORD')}"
"@wrds-pgdata.wharton.upenn.edu:9737/wrds"
)
tidy_finance = sqlite3.connect(database="data/tidy_finance_python.sqlite")
User_Guide_v4.pdf
5 https://www.cusip.com/index.html
Mergent FISD 53
fisd_query = (
"SELECT complete_cusip, maturity, offering_amt, offering_date, "
"dated_date, interest_frequency, coupon, last_interest_date, "
"issue_id, issuer_id "
"FROM fisd.fisd_mergedissue "
"WHERE security_level = 'SEN' "
"AND (slob = 'N' OR slob IS NULL) "
"AND security_pledge IS NULL "
"AND (asset_backed = 'N' OR asset_backed IS NULL) "
"AND (defeased = 'N' OR defeased IS NULL) "
"AND defeased_date IS NULL "
"AND bond_type IN ('CDEB', 'CMTN', 'CMTZ', 'CZ', 'USBN') "
"AND (pay_in_kind != 'Y' OR pay_in_kind IS NULL) "
"AND pay_in_kind_exp_date IS NULL "
"AND (yankee = 'N' OR yankee IS NULL) "
"AND (canadian = 'N' OR canadian IS NULL) "
"AND foreign_currency = 'N' "
"AND coupon_type IN ('F', 'Z') "
"AND fix_frequency IS NULL "
"AND coupon_change_indicator = 'N' "
"AND interest_frequency IN ('0', '1', '2', '4', '12') "
54 TRACE and FISD
fisd = pd.read_sql_query(
sql=fisd_query,
con=wrds,
dtype={"complete_cusip": str, "interest_frequency": str,
"issue_id": int, "issuer_id": int},
parse_dates={"maturity", "offering_date",
"dated_date", "last_interest_date"}
)
We also pull issuer information from fisd_mergedissuer regarding the industry and coun-
try of the firm that issued a particular bond. Then, we filter to include only US-domiciled
firms’ bonds. We match the data by issuer_id.
fisd_issuer_query = (
"SELECT issuer_id, sic_code, country_domicile "
"FROM fisd.fisd_mergedissuer"
)
fisd_issuer = pd.read_sql_query(
sql=fisd_issuer_query,
con=wrds,
dtype={"issuer_id": int, "sic_code": str, "country_domicile": str}
)
fisd = (fisd
.merge(fisd_issuer, how="inner", on="issuer_id")
.query("country_domicile == 'USA'")
.drop(columns="country_domicile")
)
Finally, we save the bond characteristics to our local database. This selection of bonds also
constitutes the sample for which we will collect trade reports from TRACE below.
(fisd
.to_sql(name="fisd",
con=tidy_finance,
if_exists="replace",
TRACE 55
index=False)
)
The FISD database also contains other data. The issue-based file contains information on
covenants, i.e., restrictions included in bond indentures to limit specific actions by firms
(e.g., Handler et al., 2021). The FISD redemption database also contains information on
callable bonds. Moreover, FISD also provides information on bond ratings. We do not need
either here.
5.3 TRACE
The Financial Industry Regulatory Authority (FINRA) provides the Trade Reporting and
Compliance Engine (TRACE). In TRACE, dealers that trade corporate bonds must report
such trades individually. Hence, we observe trade messages in TRACE that contain informa-
tion on the bond traded, the trade time, price, and volume. TRACE comes in two variants:
standard and enhanced TRACE. We show how to download and clean enhanced TRACE
as it contains uncapped volume, a crucial quantity missing in the standard distribution.
Moreover, enhanced TRACE also provides information on the respective parties’ roles and
the direction of the trade report. These items become essential in cleaning the messages.
Why do we repeatedly talk about cleaning TRACE? Trade messages are submitted within
a short time window after a trade is executed (less than 15 minutes). These messages can
contain errors, and the reporters subsequently correct them or they cancel a trade altogether.
The cleaning needs are described by Dick-Nielsen (2009) in detail, and Dick-Nielsen (2014)
shows how to clean the enhanced TRACE data using SAS. We do not go into the cleaning
steps here, since the code is lengthy and is not our focus here. However, downloading and
cleaning enhanced TRACE data is straightforward with our setup.
We store code for cleaning enhanced TRACE with Python on the following GitHub Gist.6
Appendix D also contains the code for reference. We only need to source the code from the
Gist, which we can do with the code below using httpimport. In the chunk, we explicitly
load the necessary function interpreting the Gist as a module (i.e., you could also use it as
a module and precede the function calls with the module’s name). Alternatively, you can
also go to the Gist, download it, and manually execute it. The clean_enhanced_trace()
function takes a vector of CUSIPs, a connection to WRDS explained in Chapter 4, and a
start and end date, respectively.
gist_url = (
"https://gist.githubusercontent.com/patrick-weiss/"
"86ddef6de978fbdfb22609a7840b5d8b/raw/"
"8fbcc6c6f40f537cd3cd37368be4487d73569c6b/"
)
with httpimport.remote_repo(gist_url):
from clean_enhanced_TRACE_python import clean_enhanced_trace
The TRACE database is considerably large. Therefore, we only download subsets of data at
once. Specifying too many CUSIPs over a long time horizon will result in very long download
6 https://gist.githubusercontent.com/patrick-weiss/86ddef6de978fbdfb22609a7840b5d8b
56 TRACE and FISD
times and a potential failure due to the size of the request to WRDS. The size limit depends
on many parameters, and we cannot give you a guideline here. For the applications in this
book, we need data around the Paris Agreement in December 2015 and download the data
in sets of 1000 bonds, which we define below.
cusips = list(fisd["complete_cusip"].unique())
batch_size = 1000
batches = np.ceil(len(cusips)/batch_size).astype(int)
Finally, we run a loop in the same style as in Chapter 4 where we download daily returns
from CRSP. For each of the CUSIP sets defined above, we call the cleaning function and
save the resulting output. We add new data to the existing table for batch two and all
following batches.
for j in range(1, batches + 1):
cusip_batch = cusips[
((j-1)*batch_size):(min(j*batch_size, len(cusips)))
]
trace_enhanced_sub = clean_enhanced_trace(
cusips=cusip_string,
connection=wrds,
start_date="'01/01/2014'",
end_date="'11/30/2016'"
)
if not trace_enhanced_sub.empty:
if j == 1:
if_exists_string = "replace"
else:
if_exists_string = "append"
trace_enhanced_sub.to_sql(
name="trace_enhanced",
con=tidy_finance,
if_exists=if_exists_string,
index=False
)
trade-level information, potentially new to students. Therefore, we provide you with some
insights by showing some summary statistics.
We start by looking into the number of bonds outstanding over time and compare it to the
number of bonds traded in our sample. First, we compute the number of bonds outstanding
for each quarter around the Paris Agreement from 2014 to 2016.
dates = pd.date_range(start="2014-01-01", end="2016-11-30", freq="Q")
Next, we look at the bonds traded each quarter in the same period. Notice that we load
the complete trace table from our database, as we only have a single part of it in the
environment from the download loop above.
trace_enhanced = pd.read_sql_query(
sql=("SELECT cusip_id, trd_exctn_dt, rptd_pr, entrd_vol_qt, yld_pt "
"FROM trace_enhanced"),
con=tidy_finance,
parse_dates={"trd_exctn_dt"}
)
bonds_traded = (trace_enhanced
.assign(
date=lambda x: (
(x["trd_exctn_dt"]-pd.offsets.MonthBegin(1))
.dt.to_period("Q").dt.start_time
)
)
.groupby("date")
.aggregate(count=("cusip_id", "nunique"))
.reset_index()
.assign(type="Traded")
)
bonds_figure = (
58 TRACE and FISD
ggplot(bonds_combined,
aes(x="date", y="count", color="type", linetype="type")) +
geom_line() +
labs(x="", y="", color="", linetype="",
title="Number of bonds outstanding and traded each quarter") +
scale_x_datetime(breaks=date_breaks("1 year"), labels=date_format("%Y")) +
scale_y_continuous(labels=comma_format())
)
bonds_figure.draw()
Figure 5.1: Number of corporate bonds outstanding each quarter as reported by Mergent
FISD and the number of traded bonds from enhanced TRACE between 2014 and end of
2016.
We see that the number of bonds outstanding increases steadily between 2014 and 2016.
During our sample period of trade data, we see that the fraction of bonds trading each
quarter is roughly 60 percent. The relatively small number of traded bonds means that
many bonds do not trade through an entire quarter. This lack of trading activity illustrates
the generally low level of liquidity in the corporate bond market, where it can be hard
to trade specific bonds. Does this lack of liquidity mean that corporate bond markets are
irrelevant in terms of their size? With over 7,500 traded bonds each quarter, it is hard to
say that the market is small. However, let us also investigate the characteristics of issued
corporate bonds. In particular, we consider maturity (in years), coupon, and offering amount
(in million USD).
average_characteristics = (fisd
.assign(
maturity=lambda x: (x["maturity"] - x["offering_date"]).dt.days/365,
offering_amt=lambda x: x["offering_amt"]/10**3
Insights into Corporate Bonds 59
)
.melt(var_name="measure",
value_vars=["maturity", "coupon", "offering_amt"],
value_name="value")
.dropna()
.groupby("measure")["value"]
.describe(percentiles=[.05, .50, .95])
.drop(columns="count")
)
average_characteristics.round(2)
We see that the average bond in our sample period has an offering amount of over 357
million USD with a median of 200 million USD, which both cannot be considered small.
The average bond has a maturity of ten years and pays around 6 percent in coupons.
Finally, let us compute some summary statistics for the trades in this market. To this end,
we show a summary based on aggregate information daily. In particular, we consider the
trade size (in million USD) and the number of trades.
average_trade_size = (trace_enhanced
.groupby("trd_exctn_dt")
.aggregate(
trade_size=("entrd_vol_qt", lambda x: (
sum(x*trace_enhanced.loc[x.index, "rptd_pr"]/100)/10**6)
),
trade_number=("trd_exctn_dt", "size")
)
.reset_index()
.melt(id_vars=["trd_exctn_dt"], var_name="measure",
value_vars=["trade_size", "trade_number"], value_name="value")
.groupby("measure")["value"]
.describe(percentiles=[.05, .50, .95])
.drop(columns="count")
)
average_trade_size.round(0)
On average, nearly 26 billion USD of corporate bonds are traded daily in nearly 13,000
transactions. We can, hence, conclude that the corporate bond market is indeed significant
in terms of trading volume and activity.
5.5 Exercises
1. Compute the amount outstanding across all bonds over time. Make sure to sub-
tract all matured bonds. How would you describe the resulting plot?
2. Compute the number of days each bond is traded (accounting for the bonds’
maturities and issuances). Start by looking at the number of bonds traded each
day in a graph similar to the one above. How many bonds trade on more than 75
percent of trading days?
3. WRDS provides more information from Mergent FISD such as ratings in the table
fisd_ratings. Download the ratings table and plot the distribution of ratings
for the different rating providers. How would you map the different providers to
a common numeric rating scale?
6
Other Data Providers
In the previous chapters, we introduced many ways to get financial data that researchers
regularly use. We showed how to load data into Python from Yahoo!Finance (using the
yfinance package) and from Kenneth French’s data library1 (using the pandas-datareader
package). We also presented commonly used file types, such as comma-separated or Excel
files. Then, we introduced remotely connecting to WRDS and downloading data from there.
However, this is only a subset of the vast amounts of data available online these days.
In this chapter, we provide an overview of common alternative data providers for which
direct access via Python packages exists. Such a list requires constant adjustments because
both data providers and access methods change. However, we want to emphasize two main
insights. First, the number of Python packages that provide access to (financial) data is
large. Too large actually to survey here exhaustively. Instead, we can only cover the tip of
the iceberg. Second, Python provides the functionalities to access any form of files or data
available online. Thus, even if a desired data source does not come with a well-established
Python package, chances are high that data can be retrieved by establishing your own API
connection (using the Python requests package) or by scrapping the content.
In our non-exhaustive list below, we restrict ourselves to listing data sources accessed
through easy-to-use Python packages. For further inspiration on potential data sources,
we recommend reading the Awesome Quant curated list of insanely awesome libraries, pack-
ages, and resources for Quants (Quantitative Finance).2 In fact, the pandas-datareader
package provides comprehensive access to a lot of databases, including some of those listed
below.
Also, the requests library in Python provides a versatile and direct way to interact with
APIs (Application Programming Interfaces) offered by various financial data providers. The
package simplifies the process of making HTTP requests,3 handling authentication, and
parsing the received data.
Apart from the list below, we want to advertise some amazing data compiled by others. First,
there is Open Source Asset Pricing4 related to Chen and Zimmermann (2022). They provide
return data for over 200 trading strategies with different time periods and specifications.
The authors also provide signals and explanations of the factor construction. Moreover, in
the same spirit, Global factor data5 provides the data related to Jensen2022b. They provide
return data for characteristic-managed portfolios from around the world. The database
includes factors for 153 characteristics in 13 themes, using data from 93 countries. Finally,
we want to mention the TAQ data providing trades and quotes data for NYSE, Nasdaq,
and regional exchanges, which is available via WRDS.
1 https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html
2 https://github.com/wilsonfreitas/awesome-quant
3 https://requests.readthedocs.io/
4 https://www.openassetpricing.com/data
5 https://www.jkpfactors.com
61
62 Other Data Providers
6.1 Exercises
1. Select one of the data sources in the table above and retrieve some data. Browse
the homepage of the data provider or the package documentation to find inspira-
tion on which type of data is available to you and how to download the data into
your Python session.
2. Generate summary statistics of the data you retrieved and provide some use-
ful visualization. The possibilities are endless: Maybe there is some interesting
economic event you want to analyze, such as stock market responses to Twitter
activity.
Part III
Asset Pricing
7
Beta Estimation
67
68 Beta Estimation
tidy_finance = sqlite3.connect(database="data/tidy_finance_python.sqlite")
crsp_monthly = (pd.read_sql_query(
sql="SELECT permno, month, industry, ret_excess FROM crsp_monthly",
con=tidy_finance,
parse_dates={"month"})
.dropna()
)
factors_ff3_monthly = pd.read_sql_query(
sql="SELECT month, mkt_excess FROM factors_ff3_monthly",
con=tidy_finance,
parse_dates={"month"}
)
crsp_monthly = (crsp_monthly
.merge(factors_ff3_monthly, how="left", on="month")
)
OLS Model:
ret_excess ~ mkt_excess
Coefficients:
Estimate Std. Error t-Statistic p-Value
Intercept 0.010 0.005 2.004 0.046
mkt_excess 1.389 0.111 12.467 0.000
Summary statistics:
- Number of observations: 504
- R-squared: 0.236, Adjusted R-squared: 0.235
Rolling-Window Estimation 69
valid_permnos = (crsp_monthly
.dropna()
.groupby("permno")["permno"]
.count()
.reset_index(name="counts")
.query(f"counts > {window_size}+1")
)
Before we proceed with the estimation, one important issue is worth emphasizing:
RollingOLS returns the estimated parameters of a linear regression by incorporating a
window of the last window_size rows. Whenever monthly returns are implicitly missing
(which means there is simply no entry recorded, e.g., because a company was delisted and
only traded publicly again later), such a fixed window size may cause outdated observations
to influence the estimation results. We thus recommend making such implicit missing rows
explicit.
We hence collect information about the first and last listing date of each permno.
permno_information = (crsp_monthly
.merge(valid_permnos, how="inner", on="permno")
.groupby(["permno"])
.aggregate(first_month=("month", "min"),
last_month=("month", "max"))
.reset_index()
)
To complete the missing observations in the CRSP sample, we obtain all possible permno-
month combinations.
70 Beta Estimation
unique_permno = crsp_monthly["permno"].unique()
unique_month = factors_ff3_monthly["month"].unique()
all_combinations = pd.DataFrame(
product(unique_permno, unique_month),
columns=["permno", "month"]
)
Finally, we expand the CRSP sample and include a row (with missing excess returns) for
each possible permno-month observation that falls within the start and end date where the
respective permno has been publicly listed.
returns_monthly = (all_combinations
.merge(crsp_monthly.get(["permno", "month", "ret_excess"]),
how="left", on=["permno", "month"])
.merge(permno_information, how="left", on="permno")
.query("(month >= first_month) & (month <= last_month)")
.drop(columns=["first_month", "last_month"])
.merge(crsp_monthly.get(["permno", "month", "industry"]),
how="left", on=["permno", "month"])
.merge(factors_ff3_monthly, how="left", on="month")
)
The following function implements the CAPM regression for a dataframe (or a part thereof)
containing at least min_obs observations to avoid huge fluctuations if the time series is too
short. If the condition is violated (i.e., the time series is too short) the function returns a
missing value.
def roll_capm_estimation(data, window_size, min_obs):
"""Calculate rolling CAPM estimation."""
data = data.sort_values("month")
result = (RollingOLS.from_formula(
formula="ret_excess ~ mkt_excess",
data=data,
window=window_size,
min_nobs=min_obs,
missing="drop")
.fit()
.params.get("mkt_excess")
)
result.index = data.index
return result
Before we approach the whole CRSP sample, let us focus on a couple of examples for
well-known firms.
examples = pd.DataFrame({
"permno": [14593, 10107, 93436, 17778],
Rolling-Window Estimation 71
It is actually quite simple to perform the rolling-window estimation for an arbitrary number
of stocks, which we visualize in the following code chunk and the resulting Figure 7.1.
beta_example = (returns_monthly
.merge(examples, how="inner", on="permno")
.groupby(["permno"])
.apply(lambda x: x.assign(
beta=roll_capm_estimation(x, window_size, min_obs))
)
.reset_index(drop=True)
.dropna()
)
plot_beta = (
ggplot(beta_example,
aes(x="month", y="beta", color="company", linetype="company")) +
geom_line() +
labs(x="", y="", color="", linetype="",
title="Monthly beta estimates for example stocks") +
scale_x_datetime(breaks=date_breaks("5 year"), labels=date_format("%Y"))
)
plot_beta.draw()
Figure 7.1: The figure shows monthly beta estimates for example stocks using five years of
data. The CAPM betas are estimated with monthly data and a rolling window of length
five years based on adjusted excess returns from CRSP. We use market excess returns from
Kenneth French data library.
72 Beta Estimation
if "date" in group.columns:
group = group.sort_values(by="date")
else:
group = group.sort_values(by="month")
beta_values = (RollingOLS.from_formula(
formula="ret_excess ~ mkt_excess",
data=group,
window=window_size,
min_nobs=min_obs,
missing="drop"
)
.fit()
.params.get("mkt_excess")
)
result = pd.DataFrame(beta_values)
result.columns = ["beta"]
result["month"] = group["month"].values
result["permno"] = permno
try:
result["date"] = group["date"].values
result = result[
(result.groupby("month")["date"].transform("max")) == result["date"]
]
except(KeyError):
pass
return result
permno_groups = (returns_monthly
.merge(valid_permnos, how="inner", on="permno")
.groupby("permno", group_keys=False)
)
n_cores = cpu_count()-1
Estimating Beta Using Daily Returns 73
beta_monthly = (
pd.concat(
Parallel(n_jobs=n_cores)
(delayed(roll_capm_estimation_for_joblib)(name, group)
for name, group in permno_groups)
)
.dropna()
.rename(columns={"beta": "beta_monthly"})
)
unique_date = factors_ff3_daily["date"].unique()
For the daily data, we consider around three months of data (i.e., 60 trading days), require
at least 50 observations, and estimate betas in batches of 500.
window_size = 60
min_obs = 50
permnos = list(crsp_monthly["permno"].unique().astype(str))
batch_size = 500
batches = np.ceil(len(permnos)/batch_size).astype(int)
We then proceed to perform the same steps as with the monthly CRSP data, just in batches:
Load in daily returns, transform implicit missing returns to explicit ones, keep only valid
stocks with a minimum number of rows, and parallelize the beta estimation across stocks.
beta_daily = []
permno_batch = permnos[
((j-1)*batch_size):(min(j*batch_size, len(permnos)))
74 Beta Estimation
permno_batch_formatted = (
", ".join(f"'{permno}'" for permno in permno_batch)
)
permno_string = f"({permno_batch_formatted})"
crsp_daily_sub_query = (
"SELECT permno, month, date, ret_excess "
"FROM crsp_daily "
f"WHERE permno IN {permno_string}"
)
crsp_daily_sub = pd.read_sql_query(
sql=crsp_daily_sub_query,
con=tidy_finance,
dtype={"permno": int},
parse_dates={"date", "month"}
)
valid_permnos = (crsp_daily_sub
.groupby("permno")["permno"]
.count()
.reset_index(name="counts")
.query(f"counts > {window_size}+1")
.drop(columns="counts")
)
permno_information = (crsp_daily_sub
.merge(valid_permnos, how="inner", on="permno")
.groupby(["permno"])
.aggregate(first_date=("date", "min"),
last_date=("date", "max"),)
.reset_index()
)
unique_permno = permno_information["permno"].unique()
all_combinations = pd.DataFrame(
product(unique_permno, unique_date),
columns=["permno", "date"]
)
returns_daily = (crsp_daily_sub
.merge(all_combinations, how="right", on=["permno", "date"])
.merge(permno_information, how="left", on="permno")
.query("(date >= first_date) & (date <= last_date)")
.drop(columns=["first_date", "last_date"])
.merge(factors_ff3_daily, how="left", on="date")
)
Comparing Beta Estimates 75
permno_groups = (returns_daily
.groupby("permno", group_keys=False)
)
beta_daily_sub = (
pd.concat(
Parallel(n_jobs=n_cores)
(delayed(roll_capm_estimation_for_joblib)(name, group)
for name, group in permno_groups)
)
.dropna()
.rename(columns={"beta": "beta_daily"})
)
beta_daily.append(beta_daily_sub)
beta_daily = pd.concat(beta_daily)
industry_order = (beta_industries
.groupby("industry")["beta_monthly"]
.aggregate("median")
.sort_values()
.index.tolist()
)
plot_beta_industries = (
ggplot(beta_industries,
aes(x="industry", y="beta_monthly")) +
geom_boxplot() +
coord_flip() +
76 Beta Estimation
labs(x="", y="",
title="Firm-specific beta distributions by industry") +
scale_x_discrete(limits=industry_order)
)
plot_beta_industries.draw()
Figure 7.2: The box plots show the average firm-specific beta estimates by industry.
Next, we illustrate the time-variation in the cross-section of estimated betas. Figure 7.3
shows the monthly deciles of estimated betas (based on monthly data) and indicates an
interesting pattern: First, betas seem to vary over time in the sense that during some
periods, there is a clear trend across all deciles. Second, the sample exhibits periods where
the dispersion across stocks increases in the sense that the lower decile decreases and the
upper decile increases, which indicates that for some stocks, the correlation with the market
increases, while for others it decreases. Note also here: stocks with negative betas are a rare
exception.
beta_quantiles = (beta_monthly
.groupby("month")["beta_monthly"]
.quantile(q=np.arange(0.1, 1.0, 0.1))
.reset_index()
.rename(columns={"level_1": "quantile"})
.assign(quantile=lambda x: (x["quantile"]*100).astype(int))
.dropna()
)
plot_beta_quantiles = (
ggplot(beta_quantiles,
aes(x="month", y="beta_monthly",
color="factor(quantile)", linetype="factor(quantile)")) +
geom_line() +
labs(x="", y="", color="", linetype="",
title="Monthly deciles of estimated betas") +
scale_x_datetime(breaks=date_breaks("5 year"), labels=date_format("%Y")) +
scale_linetype_manual(
values=[linetypes[l % len(linetypes)] for l in range(n_quantiles)]
)
)
plot_beta_quantiles.draw()
Figure 7.3: The figure shows monthly deciles of estimated betas. Each line corresponds to
the monthly cross-sectional quantile of the estimated CAPM beta.
To compare the difference between daily and monthly data, we combine beta estimates to a
single table. Then, we use the table to plot a comparison of beta estimates for our example
stocks in Figure 7.4.
beta = (beta_monthly
.get(["permno", "month", "beta_monthly"])
.merge(beta_daily.get(["permno", "month", "beta_daily"]),
how="outer", on=["permno", "month"])
)
beta_comparison = (beta
.merge(examples, on="permno")
78 Beta Estimation
plot_beta_comparison = (
ggplot(beta_comparison,
aes(x="month", y="value", color="name")) +
geom_line() +
facet_wrap("~company", ncol=1) +
labs(x="", y="", color="",
title="Comparison of beta estimates using monthly and daily data") +
scale_x_datetime(breaks=date_breaks("10 years"),
labels=date_format("%Y")) +
theme(figure_size=(6.4, 6.4))
)
plot_beta_comparison.draw()
The estimates in Figure 7.4 look as expected. As you can see, the beta estimates really
depend on the estimation window and data frequency. Nevertheless, one can observe a
clear connection between daily and monthly betas in this example, in magnitude and the
dynamics over time.
Finally, we write the estimates to our database so that we can use them in later chapters.
(beta.to_sql(
name="beta",
con=tidy_finance,
if_exists="replace",
index=False
)
)
Whenever you perform some kind of estimation, it also makes sense to do rough plausibility
tests. A possible check is to plot the share of stocks with beta estimates over time. This
descriptive analysis helps us discover potential errors in our data preparation or the esti-
mation procedure. For instance, suppose there was a gap in our output without any betas.
In this case, we would have to go back and check all previous steps to find out what went
wrong. Figure 7.5 does not indicate any troubles, so let us move on to the next check.
beta_long = (crsp_monthly
.merge(beta, how="left", on=["permno", "month"])
.melt(id_vars=["permno", "month"], var_name="name",
value_vars=["beta_monthly", "beta_daily"], value_name="value")
)
beta_shares = (beta_long
.groupby(["month", "name"])
.aggregate(share=("value", lambda x: sum(~x.isna())/len(x)))
.reset_index()
)
Comparing Beta Estimates 79
Figure 7.4: The figure shows the comparison of beta estimates using monthly and daily data.
CAPM betas are computed using five years of monthly or three months of daily data. The
two lines show the monthly estimates based on a rolling window for few exemplary stocks.
plot_beta_long = (
ggplot(beta_shares,
aes(x="month", y="share", color="name", linetype="name")) +
geom_line() +
labs(x="", y="", color="", linetype="",
title="End-of-month share of securities with beta estimates") +
scale_y_continuous(labels=percent_format()) +
scale_x_datetime(breaks=date_breaks("10 year"), labels=date_format("%Y"))
)
plot_beta_long.draw()
80 Beta Estimation
Figure 7.5: The figure shows end-of-month share of securities with beta estimates. The two
lines show the share of securities with beta estimates using five years of monthly or three
months of daily data.
We also encourage everyone to always look at the distributional summary statistics of vari-
ables. You can easily spot outliers or weird distributions when looking at such tables.
beta_long.groupby("name")["value"].describe().round(2)
The summary statistics also look plausible for the two estimation procedures.
Finally, since we have two different estimators for the same theoretical object, we expect
the estimators to be at least positively correlated (although not perfectly as the estimators
are based on different sample periods and frequencies).
beta.get(["beta_monthly", "beta_daily"]).corr().round(2)
beta_monthly beta_daily
beta_monthly 1.00 0.32
beta_daily 0.32 1.00
Indeed, we find a positive correlation between our beta estimates. In the subsequent chapters,
we mainly use the estimates based on monthly data, as most readers should be able to
Exercises 81
replicate them and should not encounter potential memory limitations that might arise
with the daily data.
7.6 Exercises
1. Compute beta estimates based on monthly data using one, three, and five years
of data and impose a minimum number of observations of 10, 28, and 48 months
with return data, respectively. How strongly correlated are the estimated betas?
2. Compute beta estimates based on monthly data using five years of data and im-
pose different numbers of minimum observations. How does the share of permno-
month observations with successful beta estimates vary across the different re-
quirements? Do you find a high correlation across the estimated betas?
3. Instead of using joblib, perform the beta estimation in a loop (using either
monthly or daily data) for a subset of 100 permnos of your choice. Verify that
you get the same results as with the parallelized code from above.
4. Filter out the stocks with negative betas. Do these stocks frequently exhibit neg-
ative betas, or do they resemble estimation errors?
5. Compute beta estimates for multi-factor models such as the Fama-French three-
factor model. For that purpose, you extend your regression to
𝑘𝑘
𝑟𝑟𝑖𝑖𝑖𝑖𝑖 − 𝑟𝑟𝑓𝑓𝑓𝑓𝑓 = 𝛼𝛼𝑖𝑖 + ∑ 𝛽𝛽𝑖𝑖𝑖𝑖𝑖 (𝑟𝑟𝑗𝑗𝑗𝑗𝑗 − 𝑟𝑟𝑓𝑓𝑓𝑓𝑓 ) + 𝜀𝜀𝑖𝑖𝑖𝑖𝑖 (7.2)
𝑗𝑗𝑗𝑗
where 𝑟𝑟𝑖𝑖𝑖𝑖𝑖 are the 𝑘𝑘 factor returns. Thus, you estimate four parameters (𝛼𝛼𝑖𝑖 and
the slope coefficients). Provide some summary statistics of the cross-section of
firms and their exposure to the different factors.
8
Univariate Portfolio Sorts
In this chapter, we dive into portfolio sorts, one of the most widely used statistical method-
ologies in empirical asset pricing (e.g., Bali et al., 2016). The key application of portfolio
sorts is to examine whether one or more variables can predict future excess returns. In
general, the idea is to sort individual stocks into portfolios, where the stocks within each
portfolio are similar with respect to a sorting variable, such as firm size. The different portfo-
lios then represent well-diversified investments that differ in the level of the sorting variable.
You can then attribute the differences in the return distribution to the impact of the sorting
variable. We start by introducing univariate portfolio sorts (which sort based on only one
characteristic) and tackle bivariate sorting in Chapter 10.
A univariate portfolio sort considers only one sorting variable 𝑥𝑥𝑖𝑖𝑖𝑖𝑖𝑖𝑖 . Here, 𝑖𝑖 denotes the
stock and 𝑡𝑡 𝑡 𝑡 indicates that the characteristic is observable by investors at time 𝑡𝑡. The
objective is to assess the cross-sectional relation between 𝑥𝑥𝑖𝑖𝑖𝑖𝑖𝑖𝑖 and, typically, stock excess
returns 𝑟𝑟𝑖𝑖𝑖𝑖𝑖 at time 𝑡𝑡 as the outcome variable. To illustrate how portfolio sorts work, we use
estimates for market betas from the previous chapter as our sorting variable.
The current chapter relies on the following set of Python packages.
import pandas as pd
import numpy as np
import sqlite3
import statsmodels.api as sm
crsp_monthly = (pd.read_sql_query(
sql="SELECT permno, month, ret_excess, mktcap_lag FROM crsp_monthly",
con=tidy_finance,
83
84 Univariate Portfolio Sorts
parse_dates={"month"})
)
factors_ff3_monthly = pd.read_sql_query(
sql="SELECT month, mkt_excess FROM factors_ff3_monthly",
con=tidy_finance,
parse_dates={"month"}
)
beta = (pd.read_sql_query(
sql="SELECT permno, month, beta_monthly FROM beta",
con=tidy_finance,
parse_dates={"month"})
)
data_for_sorts = (crsp_monthly
.merge(beta_lag, how="inner", on=["permno", "month"])
)
The first step to conduct portfolio sorts is to calculate periodic breakpoints that you can use
to group the stocks into portfolios. For simplicity, we start with the median lagged market
beta as the single breakpoint. We then compute the value-weighted returns for each of the
two resulting portfolios, which means that the lagged market capitalization determines the
weight in np.average().
beta_portfolios = (data_for_sorts
.groupby("month")
.apply(lambda x: (x.assign(
portfolio=pd.qcut(
x["beta_lag"], q=[0, 0.5, 1], labels=["low", "high"]))
Performance Evaluation 85
)
)
.reset_index(drop=True)
.groupby(["portfolio","month"])
.apply(lambda x: np.average(x["ret_excess"], weights=x["mktcap_lag"]))
.reset_index(name="ret")
)
We compute the average return and the corresponding standard error to test whether the
long-short portfolio yields on average positive or negative excess returns. In the asset pricing
literature, one typically adjusts for autocorrelation by using Newey and West (1987) 𝑡𝑡-
statistics to test the null hypothesis that average portfolio excess returns are equal to zero.
One necessary input for Newey-West standard errors is a chosen bandwidth based on the
number of lags employed for the estimation. Researchers often default to choosing a pre-
specified lag length of six months (which is not a data-driven approach). We do so in the
fit() function by indicating the cov_type as HAC and providing the maximum lag length
through an additional keywords dictionary.
model_fit = (sm.OLS.from_formula(
formula="long_short ~ 1",
data=beta_longshort
)
.fit(cov_type="HAC", cov_kwds={"maxlags": 6})
)
prettify_result(model_fit)
OLS Model:
long_short ~ 1
Coefficients:
Estimate Std. Error t-Statistic p-Value
Intercept -0.0 0.001 -0.008 0.994
Summary statistics:
- Number of observations: 695
- R-squared: -0.000, Adjusted R-squared: -0.000
86 Univariate Portfolio Sorts
breakpoints = np.quantile(
data[sorting_variable].dropna(),
np.linspace(0, 1, n_portfolios + 1),
method="linear"
)
assigned_portfolios = pd.cut(
data[sorting_variable],
bins=breakpoints,
labels=range(1, breakpoints.size),
include_lowest=True,
right=False
)
return assigned_portfolios
We can use the above function to sort stocks into ten portfolios each month using lagged
betas and compute value-weighted returns for each portfolio. Note that we transform the
More Performance Evaluation 87
portfolio column to a factor variable because it provides more convenience for the figure
construction below.
beta_portfolios = (data_for_sorts
.groupby("month")
.apply(lambda x: x.assign(
portfolio=assign_portfolio(x, "beta_lag", 10)
)
)
.reset_index(drop=True)
.groupby(["portfolio", "month"])
.apply(lambda x: x.assign(
ret=np.average(x["ret_excess"], weights=x["mktcap_lag"])
)
)
.reset_index(drop=True)
.merge(factors_ff3_monthly, how="left", on="month")
)
Figure 8.1 illustrates the CAPM alphas of beta-sorted portfolios. It shows that low-beta
portfolios tend to exhibit positive alphas, while high-beta portfolios exhibit negative alphas.
plot_beta_portfolios_summary = (
ggplot(beta_portfolios_summary,
aes(x="portfolio", y="alpha", fill="portfolio")) +
geom_bar(stat="identity") +
88 Univariate Portfolio Sorts
Figure 8.1: The figure shows CAPM alphas of beta-sorted portfolios. Portfolios are sorted
into deciles each month based on their estimated CAPM beta. The bar charts indicate the
CAPM alpha of the resulting portfolio returns during the entire CRSP period.
These results suggest a negative relation between beta and future stock returns, which con-
tradicts the predictions of the CAPM. According to the CAPM, returns should increase with
beta across the portfolios and risk-adjusted returns should be statistically indistinguishable
from zero.
sml_capm = (sm.OLS.from_formula(
formula="ret ~ 1 + beta",
data=beta_portfolios_summary
)
.fit()
.params
)
plot_sml_capm = (
ggplot(beta_portfolios_summary,
aes(x="beta", y="ret", color="portfolio")) +
geom_point() +
geom_abline(intercept=0,
slope=factors_ff3_monthly["mkt_excess"].mean(),
linetype="solid") +
geom_abline(intercept=sml_capm["Intercept"],
slope=sml_capm["beta"],
linetype="dashed") +
labs(x="Beta", y="Excess return", color="Portfolio",
title="Average portfolio excess returns and beta estimates") +
scale_x_continuous(limits=(0, 2)) +
scale_y_continuous(labels=percent_format(),
limits=(0, factors_ff3_monthly["mkt_excess"].mean()*2))
)
plot_sml_capm.draw()
To provide more evidence against the CAPM predictions, we again form a long-short strat-
egy that buys the high-beta portfolio and shorts the low-beta portfolio.
beta_longshort = (beta_portfolios
.assign(
portfolio=lambda x: (
x["portfolio"].apply(
lambda y: "high" if y == x["portfolio"].max()
else ("low" if y == x["portfolio"].min()
else y)
)
)
)
.query("portfolio in ['low', 'high']")
.pivot_table(index="month", columns="portfolio", values="ret")
.assign(long_short=lambda x: x["high"]-x["low"])
.merge(factors_ff3_monthly, how="left", on="month")
)
Again, the resulting long-short strategy does not exhibit statistically significant returns.
model_fit = (sm.OLS.from_formula(
formula="long_short ~ 1",
data=beta_longshort
)
.fit(cov_type="HAC", cov_kwds={"maxlags": 1})
90 Univariate Portfolio Sorts
Figure 8.2: The figure shows average portfolio excess returns and beta estimates. Excess
returns are computed as CAPM alphas of the beta-sorted portfolios. The horizontal axis
indicates the CAPM beta of the resulting beta-sorted portfolio return time series. The
dashed line indicates the slope coefficient of a linear regression of excess returns on portfolio
betas.
)
prettify_result(model_fit)
OLS Model:
long_short ~ 1
Coefficients:
Estimate Std. Error t-Statistic p-Value
Intercept 0.002 0.003 0.549 0.583
Summary statistics:
- Number of observations: 695
- R-squared: 0.000, Adjusted R-squared: 0.000
- F-statistic not available
However, controlling for the effect of beta, the long-short portfolio yields a statistically sig-
nificant negative CAPM-adjusted alpha, although the average excess stock returns should
be zero according to the CAPM. The results thus provide no evidence in support of the
CAPM. The negative value has been documented as the so-called betting-against-beta factor
(Frazzini and Pedersen, 2014). Betting-against-beta corresponds to a strategy that shorts
high-beta stocks and takes a (levered) long position in low-beta stocks. If borrowing con-
straints prevent investors from taking positions on the security market line they are instead
incentivized to buy high-beta stocks, which leads to a relatively higher price (and therefore
Security Market Line and Beta Portfolios 91
lower expected returns than implied by the CAPM) for such high-beta stocks. As a re-
sult, the betting-against-beta strategy earns from providing liquidity to capital-constrained
investors with lower risk aversion.
model_fit = (sm.OLS.from_formula(
formula="long_short ~ 1 + mkt_excess",
data=beta_longshort
)
.fit(cov_type="HAC", cov_kwds={"maxlags": 1})
)
prettify_result(model_fit)
OLS Model:
long_short ~ 1 + mkt_excess
Coefficients:
Estimate Std. Error t-Statistic p-Value
Intercept -0.004 0.002 -1.931 0.053
mkt_excess 1.129 0.069 16.392 0.000
Summary statistics:
- Number of observations: 695
- R-squared: 0.439, Adjusted R-squared: 0.438
- F-statistic: 268.700 on 1 and 693 DF, p-value: 0.000
Figure 8.3 shows the annual returns of the extreme beta portfolios we are mainly interested
in. The figure illustrates no consistent striking patterns over the last years; each portfolio
exhibits periods with positive and negative annual returns.
beta_longshort_year = (beta_longshort
.assign(year=lambda x: x["month"].dt.year)
.groupby("year")
.aggregate(
low=("low", lambda x: 1-(1+x).prod()),
high=("high", lambda x: 1-(1+x).prod()),
long_short=("long_short", lambda x: 1-(1+x).prod())
)
.reset_index()
.melt(id_vars="year", var_name="name", value_name="value")
)
plot_beta_longshort_year = (
ggplot(beta_longshort_year,
aes(x="year", y="value", fill="name")) +
geom_col(position='dodge') +
facet_wrap("~name", ncol=1) +
labs(x="", y="", title="Annual returns of beta portfolios") +
scale_color_discrete(guide=False) +
scale_y_continuous(labels=percent_format()) +
theme(legend_position="none")
)
plot_beta_longshort_year.draw()
92 Univariate Portfolio Sorts
Figure 8.3: The figure shows annual returns of beta portfolios. We construct portfolios by
sorting stocks into high and low based on their estimated CAPM beta. Long short indicates
a strategy that goes long into high beta stocks and short low beta stocks.
Overall, this chapter shows how functional programming can be leveraged to form an arbi-
trary number of portfolios using any sorting variable and how to evaluate the performance
of the resulting portfolios. In the next chapter, we dive deeper into the many degrees of
freedom that arise in the context of portfolio analysis.
8.7 Exercises
1. Take the two long-short beta strategies based on different numbers of portfolios
and compare the returns. Is there a significant difference in returns? How do
the Sharpe ratios compare between the strategies? Find one additional portfolio
evaluation statistic and compute it.
2. We plotted the alphas of the ten beta portfolios above. Write a function that tests
these estimates for significance. Which portfolios have significant alphas?
3. The analysis here is based on betas from monthly returns. However, we also
computed betas from daily returns. Re-run the analysis and point out differences
in the results.
4. Given the results in this chapter, can you define a long-short strategy that yields
positive abnormal returns (i.e., alphas)? Plot the cumulative excess return of your
strategy and the market excess return for comparison.
9
Size Sorts and p-Hacking
In this chapter, we continue with portfolio sorts in a univariate setting. Yet, we consider firm
size as a sorting variable, which gives rise to a well-known return factor: the size premium.
The size premium arises from buying small stocks and selling large stocks. Prominently,
Fama and French (1993) include it as a factor in their three-factor model. Apart from that,
asset managers commonly include size as a key firm characteristic when making investment
decisions.
We also introduce new choices in the formation of portfolios. In particular, we discuss listing
exchanges, industries, weighting regimes, and periods. These choices matter for the portfolio
returns and result in different size premiums (see Hasler, 2021, Soebhag et al. (2022), and
Walter et al. (2022) for more insights into decision nodes and their effect on premiums).
Exploiting these ideas to generate favorable results is called p-hacking. There is arguably
a thin line between p-hacking and conducting robustness tests. Our purpose here is to
illustrate the substantial variation that can arise along the evidence-generating process.
The chapter relies on the following set of Python packages:
import pandas as pd
import numpy as np
import sqlite3
93
94 Size Sorts and p-Hacking
crsp_monthly = pd.read_sql_query(
sql="SELECT * FROM crsp_monthly",
con=tidy_finance,
parse_dates={"month"}
)
factors_ff3_monthly = pd.read_sql_query(
sql="SELECT * FROM factors_ff3_monthly",
con=tidy_finance,
parse_dates={"month"}
)
plot_market_cap_concentration = (
ggplot(market_cap_concentration,
aes(x="month", y="value",
color="name", linetype="name")) +
geom_line() +
scale_y_continuous(labels=percent_format()) +
scale_x_date(name="", date_labels="%Y") +
labs(x="", y="", color="", linetype="",
title=("Percentage of total market capitalization in "
"largest stocks")) +
theme(legend_title=element_blank())
)
plot_market_cap_concentration.draw()
Figure 9.1: The figure shows the percentage of total market capitalization in largest stocks.
We report the aggregate market capitalization of all stocks that belong to the 1, 5, 10, and
25 percent quantile of the largest firms in the monthly cross-section relative to the market
capitalization of all stocks during the month.
Next, firm sizes also differ across listing exchanges. The primary listings of stocks were
important in the past and are potentially still relevant today. Figure 9.2 shows that the
New York Stock Exchange (NYSE) was and still is the largest listing exchange in terms of
market capitalization. More recently, NASDAQ has gained relevance as a listing exchange.
Do you know what the small peak in NASDAQ’s market cap around the year 2000 was?
market_cap_share = (crsp_monthly
.groupby(["month", "exchange"])
.aggregate({"mktcap": "sum"})
.reset_index(drop=False)
96 Size Sorts and p-Hacking
.groupby("month")
.apply(lambda x:
x.assign(total_market_cap=lambda x: x["mktcap"].sum(),
share=lambda x: x["mktcap"]/x["total_market_cap"]
)
)
.reset_index(drop=True)
)
plot_market_cap_share = (
ggplot(market_cap_share,
aes(x="month", y="share",
fill="exchange", color="exchange")) +
geom_area(position="stack", stat="identity", alpha=0.5) +
geom_line(position="stack") +
scale_y_continuous(labels=percent_format()) +
scale_x_date(name="", date_labels="%Y") +
labs(x="", y="", fill="", color="",
title="Share of total market capitalization per listing exchange") +
theme(legend_title=element_blank())
)
plot_market_cap_share.draw()
Figure 9.2: The figure shows the share of total market capitalization per listing exchange.
Years are on the horizontal axis and the corresponding share of total market capitalization
per listing exchange on the vertical axis.
Finally, we consider the distribution of firm size across listing exchanges and create summary
statistics. The function describe() does not include all statistics we are interested in, which
Size Distribution 97
is why we create the function compute_summary() that adds the standard deviation and
the number of observations. Then, we apply it to the most current month of our CRSP data
on each listing exchange. We also add a row with the overall summary statistics.
The resulting table shows that firms listed on NYSE in December 2021 are significantly
larger on average than firms listed on the other exchanges. Moreover, NASDAQ lists the
largest number of firms. This discrepancy between firm sizes across listing exchanges mo-
tivated researchers to form breakpoints exclusively on the NYSE sample and apply those
breakpoints to all stocks. In the following, we use this distinction to update our portfolio
sort procedure.
def compute_summary(data, variable, filter_variable, percentiles):
"""Compute summary statistics for a given variable and percentiles."""
summary = (data
.get([filter_variable, variable])
.groupby(filter_variable)
.describe(percentiles=percentiles)
)
summary.columns = summary.columns.droplevel(0)
summary_overall = (data
.get(variable)
.describe(percentiles=percentiles)
)
summary.loc["Overall", :] = summary_overall
return summary.round(0)
compute_summary(
crsp_monthly[crsp_monthly["month"] == crsp_monthly["month"].max()],
variable="mktcap",
filter_variable="exchange",
percentiles=[0.05, 0.5, 0.95]
)
breakpoints = (data
.query(f"exchange in {exchanges}")
.get(sorting_variable)
.quantile(np.linspace(0, 1, num=n_portfolios+1),
interpolation="linear")
.drop_duplicates()
)
breakpoints.iloc[[0, -1]] = [-np.Inf, np.Inf]
assigned_portfolios = pd.cut(
data[sorting_variable],
bins=breakpoints,
labels=range(1, breakpoints.size),
include_lowest=True,
right=False
)
return assigned_portfolios
if value_weighted:
return np.average(data["ret_excess"], weights=data["mktcap_lag"])
else:
return data["ret_excess"].mean()
def compute_portfolio_returns(n_portfolios=10,
exchanges=["NYSE", "NASDAQ", "AMEX"],
value_weighted=True,
data=crsp_monthly):
"""Compute (value-weighted) portfolio returns."""
returns = (data
.groupby("month")
.apply(lambda x: x.assign(
portfolio=assign_portfolio(x, exchanges,
"mktcap_lag", n_portfolios))
)
.reset_index(drop=True)
.groupby(["portfolio", "month"])
.apply(lambda x: x.assign(
ret=calculate_returns(x, value_weighted))
)
.reset_index(drop=True)
.groupby("month")
.apply(lambda x:
pd.Series({"size_premium": x.loc[x["portfolio"].idxmin(), "ret"]-
x.loc[x["portfolio"].idxmax(), "ret"]}))
.reset_index(drop=True)
.aggregate({"size_premium": "mean"})
)
return returns
data=crsp_monthly
)
ret_nyse = compute_portfolio_returns(
n_portfolios=2,
exchanges=["NYSE"],
value_weighted=True,
data=crsp_monthly
)
Premium
NYSE, NASDAQ & AMEX 0.08
NYSE 0.16
The table shows that the size premium is more than 60 percent larger if we consider only
stocks from NYSE to form the breakpoint each month. The NYSE-specific breakpoints are
larger, and there are more than 50 percent of the stocks in the entire universe in the resulting
small portfolio because NYSE firms are larger on average. The impact of this choice is not
negligible.
From a malicious perspective, these modeling choices give the researcher multiple chances
to find statistically significant results. Yet this is considered p-hacking, which renders the
statistical inference invalid due to multiple testing (Harvey et al., 2016).
Nevertheless, the multitude of options creates a problem since there is no single correct way
of sorting portfolios. How should a researcher convince a reader that their results do not
come from a p-hacking exercise? To circumvent this dilemma, academics are encouraged
to present evidence from different sorting schemes as robustness tests and report multiple
approaches to show that a result does not depend on a single choice. Thus, the robustness
of premiums is a key feature.
Below, we conduct a series of robustness tests, which could also be interpreted as a p-
hacking exercise. To do so, we examine the size premium in different specifications presented
in the table p_hacking_setup. The function itertools.product() produces all possible
permutations of its arguments. Note that we use the argument data to exclude financial
firms and truncate the time series.
n_portfolios = [2, 5, 10]
exchanges = [["NYSE"], ["NYSE", "NASDAQ", "AMEX"]]
value_weighted = [True, False]
data = [
crsp_monthly,
crsp_monthly[crsp_monthly["industry"] != "Finance"],
crsp_monthly[crsp_monthly["month"] < "1990-06-01"],
crsp_monthly[crsp_monthly["month"] >= "1990-06-01"],
]
p_hacking_setup = list(
product(n_portfolios, exchanges, value_weighted, data)
)
To speed the computation up, we parallelize the (many) different sorting procedures, as
in Chapter 7 using the joblib package. Finally, we report the resulting size premiums
in descending order. There are indeed substantial size premiums possible in our data, in
particular when we use equal-weighted portfolios.
n_cores = cpu_count()-1
p_hacking_results = pd.concat(
Parallel(n_jobs=n_cores)
(delayed(compute_portfolio_returns)(x, y, z, w)
for x, y, z, w in p_hacking_setup)
)
p_hacking_results = p_hacking_results.reset_index(name="size_premium")
p_hacking_results_figure = (
ggplot(p_hacking_results,
aes(x="size_premium")) +
geom_histogram(bins=len(p_hacking_results)) +
scale_x_continuous(labels=percent_format()) +
labs(x="", y="",
title="Distribution of size premiums for various sorting choices") +
geom_vline(aes(xintercept=factors_ff3_monthly["smb"].mean()),
linetype="dashed")
)
p_hacking_results_figure.draw()
Figure 9.3: The figure shows the distribution of size premiums for various sorting choices.
The dashed vertical line indicates the average Fama-French SMB premium.
9.7 Exercises
1. We gained several insights on the size distribution above. However, we did not
analyze the average size across listing exchanges and industries. Which listing
exchanges/industries have the largest firms? Plot the average firm size for the
three listing exchanges over time. What do you conclude?
2. We compute breakpoints but do not take a look at them in the exposition above.
This might cover potential data errors. Plot the breakpoints for ten size portfolios
over time. Then, take the difference between the two extreme portfolios and plot
it. Describe your results.
Exercises 103
In this chapter, we extend the univariate portfolio analysis of Chapter 8 to bivariate sorts,
which means we assign stocks to portfolios based on two characteristics. Bivariate sorts are
regularly used in the academic asset pricing literature and are the basis for factors in the
Fama-French three-factor model. However, some scholars also use sorts with three grouping
variables. Conceptually, portfolio sorts are easily applicable in higher dimensions.
We form portfolios on firm size and the book-to-market ratio. To calculate book-to-market
ratios, accounting data is required, which necessitates additional steps during portfolio
formation. In the end, we demonstrate how to form portfolios on two sorting variables using
so-called independent and dependent portfolio sorts.
The current chapter relies on this set of Python packages.
import pandas as pd
import numpy as np
import datetime as dt
import sqlite3
Compared to previous chapters, we introduce the datetime module that is part of the
Python standard library for manipulating dates.
crsp_monthly = (pd.read_sql_query(
sql=("SELECT permno, gvkey, month, ret_excess, mktcap, "
"mktcap_lag, exchange FROM crsp_monthly"),
con=tidy_finance,
parse_dates={"month"})
.dropna()
)
Further, we utilize accounting data. The most common source of accounting data is Compu-
stat. We only need book equity data in this application, which we select from our database.
105
106 Value and Bivariate Sorts
Additionally, we convert the variable datadate to its monthly value, as we only consider
monthly returns here and do not need to account for the exact date.
book_equity = (pd.read_sql_query(
sql="SELECT gvkey, datadate, be FROM compustat",
con=tidy_finance,
parse_dates={"datadate"})
.dropna()
.assign(
month=lambda x: (
pd.to_datetime(x["datadate"]).dt.to_period("M").dt.to_timestamp()
)
)
)
1 The definition of a time lag is another choice a researcher has to make, similar to breakpoint choices as
we describe in Chapter 9.
Book-to-Market Ratio 107
.rename(columns={"mktcap": "me"})
.get(["permno", "sorting_date", "me"])
)
bm = (book_equity
.merge(crsp_monthly, how="inner", on=["gvkey", "month"])
.assign(bm=lambda x: x["be"]/x["mktcap"],
sorting_date=lambda x: x["month"]+pd.DateOffset(months=6))
.assign(comp_date=lambda x: x["sorting_date"])
.get(["permno", "gvkey", "sorting_date", "comp_date", "bm"])
)
data_for_sorts = (crsp_monthly
.merge(bm,
how="left",
left_on=["permno", "gvkey", "month"],
right_on=["permno", "gvkey", "sorting_date"])
.merge(me,
how="left",
left_on=["permno", "month"],
right_on=["permno", "sorting_date"])
.get(["permno", "gvkey", "month", "ret_excess",
"mktcap_lag", "me", "bm", "exchange", "comp_date"])
)
data_for_sorts = (data_for_sorts
.sort_values(by=["permno", "gvkey", "month"])
.groupby(["permno", "gvkey"])
.apply(lambda x: x.assign(
bm=x["bm"].fillna(method="ffill"),
comp_date=x["comp_date"].fillna(method="ffill")
)
)
.reset_index(drop=True)
.assign(threshold_date = lambda x: (x["month"]-pd.DateOffset(months=12)))
.query("comp_date > threshold_date")
.drop(columns=["comp_date", "threshold_date"])
.dropna()
)
The last step of preparation for the portfolio sorts is the computation of breakpoints. We
continue to use the same function, allowing for the specification of exchanges to be used
for the breakpoints. Additionally, we reintroduce the argument sorting_variable into the
function for defining different sorting variables.
def assign_portfolio(data, exchanges, sorting_variable, n_portfolios):
"""Assign portfolio for a given sorting variable."""
breakpoints = (data
.query(f"exchange in {exchanges}")
.get(sorting_variable)
108 Value and Bivariate Sorts
.quantile(np.linspace(0, 1, num=n_portfolios+1),
interpolation="linear")
.drop_duplicates()
)
breakpoints.iloc[0] = -np.Inf
breakpoints.iloc[breakpoints.size-1] = np.Inf
assigned_portfolios = pd.cut(
data[sorting_variable],
bins=breakpoints,
labels=range(1, breakpoints.size),
include_lowest=True,
right=False
)
return assigned_portfolios
After these data preparation steps, we present bivariate portfolio sorts on an independent
and dependent basis.
Equipped with our monthly portfolio returns, we are ready to compute the value premium.
However, we still have to decide how to invest in the five high and the five low book-to-
market portfolios. The most common approach is to weigh these portfolios equally, but this
is yet another researcher’s choice. Then, we compute the return differential between the
high and low book-to-market portfolios and show the average value premium.
value_premium = (value_portfolios
.groupby(["month", "portfolio_bm"])
.aggregate({"ret": "mean"})
.reset_index()
.groupby("month")
.apply(lambda x: pd.Series({
"value_premium": (
x.loc[x["portfolio_bm"] == x["portfolio_bm"].max(), "ret"].mean() -
x.loc[x["portfolio_bm"] == x["portfolio_bm"].min(), "ret"].mean()
)
})
)
.aggregate({"value_premium": "mean"})
)
The resulting monthly value premium is 0.43 percent with an annualized return of 5.3
percent.
To implement the dependent sorts, we first create the size portfolios by calling
assign_portfolio() with sorting_variable="me". Then, we group our data again by
month and by the size portfolio before assigning the book-to-market portfolio. The rest of
the implementation is the same as before. Finally, we compute the value premium.
value_portfolios = (data_for_sorts
.groupby("month")
.apply(lambda x: x.assign(
portfolio_me=assign_portfolio(
data=x, sorting_variable="me", n_portfolios=5, exchanges=["NYSE"]
)
)
)
.reset_index(drop=True)
.groupby(["month", "portfolio_me"])
.apply(lambda x: x.assign(
portfolio_bm=assign_portfolio(
data=x, sorting_variable="bm", n_portfolios=5, exchanges=["NYSE"]
)
)
)
.reset_index(drop=True)
.groupby(["month", "portfolio_bm", "portfolio_me"])
.apply(lambda x: pd.Series({
"ret": np.average(x["ret_excess"], weights=x["mktcap_lag"])
})
)
.reset_index()
)
value_premium = (value_portfolios
.groupby(["month", "portfolio_bm"])
.aggregate({"ret": "mean"})
.reset_index()
.groupby("month")
.apply(lambda x: pd.Series({
"value_premium": (
x.loc[x["portfolio_bm"] == x["portfolio_bm"].max(), "ret"].mean() -
x.loc[x["portfolio_bm"] == x["portfolio_bm"].min(), "ret"].mean()
)
})
)
.aggregate({"value_premium": "mean"})
)
The monthly value premium from dependent sorts is 0.38 percent, which translates to an
annualized premium of 4.6 percent per year.
Overall, we show how to conduct bivariate portfolio sorts in this chapter. In one case, we
sort the portfolios independently of each other. Yet we also discuss how to create dependent
portfolio sorts. Along the lines of Chapter 9, we see how many choices a researcher has to
make to implement portfolio sorts, and bivariate sorts increase the number of choices.
Exercises 111
10.5 Exercises
1. In Chapter 9, we examine the distribution of market equity. Repeat this analysis
for book equity and the book-to-market ratio (alongside a plot of the breakpoints,
i.e., deciles).
2. When we investigate the portfolios, we focus on the returns exclusively. However,
it is also of interest to understand the characteristics of the portfolios. Write
a function to compute the average characteristics for size and book-to-market
across the 25 independently and dependently sorted portfolios.
3. As for the size premium, also the value premium constructed here does not follow
Fama and French (1993). Implement a p-hacking setup as in Chapter 9 to find a
premium that comes closest to their HML premium.
11
Replicating Fama-French Factors
In this chapter, we provide a replication of the famous Fama-French factor portfolios. The
Fama-French factor models are a cornerstone of empirical asset pricing (see Fama and
French, 1992, and Fama and French (2015)). On top of the market factor represented by
the traditional CAPM beta, the three-factor model includes the size and value factors to
explain the cross section of returns. Its successor, the five-factor model, additionally includes
profitability and investment as explanatory factors.
We start with the three-factor model. We already introduced the size and value factors in
Chapter 10, and their definition remains the same: size is the SMB factor (small-minus-big)
that is long small firms and short large firms. The value factor is HML (high-minus-low)
and is long in high book-to-market firms and short in low book-to-market counterparts.
After the replication of the three-factor model, we move to the five-factors by constructing
the profitability factor RMW (robust-minus-weak) as the difference between the returns of
firms with high and low operating profitability and the investment factor CMA (conservative-
minus-aggressive) as the difference between firms with high versus low investment rates.
The current chapter relies on this set of Python packages.
import pandas as pd
import numpy as np
import sqlite3
import statsmodels.formula.api as smf
crsp_monthly = (pd.read_sql_query(
sql=("SELECT permno, gvkey, month, ret_excess, mktcap, "
"mktcap_lag, exchange FROM crsp_monthly"),
113
114 Replicating Fama-French Factors
con=tidy_finance,
parse_dates={"month"})
.dropna()
)
compustat = (pd.read_sql_query(
sql="SELECT gvkey, datadate, be, op, inv FROM compustat",
con=tidy_finance,
parse_dates={"datadate"})
.dropna()
)
factors_ff3_monthly = pd.read_sql_query(
sql="SELECT month, smb, hml FROM factors_ff3_monthly",
con=tidy_finance,
parse_dates={"month"}
)
factors_ff5_monthly = pd.read_sql_query(
sql="SELECT month, smb, hml, rmw, cma FROM factors_ff5_monthly",
con=tidy_finance,
parse_dates={"month"}
)
Yet when we start merging our dataset for computing the premiums, there are a few differ-
ences to Chapter 10. First, Fama and French form their portfolios in June of year 𝑡𝑡, whereby
the returns of July are the first monthly return for the respective portfolio. For firm size,
they consequently use the market capitalization recorded for June. It is then held constant
until June of year 𝑡𝑡 𝑡 𝑡.
Second, Fama and French also have a different protocol for computing the book-to-market
ratio. They use market equity as of the end of year 𝑡𝑡 𝑡 𝑡 and the book equity reported in
year 𝑡𝑡 𝑡 𝑡, i.e., the datadate is within the last year. Hence, the book-to-market ratio can
be based on accounting information that is up to 18 months old. Market equity also does
not necessarily reflect the same time point as book equity.
To implement all these time lags, we again employ the temporary sorting_date-column.
Notice that when we combine the information, we want to have a single observation per
year and stock since we are only interested in computing the breakpoints held constant for
the entire year. We ensure this by a call of drop_duplicates() at the end of the chunk
below.
size = (crsp_monthly
.query("month.dt.month == 6")
.assign(sorting_date=lambda x: (x["month"]+pd.DateOffset(months=1)))
.get(["permno", "exchange", "sorting_date", "mktcap"])
.rename(columns={"mktcap": "size"})
)
market_equity = (crsp_monthly
.query("month.dt.month == 12")
.assign(sorting_date=lambda x: (x["month"]+pd.DateOffset(months=7)))
Portfolio Sorts 115
book_to_market = (compustat
.assign(
sorting_date=lambda x: (pd.to_datetime(
(x["datadate"].dt.year+1).astype(str)+"0701", format="%Y%m%d")
)
)
.merge(market_equity, how="inner", on=["gvkey", "sorting_date"])
.assign(bm=lambda x: x["be"]/x["me"])
.get(["permno", "sorting_date", "me", "bm"])
)
sorting_variables = (size
.merge(book_to_market, how="inner", on=["permno", "sorting_date"])
.dropna()
.drop_duplicates(subset=["permno", "sorting_date"])
)
breakpoints = (data
.query("exchange == 'NYSE'")
.get(sorting_variable)
.quantile(percentiles, interpolation="linear")
)
breakpoints.iloc[0] = -np.Inf
breakpoints.iloc[breakpoints.size-1] = np.Inf
assigned_portfolios = pd.cut(
data[sorting_variable],
bins=breakpoints,
116 Replicating Fama-French Factors
labels=pd.Series(range(1, breakpoints.size)),
include_lowest=True,
right=False
)
return assigned_portfolios
portfolios = (sorting_variables
.groupby("sorting_date")
.apply(lambda x: x
.assign(
portfolio_size=assign_portfolio(x, "size", [0, 0.5, 1]),
portfolio_bm=assign_portfolio(x, "bm", [0, 0.3, 0.7, 1])
)
)
.reset_index(drop=True)
.get(["permno", "sorting_date", "portfolio_size", "portfolio_bm"])
)
Next, we merge the portfolios to the return data for the rest of the year. To implement this
step, we create a new column sorting_date in our return data by setting the date to sort
on to July of 𝑡𝑡 𝑡 𝑡 if the month is June (of year 𝑡𝑡) or earlier or to July of year 𝑡𝑡 if the month
is July or later.
portfolios = (crsp_monthly
.assign(
sorting_date=lambda x: (pd.to_datetime(
x["month"].apply(lambda x: str(x.year-1)+
"0701" if x.month <= 6 else str(x.year)+"0701")))
)
.merge(portfolios, how="inner", on=["permno", "sorting_date"])
)
.reset_index()
.groupby("month")
.apply(lambda x: pd.Series({
"smb_replicated": (
x["ret"][x["portfolio_size"] == 1].mean() -
x["ret"][x["portfolio_size"] == 2].mean()),
"hml_replicated": (
x["ret"][x["portfolio_bm"] == 3].mean() -
x["ret"][x["portfolio_bm"] == 1].mean())
}))
.reset_index()
)
factors_replicated = (factors_replicated
.merge(factors_ff3_monthly, how="inner", on="month")
.round(4)
)
OLS Model:
smb ~ smb_replicated
Coefficients:
Estimate Std. Error t-Statistic p-Value
Intercept -0.000 0.000 -1.334 0.183
smb_replicated 0.994 0.004 231.343 0.000
Summary statistics:
- Number of observations: 726
- R-squared: 0.987, Adjusted R-squared: 0.987
118 Replicating Fama-French Factors
OLS Model:
hml ~ hml_replicated
Coefficients:
Estimate Std. Error t-Statistic p-Value
Intercept 0.000 0.000 1.633 0.103
hml_replicated 0.963 0.007 133.678 0.000
Summary statistics:
- Number of observations: 726
- R-squared: 0.961, Adjusted R-squared: 0.961
- F-statistic: 17,869.727 on 1 and 724 DF, p-value: 0.000
The replication of the HML factor is also a success, although at a slightly lower level with
coefficient of 0.96 and R-squared around 96 percent.
The evidence allows us to conclude that we did a relatively good job in replicating the
original Fama-French size and value premiums, although we do not know their underlying
code. From our perspective, a perfect match is only possible with additional information
from the maintainers of the original data.
sorting_variables = (size
.merge(other_sorting_variables, how="inner", on=["permno", "sorting_date"])
.dropna()
.drop_duplicates(subset=["permno", "sorting_date"])
)
In each month, we independently sort all stocks into the two size portfolios. The value,
profitability, and investment portfolios, on the other hand, are the results of dependent
sorts based on the size portfolios. We then merge the portfolios to the return data for the
rest of the year just as above.
portfolios = (sorting_variables
.groupby("sorting_date")
.apply(lambda x: x
.assign(
portfolio_size=assign_portfolio(x, "size", [0, 0.5, 1])
)
)
.reset_index(drop=True)
.groupby(["sorting_date", "portfolio_size"])
.apply(lambda x: x
.assign(
portfolio_bm=assign_portfolio(x, "bm", [0, 0.3, 0.7, 1]),
portfolio_op=assign_portfolio(x, "op", [0, 0.3, 0.7, 1]),
portfolio_inv=assign_portfolio(x, "inv", [0, 0.3, 0.7, 1])
)
)
.reset_index(drop=True)
.get(["permno", "sorting_date",
"portfolio_size", "portfolio_bm",
"portfolio_op", "portfolio_inv"])
)
portfolios = (crsp_monthly
.assign(
sorting_date=lambda x: (pd.to_datetime(
x["month"].apply(lambda x: str(x.year-1)+
"0701" if x.month <= 6 else str(x.year)+"0701")))
)
.merge(portfolios, how="inner", on=["permno", "sorting_date"])
)
Now, we want to construct each of the factors, but this time, the size factor actually comes
last because it is the result of averaging across all other factor portfolios. This dependency
is the reason why we keep the table with value-weighted portfolio returns as a separate
object that we reuse later. We construct the value factor, HML, as above by going long
the two portfolios with high book-to-market ratios and shorting the two portfolios with low
book-to-market.
120 Replicating Fama-French Factors
portfolios_value = (portfolios
.groupby(["portfolio_size", "portfolio_bm", "month"])
.apply(lambda x: pd.Series({
"ret": np.average(x["ret_excess"], weights=x["mktcap_lag"])
})
)
.reset_index()
)
factors_value = (portfolios_value
.groupby("month")
.apply(lambda x: pd.Series({
"hml_replicated": (
x["ret"][x["portfolio_bm"] == 3].mean() -
x["ret"][x["portfolio_bm"] == 1].mean())})
)
.reset_index()
)
For the profitability factor, RMW (robust-minus-weak), we take a long position in the two
high profitability portfolios and a short position in the two low profitability portfolios.
portfolios_profitability = (portfolios
.groupby(["portfolio_size", "portfolio_op", "month"])
.apply(lambda x: pd.Series({
"ret": np.average(x["ret_excess"], weights=x["mktcap_lag"])
})
)
.reset_index()
)
factors_profitability = (portfolios_profitability
.groupby("month")
.apply(lambda x: pd.Series({
"rmw_replicated": (
x["ret"][x["portfolio_op"] == 3].mean() -
x["ret"][x["portfolio_op"] == 1].mean())})
)
.reset_index()
)
For the investment factor, CMA (conservative-minus-aggressive), we go long the two low
investment portfolios and short the two high investment portfolios.
portfolios_investment = (portfolios
.groupby(["portfolio_size", "portfolio_inv", "month"])
.apply(lambda x: pd.Series({
"ret": np.average(x["ret_excess"], weights=x["mktcap_lag"])
})
)
.reset_index()
)
Fama-French Five-Factor Model 121
factors_investment = (portfolios_investment
.groupby("month")
.apply(lambda x: pd.Series({
"cma_replicated": (
x["ret"][x["portfolio_inv"] == 1].mean() -
x["ret"][x["portfolio_inv"] == 3].mean())})
)
.reset_index()
)
Finally, the size factor, SMB, is constructed by going long the six small portfolios and short
the six large portfolios.
factors_size = (
pd.concat(
[portfolios_value, portfolios_profitability, portfolios_investment],
ignore_index=True
)
.groupby("month")
.apply(lambda x: pd.Series({
"smb_replicated": (
x["ret"][x["portfolio_size"] == 1].mean() -
x["ret"][x["portfolio_size"] == 2].mean())})
)
.reset_index()
)
We then join all factors together into one dataframe and construct again a suitable table to
run tests for evaluating our replication.
factors_replicated = (factors_size
.merge(factors_value, how="outer", on="month")
.merge(factors_profitability, how="outer", on="month")
.merge(factors_investment, how="outer", on="month")
)
factors_replicated = (factors_replicated
.merge(factors_ff5_monthly, how="inner", on="month")
.round(4)
)
Let us start the replication evaluation again with the size factor:
model_smb = (smf.ols(
formula="smb ~ smb_replicated",
data=factors_replicated
)
.fit()
)
prettify_result(model_smb)
OLS Model:
smb ~ smb_replicated
122 Replicating Fama-French Factors
Coefficients:
Estimate Std. Error t-Statistic p-Value
Intercept -0.00 0.000 -1.493 0.136
smb_replicated 0.97 0.004 222.489 0.000
Summary statistics:
- Number of observations: 714
- R-squared: 0.986, Adjusted R-squared: 0.986
- F-statistic: 49,501.421 on 1 and 712 DF, p-value: 0.000
The results for the SMB factor are quite convincing, as all three criteria outlined above are
met and the coefficient is 0.97 and the R-squared is at 99 percent.
model_hml = (smf.ols(
formula="hml ~ hml_replicated",
data=factors_replicated
)
.fit()
)
prettify_result(model_hml)
OLS Model:
hml ~ hml_replicated
Coefficients:
Estimate Std. Error t-Statistic p-Value
Intercept 0.000 0.00 1.588 0.113
hml_replicated 0.991 0.01 96.694 0.000
Summary statistics:
- Number of observations: 714
- R-squared: 0.929, Adjusted R-squared: 0.929
- F-statistic: 9,349.714 on 1 and 712 DF, p-value: 0.000
The replication of the HML factor is also a success, although at a slightly higher coefficient
of 0.99 and an R-squared around 93 percent.
model_rmw = (smf.ols(
formula="rmw ~ rmw_replicated",
data=factors_replicated
)
.fit()
)
prettify_result(model_rmw)
OLS Model:
rmw ~ rmw_replicated
Coefficients:
Estimate Std. Error t-Statistic p-Value
Intercept 0.000 0.000 0.268 0.789
rmw_replicated 0.954 0.009 107.038 0.000
Exercises 123
Summary statistics:
- Number of observations: 714
- R-squared: 0.941, Adjusted R-squared: 0.941
- F-statistic: 11,457.150 on 1 and 712 DF, p-value: 0.000
We are also able to replicate the RMW factor quite well with a coefficient of 0.95 and an
R-squared around 94 percent.
model_cma = (smf.ols(
formula="cma ~ cma_replicated",
data=factors_replicated
)
.fit()
)
prettify_result(model_cma)
OLS Model:
cma ~ cma_replicated
Coefficients:
Estimate Std. Error t-Statistic p-Value
Intercept 0.001 0.000 4.002 0.0
cma_replicated 0.964 0.008 117.396 0.0
Summary statistics:
- Number of observations: 714
- R-squared: 0.951, Adjusted R-squared: 0.951
- F-statistic: 13,781.776 on 1 and 712 DF, p-value: 0.000
Finally, the CMA factor also replicates well with a coefficient of 0.96 and an R-squared
around 95 percent.
Overall, our approach seems to replicate the Fama-French three and five-factor models just
as well as the three-factors.
11.6 Exercises
1. Fama and French (1993) claim that their sample excludes firms until they have ap-
peared in Compustat for two years. Implement this additional filter and compare
the improvements of your replication effort.
2. On his homepage, Kenneth French1 provides instructions on how to construct the
most common variables used for portfolio sorts. Try to replicate the univariate
portfolio sort return time series for E/P (earnings/price) provided on his homepage
and evaluate your replication effort using regressions.
1 https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/Data_Library/variable_definitions.html
12
Fama-MacBeth Regressions
In this chapter, we present a simple implementation of Fama and MacBeth (1973), a re-
gression approach commonly called Fama-MacBeth regressions. Fama-MacBeth regressions
are widely used in empirical asset pricing studies. We use individual stocks as test assets to
estimate the risk premium associated with the three factors included in Fama and French
(1993).
Researchers use the two-stage regression approach to estimate risk premiums in various
markets, but predominately in the stock market. Essentially, the two-step Fama-MacBeth
regressions exploit a linear relationship between expected returns and exposure to (priced)
risk factors. The basic idea of the regression approach is to project asset returns on factor
exposures or characteristics that resemble exposure to a risk factor in the cross-section in
each time period. Then, in the second step, the estimates are aggregated across time to test
if a risk factor is priced. In principle, Fama-MacBeth regressions can be used in the same
way as portfolio sorts introduced in previous chapters.
The Fama-MacBeth procedure is a simple two-step approach: The first step uses the expo-
sures (characteristics) as explanatory variables in 𝑇𝑇 cross-sectional regressions. For example,
if 𝑟𝑟𝑖𝑖𝑖𝑖𝑖𝑖𝑖 denote the excess returns of asset 𝑖𝑖 in month 𝑡𝑡 𝑡 𝑡, then the famous Fama-French
three-factor model implies the following return generating process (see also Campbell et al.,
1998):
𝑟𝑟𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 𝛼𝛼𝑖𝑖 + 𝜆𝜆𝑀𝑀 𝑀𝑀 𝑆𝑆𝑆𝑆𝑆𝑆 𝑆𝑆𝑆𝑆𝑆𝑆
𝑡𝑡 𝛽𝛽𝑖𝑖𝑖𝑖𝑖 + 𝜆𝜆𝑡𝑡 𝛽𝛽𝑖𝑖𝑖𝑖𝑖 + 𝜆𝜆𝐻𝐻𝐻𝐻𝐻𝐻
𝑡𝑡
𝐻𝐻𝐻𝐻𝐻𝐻
𝛽𝛽𝑖𝑖𝑖𝑖𝑖 + 𝜖𝜖𝑖𝑖𝑖𝑖𝑖 . (12.1)
Here, we are interested in the compensation 𝜆𝜆𝑓𝑓𝑡𝑡 for the exposure to each risk factor 𝛽𝛽𝑖𝑖𝑖𝑖𝑖
𝑓𝑓
at
𝑓𝑓
each time point, i.e., the risk premium. Note the terminology: 𝛽𝛽𝑖𝑖𝑖𝑖𝑖 is an asset-specific char-
acteristic, e.g., a factor exposure or an accounting variable. If there is a linear relationship
between expected returns and the characteristic in a given month, we expect the regression
coefficient to reflect the relationship, i.e., 𝜆𝜆𝑓𝑓𝑡𝑡 ≠ 0.
𝑇𝑇
In the second step, the time-series average 𝑇𝑇1 ∑𝑡𝑡𝑡𝑡 𝜆𝜆̂ 𝑓𝑓𝑡𝑡 of the estimates 𝜆𝜆̂ 𝑓𝑓𝑡𝑡 can then be
interpreted as the risk premium for the specific risk factor 𝑓𝑓. We follow Zaffaroni and Zhou
(2022) and consider the standard cross-sectional regression to predict future returns. If the
characteristics are replaced with time 𝑡𝑡 𝑡 𝑡 variables, then the regression approach captures
risk attributes rather than risk premiums.
Before we move to the implementation, we want to highlight that the characteristics, e.g.,
𝛽𝛽𝑖𝑖𝑓𝑓̂ , are often estimated in a separate step before applying the actual Fama-MacBeth method-
ology. You can think of this as a step 0. You might thus worry that the errors of 𝛽𝛽𝑖𝑖𝑓𝑓̂ impact
the risk premiums’ standard errors. Measurement error in 𝛽𝛽𝑖𝑖𝑓𝑓̂ indeed affects the risk pre-
mium estimates, i.e., they lead to biased estimates. The literature provides adjustments
for this bias (see, e.g., Shanken, 1992; Kim, 1995; Chen et al., 2015, among others) but
also shows that the bias goes to zero as 𝑇𝑇 𝑇 𝑇. We refer to Gagliardini et al. (2016) for
an in-depth discussion also covering the case of time-varying betas. Moreover, if you plan
125
126 Fama-MacBeth Regressions
to use Fama-MacBeth regressions with individual stocks, Hou et al. (2020) advocates using
weighted-least squares to estimate the coefficients such that they are not biased toward small
firms. Without this adjustment, the high number of small firms would drive the coefficient
estimates.
The current chapter relies on this set of Python packages.
import pandas as pd
import numpy as np
import sqlite3
import statsmodels.formula.api as smf
crsp_monthly = pd.read_sql_query(
sql="SELECT permno, gvkey, month, ret_excess, mktcap FROM crsp_monthly",
con=tidy_finance,
parse_dates={"month"}
)
compustat = pd.read_sql_query(
sql="SELECT datadate, gvkey, be FROM compustat",
con=tidy_finance,
parse_dates={"datadate"}
)
beta = pd.read_sql_query(
sql="SELECT month, permno, beta_monthly FROM beta",
con=tidy_finance,
parse_dates={"month"}
)
We use the Compustat and CRSP data to compute the book-to-market ratio and the (log)
market capitalization. Furthermore, we also use the CAPM betas based on monthly returns
we computed in the previous chapters.
characteristics = (compustat
.assign(month=lambda x: x["datadate"].dt.to_period("M").dt.to_timestamp())
.merge(crsp_monthly, how="left", on=["gvkey", "month"], )
.merge(beta, how="left", on=["permno", "month"])
.assign(
bm=lambda x: x["be"]/x["mktcap"],
Cross-Sectional Regression 127
log_mktcap=lambda x: np.log(x["mktcap"]),
sorting_date=lambda x: x["month"]+pd.DateOffset(months=6)
)
.get(["gvkey", "bm", "log_mktcap", "beta_monthly", "sorting_date"])
.rename(columns={"beta_monthly": "beta"})
)
data_fama_macbeth = (crsp_monthly
.merge(characteristics,
how="left",
left_on=["gvkey", "month"], right_on=["gvkey", "sorting_date"])
.sort_values(["month", "permno"])
.groupby("permno")
.apply(lambda x: x.assign(
beta=x["beta"].fillna(method="ffill"),
bm=x["bm"].fillna(method="ffill"),
log_mktcap=x["log_mktcap"].fillna(method="ffill")
)
)
.reset_index(drop=True)
)
data_fama_macbeth_lagged = (data_fama_macbeth
.assign(month=lambda x: x["month"]-pd.DateOffset(months=1))
.get(["permno", "month", "ret_excess"])
.rename(columns={"ret_excess": "ret_excess_lead"})
)
data_fama_macbeth = (data_fama_macbeth
.merge(data_fama_macbeth_lagged, how="left", on=["permno", "month"])
.get(["permno", "month", "ret_excess_lead", "beta", "log_mktcap", "bm"])
.dropna()
)
.params
)
.reset_index()
)
It is common to adjust for autocorrelation when reporting standard errors of risk premi-
ums. As in Chapter 8, the typical procedure for this is computing Newey and West (1987)
standard errors.
price_of_risk_newey_west = (risk_premiums
.melt(id_vars="month", var_name="factor", value_name="estimate")
.groupby("factor")
.apply(lambda x: (
x["estimate"].mean()/
smf.ols("estimate ~ 1", x)
.fit(cov_type="HAC", cov_kwds={"maxlags": 6}).bse
)
)
.reset_index()
.rename(columns={"Intercept": "t_statistic_newey_west"})
)
(price_of_risk
.merge(price_of_risk_newey_west, on="factor")
.round(3)
)
Exercises 129
Finally, let us interpret the results. Stocks with higher book-to-market ratios earn higher
expected future returns, which is in line with the value premium. The negative value for log
market capitalization reflects the size premium for smaller stocks. Consistent with results
from earlier chapters, we detect no relation between beta and future stock returns.
12.4 Exercises
1. Download a sample of test assets from Kenneth French’s homepage and reevaluate
the risk premiums for industry portfolios instead of individual stocks.
2. Use individual stocks with weighted-least squares based on a firm’s size as sug-
gested by Hou et al. (2020). Then, repeat the Fama-MacBeth regressions without
the weighting-scheme adjustment but drop the smallest 20 percent of firms each
month. Compare the results of the three approaches.
Part IV
In this chapter, we provide an intuitive introduction to the two popular concepts of fixed
effects regressions and clustered standard errors. When working with regressions in empirical
finance, you will sooner or later be confronted with discussions around how you deal with
omitted variables bias and dependence in your residuals. The concepts we introduce in this
chapter are designed to address such concerns.
We focus on a classical panel regression common to the corporate finance literature (e.g.,
Fazzari et al., 1988; Erickson and Whited, 2012; Gulen and Ion, 2015): firm investment
modeled as a function that increases in firm cash flow and firm investment opportunities.
Typically, this investment regression uses quarterly balance sheet data provided via Com-
pustat because it allows for richer dynamics in the regressors and more opportunities to
construct variables. As we focus on the implementation of fixed effects and clustered stan-
dard errors, we use the annual Compustat data from our previous chapters and leave the
estimation using quarterly data as an exercise. We demonstrate below that the regression
based on annual data yields qualitatively similar results to estimations based on quarterly
data from the literature, namely confirming the positive relationships between investment
and the two regressors.
The current chapter relies on the following set of Python packages.
import pandas as pd
import numpy as np
import sqlite3
import datetime as dt
import itertools
import linearmodels as lm
133
134 Fixed Effects and Clustered Standard Errors
tidy_finance = sqlite3.connect(
database="data/tidy_finance_python.sqlite"
)
crsp_monthly = pd.read_sql_query(
sql="SELECT gvkey, month, mktcap FROM crsp_monthly",
con=tidy_finance,
parse_dates={"month"}
)
compustat = pd.read_sql_query(
sql=("SELECT datadate, gvkey, year, at, be, capx, oancf, txdb "
"FROM compustat"),
con=tidy_finance,
parse_dates={"datadate"}
)
The classical investment regressions model is the capital investment of a firm as a function
of operating cash flows and Tobin’s q, a measure of a firm’s investment opportunities. We
start by constructing investment and cash flows which are usually normalized by lagged total
assets of a firm. In the following code chunk, we construct a panel of firm-year observations,
so we have both cross-sectional information on firms as well as time-series information for
each firm.
data_investment = (compustat
.assign(
month=lambda x: (
pd.to_datetime(x["datadate"]).dt.to_period("M").dt.to_timestamp()
)
)
.merge(compustat.get(["gvkey", "year", "at"])
.rename(columns={"at": "at_lag"})
.assign(year=lambda x: x["year"]+1),
on=["gvkey", "year"], how="left")
.query("at > 0 and at_lag > 0")
.assign(investment=lambda x: x["capx"]/x["at_lag"],
cash_flows=lambda x: x["oancf"]/x["at_lag"])
)
data_investment = (data_investment
.merge(data_investment.get(["gvkey", "year", "investment"])
.rename(columns={"investment": "investment_lead"})
.assign(year=lambda x: x["year"]-1),
on=["gvkey", "year"], how="left")
)
Tobin’s q is the ratio of the market value of capital to its replacement costs. It is one of
the most common regressors in corporate finance applications (e.g., Fazzari et al., 1988;
Erickson and Whited, 2012). We follow the implementation of Gulen and Ion (2015) and
compute Tobin’s q as the market value of equity (mktcap) plus the book value of assets (at)
minus book value of equity (be) plus deferred taxes (txdb), all divided by book value of
Data Preparation 135
assets (at). Finally, we only keep observations where all variables of interest are non-missing,
and the reported book value of assets is strictly positive.
data_investment = (data_investment
.merge(crsp_monthly, on=["gvkey", "month"], how="left")
.assign(
tobins_q=lambda x: (
(x["mktcap"]+x["at"]-x["be"]+x["txdb"])/x["at"]
)
)
.get(["gvkey", "year", "investment_lead", "cash_flows", "tobins_q"])
.dropna()
)
As the variable construction typically leads to extreme values that are most likely related
to data issues (e.g., reporting errors), many papers include winsorization of the variables
of interest. Winsorization involves replacing values of extreme outliers with quantiles on
the respective end. The following function implements the winsorization for any percentage
cut that should be applied on either end of the distributions. In the specific example, we
winsorize the main variables (investment, cash_flows, and tobins_q) at the one percent
level.1
def winsorize(x, cut):
"""Winsorize returns at cut level."""
tmp_x = x.copy()
upper_quantile=np.nanquantile(tmp_x, 1 - cut)
lower_quantile=np.nanquantile(tmp_x, cut)
tmp_x[tmp_x > upper_quantile]=upper_quantile
tmp_x[tmp_x < lower_quantile]=lower_quantile
return tmp_x
data_investment = (data_investment
.assign(
investment_lead=lambda x: winsorize(x["investment_lead"], 0.01),
cash_flows=lambda x: winsorize(x["cash_flows"], 0.01),
tobins_q=lambda x: winsorize(x["tobins_q"], 0.01)
)
)
value
count mean std min 5% 50% 95% max
measure
cash_flows 127468.0 0.01 0.27 -1.56 -0.47 0.06 0.27 0.48
investment_lead 127468.0 0.06 0.08 0.00 0.00 0.03 0.21 0.46
tobins_q 127468.0 2.00 1.70 0.57 0.79 1.39 5.37 10.91
Coefficients:
Estimate Std. Error t-Statistic p-Value
Intercept 0.042 0.000 127.470 0.0
cash_flows 0.049 0.001 61.787 0.0
tobins_q 0.007 0.000 57.099 0.0
Summary statistics:
- Number of observations: 127,468
- R-squared (incl. FE): 0.043, Within R-squared: 0.039
As expected, the regression output shows significant coefficients for both variables. Higher
cash flows and investment opportunities are associated with higher investment. However,
Fixed Effects 137
the simple model actually may have a lot of omitted variables, so our coefficients are most
likely biased. As there is a lot of unexplained variation in our simple model (indicated by
the rather low adjusted R-squared), the bias in our coefficients is potentially severe, and
the true values could be above or below zero. Note that there are no clear cutoffs to decide
when an R-squared is high or low, but it depends on the context of your application and
on the comparison of different models for the same data.
One way to tackle the issue of omitted variable bias is to get rid of as much unexplained
variation as possible by including fixed effects; i.e., model parameters that are fixed for
specific groups (e.g., Wooldridge, 2010). In essence, each group has its own mean in fixed
effects regressions. The simplest group that we can form in the investment regression is the
firm level. The firm fixed effects regression is then
Investment𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 𝛼𝛼𝑖𝑖 + 𝛽𝛽1 Cash Flows𝑖𝑖𝑖𝑖𝑖 + 𝛽𝛽2 Tobin’s q𝑖𝑖𝑖𝑖𝑖 + 𝜀𝜀𝑖𝑖𝑖𝑖𝑖 , (13.2)
where 𝛼𝛼𝑖𝑖 is the firm fixed effect and captures the firm-specific mean investment across
all years. In fact, you could also compute firms’ investments as deviations from the firms’
average investments and estimate the model without the fixed effects. The idea of the firm
fixed effect is to remove the firm’s average investment, which might be affected by firm-
specific variables that you do not observe. For example, firms in a specific industry might
invest more on average. Or you observe a young firm with large investments but only small
concurrent cash flows, which will only happen in a few years. This sort of variation is
unwanted because it is related to unobserved variables that can bias your estimates in any
direction.
To include the firm fixed effect, we use gvkey (Compustat’s firm identifier) as follows:
model_fe_firm = lm.PanelOLS.from_formula(
formula="investment_lead ~ cash_flows + tobins_q + EntityEffects",
data=data_investment.set_index(["gvkey", "year"]),
).fit()
prettify_result(model_fe_firm)
Coefficients:
Estimate Std. Error t-Statistic p-Value
cash_flows 0.014 0.001 15.219 0.0
tobins_q 0.011 0.000 82.098 0.0
Summary statistics:
- Number of observations: 127,468
- R-squared (incl. FE): 0.585, Within R-squared: 0.057
The regression output shows a lot of unexplained variation at the firm level that is taken
care of by including the firm fixed effect as the adjusted R-squared rises above 50 percent.
In fact, it is more interesting to look at the within R-squared that shows the explanatory
138 Fixed Effects and Clustered Standard Errors
power of a firm’s cash flow and Tobin’s q on top of the average investment of each firm. We
can also see that the coefficients changed slightly in magnitude but not in sign.
There is another source of variation that we can get rid of in our setting: average investment
across firms might vary over time due to macroeconomic factors that affect all firms, such as
economic crises. By including year fixed effects, we can take out the effect of unobservables
that vary over time. The two-way fixed effects regression is then
Investment𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 𝛼𝛼𝑖𝑖 + 𝛼𝛼𝑡𝑡 + 𝛽𝛽1 Cash Flows𝑖𝑖𝑖𝑖𝑖 + 𝛽𝛽2 Tobin’s q𝑖𝑖𝑖𝑖𝑖 + 𝜀𝜀𝑖𝑖𝑖𝑖𝑖 , (13.3)
where 𝛼𝛼𝑡𝑡 is the time fixed effect. Here you can think of higher investments during an
economic expansion with simultaneously high cash flows. You can include a time fixed
effects by using “TimeEffects” in the formula of PanelOLS.
model_fe_firmyear = lm.PanelOLS.from_formula(
formula=("investment_lead ~ cash_flows + tobins_q + EntityEffects"
" + TimeEffects"),
data=data_investment.set_index(["gvkey", "year"]),
).fit()
prettify_result(model_fe_firmyear)
Coefficients:
Estimate Std. Error t-Statistic p-Value
cash_flows 0.018 0.001 19.361 0.0
tobins_q 0.010 0.000 75.419 0.0
Summary statistics:
- Number of observations: 127,468
- R-squared (incl. FE): 0.606, Within R-squared: 0.057
The inclusion of time fixed effects did only marginally affect the R-squared and the coeffi-
cients, which we can interpret as a good thing as it indicates that the coefficients are not
driven by an omitted variable that varies over time.
How can we further improve the robustness of our regression results? Ideally, we want to get
rid of unexplained variation at the firm-year level, which means we need to include more
variables that vary across firm and time and are likely correlated with investment. Note
that we cannot include firm-year fixed effects in our setting because then cash flows and
Tobin’s q are colinear with the fixed effects, and the estimation becomes void.
Before we discuss the properties of our estimation errors, we want to point out that regres-
sion tables are at the heart of every empirical analysis, where you compare multiple models.
Fortunately, the results.compare() function provides a convenient way to tabulate the
regression output (with many parameters to customize and even print the output in LaTeX).
Clustering Standard Errors 139
We recommend printing 𝑡𝑡-statistics rather than standard errors in regression tables because
the latter are typically very hard to interpret across coefficients that vary in size. We also
do not print p-values because they are sometimes misinterpreted to signal the importance
of observed effects (Wasserstein and Lazar, 2016). The 𝑡𝑡-statistics provide a consistent way
to interpret changes in estimation uncertainty across different model specifications.
prettify_result([model_ols, model_fe_firm, model_fe_firmyear])
model_cluster_firmyear = lm.PanelOLS.from_formula(
140 Fixed Effects and Clustered Standard Errors
The table below shows the comparison of the different assumptions behind the standard
errors. In the first column, we can see highly significant coefficients on both cash flows and
Tobin’s q. By clustering the standard errors on the firm level, the 𝑡𝑡-statistics of both coeffi-
cients drop in half, indicating a high correlation of residuals within firms. If we additionally
cluster by year, we see a drop, particularly for Tobin’s q, again. Even after relaxing the
assumptions behind our standard errors, both coefficients are still comfortably significant
as the 𝑡𝑡-statistics are well above the usual critical values of 1.96 or 2.576 for two-tailed
significance tests.
prettify_result([
model_fe_firmyear, model_cluster_firm, model_cluster_firmyear
])
13.4 Exercises
1. Estimate the two-way fixed effects model with two-way clustered standard errors
using quarterly Compustat data from WRDS.
2. Following Peters and Taylor (2017), compute Tobin’s q as the market value of
outstanding equity mktcap plus the book value of debt (dltt + dlc) minus the
current assets atc and everything divided by the book value of property, plant
and equipment ppegt. What is the correlation between the measures of Tobin’s
q? What is the impact on the two-way fixed effects regressions?
14
Difference in Differences
Compared to previous chapters, we introduce the scipy.stats module from the scipy
(Virtanen et al., 2020) for simple retrieval of quantiles of the standard normal distribution.
141
142 Difference in Differences
fisd = (pd.read_sql_query(
sql="SELECT complete_cusip, maturity, offering_amt, sic_code FROM fisd",
con=tidy_finance,
parse_dates={"maturity"})
.dropna()
)
trace_enhanced = (pd.read_sql_query(
sql=("SELECT cusip_id, trd_exctn_dt, rptd_pr, entrd_vol_qt, yld_pt "
"FROM trace_enhanced"),
con=tidy_finance,
parse_dates={"trd_exctn_dt"})
.dropna()
)
We start our analysis by preparing the sample of bonds. We only consider bonds with a
time to maturity of more than one year to the signing of the PA, so that we have sufficient
data to analyze the yield behavior after the treatment date. This restriction also excludes
all bonds issued after the agreement. We also consider only the first two digits of the SIC
industry code to identify the polluting industries (in line with Seltzer et al., 2022).
treatment_date = pd.to_datetime("2015-12-12")
polluting_industries = [
49, 13, 45, 29, 28, 33, 40, 20, 26, 42, 10, 53, 32, 99, 37
]
bonds = (fisd
.query("offering_amt > 0 & sic_code != 'None'")
.assign(
time_to_maturity=lambda x: (x["maturity"]-treatment_date).dt.days / 365,
sic_code=lambda x: x["sic_code"].astype(str).str[:2].astype(int),
log_offering_amt=lambda x: np.log(x["offering_amt"])
)
.query("time_to_maturity >= 1")
.rename(columns={"complete_cusip": "cusip_id"})
.get(["cusip_id", "time_to_maturity", "log_offering_amt", "sic_code"])
.assign(polluter=lambda x: x["sic_code"].isin(polluting_industries))
.reset_index(drop=True)
)
literature (see, e.g., Bessembinder et al., 2008). We weigh each transaction by volume to
reflect a trade’s relative importance and avoid emphasizing small trades. Moreover, we
only consider transactions with reported prices rptd_pr larger than 25 (to exclude bonds
that are close to default) and only bond-day observations with more than five trades on
a corresponding day (to exclude prices based on too few, potentially non-representative
transactions).
trace_enhanced = (trace_enhanced
.query("rptd_pr > 25")
.assign(weight=lambda x: x["entrd_vol_qt"]*x["rptd_pr"])
.assign(weighted_yield=lambda x: x["weight"]*x["yld_pt"])
)
trace_aggregated = (trace_enhanced
.groupby(["cusip_id", "trd_exctn_dt"])
.aggregate(
weighted_yield_sum=("weighted_yield", "sum"),
weight_sum=("weight", "sum"),
trades=("rptd_pr", "count")
)
.reset_index()
.assign(avg_yield=lambda x: x["weighted_yield_sum"]/x["weight_sum"])
.dropna(subset=["avg_yield"])
.query("trades >= 5")
.assign(trd_exctn_dt=lambda x: pd.to_datetime(x["trd_exctn_dt"]))
.assign(month=lambda x: x["trd_exctn_dt"]-pd.offsets.MonthBegin())
)
date_index = (trace_aggregated
.groupby(["cusip_id", "month"])["trd_exctn_dt"]
.idxmax()
)
trace_aggregated = (trace_aggregated
.loc[date_index]
.get(["cusip_id", "month", "avg_yield"])
)
By combining the bond-specific information from Mergent FISD for our bond sample with
the aggregated TRACE data, we arrive at the main sample for our analysis.
bonds_panel = (bonds
.merge(trace_aggregated, how="inner", on="cusip_id")
.dropna()
)
Before we can run the first regression, we need to define the treated indicator,1 which
is the product of the post_period (i.e., all months after the signing of the PA) and the
polluter indicator defined above.
1 Note that by using a generic name here, everybody can replace ours with their sample data and run the
bonds_panel = (bonds_panel
.assign(
post_period=lambda x: (
x["month"] >= (treatment_date-pd.offsets.MonthBegin())
)
)
.assign(treated=lambda x: x["polluter"] & x["post_period"])
.assign(month_cat=lambda x: pd.Categorical(x["month"], ordered=True))
)
As usual, we tabulate summary statistics of the variables that enter the regression to check
the validity of our variable definitions.
bonds_panel_summary = (bonds_panel
.melt(var_name="measure",
value_vars=["avg_yield", "time_to_maturity", "log_offering_amt"])
.groupby("measure")
.describe(percentiles=[0.05, 0.5, 0.95])
)
np.round(bonds_panel_summary, 2)
value
count mean std min 5% 50% 95% max
measure
avg_yield 127546.0 4.08 4.21 0.06 1.27 3.38 8.11 127.97
log_offering_amt 127546.0 13.27 0.82 4.64 12.21 13.22 14.51 16.52
time_to_maturity 127546.0 8.55 8.41 1.01 1.50 5.81 27.41 100.70
model_without_fe = lm.PanelOLS.from_formula(
formula=("avg_yield ~ treated + post_period + polluter + log_offering_amt"
" + time_to_maturity + 1"),
data=bonds_panel.set_index(["cusip_id", "month"]),
).fit()
model_with_fe = lm.PanelOLS.from_formula(
formula=("avg_yield ~ treated + EntityEffects + TimeEffects"),
data=bonds_panel.set_index(["cusip_id", "month"]),
).fit()
prettify_result([model_without_fe, model_with_fe])
model_without_fe_time = (smf.ols(
formula=("avg_yield ~ polluter + month_cat:polluter + time_to_maturity"
" + log_offering_amt"),
data=bonds_panel)
.fit()
.summary()
)
model_without_fe_coefs = (
pd.DataFrame(model_without_fe_time.tables[1].data[1:],
columns=["term", "estimate", "std_error",
"t_stat", "p_value", "ci_1", "ci_2"])
.query("term.str.contains('month_cat')")
.assign(
month=lambda x:
x["term"].str.extract(r"(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})")
)
.assign(month=lambda x: pd.to_datetime(x["month"]))
.assign(treatment=lambda x: x["term"].str.contains("True"))
.assign(estimate=lambda x: x["estimate"].astype(float),
std_error=lambda x: x["std_error"].astype(float))
.assign(ci_up=lambda x: x["estimate"]+norm.ppf(0.975)*x["std_error"],
ci_low=lambda x: x["estimate"]+norm.ppf(0.025)*x["std_error"])
)
polluters_plot = (
ggplot(model_without_fe_coefs,
aes(x="month", y="estimate",
color="treatment", linetype="treatment", shape="treatment")) +
geom_vline(xintercept=pd.to_datetime(treatment_date) -
pd.offsets.MonthBegin(), linetype="dashed") +
geom_hline(yintercept=0, linetype="dashed") +
geom_errorbar(aes(ymin="ci_low", ymax="ci_up"), alpha=0.5) +
geom_point() +
guides(linetype=None) +
labs(x="", y="Yield", shape="Polluter?", color="Polluter?",
title="Polluters respond stronger than green firms") +
scale_linetype_manual(values=["solid", "dashed"]) +
scale_x_datetime(breaks=date_breaks("1 year"), labels=date_format("%Y"))
)
polluters_plot.draw()
Figure 14.1 shows that throughout most of 2014, the yields of the two groups changed in
unison. However, starting at the end of 2014, the yields start to diverge, reaching the highest
difference around the signing of the PA. Afterward, the yields for both groups fall again,
and the polluters arrive at the same level as at the beginning of 2014. The non-polluters, on
the other hand, even experience significantly lower yields than polluters after the signing of
the agreement.
Instead of plotting both groups using the simple model approach, we can also use the
fixed-effects model and focus on the polluter’s yield response to the signing relative to the
Visualizing Parallel Trends 147
Figure 14.1: The figure shows the coefficient estimates and 95 percent confidence intervals
for OLS regressions estimating the treatment effect of the Paris Agreement on bond yields
(in percent) for polluters and non-polluters. The horizontal line represents the benchmark
yield of polluters before the Paris Agreement. The vertical line indicates the date of the
agreement (December 12, 2015).
non-polluters. To perform this estimation, we need to replace the treated indicator with
separate time dummies for the polluters, each marking a one-month period relative to the
treatment date.
bonds_panel_alt = (bonds_panel
.assign(
diff_to_treatment=lambda x: (
np.round(
((x["month"]-(treatment_date-
pd.offsets.MonthBegin())).dt.days/365)*12, 0
).astype(int)
)
)
)
variables = (bonds_panel_alt
.get(["diff_to_treatment", "month"])
.drop_duplicates()
.sort_values("month")
.copy()
.assign(variable_name=np.nan)
.reset_index(drop=True)
)
148 Difference in Differences
In the next code chunk, we assemble the model formula and regress the monthly yields on
the set of time dummies and cusip_id and month fixed effects.
formula = "avg_yield ~ 1 + "
for j in range(variables.shape[0]):
if variables["diff_to_treatment"].iloc[j] != 0:
old_names=list(bonds_panel_alt.columns)
bonds_panel_alt["new_var"] = (
bonds_panel_alt["diff_to_treatment"] ==
variables["diff_to_treatment"].iloc[j]
) & bonds_panel_alt["polluter"]
diff_to_treatment_value=variables["diff_to_treatment"].iloc[j]
direction="lag" if diff_to_treatment_value < 0 else "lead"
abs_diff_to_treatment=int(abs(diff_to_treatment_value))
new_var_name=f"{direction}{abs_diff_to_treatment}"
variables.at[j, "variable_name"]=new_var_name
bonds_panel_alt[new_var_name]=bonds_panel_alt["new_var"]
formula += (f" + {new_var_name}" if j > 0 else new_var_name)
model_with_fe_time = (lm.PanelOLS.from_formula(
formula=formula,
data=bonds_panel_alt.set_index(["cusip_id", "month"]))
.fit()
.summary
)
We then collect the regression results into a dataframe that contains the estimates and
corresponding 95 percent confidence intervals. Note that we also add a row with zeros for
the (omitted) reference point of the time dummies.
lag0_row = pd.DataFrame({
"term": ["lag0"],
"estimate": [0],
"ci_1": [0],
"ci_2": [0],
"ci_up": [0],
"ci_low": [0],
"month": [treatment_date - pd.offsets.MonthBegin()]
})
model_with_fe_time_coefs = (
pd.DataFrame(model_with_fe_time.tables[1].data[1:],
columns=["term", "estimate", "std_error",
"t_stat", "p_value", "ci_1", "ci_2"])
.assign(term=lambda x: x["term"].str.replace("[T.True]", ""))
.assign(estimate=lambda x: x["estimate"].astype(float),
std_error=lambda x: x["std_error"].astype(float))
Exercises 149
model_with_fe_time_coefs = pd.concat(
[model_with_fe_time_coefs, lag0_row],
ignore_index=True
)
The resulting graph shown in Figure 15.2 confirms the main conclusion of the previous
image: polluters’ yield patterns show a considerable anticipation effect starting toward the
end of 2014. Yields only marginally increase after the signing of the agreement. However, as
opposed to the simple model, we do not see a complete reversal back to the pre-agreement
level. Yields of polluters stay at a significantly higher level even one year after the signing.
Notice that during the year after the PA was signed, the 45th president of the United States
was elected (on November 8, 2016). During his campaign there were some indications of
intentions to withdraw the US from the PA, which ultimately happened on November 4,
2020. Hence, reversal effects are potentially driven by these actions.
14.4 Exercises
1. The 46th president of the US rejoined the Paris Agreement on February 19, 2021.
Repeat the difference in differences analysis for the day of his election victory.
Note that you will also have to download new TRACE data. How did polluters’
yields react to this action?
2. Based on the exercise on ratings in Chapter 5, include ratings as a control variable
in the analysis above. Do the results change?
150 Difference in Differences
Figure 14.2: The figure shows the coefficient estimates and 95 percent confidence intervals
for OLS regressions estimating the treatment effect of the Paris Agreement on bond yields
(in percent) for polluters. The horizontal line represents the benchmark yield of polluters
before the Paris Agreement. The vertical line indicates the date of the agreement (December
12, 2015).
15
Factor Selection via Machine Learning
The aim of this chapter is twofold. From a data science perspective, we introduce
scikit-learn, a collection of packages for modeling and machine learning (ML).
scikit-learn comes with a handy workflow for all sorts of typical prediction tasks. From
a finance perspective, we address the notion of factor zoo (Cochrane, 2011) using ML meth-
ods. We introduce Lasso, Ridge, and Elastic Net regression as a special case of penalized
regression models. Then, we explain the concept of cross-validation for model tuning with
Elastic Net regularization as a popular example. We implement and showcase the entire
cycle from model specification, training, and forecast evaluation within the scikit-learn
universe. While the tools can generally be applied to an abundance of interesting asset pric-
ing problems, we apply penalized regressions for identifying macroeconomic variables and
asset pricing factors that help explain a cross-section of industry portfolios.
In previous chapters, we illustrate that stock characteristics such as size provide valuable
pricing information in addition to the market beta. Such findings question the usefulness
of the Capital Asset Pricing Model. In fact, during the last decades, financial economists
discovered a plethora of additional factors which may be correlated with the marginal
utility of consumption (and would thus deserve a prominent role in pricing applications).
The search for factors that explain the cross-section of expected stock returns has produced
hundreds of potential candidates, as noted more recently by Harvey et al. (2016), Harvey
(2017), Mclean and Pontiff (2016), and Hou et al. (2020). Therefore, given the multitude of
proposed risk factors, the challenge these days rather is: do we believe in the relevance of
hundreds of risk factors? During recent years, promising methods from the field of ML got
applied to common finance applications. We refer to Mullainathan and Spiess (2017) for a
treatment of ML from the perspective of an econometrician, Nagel (2021) for an excellent
review of ML practices in asset pricing, Easley et al. (2020) for ML applications in (high-
frequency) market microstructure, and Dixon et al. (2020) for a detailed treatment of all
methodological aspects.
151
152 Factor Selection via Machine Learning
To set the stage, we start with the definition of a linear model: Suppose we have data
(𝑦𝑦𝑡𝑡 , 𝑥𝑥𝑡𝑡 ), 𝑡𝑡 𝑡 𝑡𝑡 𝑡 𝑡 𝑡𝑡 , where 𝑥𝑥𝑡𝑡 is a (𝐾𝐾 𝐾 𝐾𝐾 vector of regressors and 𝑦𝑦𝑡𝑡 is the response for
observation 𝑡𝑡. The linear model takes the form 𝑦𝑦𝑡𝑡 = 𝛽𝛽 ′ 𝑥𝑥𝑡𝑡 + 𝜀𝜀𝑡𝑡 with some error term 𝜀𝜀𝑡𝑡 and
has been studied in abundance. For 𝐾𝐾 𝐾 𝐾𝐾 , the well-known ordinary-least square (OLS)
estimator for the (𝐾𝐾 𝐾 𝐾𝐾 vector 𝛽𝛽 minimizes the sum of squared residuals and is then
−1
𝑇𝑇 𝑇𝑇
𝛽𝛽 ̂
ols
= (∑ 𝑥𝑥′𝑡𝑡 𝑥𝑥𝑡𝑡 ) ∑ 𝑥𝑥′𝑡𝑡 𝑦𝑦𝑡𝑡 . (15.1)
𝑡𝑡𝑡𝑡 𝑡𝑡𝑡𝑡
While we are often interested in the estimated coefficient vector 𝛽𝛽 ols ̂ , ML is about the pre-
dictive performance most of the time. For a new observation 𝑥𝑥𝑡𝑡̃ , the linear model generates
predictions such that
̂ 𝑥𝑥̃ .
𝑦𝑦𝑡𝑡̂ = 𝐸𝐸 𝐸𝐸𝐸𝐸𝐸𝐸𝑡𝑡 = 𝑥𝑥𝑡𝑡̃ ) = 𝛽𝛽 ols′ (15.2)
𝑡𝑡
Is this the best we can do? Not necessarily: instead of minimizing the sum of squared
residuals, penalized linear models can improve predictive performance by choosing other
estimators 𝛽𝛽 ̂ with lower variance than the estimator 𝛽𝛽 ols
̂ . At the same time, it seems ap-
pealing to restrict the set of regressors to a few meaningful ones, if possible. In other words,
if 𝐾𝐾 is large (such as for the number of proposed factors in the asset pricing literature), it
̂ = 0 for some redundant
may be a desirable feature to select reasonable factors and set 𝛽𝛽𝑘𝑘ols
factors.
It should be clear that the promised benefits of penalized regressions, i.e., reducing the mean
squared error (MSE), come at a cost. In most cases, reducing the variance of the estimator
introduces a bias such that 𝐸𝐸 𝐸𝛽𝛽𝛽̂ 𝛽 𝛽𝛽. What is the effect of such a bias-variance trade-off?
To understand the implications, assume the following data-generating process for 𝑦𝑦:
We want to recover 𝑓𝑓𝑓𝑓𝑓𝑓, which denotes some unknown functional which maps the rela-
̂ as an unbiased estimator may be
tionship between 𝑥𝑥 and 𝑦𝑦. While the properties of 𝛽𝛽 ols
desirable under some circumstances, they are certainly not if we consider predictive accu-
̂
racy. Alternative predictors 𝑓𝑓𝑓𝑓𝑓𝑓 could be more desirable: For instance, the MSE depends
on our model choice as follows:
2 2
̂
𝑀𝑀 𝑀𝑀𝑀𝑀𝑀𝑀𝑀 𝑀𝑀𝑀𝑀𝑀 𝑓𝑓𝑓𝑓𝑓𝑓𝑓 ̂
) = 𝐸𝐸 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 𝐸𝐸 𝐸 𝑓𝑓𝑓𝑓𝑓𝑓𝑓 )
2
̂
= 𝐸𝐸 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 𝑓𝑓𝑓𝑓𝑓𝑓𝑓 )+ 𝐸𝐸 𝐸𝐸𝐸2 )
⏟
⏟⏟⏟⏟⏟⏟⏟⏟⏟
irreducible error
total quadratic error
̂ 2 ) + 𝐸𝐸 𝐸𝐸𝐸𝐸𝐸𝐸𝐸2 ) − 2𝐸𝐸 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓
= 𝐸𝐸 𝐸𝑓𝑓𝑓𝑓𝑓𝑓 ̂ 2 (15.4)
𝜀𝜀
̂ 2 ) + 𝑓𝑓𝑓𝑓𝑓𝑓2 − 2𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓
= 𝐸𝐸 𝐸𝑓𝑓𝑓𝑓𝑓𝑓 ̂ 2
𝜀𝜀
2
= Var ̂
(𝑓𝑓𝑓𝑓𝑓𝑓𝑓 ̂ 2
⏟⏟⏟ ⏟⏟ + 𝐸𝐸
⏟⏟ 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸
⏟⏟⏟⏟ 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓
⏟⏟⏟ +𝜎𝜎𝜀𝜀 .
variance of model squared bias
While no model can reduce 𝜎𝜎𝜀𝜀2 , a biased estimator with small variance may have a lower
MSE than an unbiased estimator.
Brief Theoretical Background 153
̂ −1
𝛽𝛽 ridge = (𝑋𝑋 ′ 𝑋𝑋 𝑋 𝑋𝑋𝑋𝑋𝑋 𝑋𝑋 ′ 𝑦𝑦𝑦 (15.5)
where 𝐼𝐼 is the identity matrix of dimension 𝐾𝐾. A couple of observations are worth noting:
̂
𝛽𝛽 ridge ̂ for 𝜆𝜆 𝜆 𝜆 and 𝛽𝛽 ridge
= 𝛽𝛽 ols ̂ → 0 for 𝜆𝜆 𝜆 𝜆. Also for 𝜆𝜆 𝜆 𝜆, (𝑋𝑋 ′ 𝑋𝑋 𝑋 𝑋𝑋𝑋𝑋𝑋 is non-
̂
singular even if 𝑋𝑋 ′ 𝑋𝑋 is which means that 𝛽𝛽 ridge exists even if 𝛽𝛽 ̂ is not defined. However,
note also that the Ridge estimator requires careful choice of the hyperparameter 𝜆𝜆 which
controls the amount of regularization: a larger value of 𝜆𝜆 implies shrinkage of the regression
coefficient toward 0; a smaller value of 𝜆𝜆 reduces the bias of the resulting estimator.
Note that 𝑋𝑋 usually contains an intercept column with ones. As a general rule, the associated
intercept coefficient is not penalized. In practice, this often implies that 𝑦𝑦 is simply demeaned
̂
before computing 𝛽𝛽 ridge .
What about the statistical properties of the Ridge estimator? First, the bad news is that
̂
𝛽𝛽 ridge is a biased estimator of 𝛽𝛽. However, the good news is that (under homoscedastic error
terms) the variance of the Ridge estimator is guaranteed to be smaller than the variance of
the OLS estimator. We encourage you to verify these two statements in the Exercises. As
a result, we face a trade-off: The Ridge regression sacrifices some unbiasedness to achieve a
smaller variance than the OLS estimator.
15.1.2 Lasso
An alternative to Ridge regression is the Lasso (least absolute shrinkage and selection
operator). Similar to Ridge regression, the Lasso (Tibshirani, 1996) is a penalized and
biased estimator. The main difference to Ridge regression is that Lasso does not only shrink
coefficients but effectively selects variables by setting coefficients for irrelevant variables to
zero. Lasso implements a 𝐿𝐿1 penalization on the parameters such that:
𝐾𝐾
̂ ′
𝛽𝛽 Lasso = arg min (𝑌𝑌 𝑌𝑌𝑌𝑌𝑌𝑌 (𝑌𝑌 𝑌𝑌𝑌𝑌𝑌𝑌 s.t. ∑ |𝛽𝛽𝑘𝑘 | < 𝑐𝑐𝑐𝑐𝑐𝑐𝑐 (15.6)
𝛽𝛽
𝑘𝑘𝑘𝑘
̂
There is no closed-form solution for 𝛽𝛽 Lasso in the above maximization problem, but efficient
algorithms exist (e.g., the glmnet package for R and Python). Like for Ridge regression,
the hyperparameter 𝜆𝜆 has to be specified beforehand.
The corresponding Lagrangian reads as follows
𝐾𝐾
̂ ′
𝛽𝛽𝜆𝜆Lasso = arg min (𝑌𝑌 𝑌𝑌𝑌𝑌𝑌𝑌 (𝑌𝑌 𝑌𝑌𝑌𝑌𝑌𝑌𝑌𝑌𝑌 ∑ |𝛽𝛽𝑘𝑘 |. (15.7)
𝛽𝛽
𝑘𝑘𝑘𝑘
154 Factor Selection via Machine Learning
Now, we have to choose two hyperparameters: the shrinkage factor 𝜆𝜆 and the weighting
parameter 𝜌𝜌. The Elastic Net resembles Lasso for 𝜌𝜌𝜌 𝜌 and Ridge regression for 𝜌𝜌𝜌𝜌. While
the glmnet package provides efficient algorithms to compute the coefficients of penalized
regressions, it is a good exercise to implement Ridge and Lasso estimation on your own
before you use the scikit-learn back-end.
• Monthly q-factor returns from Hou et al. (2014). The factors contain the size factor, the
investment factor, the return-on-equity factor, and the expected growth factor.
Next, we include macroeconomic predictors which may predict the general stock market
economy. Macroeconomic variables effectively serve as conditioning information such that
their inclusion hints at the relevance of conditional models instead of unconditional asset
pricing. We refer the interested reader to Cochrane (2009) on the role of conditioning infor-
mation.
• Our set of macroeconomic predictors comes from Welch and Goyal (2008). The data
has been updated by the authors until 2021 and contains monthly variables that have
been suggested as good predictors for the equity premium. Some of the variables are the
dividend price ratio, earnings price ratio, stock variance, net equity expansion, treasury
bill rate, and inflation.
Finally, we need a set of test assets. The aim is to understand which of the plenty factors
and macroeconomic variable combinations prove helpful in explaining our test assets’ cross-
section of returns. In line with many existing papers, we use monthly portfolio returns from
ten different industries according to the definition from Kenneth French’s homepage1 as
test assets.
tidy_finance = sqlite3.connect(database="data/tidy_finance_python.sqlite")
factors_ff3_monthly = (pd.read_sql_query(
sql="SELECT * FROM factors_ff3_monthly",
con=tidy_finance,
parse_dates={"month"})
.add_prefix("factor_ff_")
)
factors_q_monthly = (pd.read_sql_query(
sql="SELECT * FROM factors_q_monthly",
con=tidy_finance,
parse_dates={"month"})
.add_prefix("factor_q_")
)
macro_predictors = (pd.read_sql_query(
sql="SELECT * FROM macro_predictors",
con=tidy_finance,
parse_dates={"month"})
.add_prefix("macro_")
)
industries_ff_monthly = (pd.read_sql_query(
sql="SELECT * FROM industries_ff_monthly",
con=tidy_finance,
parse_dates={"month"})
.melt(id_vars="month", var_name="industry", value_name="ret")
)
1 https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/Data_Library/det_10_ind_port.html
156 Factor Selection via Machine Learning
Our data contains 22 columns of regressors with the 13 macro-variables and 8 factor returns
for each month. Figure 15.1 provides summary statistics for the 10 monthly industry excess
returns in percent. One can see that the dispersion in the excess returns varies widely across
industries.
data_plot = (ggplot(data,
aes(x="industry", y="ret_excess")) +
geom_boxplot() +
coord_flip() +
labs(x="", y="",
title="Excess return distributions by industry in percent") +
scale_y_continuous(labels=percent_format())
)
data_plot.draw()
Figure 15.1: The box plots show the monthly dispersion of returns for 10 different industries.
Machine Learning Workflow 157
new_column_values = []
for macro_column, factor_column in column_combinations:
new_column_values.append(data[macro_column] * data[factor_column])
preprocessor = ColumnTransformer(
transformers=[
("scale", StandardScaler(),
158 Factor Selection via Machine Learning
𝐾𝐾
̂ ′
𝛽𝛽𝜆𝜆Lasso = arg min (𝑌𝑌 𝑌 𝑌𝑌𝑌𝑌𝑌 (𝑌𝑌 𝑌 𝑌𝑌𝑌𝑌𝑌 𝑌 𝑌𝑌 ∑ |𝛽𝛽𝑘𝑘 |. (15.10)
𝛽𝛽
𝑘𝑘𝑘𝑘
In the application at hand, 𝑋𝑋 contains 104 columns with all possible interactions between
factor returns and macroeconomic variables. We want to emphasize that the workflow for
any model is very similar, irrespective of the specific model. As you will see further below, it
is straightforward to fit Ridge regression coefficients and, later, Neural networks or Random
forests with similar code. For now, we start with the linear regression model with an arbi-
trarily chosen value for the penalty factor 𝜆𝜆 (denoted as alpha=0.007 in the code below).
In the setup below, l1_ratio denotes the value of 1 − 𝜌𝜌, hence setting l1_ratio=1 implies
the Lasso.
lm_model = ElasticNet(
alpha=0.007,
l1_ratio=1,
max_iter=5000,
fit_intercept=False
)
lm_pipeline = Pipeline([
("preprocessor", preprocessor),
("regressor", lm_model)
])
That’s it - we are done! The object lm_model_pipeline contains the definition of our model
with all required information, in particular the pre-processing steps and the regression
model.
data_manufacturing_training = (data_manufacturing
Machine Learning Workflow 159
.query(f"month<'{training_date}'")
)
lm_fit = lm_pipeline.fit(
data_manufacturing_training,
data_manufacturing_training.get("ret_excess")
)
̂
First, we focus on the in-sample predicted values 𝑦𝑦𝑡𝑡̂ = 𝑥𝑥𝑡𝑡 𝛽𝛽 Lasso . Figure 15.2 illustrates the
projections for the entire time series of the manufacturing industry portfolio returns.
predicted_values = (pd.DataFrame({
"Fitted value": lm_fit.predict(data_manufacturing),
"Realization": data_manufacturing.get("ret_excess")
})
.assign(month = data_manufacturing["month"])
.melt(id_vars="month", var_name="Variable", value_name="return")
)
predicted_values_plot = (
ggplot(predicted_values,
aes(x="month", y="return",
color="Variable", linetype="Variable")) +
annotate(
"rect",
xmin=data_manufacturing_training["month"].max(),
xmax=data_manufacturing["month"].max(),
ymin=-np.inf, ymax=np.inf,
alpha=0.25, fill="#808080"
) +
geom_line() +
labs(x="", y="", color="", linetype="",
title="Monthly realized and fitted manufacturing risk premia") +
scale_x_datetime(breaks=date_breaks("5 years"),
labels=date_format("%Y")) +
scale_y_continuous(labels=percent_format())
)
predicted_values_plot.draw()
What do the estimated coefficients look like? To analyze these values, it is worth computing
̂
the coefficients 𝛽𝛽 Lasso directly. The code below estimates the coefficients for the Lasso and
Ridge regression for the processed training data sample for a grid of different 𝜆𝜆’s.
x = preprocessor.fit_transform(data_manufacturing)
y = data_manufacturing["ret_excess"]
coefficients_lasso = []
for a in alphas:
lasso = Lasso(alpha=a, fit_intercept=False)
coefficients_lasso.append(lasso.fit(x, y).coef_)
160 Factor Selection via Machine Learning
Figure 15.2: The figure shows monthly realized and fitted manufacturing industry risk pre-
mium. The grey area corresponds to the out of sample period.
coefficients_lasso = (pd.DataFrame(coefficients_lasso)
.assign(alpha=alphas, model="Lasso")
.melt(id_vars=["alpha", "model"])
)
coefficients_ridge = []
for a in alphas:
ridge = Ridge(alpha=a, fit_intercept=False)
coefficients_ridge.append(ridge.fit(x, y).coef_)
coefficients_ridge = (pd.DataFrame(coefficients_ridge)
.assign(alpha=alphas, model="Ridge")
.melt(id_vars=["alpha", "model"])
)
Figure 15.3: The figure shows estimated coefficient paths for different penalty factors. The
penalty parameters are chosen iteratively to resemble the path from no penalization to a
model that excludes all variables.
How should you pick 𝐾𝐾? Large values of 𝐾𝐾 are preferable because the training data better
imitates the original data. However, larger values of 𝐾𝐾 will have much higher computation
time. scikit-learn provides all required tools to conduct 𝐾𝐾-fold cross-validation. We just
have to update our model specification. In our case, we specify the penalty factor 𝜆𝜆 as well
as the mixing factor 𝜌𝜌 as free parameters.
For our sample, we consider a time-series cross-validation sample. This means that we tune
our models with 20 random samples of length five years with a validation period of four years.
For a grid of possible hyperparameters, we then fit the model for each fold and evaluate
MSPÊ in the corresponding validation set. Finally, we select the model specification with
the lowest MSPE in the validation set. First, we define the cross-validation folds based on
our training data only.
Then, we evaluate the performance for a grid of different penalty values. scikit-learn pro-
vides functionalities to construct a suitable grid of hyperparameters with GridSearchCV().
The code chunk below creates a 10 × 3 hyperparameters grid. Then, the method fit()
evaluates all the models for each fold.
initial_years = 5
assessment_months = 48
n_splits = int(len(data_manufacturing)/assessment_months) - 1
length_of_year = 12
alphas = np.logspace(-6, 2, 100)
data_folds = TimeSeriesSplit(
n_splits=n_splits,
test_size=assessment_months,
max_train_size=initial_years * length_of_year
)
params = {
"regressor__alpha": alphas,
"regressor__l1_ratio": (0.0, 0.5, 1)
}
Machine Learning Workflow 163
finder = GridSearchCV(
lm_pipeline,
param_grid=params,
scoring="neg_root_mean_squared_error",
cv=data_folds
)
finder = finder.fit(
data_manufacturing, data_manufacturing.get("ret_excess")
)
After the tuning process, we collect the evaluation metrics (the root mean-squared error in
our example) to identify the optimal model. Figure 15.4 illustrates the average validation
set’s root mean-squared error for each value of 𝜆𝜆 and 𝜌𝜌.
validation = (pd.DataFrame(finder.cv_results_)
.assign(
mspe=lambda x: -x["mean_test_score"],
param_regressor__alpha=lambda x: pd.to_numeric(
x["param_regressor__alpha"], errors="coerce"
)
)
)
validation_plot = (ggplot(validation,
aes(x="param_regressor__alpha", y="mspe",
color="param_regressor__l1_ratio",
shape="param_regressor__l1_ratio",
group="param_regressor__l1_ratio")) +
geom_point() +
geom_line() +
labs(x ="Penalty factor (lambda)", y="Root MSPE",
title="Root MSPE for different penalty factors",
color="Proportion of Lasso Penalty",
shape="Proportion of Lasso Penalty") +
scale_x_log10() +
guides(linetype="none")
)
validation_plot.draw()
Figure 15.4 shows that the MSPE drops faster for Lasso and Elastic Net compared to Ridge
regressions as penalty factor increases. However, for higher penalty factors, the MSPE for
Ridge regressions dips below the others, which both slightly increase again above a certain
threshold. Recall that the larger the regularization, the more restricted the model becomes.
The best performing model yields a penalty parameter (alpha) of 0.0063 and a mixture
factor (𝜌𝜌) of 0.5.
Figure 15.4: The figure shows root MSPE for different penalty factors. Evaluation of man-
ufacturing excess returns for different penalty factors (lambda) and proportions of Lasso
penalty (rho). 1.0 indicates Lasso, 0.5 indicates Elastic Net, and 0.0 indicates Ridge.
we choose a simplified approach: To illustrate the entire workflow, we now run the penalized
regressions for all ten industries. We want to identify relevant variables by fitting Lasso
models for each industry returns time series. More specifically, we perform cross-validation
for each industry to identify the optimal penalty factor 𝜆𝜆.
First, we define the Lasso model with one tuning parameter.
lm_model = Lasso(fit_intercept=False, max_iter=5000)
lm_pipeline = Pipeline([
("preprocessor", preprocessor),
("regressor", lm_model)
])
The following task can be easily parallelized to reduce computing time, but we use a simple
loop for ease of exposition.
all_industries = data["industry"].drop_duplicates()
results = []
for industry in all_industries:
print(industry)
finder = GridSearchCV(
lm_pipeline,
Machine Learning Workflow 165
param_grid=params,
scoring="neg_mean_squared_error",
cv=data_folds
)
finder = finder.fit(
data.query("industry == @industry"),
data.query("industry == @industry").get("ret_excess")
)
results.append(
pd.DataFrame(finder.best_estimator_.named_steps.regressor.coef_ != 0)
)
selected_factors = (
pd.DataFrame(
lm_pipeline[:-1].get_feature_names_out(),
columns=["variable"]
)
.assign(variable = lambda x: (
x["variable"].str.replace("factor_|ff_|q_|macro_",""))
)
.assign(**dict(zip(all_industries, results)))
.melt(id_vars="variable", var_name ="industry")
.query("value == True")
)
What has just happened? In principle, exactly the same as before but instead of computing
the Lasso coefficients for one industry, we did it for ten sequentially. Now, we just have to
do some housekeeping and keep only variables that Lasso does not set to zero. We illustrate
the results in a heat map in Figure 15.5.
selected_factors_plot = (
ggplot(selected_factors,
aes(x="variable", y="industry")) +
geom_tile() +
labs(x="", y="",
title="Selected variables for different industries") +
coord_flip() +
scale_x_discrete(limits=reversed) +
theme(axis_text_x=element_text(rotation=70, hjust=1),
figure_size=(6.4, 6.4))
)
selected_factors_plot.draw()
The heat map in Figure 15.5 conveys two main insights: first, we see that many factors,
macroeconomic variables, and interaction terms are not relevant for explaining the cross-
section of returns across the industry portfolios. In fact, only factor_ff_mkt_excess and
its interaction with macro_bm a role for several industries. Second, there seems to be quite
some heterogeneity across different industries. While barely any variable is selected by Lasso
for Utilities, many factors are selected for, e.g., Durable and Manufacturing, but the selected
factors do not necessarily coincide. In other words, there seems to be a clear picture that
166 Factor Selection via Machine Learning
we do not need many factors, but Lasso does not provide a factor that consistently provides
pricing abilities across industries.
15.5 Exercises
1. Write a function that requires three inputs, namely, y (a 𝑇𝑇 vector), X (a (𝑇𝑇 𝑇 𝑇𝑇𝑇
matrix), and lambda and then returns the Ridge estimator (a 𝐾𝐾 vector) for a
given penalization parameter 𝜆𝜆. Recall that the intercept should not be penalized.
Therefore, your function should indicate whether 𝑋𝑋 contains a vector of ones as
the first column, which should be exempt from the 𝐿𝐿2 penalty.
2. Compute the 𝐿𝐿2 norm (𝛽𝛽 ′ 𝛽𝛽) for the regression coefficients based on the predictive
regression from the previous exercise for a range of 𝜆𝜆’s and illustrate the effect of
penalization in a suitable figure.
3. Now, write a function that requires three inputs, namely, y (a 𝑇𝑇 vector), X (a
(𝑇𝑇 𝑇 𝑇𝑇𝑇 matrix), and 𝜆𝜆 and then returns the Lasso estimator (a 𝐾𝐾 vector) for a
given penalization parameter 𝜆𝜆. Recall that the intercept should not be penalized.
Therefore, your function should indicate whether 𝑋𝑋 contains a vector of ones as
the first column, which should be exempt from the 𝐿𝐿1 penalty.
4. After you understand what Ridge and Lasso regressions are doing, familiarize
yourself with the glmnet package’s documentation. It is a thoroughly tested and
well-established package that provides efficient code to compute the penalized
regression coefficients for Ridge and Lasso and for combinations, commonly called
Elastic Nets.
Exercises 167
Figure 15.5: The figure shows selected variables for different industries. Dark areas indicate
that the estimated Lasso regression coefficient is not set to zero. White fields show which
variables get assigned a value of exactly zero.
16
Option Pricing via Machine Learning
This chapter covers machine learning methods in option pricing. First, we briefly introduce
regression trees, random forests, and neural networks; these methods are advocated as highly
flexible universal approximators, capable of recovering highly non-linear structures in the
data. As the focus is on implementation, we leave a thorough treatment of the statistical
underpinnings to other textbooks from authors with a real comparative advantage on these
issues. We show how to implement random forests and deep neural networks with tidy
principles using scikit-learn.
Machine learning (ML) is seen as a part of artificial intelligence. ML algorithms build
a model based on training data in order to make predictions or decisions without being
explicitly programmed to do so. While ML can be specified along a vast array of different
branches, this chapter focuses on so-called supervised learning for regressions. The basic idea
of supervised learning algorithms is to build a mathematical model for data that contains
both the inputs and the desired outputs. In this chapter, we apply well-known methods
such as random forests and neural networks to a simple application in option pricing. More
specifically, we create an artificial dataset of option prices for different values based on the
Black-Scholes pricing equation for call options. Then, we train different models to learn how
to price call options without prior knowledge of the theoretical underpinnings of the famous
option pricing equation by Black and Scholes (1973).
Throughout this chapter, we need the following Python packages.
import pandas as pd
import numpy as np
169
170 Option Pricing via Machine Learning
𝑅𝑅1 (𝑗𝑗𝑗𝑗𝑗𝑗 𝑗 𝑗𝑗𝑗 𝑗 𝑗𝑗𝑗𝑗 < 𝑠𝑠𝑠 and 𝑅𝑅2 (𝑗𝑗𝑗𝑗𝑗𝑗 𝑗 𝑗𝑗𝑗 𝑗 𝑗𝑗𝑗𝑗 ≥ 𝑠𝑠𝑠𝑠 (16.1)
To pick 𝑗𝑗 and 𝑠𝑠, we find the pair that minimizes the residual sum of square (RSS):
∑ ̂ 1 )2 +
(𝑦𝑦𝑖𝑖 − 𝑦𝑦𝑅𝑅 ∑ ̂ 2 )2
(𝑦𝑦𝑖𝑖 − 𝑦𝑦𝑅𝑅 (16.2)
𝑖𝑖𝑖 𝑖𝑖𝑖𝑖 ∈𝑅𝑅1 (𝑗𝑗𝑗𝑗𝑗𝑗 𝑖𝑖𝑖 𝑖𝑖𝑖𝑖 ∈𝑅𝑅2 (𝑗𝑗𝑗𝑗𝑗𝑗
As in Chapter 15 in the context of penalized regressions, the first relevant question is:
What are the hyperparameter decisions? Instead of a regularization parameter, trees are
fully determined by the number of branches used to generate a partition (sometimes one
specifies the minimum number of observations in each final branch instead of the maximum
number of branches).
Models with a single tree may suffer from high predictive variance. Random forests address
these shortcomings of decision trees. The goal is to improve the predictive performance and
reduce instability by averaging multiple regression trees. A forest basically implies creating
many regression trees and averaging their predictions. To assure that the individual trees
are not the same, we use a bootstrap to induce randomness. More specifically, we build 𝐵𝐵
decision trees 𝑇𝑇1 , … , 𝑇𝑇𝐵𝐵 using the training sample. For that purpose, we randomly select
features to be included in the building of each tree. For each observation in the test set, we
𝐵𝐵
then form a prediction 𝑦𝑦 ̂ 𝑦 1
𝐵𝐵 ∑ 𝑦𝑦𝑇𝑇̂ 𝑖𝑖 .
𝑖𝑖𝑖𝑖
the future data, similar to the fitted value in a regression analysis. Neural networks have
theoretical underpinnings as universal approximators for any smooth predictive association
(Hornik, 1991). Their complexity, however, ranks neural networks among the least transpar-
ent, least interpretable, and most highly parameterized ML tools. In finance, applications
of neural networks can be found in many different contexts, e.g., Avramov et al. (2022),
Chen et al. (2023), and Gu et al. (2020).
Each neuron applies a non-linear activation function 𝑓𝑓 to its aggregated signal before sending
its output to the next layer
𝑁𝑁 𝑙𝑙
𝑥𝑥𝑙𝑙𝑘𝑘 = 𝑓𝑓 𝑓𝑓𝑓0𝑘𝑘 + ∑ 𝑧𝑧𝑗𝑗 𝜃𝜃𝑙𝑙𝑙𝑙𝑙
𝑘𝑘
) (16.3)
𝑗𝑗𝑗𝑗
Here, 𝜃𝜃 are the parameters to fit, 𝑁𝑁 𝑙𝑙 denotes the number of units (a hyperparameter to
tune), and 𝑧𝑧𝑗𝑗 are the input variables which can be either the raw data or, in the case of
multiple chained layers, the outcome from a previous layer 𝑧𝑧𝑗𝑗 = 𝑥𝑥𝑘𝑘 − 1. While the easiest
case with 𝑓𝑓𝑓𝑓𝑓𝑓 𝑓 𝑓𝑓 𝑓 𝑓𝑓𝑓𝑓 resembles linear regression, typical activation functions are sigmoid
(i.e., 𝑓𝑓𝑓𝑓𝑓𝑓 𝑓 𝑓𝑓𝑓 𝑓𝑓−𝑥𝑥 )−1 ) or ReLu (i.e., 𝑓𝑓𝑓𝑓𝑓𝑓 𝑓 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝑓𝑓).
Neural networks gain their flexibility from chaining multiple layers together. Naturally, this
imposes many degrees of freedom on the network architecture for which no clear theoretical
guidance exists. The specification of a neural network requires, at a minimum, a stance
on depth (number of hidden layers), the activation function, the number of neurons, the
connection structure of the units (dense or sparse), and the application of regularization
techniques to avoid overfitting. Finally, learning means to choose optimal parameters relying
on numerical optimization, which often requires specifying an appropriate learning rate.
Despite these computational challenges, implementation in Python is not tedious at all
because we can use the API to TensorFlow.
where 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 is the price of the option as a function of today’s stock price of the underlying,
𝑆𝑆, with time to maturity 𝑇𝑇 , 𝑟𝑟𝑓𝑓 is the risk-free interest rate, and 𝜎𝜎 is the volatility of the
underlying stock return. Φ is the cumulative distribution function of a standard normal
random variable.
The Black-Scholes equation provides a way to compute the arbitrage-free price of a call
option once the parameters 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑓𝑓 , 𝑇𝑇 , and 𝜎𝜎 are specified (arguably, in a realistic context,
172 Option Pricing via Machine Learning
all parameters are easy to specify except for 𝜎𝜎 which has to be estimated). A simple R
function allows computing the price as we do below.
def black_scholes_price(S, K, r, T, sigma):
"""Calculate Black Scholes option price."""
d1 = (np.log(S/K)+(r+sigma**2/2)*T)/(sigma*np.sqrt(T))
d2 = d1-sigma*np.sqrt(T)
price = S*norm.cdf(d1)-K*np.exp(-r*T)*norm.cdf(d2)
return price
S = np.arange(40, 61)
K = np.arange(20, 91)
r = np.arange(0, 0.051, 0.01)
T = np.arange(3/12, 2.01, 1/12)
sigma = np.arange(0.1, 0.81, 0.1)
option_prices = pd.DataFrame(
product(S, K, r, T, sigma),
columns=["S", "K", "r", "T", "sigma"]
)
option_prices["black_scholes"] = black_scholes_price(
option_prices["S"].values,
option_prices["K"].values,
option_prices["r"].values,
option_prices["T"].values,
Learning Black-Scholes 173
option_prices["sigma"].values
)
option_prices = (option_prices
.assign(
observed_price=lambda x: (
x["black_scholes"] + np.random.normal(scale=0.15)
)
)
)
The code above generates more than 1.5 million random parameter constellations (in the
definition of the option_prices dataframe). For each of these values, the true prices reflect-
ing the Black-Scholes model are given and a random innovation term pollutes the observed
prices. The intuition of this application is simple: the simulated data provides many ob-
servations of option prices, by using the Black-Scholes equation we can evaluate the actual
predictive performance of a ML method, which would be hard in a realistic context where
the actual arbitrage-free price would be unknown.
Next, we split the data into a training set (which contains 1 percent of all the observed
option prices) and a test set that will only be used for the final evaluation. Note that
the entire grid of possible combinations contains python len(option_prices.columns)
different specifications. Thus, the sample to learn the Black-Scholes price contains only
31,489 observations and is therefore relatively small.
train_data, test_data = train_test_split(
option_prices,
test_size=0.01, random_state=random_state
)
We process the training dataset further before we fit the different ML models. We define a
ColumnTransformer() that defines all processing steps for that purpose. For our specific
case, we want to explain the observed price by the five variables that enter the Black-
Scholes equation. The true price (stored in column black_scholes) should obviously not
be used to fit the model. The recipe also reflects that we standardize all predictors via
StandardScaler() to ensure that each variable exhibits a sample average of zero and a
sample standard deviation of one.
preprocessor = ColumnTransformer(
transformers=[(
"normalize_predictors",
StandardScaler(),
["S", "K", "r", "T", "sigma"]
)],
remainder="drop"
)
feed-forward neural network. The specification below defines a single layer feed-forward
neural network with ten hidden units. We set the number of training iterations to
max_iter=1000.
max_iter = 1000
nnet_model = MLPRegressor(
hidden_layer_sizes=10,
max_iter=max_iter,
random_state=random_state
)
We can follow the straightforward workflow as in the chapter before: define a workflow,
equip it with the recipe, and specify the associated model. Finally, fit the model with the
training data.
nnet_pipeline = Pipeline([
("preprocessor", preprocessor),
("regressor", nnet_model)
])
nnet_fit = nnet_pipeline.fit(
train_data.drop(columns=["observed_price"]),
train_data.get("observed_price")
)
One word of caution regarding the training of Neural networks: For illustrative pur-
poses we sequentially update the parameters by reiterating through the data 1,000 times
(max_iter=1000). Typically, however, early stopping rules are advised that aim to interrupt
the process of updating parameters as soon as the predictive performance on the validation
test seems to deteriorate. A detailed discussion of these details in the implementation would
go beyond the scope of this book.
Once you are familiar with the scikit-learn workflow, it is a piece of cake to fit other
models. For instance, the model below initializes a random forest with 50 trees contained
in the ensemble, where we require at least 2000 observations in a node. The random forests
are trained using the function RandomForestRegressor().
rf_model = RandomForestRegressor(
n_estimators=50,
min_samples_leaf=2000,
random_state=random_state
)
Fitting the model follows exactly the same convention as for the neural network before.
rf_pipeline = Pipeline([
("preprocessor", preprocessor),
("regressor", rf_model)
])
rf_fit = rf_pipeline.fit(
train_data.drop(columns=["observed_price"]),
train_data.get("observed_price")
)
Learning Black-Scholes 175
deepnnet_pipeline = Pipeline([
("preprocessor", preprocessor),
("regressor", deepnnet_model)
])
deepnnet_fit = deepnnet_pipeline.fit(
train_data.drop(columns=["observed_price"]),
train_data.get("observed_price")
)
lm_fit = lm_pipeline.fit(
train_data.get(["S", "K", "r", "T", "sigma"]),
train_data.get("observed_price")
)
176 Option Pricing via Machine Learning
predictive_performance = (pd.concat(
[test_data.reset_index(drop=True),
pd.DataFrame({"Random forest": rf_fit.predict(test_X),
"Single layer": nnet_fit.predict(test_X),
"Deep NN": deepnnet_fit.predict(test_X),
"Lasso": lm_fit.predict(test_X)})
], axis=1)
.melt(
id_vars=["S", "K", "r", "T", "sigma",
"black_scholes", "observed_price"],
var_name="Model",
value_name="Predicted"
)
.assign(
moneyness=lambda x: x["S"]-x["K"],
pricing_error=lambda x: np.abs(x["Predicted"]-x["black_scholes"])
)
)
In the lines above, we use each of the fitted models to generate predictions for the entire
test dataset of option prices. We evaluate the absolute pricing error as one possible measure
of pricing accuracy, defined as the absolute value of the difference between predicted option
price and the theoretical correct option price from the Black-Scholes model. We show the
results graphically in Figure 16.1.
predictive_performance_plot = (
ggplot(predictive_performance,
aes(x="moneyness", y="pricing_error")) +
geom_point(alpha=0.05) +
facet_wrap("Model") +
labs(x="Moneyness (S - K)", y="Absolut prediction error (USD)",
title="Prediction errors of call options for different models") +
theme(legend_position="")
)
predictive_performance_plot.draw()
1. All ML methods seem to be able to price call options after observing the training
test set.
2. Random forest and the Lasso seem to perform consistently worse in predicting
option prices than the neural networks.
Exercises 177
Figure 16.1: The figure shows absolut prediction error in USD for the different fitted meth-
ods. The prediction error is evaluated on a sample of call options that were not used for
training.
3. For random forest and Lasso, the average prediction errors increase for far in-the-
money options.
4. The increased complexity of the deep neural network relative to the single-layer
neural network results in lower prediction errors.
16.6 Exercises
1. Write a function that takes y and a matrix of predictors X as inputs and returns a
characterization of the relevant parameters of a regression tree with one branch.
2. Create a function that creates predictions for a new matrix of predictors newX
based on the estimated regression tree.
Part V
Portfolio Optimization
17
Parametric Portfolio Policies
In this chapter, we apply different portfolio performance measures to evaluate and compare
portfolio allocation strategies. For this purpose, we introduce a direct way to estimate
optimal portfolio weights for large-scale cross-sectional applications. More precisely, the
approach of Brandt et al. (2009) proposes to parametrize the optimal portfolio weights as a
function of stock characteristics instead of estimating the stock’s expected return, variance,
and covariances with other stocks in a prior step. We choose weights as a function of
characteristics that maximize the expected utility of the investor. This approach is feasible
for large portfolio dimensions (such as the entire CRSP universe) and has been proposed
by Brandt et al. (2009). See the review paper by Brandt (2010) for an excellent treatment
of related portfolio choice methods.
The current chapter relies on the following set of Python packages:
import pandas as pd
import numpy as np
import sqlite3
import statsmodels.formula.api as smf
Compared to previous chapters, we introduce the scipy.optimize module from the scipy
(Virtanen et al., 2020) for solving optimization problems.
crsp_monthly = (pd.read_sql_query(
sql=("SELECT permno, month, ret_excess, mktcap, mktcap_lag "
"FROM crsp_monthly"),
con=tidy_finance,
parse_dates={"month"})
.dropna()
)
181
182 Parametric Portfolio Policies
Next, we retrieve some stock characteristics that have been shown to have an effect on the
expected returns or expected variances (or even higher moments) of the return distribution.
In particular, we record the lagged one-year return momentum (momentum_lag), defined
as the compounded return between months 𝑡𝑡 𝑡 𝑡𝑡 and 𝑡𝑡 𝑡 𝑡 for each firm. In finance,
momentum is the empirically observed tendency for rising asset prices to rise further and
falling prices to keep falling (Jegadeesh and Titman, 1993). The second characteristic is
the firm’s market equity (size_lag), defined as the log of the price per share times the
number of shares outstanding (Banz, 1981). To construct the correct lagged values, we use
the approach introduced in Chapter 4.
crsp_monthly_lags = (crsp_monthly
.assign(month=lambda x: x["month"]+pd.DateOffset(months=13))
.get(["permno", "month", "mktcap"])
)
crsp_monthly = (crsp_monthly
.merge(crsp_monthly_lags,
how="inner", on=["permno", "month"], suffixes=["", "_13"])
)
data_portfolios = (crsp_monthly
.assign(
momentum_lag=lambda x: x["mktcap_lag"]/x["mktcap_13"],
size_lag=lambda x: np.log(x["mktcap_lag"])
)
.dropna(subset=["momentum_lag", "size_lag"])
)
Where do the stock characteristics show up? We parameterize the optimal portfolio weights
as a function of the stock characteristic 𝑥𝑥𝑖𝑖𝑖𝑖𝑖 with the following linear specification for the
portfolio weights:
1 ′
𝜔𝜔𝑖𝑖𝑖𝑖𝑖 = 𝜔𝜔̄𝑖𝑖𝑖𝑖𝑖 + 𝜃𝜃 𝑥𝑥𝑖𝑖𝑖𝑖𝑖
̂ , (17.2)
𝑁𝑁𝑡𝑡
where 𝜔𝜔̄𝑖𝑖𝑖𝑖𝑖 is a stock’s weight in a benchmark portfolio (we use the value-weighted or naive
portfolio in the application below), 𝜃𝜃 is a vector of coefficients which we are going to estimate,
̂ are the characteristics of stock 𝑖𝑖, cross-sectionally standardized to have zero mean
and 𝑥𝑥𝑖𝑖𝑖𝑖𝑖
and unit standard deviation.
Intuitively, the portfolio strategy is a form of active portfolio management relative to a
performance benchmark. Deviations from the benchmark portfolio are derived from the
individual stock characteristics. Note that by construction, the weights sum up to one as
𝑁𝑁𝑡𝑡
∑𝑖𝑖𝑖𝑖 ̂ = 0 due to the standardization. Moreover, the coefficients are constant across
𝑥𝑥𝑖𝑖𝑖𝑖𝑖
assets and over time. The implicit assumption is that the characteristics fully capture all
aspects of the joint distribution of returns that are relevant for forming optimal portfolios.
We first implement cross-sectional standardization for the entire CRSP universe. We also
keep track of (lagged) relative market capitalization relative_mktcap, which will represent
the value-weighted benchmark portfolio, while n denotes the number of traded assets 𝑁𝑁𝑡𝑡 ,
which we use to construct the naive portfolio benchmark.
data_portfolios = (data_portfolios
.groupby("month")
.apply(lambda x: x.assign(
relative_mktcap=x["mktcap_lag"]/x["mktcap_lag"].sum()
)
)
.reset_index(drop=True)
.set_index("month")
.groupby(level="month")
.transform(
lambda x: (x-x.mean())/x.std() if x.name.endswith("lag") else x
)
.reset_index()
.drop(["mktcap_lag"], axis=1)
)
and the 𝑁𝑁𝑡𝑡 (𝑁𝑁𝑡𝑡 + 1)/2 free elements of the covariance matrix, all we need to focus on in
our application is the vector 𝜃𝜃. 𝜃𝜃 contains only two elements in our application: the relative
deviation from the benchmark due to size and momentum.
To get a feeling for the performance of such an allocation strategy, we start with an arbitrary
initial vector 𝜃𝜃0 . The next step is to choose 𝜃𝜃 optimally to maximize the objective function.
We automatically detect the number of parameters by counting the number of columns with
lagged values. Note that the value for 𝜃𝜃 of 1.5 is an arbitrary choice.
lag_columns = [i for i in data_portfolios.columns if "lag" in i]
n_parameters = len(lag_columns)
theta = pd.DataFrame({"theta": [1.5]*n_parameters}, index=lag_columns)
The following function computes the optimal portfolio weights in the way just described.
def compute_portfolio_weights(theta,
data,
value_weighting=True,
allow_short_selling=True):
"""Compute portfolio weights for different strategies."""
data = (data
.groupby("month")
.apply(lambda x: x.assign(
characteristic_tilt=x[theta.index] @ theta / x.shape[0]
)
)
.reset_index(drop=True)
.assign(
weight_benchmark=lambda x:
Portfolio Performance 185
if not allow_short_selling:
data = (data
.assign(weight_tilt=lambda x: np.maximum(0, x["weight_tilt"]))
)
data = (data
.groupby("month")
.apply(lambda x: x.assign(
weight_tilt=lambda x: x["weight_tilt"]/x["weight_tilt"].sum()))
.reset_index(drop=True)
)
return data
In the next step, we compute the portfolio weights for the arbitrary vector 𝜃𝜃0 . In the
example below, we use the value-weighted portfolio as a benchmark and allow negative
portfolio weights.
weights_crsp = compute_portfolio_weights(
theta,
data_portfolios,
value_weighting=True,
allow_short_selling=True
)
(1 + 𝑟𝑟𝑟(1−𝛾𝛾𝛾
𝑢𝑢𝛾𝛾 (𝑟𝑟𝑟 𝑟 , (17.5)
1 − 𝛾𝛾
utility = ((1+r)**(1-gamma))/(1-gamma)
186 Parametric Portfolio Policies
return utility
We want to note that Gehrig et al. (2020) warn that, in the leading case of constant relative
risk aversion (CRRA), strong assumptions on the properties of the returns, the variables
used to implement the parametric portfolio policy, and the parameter space are necessary
to obtain a well-defined optimization problem.
No doubt, there are many other ways to evaluate a portfolio. The function below provides a
summary of all kinds of interesting measures that can be considered relevant. Do we need all
these evaluation measures? It depends: The original paper by Brandt et al. (2009) only cares
about the expected utility to choose 𝜃𝜃. However, if you want to choose optimal values that
achieve the highest performance while putting some constraints on your portfolio weights,
it is helpful to have everything in one function.
def evaluate_portfolio(weights_data,
full_evaluation=True,
capm_evaluation=True,
length_year=12):
"""Calculate portfolio evaluation measures."""
evaluation = (weights_data
.groupby("month")
.apply(lambda x: pd.Series(
np.average(x[["ret_excess", "ret_excess"]],
weights=x[["weight_tilt", "weight_benchmark"]],
axis=0),
["return_tilt", "return_benchmark"])
)
.reset_index()
.melt(id_vars="month", var_name="model",
value_vars=["return_tilt", "return_benchmark"],
value_name="portfolio_return")
.assign(model=lambda x: x["model"].str.replace("return_", ""))
)
evaluation_stats = (evaluation
.groupby("model")["portfolio_return"]
.aggregate([
("Expected utility", lambda x: np.mean(power_utility(x))),
("Average return", lambda x: np.mean(length_year*x)*100),
("SD return", lambda x: np.std(x)*np.sqrt(length_year)*100),
("Sharpe ratio", lambda x: (np.mean(x)/np.std(x)*
np.sqrt(length_year)))
])
)
if capm_evaluation:
evaluation_capm = (evaluation
.merge(factors_ff_monthly, how="left", on="month")
.groupby("model")
.apply(lambda x:
smf.ols(formula="portfolio_return ~ 1 + mkt_excess", data=x)
Portfolio Performance 187
.fit().params
)
.rename(columns={"const": "CAPM alpha",
"mkt_excess": "Market beta"})
)
evaluation_stats = (evaluation_stats
.merge(evaluation_capm, how="left", on="model")
)
if full_evaluation:
evaluation_weights = (weights_data
.melt(id_vars="month", var_name="model",
value_vars=["weight_benchmark", "weight_tilt"],
value_name="weight")
.groupby(["model", "month"])["weight"]
.aggregate([
("Mean abs. weight", lambda x: np.mean(abs(x))),
("Max. weight", lambda x: max(x)),
("Min. weight", lambda x: min(x)),
("Avg. sum of neg. weights", lambda x: -np.sum(x[x < 0])),
("Avg. share of neg. weights", lambda x: np.mean(x < 0))
])
.reset_index()
.drop(columns=["month"])
.groupby(["model"])
.aggregate(lambda x: np.average(x)*100)
.reset_index()
.assign(model=lambda x: x["model"].str.replace("weight_", ""))
)
evaluation_stats = (evaluation_stats
.merge(evaluation_weights, how="left", on="model")
.set_index("model")
)
evaluation_stats = (evaluation_stats
.transpose()
.rename_axis(columns=None)
)
return evaluation_stats
Let us take a look at the different portfolio strategies and evaluation measures.
evaluate_portfolio(weights_crsp).round(2)
benchmark tilt
Expected utility -0.25 -0.26
Average return 6.65 0.19
SD return 15.46 21.11
188 Parametric Portfolio Policies
benchmark tilt
Sharpe ratio 0.43 0.01
Intercept 0.00 -0.01
Market beta 0.99 0.95
Mean abs. weight 0.03 0.08
Max. weight 4.05 4.22
Min. weight 0.00 -0.17
Avg. sum of neg. weights 0.00 78.03
Avg. share of neg. weights 0.00 49.08
The value-weighted portfolio delivers an annualized return of more than six percent and
clearly outperforms the tilted portfolio, irrespective of whether we evaluate expected util-
ity, the Sharpe ratio, or the CAPM alpha. We can conclude the market beta is close to
one for both strategies (naturally almost identically one for the value-weighted benchmark
portfolio). When it comes to the distribution of the portfolio weights, we see that the bench-
mark portfolio weight takes less extreme positions (lower average absolute weights and lower
maximum weight). By definition, the value-weighted benchmark does not take any negative
positions, while the tilted portfolio also takes short positions.
processed_data = compute_portfolio_weights(
theta, data, value_weighting, allow_short_selling
)
objective_function = evaluate_portfolio(
processed_data,
capm_evaluation=False,
full_evaluation=False
)
return objective_function
More Model Specifications 189
You may wonder why we return the negative value of the objective function. This is simply
due to the common convention for optimization procedures to search for minima as a default.
By minimizing the negative value of the objective function, we get the maximum value as a
result. In its most basic form, Python optimization uses the function minimize(). As main
inputs, the function requires an initial guess of the parameters and the objective function to
minimize. Now, we are fully equipped to compute the optimal values of 𝜃𝜃,̂ which maximize
the hypothetical expected utility of the investor.
optimal_theta = minimize(
fun=objective_function,
x0=[1.5]*n_parameters,
args=(data_portfolios, "Expected utility", True, True),
method="Nelder-Mead",
tol=1e-2
)
(pd.DataFrame(
optimal_theta.x,
columns=["Optimal theta"],
index=["momentum_lag", "size_lag"]).T.round(3)
)
momentum_lag size_lag
Optimal theta 0.344 -1.827
The resulting values of 𝜃𝜃 ̂ are easy to interpret: intuitively, expected utility increases by
tilting weights from the value-weighted portfolio toward smaller stocks (negative coefficient
for size) and toward past winners (positive value for momentum). Both findings are in line
with the well-documented size effect (Banz, 1981) and the momentum anomaly (Jegadeesh
and Titman, 1993).
optimal_theta = minimize(
fun=objective_function,
x0=[1.5]*n_parameters,
args=(data, objective_measure, value_weighting, allow_short_selling),
190 Parametric Portfolio Policies
method="Nelder-Mead",
tol=10e-2
).x
processed_data = compute_portfolio_weights(
optimal_theta, data,
value_weighting, allow_short_selling
)
portfolio_evaluation = evaluate_portfolio(processed_data)
strategy_name_dict = {
"benchmark": weight_text,
"tilt": f"{weight_text} Optimal{short_text}"
}
portfolio_evaluation.columns = [
strategy_name_dict[i] for i in portfolio_evaluation.columns
]
return(portfolio_evaluation)
Finally, we can compare the results. The table below shows summary statistics for all
possible combinations: equal- or value-weighted benchmark portfolio, with or without short-
selling constraints, and tilted toward maximizing expected utility.
data = [data_portfolios]
value_weighting = [True, False]
allow_short_selling = [True, False]
objective_measure = ["Expected utility"]
permutations = product(
data, objective_measure,
value_weighting, allow_short_selling
)
results = list(starmap(
evaluate_optimal_performance,
permutations
))
performance_table = (pd.concat(results, axis=1)
.T.drop_duplicates().T.round(3)
)
performance_table.get(["EW", "VW"])
EW VW
Expected utility -0.251 -0.250
Average return 10.009 6.649
More Model Specifications 191
EW VW
SD return 20.352 15.462
Sharpe ratio 0.492 0.430
Intercept 0.002 0.000
Market beta 1.125 0.994
Mean abs. weight 0.000 0.030
Max. weight 0.000 4.053
Min. weight 0.000 0.000
Avg. sum of neg. weights 0.000 0.000
Avg. share of neg. weights 0.000 0.000
EW Optimal VW Optimal
Expected utility -5.840 -0.261
Average return -4948.462 0.194
SD return 14729.881 21.106
Sharpe ratio -0.336 0.009
Intercept -3.697 -0.005
Market beta -78.566 0.952
Mean abs. weight 62.166 0.077
Max. weight 1119.202 4.216
Min. weight -221.305 -0.174
Avg. sum of neg. weights 77916.229 78.025
Avg. share of neg. weights 51.878 49.076
The results indicate that the average annualized Sharpe ratio of the equal-weighted portfolio
exceeds the Sharpe ratio of the value-weighted benchmark portfolio. Nevertheless, starting
with the weighted value portfolio as a benchmark and tilting optimally with respect to
momentum and small stocks yields the highest Sharpe ratio across all specifications. Finally,
imposing no short-sale constraints does not improve the performance of the portfolios in
our application.
192 Parametric Portfolio Policies
17.7 Exercises
1. How do the estimated parameters 𝜃𝜃 ̂ and the portfolio performance change if your
objective is to maximize the Sharpe ratio instead of the hypothetical expected
utility?
2. The code above is very flexible in the sense that you can easily add new firm
characteristics. Construct a new characteristic of your choice and evaluate the
corresponding coefficient 𝜃𝜃𝑖𝑖̂ .
3. Tweak the function optimal_theta() such that you can impose additional per-
formance constraints in order to determine 𝜃𝜃,̂ which maximizes expected utility
under the constraint that the market beta is below 1.
4. Does the portfolio performance resemble a realistic out-of-sample backtesting
procedure? Verify the robustness of the results by first estimating 𝜃𝜃 ̂ based on
past data only. Then, use more recent periods to evaluate the actual portfolio
performance.
5. By formulating the portfolio problem as a statistical estimation problem, you can
easily obtain standard errors for the coefficients of the weight function. Brandt
et al. (2009) provide the relevant derivations in their paper in Equation (10).
Implement a small function that computes standard errors for 𝜃𝜃.̂
18
Constrained Optimization and
Backtesting
industry_returns = (pd.read_sql_query(
sql="SELECT * FROM industries_ff_monthly",
con=tidy_finance,
parse_dates={"month"})
.drop(columns=["month"])
)
193
194 Constrained Optimization and Backtesting
where Σ is the (𝑁𝑁 𝑁 𝑁𝑁 𝑁 covariance matrix of the returns. The optimal weights 𝜔𝜔mvp can be
found analytically and are 𝜔𝜔mvp = 𝜄𝜄Σ′ Σ−1 𝜄𝜄 . In terms of code, the math is equivalent to the
−1
𝜄𝜄
following chunk.
n_industries = industry_returns.shape[1]
mu = np.array(industry_returns.mean()).T
sigma = np.array(industry_returns.cov())
w_mvp = np.linalg.inv(sigma) @ np.ones(n_industries)
w_mvp = w_mvp/w_mvp.sum()
weights_mvp = pd.DataFrame({
"Industry": industry_returns.columns.tolist(),
"Minimum variance": w_mvp
})
weights_mvp.round(3)
Next, consider an investor who aims to achieve minimum variance given a required expected
portfolio return 𝜇𝜇̄ such that she chooses
𝜔𝜔eff (𝜇𝜇𝜇𝜇
̄ arg min 𝜔𝜔′ Σ𝜔𝜔 s.t. 𝜔𝜔′ 𝜄𝜄 𝜄 𝜄 and 𝜔𝜔′ 𝜇𝜇 𝜇 𝜇𝜇𝜇𝜇 (18.2)
We leave it as an exercise below to show that the portfolio choice problem can equivalently
be formulated for an investor with mean-variance preferences and risk aversion factor 𝛾𝛾.
That means the investor aims to choose portfolio weights as the solution to
𝛾𝛾 ′
𝜔𝜔𝛾𝛾∗ = arg max 𝜔𝜔′ 𝜇𝜇 𝜇 𝜔𝜔 Σ𝜔𝜔 s.t. 𝜔𝜔′ 𝜄𝜄 𝜄 𝜄𝜄 (18.3)
2
Estimation Uncertainty and Transaction Costs 195
To proof this statement, we refer to the derivations in Appendix B. Empirically, this classical
solution imposes many problems. In particular, the estimates of 𝜇𝜇 are noisy over short
horizons, the (𝑁𝑁 𝑁 𝑁𝑁 ) matrix Σ contains 𝑁𝑁 𝑁𝑁𝑁 𝑁𝑁𝑁𝑁𝑁 distinct elements and thus, estimation
error is huge. Seminal papers on the effect of ignoring estimation uncertainty, among others,
are Brown (1976), Jobson and Korkie (1980), Jorion (1986), and Chopra and Ziemba (1993).
Even worse, if the asset universe contains more assets than available time periods (𝑁𝑁 𝑁 𝑁𝑁 𝑁,
the sample covariance matrix is no longer positive definite such that the inverse Σ−1 does
not exist anymore. To address estimation issues for vast-dimensional covariance matrices,
regularization techniques (see, e.g., Ledoit and Wolf, 2003, 2004, 2012; Fan et al., 2008) and
the parametric approach from the previous chapter are popular tools.
While the uncertainty associated with estimated parameters is challenging, the data-
generating process is also unknown to the investor. In other words, model uncertainty re-
flects that it is ex-ante not even clear which parameters require estimation (for instance,
if returns are driven by a factor model, selecting the universe of relevant factors imposes
model uncertainty). Wang (2005) and Garlappi et al. (2007) provide theoretical analysis on
optimal portfolio choice under model and estimation uncertainty. In the most extreme case,
Pflug et al. (2012) shows that the naive portfolio, which allocates equal wealth to all assets,
is the optimal choice for an investor averse to model uncertainty.
On top of the estimation uncertainty, transaction costs are a major concern. Rebalancing
portfolios is costly, and, therefore, the optimal choice should depend on the investor’s cur-
rent holdings. In the presence of transaction costs, the benefits of reallocating wealth may
be smaller than the costs associated with turnover. This aspect has been investigated theo-
retically, among others, for one risky asset by Magill and Constantinides (1976) and Davis
and Norman (1990). Subsequent extensions to the case with multiple assets have been pro-
posed by Balduzzi and Lynch (1999) and Balduzzi and Lynch (2000). More recent papers
on empirical approaches that explicitly account for transaction costs include Gârleanu and
Pedersen (2013), DeMiguel et al. (2014), and DeMiguel et al. (2015).
n = sigma.shape[1]
iota = np.ones(n)
sigma_processed = sigma+(beta/gamma)*np.eye(n)
mu_processed = mu+beta*w_prev
sigma_inverse = np.linalg.inv(sigma_processed)
return w_opt
weights_efficient = pd.DataFrame({
"Industry": industry_returns.columns.tolist(),
"Efficient portfolio": w_efficient
})
weights_efficient.round(3)
The portfolio weights above indicate the efficient portfolio for an investor with risk aversion
coefficient 𝛾𝛾 𝛾 𝛾 in the absence of transaction costs. Some of the positions are negative,
which implies short-selling, and most of the positions are rather extreme. For instance, a
position of −1 implies that the investor takes a short position worth their entire wealth to
lever long positions in other assets. What is the effect of transaction costs or different levels
of risk aversion on the optimal portfolio choice? The following few lines of code analyze
the distance between the minimum variance portfolio and the portfolio implemented by the
investor for different values of the transaction cost parameter 𝛽𝛽 and risk aversion 𝛾𝛾.
gammas = [2, 4, 8, 20]
betas = 20*expon.ppf(np.arange(1, 100)/100, scale=1)
transaction_costs = (pd.DataFrame(
list(product(gammas, betas)),
columns=["gamma", "beta"]
198 Constrained Optimization and Backtesting
)
.assign(
weights=lambda x: x.apply(lambda y:
compute_efficient_weight(
sigma, mu, gamma=y["gamma"], beta=y["beta"]/10000, w_prev=w_mvp),
axis=1
),
concentration=lambda x: x["weights"].apply(
lambda x: np.sum(np.abs(x-w_mvp))
)
)
)
The code chunk above computes the optimal weight in the presence of transaction cost for
different values of 𝛽𝛽 and 𝛾𝛾 but with the same initial allocation, the theoretical optimal
minimum variance portfolio. Starting from the initial allocation, the investor chooses their
optimal allocation along the efficient frontier to reflect their own risk preferences. If trans-
action costs were absent, the investor would simply implement the mean-variance efficient
allocation. If transaction costs make it costly to rebalance, their optimal portfolio choice
reflects a shift toward the efficient portfolio, whereas their current portfolio anchors their
investment.
rebalancing_plot = (
ggplot(transaction_costs,
aes(x="beta", y="concentration",
color="factor(gamma)", linetype="factor(gamma)")) +
geom_line() +
guides(linetype=None) +
labs(x="Transaction cost parameter", y="Distance from MVP",
color="Risk aversion",
title=("Portfolio weights for different risk aversion and "
"transaction cost"))
)
rebalancing_plot.draw()
Figure 18.1 shows rebalancing from the initial portfolio (which we always set to the minimum
variance portfolio weights in this example). The higher the transaction costs parameter 𝛽𝛽,
the smaller is the rebalancing from the initial portfolio. In addition, if risk aversion 𝛾𝛾
increases, the efficient portfolio is closer to the minimum variance portfolio weights such
that the investor desires less rebalancing from the initial holdings.
Figure 18.1: The figure shows portfolio weights for different risk aversion and transaction
cost. The horizontal axis indicates the distance from the empirical minimum variance port-
folio weight, measured by the sum of the absolute deviations of the chosen portfolio from
the benchmark.
def objective_mvp(w):
return 0.5*w.T @ sigma @ w
def gradient_mvp(w):
200 Constrained Optimization and Backtesting
return sigma @ w
def equality_constraint(w):
return np.sum(w)-1
def jacobian_equality(w):
return np.ones_like(w)
constraints = (
{"type": "eq", "fun": equality_constraint, "jac": jacobian_equality}
)
options = {
"tol":1e-20,
"maxiter": 10000,
"method":"SLSQP"
}
w_mvp_numerical = minimize(
x0=w_initial,
fun=objective_mvp,
jac=gradient_mvp,
constraints=constraints,
tol=options["tol"],
options={"maxiter": options["maxiter"]},
method=options["method"]
)
def objective_efficient(w):
return 2*0.5*w.T @ sigma @ w-(1+mu) @ w
def gradient_efficient(w):
return 2*sigma @ w-(1+mu)
w_efficient_numerical = minimize(
x0=w_initial,
fun=objective_efficient,
jac=gradient_efficient,
constraints=constraints,
tol=options["tol"],
options={"maxiter": options["maxiter"]},
method=options["method"]
)
The result above shows that the numerical procedure indeed recovered the optimal weights
for a scenario where we already know the analytic solution.
Constrained Optimization 201
weights_no_short_sale = pd.DataFrame({
"Industry": industry_returns.columns.tolist(),
"No short-sale": w_no_short_sale.x
})
weights_no_short_sale.round(3)
Industry No short-sale
0 nodur 0.610
1 durbl 0.000
2 manuf 0.000
3 enrgy 0.211
4 hitec 0.000
5 telcm 0.000
6 shops 0.000
7 hlth 0.179
8 utils 0.000
9 other 0.000
As expected, the resulting portfolio weights are all positive (up to numerical precision).
Typically, the holdings in the presence of short-sale constraints are concentrated among way
fewer assets than in the unrestricted case. You can verify that np.sum(w_no_short_sale.x)
returns 1. In other words, minimize() provides the numerical solution to a portfolio choice
problem for a mean-variance investor with risk aversion gamma = 2, where negative holdings
are forbidden.
minimize() can also handle more complex problems. As an example, we show how to
compute optimal weights, subject to the so-called Regulation-T constraint,1 which requires
𝑁𝑁
that the sum of all absolute portfolio weights is smaller than 1.5, that is ∑𝑖𝑖𝑖𝑖 |𝜔𝜔𝑖𝑖 | ≤ 1.5. The
constraint enforces that a maximum of 50 percent of the allocated wealth can be allocated
to short positions, thus implying an initial margin requirement of 50 percent. Imposing
such a margin requirement reduces portfolio risks because extreme portfolio weights are not
attainable anymore. The implementation of Regulation-T rules is numerically interesting
because the margin constraints imply a non-linear constraint on the portfolio weights.
1 https://en.wikipedia.org/wiki/Regulation_T
202 Constrained Optimization and Backtesting
reg_t = 1.5
def inequality_constraint(w):
return reg_t-np.sum(np.abs(w))
def jacobian_inequality(w):
return -np.sign(w)
def objective_reg_t(w):
return -w @ (1+mu)+2*0.5*w.T @ sigma @ w
def gradient_reg_t(w):
return -(1+mu)+2*np.dot(sigma, w)
constraints = (
{"type": "eq", "fun": equality_constraint, "jac": jacobian_equality},
{"type": "ineq", "fun": inequality_constraint, "jac": jacobian_inequality}
)
w_reg_t = minimize(
x0=w_initial,
fun=objective_reg_t,
jac=gradient_reg_t,
constraints=constraints,
tol=options["tol"],
options={"maxiter": options["maxiter"]},
method=options["method"]
)
weights_reg_t = pd.DataFrame({
"Industry": industry_returns.columns.tolist(),
"Regulation-T": w_reg_t.x
})
weights_reg_t.round(3)
Industry Regulation-T
0 nodur 0.733
1 durbl -0.000
2 manuf -0.144
3 enrgy 0.264
4 hitec 0.000
5 telcm -0.000
6 shops 0.031
7 hlth 0.222
8 utils -0.000
9 other -0.106
Figure 18.2 shows the optimal allocation weights across all python
len(industry_returns.columns) industries for the four different strategies considered
Constrained Optimization 203
so far: minimum variance, efficient portfolio with 𝛾𝛾 = 2, efficient portfolio with short-sale
constraints, and the Regulation-T constrained portfolio.
weights = (weights_mvp
.merge(weights_efficient)
.merge(weights_no_short_sale)
.merge(weights_reg_t)
.melt(id_vars="Industry", var_name="Strategy", value_name="weights")
)
weights_plot = (
ggplot(weights,
aes(x="Industry", y="weights", fill="Strategy")) +
geom_bar(stat="identity", position="dodge", width=0.7) +
coord_flip() +
labs(y="Allocation weight", fill="",
title="Optimal allocations for different strategies") +
scale_y_continuous(labels=percent_format())
)
weights_plot.draw()
Figure 18.2: The figure shows optimal allocation weights for the ten industry portfolios and
the four different allocation strategies.
The results clearly indicate the effect of imposing additional constraints: the extreme hold-
ings the investor implements if they follow the (theoretically optimal) efficient portfolio
vanish under, e.g., the Regulation-T constraint. You may wonder why an investor would
deviate from what is theoretically the optimal portfolio by imposing potentially arbitrary
constraints. The short answer is: the efficient portfolio is only efficient if the true parameters
of the data-generating process correspond to the estimated parameters Σ̂ and 𝜇𝜇.̂ Estimation
204 Constrained Optimization and Backtesting
def objective(w):
return (gamma*0.5*w.T @ sigma @ w-(1+mu) @ w
+(beta/10000)/2*np.sum(np.abs(w-initial_weights)))
def gradient(w):
return (-mu+gamma*sigma @ w
+(beta/10000)*0.5*np.sign(w-initial_weights))
constraints = (
{"type": "eq", "fun": equality_constraint, "jac": jacobian_equality}
)
result = minimize(
x0=initial_weights,
fun=objective,
jac=gradient,
constraints=constraints,
tol=options["tol"],
options={"maxiter": options["maxiter"]},
method=options["method"]
)
return result.x
portfolio weights, which is costly and affects the performance. The portfolio weights deter-
mine the portfolio return. A period later, the current portfolio weights have changed and
form the foundation for transaction costs incurred in the next period. We consider three
different competing strategies: the mean-variance efficient portfolio, the mean-variance effi-
cient portfolio with ex-ante adjustment for transaction costs, and the naive portfolio, which
allocates wealth equally across the different assets.
window_length = 120
periods = industry_returns.shape[0]-window_length
beta = 50
gamma = 2
n_industries = industry_returns.shape[1]
w_prev_1 = w_prev_2 = w_prev_3 = np.ones(n_industries)/n_industries
We also define two helper functions: One to adjust the weights due to returns and one for
performance evaluation, where we compute realized returns net of transaction costs.
def adjust_weights(w, next_return):
w_prev = 1+w*next_return
return np.array(w_prev/np.sum(np.array(w_prev)))
raw_return = np.dot(next_return, w)
turnover = np.sum(np.abs(w-w_previous))
net_return = raw_return-beta/10000*turnover
sigma_window = np.array(returns_window.cov())
mu = 0*np.array(returns_window.mean())
w_1 = compute_efficient_weight_L1_TC(
mu=mu, sigma=sigma_window,
beta=beta,
gamma=gamma,
initial_weights=w_prev_1
)
# Naive portfolio
w_2 = np.ones(n_industries)/n_industries
performance_values["Naive"][p, :] = evaluate_performance(
w_2, w_prev_2, next_return
)
w_prev_2 = adjust_weights(w_2, next_return)
Finally, we get to the evaluation of the portfolio strategies net-of-transaction costs. Note
that we compute annualized returns and standard deviations.
performance = pd.DataFrame()
for i in enumerate(performance_values.keys()):
tmp_data = pd.DataFrame(
performance_values[i[1]],
columns=["raw_return", "turnover", "net_return"]
)
tmp_data["strategy"] = i[1]
performance = pd.concat([performance, tmp_data], axis=0)
length_year = 12
performance_table = (performance
.groupby("strategy")
.aggregate(
mean=("net_return", lambda x: length_year*100*x.mean()),
sd=("net_return", lambda x: np.sqrt(length_year)*100*x.std()),
sharpe_ratio=("net_return", lambda x: (
(length_year*100*x.mean())/(np.sqrt(length_year)*100*x.std())
if x.mean() > 0 else np.nan)
),
turnover=("turnover", lambda x: 100*x.mean())
)
Exercises 207
.reset_index()
)
performance_table.round(3)
The results clearly speak against mean-variance optimization. Turnover is huge when the
investor only considers their portfolio’s expected return and variance. Effectively, the mean-
variance portfolio generates a negative annualized return after adjusting for transaction
costs. At the same time, the naive portfolio turns out to perform very well. In fact, the
performance gains of the transaction-cost adjusted mean-variance portfolio are small. The
out-of-sample Sharpe ratio is slightly higher than for the naive portfolio. Note the extreme
effect of turnover penalization on turnover: MV (TC) effectively resembles a buy-and-hold
strategy which only updates the portfolio once the estimated parameters 𝜇𝜇𝑡𝑡̂ and Σ̂ 𝑡𝑡 indicate
that the current allocation is too far away from the optimal theoretical portfolio.
18.7 Exercises
1. Consider the portfolio choice problem for transaction-cost adjusted certainty
equivalent maximization with risk aversion parameter 𝛾𝛾
𝛾𝛾 ′
∗
𝜔𝜔𝑡𝑡𝑡𝑡 = arg max 𝜔𝜔′ 𝜇𝜇 𝜇 𝜇𝜇𝑡𝑡 (𝜔𝜔𝜔 𝜔𝜔𝜔 𝜔 𝜔𝜔 Σ𝜔𝜔 (18.10)
𝜔𝜔𝜔𝜔𝑁𝑁 ,𝜄𝜄′ 𝜔𝜔𝜔𝜔 2
where Σ and 𝜇𝜇 are (estimators of) the variance-covariance matrix of the returns
and the vector of expected returns. Assume for now that transaction costs are
quadratic in rebalancing and proportional to stock illiquidity such that
𝛽𝛽 ′
𝜈𝜈𝑡𝑡 (𝜔𝜔𝜔 𝜔𝜔𝜔 𝜔 (𝜔𝜔 𝜔 𝜔𝜔𝑡𝑡+ ) 𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝑡𝑡+ ) (18.11)
2
where 𝐵𝐵𝐵 diag(𝑖𝑖𝑖𝑖𝑖𝑖1 , … , 𝑖𝑖𝑖𝑖𝑖𝑖𝑁𝑁 ) is a diagonal matrix, where 𝑖𝑖𝑖𝑖𝑖𝑖1 , … , 𝑖𝑖𝑖𝑖𝑖𝑖𝑁𝑁 . Derive
a closed-form solution for the mean-variance efficient portfolio 𝜔𝜔𝑡𝑡𝑡𝑡 ∗
based on
the transaction cost specification above. Discuss the effect of illiquidity 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 on
the individual portfolio weights relative to an investor that myopically ignores
transaction costs in their decision.
2. Use the solution from the previous exercise to update the function
compute_efficient_weight() such that you can compute optimal weights con-
ditional on a matrix 𝐵𝐵 with illiquidity measures.
3. Illustrate the evolution of the optimal weights from the naive portfolio to the
efficient portfolio in the mean-standard deviation diagram.
4. Is it always optimal to choose the same 𝛽𝛽 in the optimization problem than the
value used in evaluating the portfolio performance? In other words, can it be
optimal to choose theoretically sub-optimal portfolios based on transaction cost
208 Constrained Optimization and Backtesting
considerations that do not reflect the actual incurred costs? Evaluate the out-
of-sample Sharpe ratio after transaction costs for a range of different values of
imposed 𝛽𝛽 values.
A
Colophon
In this appendix chapter, we provide details on the package versions used in this edition.
The book was built with Python (Python Software Foundation, 2023) version 3.10.11 and
the following packages:
Package Version
0 anyio 4.0.0
1 appdirs 1.4.4
2 appnope 0.1.3
3 argon2-cffi 23.1.0
4 argon2-cffi-bindings 21.2.0
5 arrow 1.3.0
6 astor 0.8.1
7 asttokens 2.4.0
8 async-lru 2.0.4
9 attrs 23.1.0
10 Babel 2.13.0
11 backcall 0.2.0
12 beautifulsoup4 4.12.2
13 bleach 6.1.0
14 certifi 2023.7.22
15 cffi 1.16.0
16 charset-normalizer 3.3.0
17 click 8.1.7
18 comm 0.1.4
19 contourpy 1.1.1
20 cycler 0.12.1
21 Cython 3.0.3
22 debugpy 1.8.0
23 decorator 5.1.1
24 defusedxml 0.7.1
25 et-xmlfile 1.1.0
26 exceptiongroup 1.1.3
27 executing 2.0.0
28 fastjsonschema 2.18.1
29 fonttools 4.43.1
30 formulaic 0.6.6
31 fqdn 1.5.1
32 frozendict 2.3.8
33 html5lib 1.1
34 httpimport 1.3.1
209
210 Colophon
Package Version
35 idna 3.4
36 importlib-metadata 6.8.0
37 interface-meta 1.3.0
38 ipykernel 6.25.2
39 ipython 8.16.1
40 ipython-genutils 0.2.0
41 ipywidgets 8.1.1
42 isoduration 20.11.0
43 jedi 0.18.2
44 Jinja2 3.1.2
45 joblib 1.3.2
46 json5 0.9.14
47 jsonpointer 2.4
48 jsonschema 4.19.1
49 jsonschema-specifications 2023.7.1
50 jupyter 1.0.0
51 jupyter-cache 0.6.1
52 jupyter-console 6.6.3
53 jupyter-events 0.7.0
54 jupyter-lsp 2.2.0
55 jupyter_client 8.3.1
56 jupyter_core 5.4.0
57 jupyter_server 2.7.3
58 jupyter_server_terminals 0.4.4
59 jupyterlab 4.0.6
60 jupyterlab-pygments 0.2.2
61 jupyterlab-widgets 3.0.9
62 jupyterlab_server 2.25.0
63 kiwisolver 1.4.5
64 linearmodels 5.3
65 lxml 4.9.3
66 MarkupSafe 2.1.3
67 matplotlib 3.8.0
68 matplotlib-inline 0.1.6
69 mistune 3.0.2
70 mizani 0.9.3
71 multitasking 0.0.11
72 mypy-extensions 1.0.0
73 nbclient 0.7.4
74 nbconvert 7.9.2
75 nbformat 5.9.2
76 nest-asyncio 1.5.8
77 notebook 7.0.4
78 notebook_shim 0.2.3
79 numpy 1.26.0
80 openpyxl 3.1.2
81 overrides 7.4.0
82 packaging 23.2
83 pandas 2.1.1
Colophon 211
Package Version
84 pandas-datareader 0.10.0
85 pandocfilters 1.5.0
86 parso 0.8.3
87 patsy 0.5.3
88 peewee 3.16.3
89 pexpect 4.8.0
90 pickleshare 0.7.5
91 Pillow 10.0.1
92 platformdirs 3.11.0
93 plotnine 0.12.3
94 prometheus-client 0.17.1
95 prompt-toolkit 3.0.39
96 psutil 5.9.5
97 psycopg2-binary 2.9.9
98 ptyprocess 0.7.0
99 pure-eval 0.2.2
100 pycparser 2.21
101 Pygments 2.16.1
102 pyhdfe 0.2.0
103 pyparsing 3.1.1
104 python-dateutil 2.8.2
105 python-dotenv 1.0.0
106 python-json-logger 2.0.7
107 pytz 2023.3.post1
108 PyYAML 6.0.1
109 pyzmq 25.1.1
110 qtconsole 5.4.4
111 QtPy 2.4.0
112 referencing 0.30.2
113 regtabletotext 0.0.12
114 requests 2.31.0
115 rfc3339-validator 0.1.4
116 rfc3986-validator 0.1.1
117 rpds-py 0.10.4
118 scikit-learn 1.3.1
119 scipy 1.11.3
120 Send2Trash 1.8.2
121 setuptools-scm 7.1.0
122 six 1.16.0
123 sniffio 1.3.0
124 soupsieve 2.5
125 SQLAlchemy 2.0.21
126 stack-data 0.6.3
127 statsmodels 0.14.0
128 tabulate 0.9.0
129 terminado 0.17.1
130 threadpoolctl 3.2.0
131 tinycss2 1.2.1
132 tomli 2.0.1
212 Colophon
Package Version
133 tornado 6.3.3
134 traitlets 5.11.2
135 types-python-dateutil 2.8.19.14
136 typing_extensions 4.8.0
137 tzdata 2023.3
138 uri-template 1.3.0
139 urllib3 2.0.6
140 wcwidth 0.2.8
141 webcolors 1.13
142 webencodings 0.5.1
143 websocket-client 1.6.4
144 widgetsnbextension 4.0.9
145 wrapt 1.15.0
146 yfinance 0.2.31
147 zipp 3.17.0
B
Proofs
In this appendix chapter, we collect the proofs that we refer to throughout the book.
𝜕𝜕𝜕𝜕𝜕𝜕𝜕 𝜆𝜆
= 0 ⇔ 2Σ𝜔𝜔 𝜔 𝜔𝜔𝜔𝜔 𝜔 𝜔𝜔 𝜔 Σ−1 𝜄𝜄
𝜕𝜕𝜕𝜕 2
Next, the constraint that weights have to sum up to one delivers: 1 = 𝜄𝜄′ 𝜔𝜔 𝜔 𝜆𝜆2 𝜄𝜄′ Σ−1 𝜄𝜄 𝜄𝜄𝜄𝜄
𝜄𝜄′ Σ−1 𝜄𝜄 . Finally, plug-in the derived value of 𝜆𝜆 to get
2
Σ−1 𝜄𝜄
𝜔𝜔mvp = .
𝜄𝜄′ Σ−1 𝜄𝜄
̃
2Σ𝜔𝜔 𝜔 𝜔𝜔𝜔𝜔 𝜔 𝜆𝜆𝜆𝜆
𝜆𝜆 −1 𝜆𝜆̃
⇒ 𝜔𝜔 𝜔 Σ 𝜄𝜄 𝜄 Σ−1 𝜇𝜇𝜇
2 2
213
214 Proofs
𝜆𝜆 ′ −1 𝜆𝜆̃ ′ −1
1= 𝜄𝜄′ 𝜔𝜔 𝜔 Σ 𝜄𝜄 + 𝜄𝜄⏟
𝜄𝜄⏟ Σ 𝜇𝜇
2 𝐶𝐶 2
𝐷𝐷
2 − 𝜆𝜆𝜆𝜆̃
⇒ 𝜆𝜆 𝜆
𝐶𝐶
𝜆𝜆 ′ −1 𝜆𝜆̃ ′ −1 ̃
1 2 − 𝜆𝜆𝜆𝜆 𝜆𝜆̃
𝜇𝜇̄ 𝜇 𝜇𝜇′ 𝜔𝜔 𝜔 𝜇𝜇 ⏟ Σ 𝜄𝜄 + 𝜇𝜇⏟
Σ 𝜇𝜇 = ( ) 𝐷𝐷𝐷 𝐸𝐸
2 2 2 𝐶𝐶 2
𝐷𝐷 𝐸𝐸
𝐷𝐷 𝜆𝜆̃ 𝐷𝐷2
= + (𝐸𝐸 𝐸 )
𝐶𝐶 2 𝐶𝐶
𝜇𝜇̄ 𝜇𝜇𝜇𝜇𝜇𝜇
⇒ 𝜆𝜆̃ 𝜆𝜆 .
𝐸𝐸 𝐸𝐸𝐸2 /𝐶𝐶
As a result, the efficient portfolio weight takes the form (for 𝜇𝜇̄ 𝜇 𝜇𝜇𝜇𝜇𝜇𝜇 𝜇𝜇′ 𝜔𝜔mvp )
𝜆𝜆̃ 𝐷𝐷
𝜔𝜔eff (𝜇𝜇𝜇̄ 𝜇𝜇𝜇mvp + (Σ−1 𝜇𝜇 𝜇 Σ−1 𝜄𝜄𝜄 𝜄
2 𝐶𝐶
Thus, the efficient portfolio allocates wealth in the minimum variance portfolio 𝜔𝜔mvp and a
levered (self-financing) portfolio to increase the expected return.
Note that the portfolio weights sum up to one as
𝐷𝐷 −1
𝜄𝜄′ (Σ−1 𝜇𝜇 𝜇 Σ 𝜄𝜄𝜄 𝜄 𝜄𝜄𝜄𝜄𝜄 𝜄 𝜄 so 𝜄𝜄′ 𝜔𝜔eff = 𝜄𝜄′ 𝜔𝜔mvp =1.
𝐶𝐶
Finally, the expected return of the efficient portfolio is
𝐷𝐷 𝐷𝐷
𝜇𝜇′ 𝜔𝜔eff = + 𝜇𝜇̄ 𝜇 =̄
𝜇𝜇𝜇
𝐶𝐶 𝐶𝐶
Note the difference: In the first case, the investor has a (known) risk aversion 𝛾𝛾 which
determines their optimal balance between risk (𝜔𝜔′ Σ𝜔𝜔𝜔 and return (𝜇𝜇′ 𝜔𝜔). In the second
case, the investor has a target return they want to achieve while minimizing the volatility.
Intuitively, both approaches are closely connected if we consider that the risk aversion 𝛾𝛾
determines the desirable return 𝜇𝜇.̄ More risk-averse investors (higher 𝛾𝛾) will chose a lower
target return to keep their volatility level down. The efficient frontier then spans all possible
portfolios depending on the risk aversion 𝛾𝛾, starting from the minimum variance portfolio
(𝛾𝛾 𝛾 𝛾).
Optimal Portfolio Choice 215
To proof this equivalence, consider first the optimal portfolio weights for a certainty equiv-
alent maximizing investor. The first-order condition reads
𝜇𝜇 𝜇 𝜇𝜇𝜇𝜇 𝜇 𝜇𝜇𝜇𝜇𝜇
1
⇔ 𝜔𝜔𝜔 Σ−1 (𝜇𝜇 𝜇 𝜇𝜇𝜇𝜇𝜇
𝛾𝛾
The resulting weights correspond to the efficient portfolio with desired return 𝑟𝑟 ̄ such that
(in the notation of book)
1 𝜆𝜆̃ 𝜇𝜇̄ 𝜇 𝜇𝜇𝜇𝜇𝜇
= =
𝛾𝛾 2 𝐸𝐸 𝐸𝐸𝐸2 /𝐶𝐶
which implies that the desired return is just
𝐷𝐷 1
𝜇𝜇̄ 𝜇 + (𝐸𝐸 𝐸𝐸𝐸2 /𝐶𝐶𝐶
𝐶𝐶 𝛾𝛾
which is 𝐷𝐷
𝐶𝐶 = 𝜇𝜇′ 𝜔𝜔mvp for 𝛾𝛾 𝛾 𝛾 as expected. For instance, letting 𝛾𝛾 𝛾 𝛾 implies 𝜇𝜇̄ 𝜇 𝐷𝐷
𝐶𝐶 =
′
𝜔𝜔mvp 𝜇𝜇.
C
WRDS Dummy Data
In this appendix chapter, we alleviate the constraints of readers who don’t have access to
WRDS and hence cannot run the code that we provide. We show how to create a dummy
database that contains the WRDS tables and corresponding columns such that all code
chunks in this book can be executed with this dummy database. We do not create dummy
data for tables of macroeconomic variables because they can be freely downloaded from the
original sources; check out Chapter 3.
We deliberately use the dummy label because the data is not meaningful in the sense that
it allows readers to actually replicate the results of the book. For legal reasons, the data
does not contain any samples of the original data. We merely generate random numbers for
all columns of the tables that we use throughout the books.
To generate the dummy database, we use the following packages:
import pandas as pd
import numpy as np
import sqlite3
import string
Since we draw random numbers for most of the columns, we also define a seed to ensure
that the generated numbers are replicable. We also initialize vectors of dates of different
frequencies over ten years that we then use to create yearly, monthly, and daily data, re-
spectively.
np.random.seed(1234)
start_date = pd.Timestamp("2003-01-01")
end_date = pd.Timestamp("2022-12-31")
217
218 WRDS Dummy Data
industries = pd.DataFrame({
"industry": ["Agriculture", "Construction", "Finance",
"Manufacturing", "Mining", "Public", "Retail",
"Services", "Transportation", "Utilities", "Wholesale"],
"n": [81, 287, 4682, 8584, 1287, 1974, 1571, 4277, 1249, 457, 904],
"prob": [0.00319, 0.0113, 0.185, 0.339, 0.0508, 0.0779,
0.0620, 0.169, 0.0493, 0.0180, 0.03451]
})
exchanges = pd.DataFrame({
"exchange": ["AMEX", "NASDAQ", "NYSE"],
"n": [2893, 17236, 5553],
"prob": [0.113, 0.671, 0.216]
})
stock_identifiers_list = []
for x in range(1, number_of_stocks+1):
exchange = np.random.choice(exchanges["exchange"], p=exchanges["prob"])
industry = np.random.choice(industries["industry"], p=industries["prob"])
exchcd_mapping = {
"NYSE": np.random.choice([1, 31]),
"AMEX": np.random.choice([2, 32]),
"NASDAQ": np.random.choice([3, 33])
}
siccd_mapping = {
"Agriculture": np.random.randint(1, 1000),
"Mining": np.random.randint(1000, 1500),
"Construction": np.random.randint(1500, 1800),
"Manufacturing": np.random.randint(1800, 4000),
"Transportation": np.random.randint(4000, 4900),
"Utilities": np.random.randint(4900, 5000),
"Wholesale": np.random.randint(5000, 5200),
"Retail": np.random.randint(5200, 6000),
"Finance": np.random.randint(6000, 6800),
Create Stock Dummy Data 219
stock_identifiers_list.append({
"permno": x,
"gvkey": str(x+10000),
"exchange": exchange,
"industry": industry,
"exchcd": exchcd_mapping[exchange],
"siccd": siccd_mapping[industry]
})
stock_identifiers = pd.DataFrame(stock_identifiers_list)
Next, we construct three panels of stock data with varying frequencies: yearly, monthly, and
daily. We begin by creating the stock_panel_yearly panel. To achieve this, we combine the
stock_identifiers table with a new table containing the variable year from dummy_years.
The expand_grid() function ensures that we get all possible combinations of the two tables.
After combining, we select only the gvkey and year columns for our final yearly panel.
Next, we construct the stock_panel_monthly panel. Similar to the yearly panel, we use
the expand_grid() function to combine stock_identifiers with a new table that has the
month variable from dummy_months. After merging, we select the columns permno, gvkey,
month, siccd, industry, exchcd, and exchange to form our monthly panel.
Lastly, we create the stock_panel_daily panel. We combine stock_identifiers with a
table containing the date variable from dummy_days. After merging, we retain only the
permno and date columns for our daily panel.
stock_panel_yearly = pd.DataFrame({
"gvkey": np.tile(stock_identifiers["gvkey"], len(dummy_years)),
"year": np.repeat(dummy_years, len(stock_identifiers))
})
stock_panel_monthly = pd.DataFrame({
"permno": np.tile(stock_identifiers["permno"], len(dummy_months)),
"gvkey": np.tile(stock_identifiers["gvkey"], len(dummy_months)),
"month": np.repeat(dummy_months, len(stock_identifiers)),
"siccd": np.tile(stock_identifiers["siccd"], len(dummy_months)),
"industry": np.tile(stock_identifiers["industry"], len(dummy_months)),
"exchcd": np.tile(stock_identifiers["exchcd"], len(dummy_months)),
"exchange": np.tile(stock_identifiers["exchange"], len(dummy_months))
})
stock_panel_daily = pd.DataFrame({
"permno": np.tile(stock_identifiers["permno"], len(dummy_days)),
"date": np.repeat(dummy_days, len(stock_identifiers))
})
220 WRDS Dummy Data
(beta_dummy
.to_sql(name="beta",
con=tidy_finance,
if_exists="replace",
index = False)
)
commands = {
col: np.random.rand(len(stock_panel_yearly)) for col in relevant_columns
}
compustat_dummy = (
stock_panel_yearly
.assign(
datadate=lambda x: pd.to_datetime(x["year"].astype(str)+"-12-31")
)
.assign(**commands)
)
(compustat_dummy
Create Stock Dummy Data 221
.to_sql(name="compustat",
con=tidy_finance,
if_exists="replace",
index=False)
)
(crsp_monthly_dummy
.to_sql(name="crsp_monthly",
con=tidy_finance,
if_exists="replace",
index=False)
)
(crsp_daily_dummy
.to_sql(name="crsp_daily",
con=tidy_finance,
if_exists="replace",
index=False)
)
def generate_cusip():
"""Generate cusip."""
return cusip
fisd_dummy = (pd.DataFrame({
"complete_cusip": [generate_cusip() for _ in range(number_of_bonds)]
})
.assign(
maturity=lambda x: np.random.choice(dummy_days, len(x), replace=True),
offering_amt=lambda x: np.random.choice(
np.arange(1, 101)*100000, len(x), replace=True
)
)
.assign(
offering_date=lambda x: (
x["maturity"]-pd.to_timedelta(
np.random.choice(np.arange(1, 26)*365, len(x), replace=True),
unit="D"
)
)
)
.assign(
Create Bond Dummy Data 223
dated_date=lambda x: (
x["offering_date"]-pd.to_timedelta(
np.random.choice(np.arange(-10, 11), len(x), replace=True),
unit="D"
)
),
interest_frequency=lambda x: np.random.choice(
[0, 1, 2, 4, 12], len(x), replace=True
),
coupon=lambda x: np.random.choice(
np.arange(0, 2.1, 0.1), len(x), replace=True
)
)
.assign(
last_interest_date=lambda x: (
x[["maturity", "offering_date", "dated_date"]].max(axis=1)
),
issue_id=lambda x: x.index+1,
issuer_id=lambda x: np.random.choice(
np.arange(1, 251), len(x), replace=True
),
sic_code=lambda x: (np.random.choice(
np.arange(1, 10)*1000, len(x), replace=True)
).astype(str)
)
)
(fisd_dummy
.to_sql(name="fisd",
con=tidy_finance,
if_exists="replace",
index=False)
)
bonds_panel = pd.DataFrame({
"cusip_id": np.tile(
fisd_dummy["complete_cusip"],
(end_date-start_date).days+1
),
"trd_exctn_dt": np.repeat(
224 WRDS Dummy Data
trace_enhanced_dummy = (pd.concat([bonds_panel]*5)
.assign(
trd_exctn_tm = lambda x: pd.to_datetime(
x["trd_exctn_dt"].astype(str)+" " +
np.random.randint(0, 24, size=len(x)).astype(str)+":" +
np.random.randint(0, 60, size=len(x)).astype(str)+":" +
np.random.randint(0, 60, size=len(x)).astype(str)
),
rptd_pr=np.random.uniform(10, 200, len(bonds_panel)*5),
entrd_vol_qt=1000*np.random.choice(
range(1,21), len(bonds_panel)*5, replace=True
),
yld_pt=np.random.uniform(-10, 10, len(bonds_panel)*5),
rpt_side_cd=np.random.choice(
["B", "S"], len(bonds_panel)*5, replace=True
),
cntra_mp_id=np.random.choice(
["C", "D"], len(bonds_panel)*5, replace=True
)
)
.reset_index(drop=True)
)
(trace_enhanced_dummy
.to_sql(name="trace_enhanced",
con=tidy_finance,
if_exists="replace",
index=False)
)
As stated in the introduction, the data does not contain any samples of the original data.
We merely generate random numbers for all columns of the tables that we use throughout
this book.
D
Clean Enhanced TRACE with Python
This appendix contains code to clean enhanced TRACE with Python. It is also available via
the following GitHub gist1 . Hence, you could also source the file with the following chunk.
gist_url = (
"https://gist.githubusercontent.com/patrick-weiss/"
"86ddef6de978fbdfb22609a7840b5d8b/raw/"
"8fbcc6c6f40f537cd3cd37368be4487d73569c6b/"
)
with httpimport.remote_repo(gist_url):
from clean_enhanced_TRACE_python import clean_enhanced_trace
We need this function in Chapter 5 to download and clean enhanced TRACE trade messages
following Dick-Nielsen (2009) and Dick-Nielsen (2014) for enhanced TRACE specifically.
This code is based on the resources provided by the project Open Source Bond Asset Pricing2
and their related publication Dickerson et al. (2023). We encourage that you acknowledge
their effort. Relatedly, WRDS provides SAS code to clean enhanced TRACE data.
The function takes a vector of CUSIPs (in cusips), a connection to WRDS (connection)
explained in Chapter 3, and a start and end date (start_date and end_date, respec-
tively). Specifying too many CUSIPs will result in very slow downloads and a potential
failure due to the size of the request to WRDS. The dates should be within the coverage of
TRACE itself, i.e., starting after 2002, and the dates should be supplied as a string indicat-
ing MM/DD/YYYY. The output of the function contains all valid trade messages for the
selected CUSIPs over the specified period.
def clean_enhanced_trace(cusips,
connection,
start_date="'01/01/2002'",
end_date="'12/31/2023'"):
"""Clean enhanced TRACE data."""
import pandas as pd
import numpy as np
1 https://gist.githubusercontent.com/patrick-weiss/86ddef6de978fbdfb22609a7840b5d8b
2 https://openbondassetpricing.com/
225
226 Clean Enhanced TRACE with Python
trace_all = pd.read_sql_query(
sql=trace_query,
con=connection,
parse_dates={"trd_exctn_dt","trd_rpt_dt", "stlmnt_dt"}
)
# Post 2012-06-02
## Trades (trc_st = T) and correction (trc_st = R)
trace_post_TR = (trace_all
.query("trc_st in ['T', 'R']")
.query("trd_rpt_dt >= '2012-06-02'")
)
# Reversals (trc_st = Y)
trace_post_Y = (trace_all
.query("trc_st == 'Y'")
.query("trd_rpt_dt >= '2012-06-02'")
.get(["cusip_id", "orig_msg_seq_nb", "entrd_vol_qt",
"rptd_pr", "rpt_side_cd", "cntra_mp_id",
"trd_exctn_dt", "trd_exctn_tm"])
.assign(drop=True)
.rename(columns={"orig_msg_seq_nb": "msg_seq_nb"})
)
Clean Enhanced TRACE with Python 227
# Clean reversals
## Match the orig_msg_seq_nb of Y-message to msg_seq_nb of main message
trace_post = (trace_post_TR
.merge(trace_post_Y, how="left")
.query("drop != True")
.drop(columns="drop")
)
# Pre 06-02-12
## Trades (trc_st = T)
trace_pre_T = (trace_all
.query("trd_rpt_dt < '2012-06-02'")
)
# Cancellations (trc_st = C)
trace_pre_C = (trace_all
.query("trc_st == 'C'")
.query("trd_rpt_dt < '2012-06-02'")
.get(["cusip_id", "orig_msg_seq_nb", "entrd_vol_qt",
"rptd_pr", "rpt_side_cd", "cntra_mp_id",
"trd_exctn_dt", "trd_exctn_tm"])
.assign(drop=True)
.rename(columns={"orig_msg_seq_nb": "msg_seq_nb"})
)
# Corrections (trc_st = W)
trace_pre_W = (trace_all
.query("trc_st == 'W'")
.query("trd_rpt_dt < '2012-06-02'")
)
## Correction loop
while (correction_control > 0):
# Create placeholder
## Only identifying columns of trace_pre_T (for joins)
placeholder_trace_pre_T = (trace_pre_T
.get(["cusip_id", "trd_exctn_dt", "msg_seq_nb"])
228 Clean Enhanced TRACE with Python
.rename(columns={"msg_seq_nb": "orig_msg_seq_nb"})
.assign(matched_T=True)
)
# Create placeholder
## Only identifying columns of trace_pre_W_correcting (for anti-joins)
placeholder_trace_pre_W_correcting = (trace_pre_W_correcting
.get(["cusip_id", "trd_exctn_dt", "orig_msg_seq_nb"])
.rename(columns={"orig_msg_seq_nb": "msg_seq_nb"})
.assign(corrected=True)
)
if correction_control == correction_control_last:
break
else:
correction_control_last = len(trace_pre_W)
continue
# Reversals (asof_cd = R)
## Record reversals
trace_pre_R = (trace_pre_T
.query("asof_cd == 'R'")
.sort_values(["cusip_id", "trd_exctn_dt",
Clean Enhanced TRACE with Python 229
trace_pre["seq"] = (trace_pre
.groupby(["cusip_id", "trd_exctn_dt", "entrd_vol_qt",
"rptd_pr", "rpt_side_cd", "cntra_mp_id"])
.cumcount()
)
"entrd_vol_qt", "rptd_pr"])
.assign(matched=True)
)
# Non-agency
trace_nonagency = (trace_clean
.query("cntra_mp_id == 'C'")
)
# Agency cleaned
trace_clean = pd.concat([trace_nonagency,
trace_agency_sells,
trace_agency_buys_filtered])
# Additional Filters
trace_add_filters = (trace_clean
.assign(
days_to_sttl_ct2 = lambda x: (
(x["stlmnt_dt"]-x["trd_exctn_dt"]).dt.days
)
)
.assign(
days_to_sttl_ct = lambda x: pd.to_numeric(
x["days_to_sttl_ct"], errors='coerce'
)
)
.query("days_to_sttl_ct.isnull() | days_to_sttl_ct <= 7")
.query("days_to_sttl_ct2.isnull() | days_to_sttl_ct2 <= 7")
.query("wis_fl == 'N'")
.query("spcl_trd_fl.isnull() | spcl_trd_fl == ''")
.query("asof_cd.isnull() | asof_cd == ''")
)
return trace_final
E
Cover Image
The cover of the book is inspired by the fast growing generative art community in R.
Generative art refers to art that in whole or in part has been created with the use of an
autonomous system. Instead of creating random dynamics, we rely on what is core to the
book: The evolution of financial markets. Each circle corresponds to one of the twelve Fama-
French industry portfolios, whereas each bar represents the average annual return between
1927 and 2022. The bar color is determined by the standard deviation of returns for each
industry. The few lines of code below replicate the entire figure.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pandas_datareader as pdr
industries_ff_daily_raw = pdr.DataReader(
name="12_Industry_Portfolios_daily",
data_source="famafrench",
start="1927-01-01",
end="2022-12-31")[0]
industries_ff_daily = (industries_ff_daily_raw
.divide(100)
.reset_index(names="date")
.assign(date=lambda x: pd.to_datetime(x["date"].astype(str)))
.rename(str.lower, axis="columns")
)
industries_long = (industries_ff_daily
.melt(id_vars="date", var_name="name", value_name="value")
)
industries_order = sorted(industries_long["name"].unique())
data_plot = (industries_long
.assign(year=industries_long["date"].dt.to_period("Y"))
.groupby(["year", "name"])
231
232 Cover Image
.aggregate(total=("value", "mean"),
vola=("value", "std"))
.reset_index()
.assign(
vola_ntile=lambda x: pd.qcut(x["vola"], 42, labels=False)
)
)
dpi = 300
width = 2400/dpi
height = 1800/dpi
num_cols = 4
num_rows = int(len(industries_order)/num_cols)
fig, axs = plt.subplots(
num_rows, num_cols,
constrained_layout=True,
subplot_kw={"projection": "polar"},
figsize=(width, height),
dpi=dpi
)
axs = axs.flatten()
for i in enumerate(industries_order):
df = data_plot.copy().query(f'name == "{i[1]}"')
min_value = df["total"].min()
max_value = df["total"].max()
std_value = df["total"].std()
df["total"] = 2*(df["total"]-min_value)/(max_value-min_value)-1
ax = axs[i[0]]
ax.set_theta_offset(offset)
ax.set_ylim(-std_value*1400, 1)
ax.set_frame_on(False)
ax.xaxis.grid(False)
ax.yaxis.grid(False)
ax.set_xticks([])
ax.set_yticks([])
color_values = df["vola_ntile"].values
normalize = plt.Normalize(min(color_values), max(color_values))
colors = colormap(normalize(color_values))
ax.bar(
angles, values,
Cover Image 233
plt.tight_layout()
plt.subplots_adjust(wspace=-0.2, hspace=-0.1)
plt.gcf().savefig(
"images/cover-image.png", dpi = 300, pad_inches=0, transparent=False
)
Bibliography
Abadie, A., Athey, S., Imbens, G. W., and Wooldridge, J. (2017). When should you adjust
standard errors for clustering? Working Paper.
Allaire, J., Teague, C., Scheidegger, C., Xie, Y., and Dervieux, C. (2023). Quarto. https:
//github.com/quarto-dev/quarto-cli. Version 1.4.
Arnott, R., Harvey, C. R., and Markowitz, H. (2019). A backtesting protocol in the era of
machine learning. The Journal of Financial Data Science, 1(1):64–74.
Aroussi, R. (2023). yfinance. https://pypi.org/project/yfinance/. Version 0.2.31.
Avramov, D., Cheng, S., and Metzker, L. (2022). Machine learning vs. economic restrictions:
Evidence from stock return predictability. Management Science, 69(5):2547–3155.
Avramov, D., Cheng, S., Metzker, L., and Voigt, S. (2023). Integrating factor models. The
Journal of Finance, 78(3):1593–1646.
Bai, J., Bali, T. G., and Wen, Q. (2019). Common risk factors in the cross-section of
corporate bond returns. Journal of Financial Economics, 131(3):619–642.
Balduzzi, P. and Lynch, A. W. (1999). Transaction costs and predictability: Some utility
cost calculations. Journal of Financial Economics, 52(1):47–78.
Balduzzi, P. and Lynch, A. W. (2000). Predictability and transaction costs: The impact on
rebalancing rules and behavior. The Journal of Finance, 55(5):2285–2309.
Bali, T. G., Engle, R. F., and Murray, S. (2016). Empirical asset pricing: The cross section
of stock returns. John Wiley & Sons.
Ball, R. (1978). Anomalies in relationships between securities’ yields and yield-surrogates.
Journal of Financial Economics, 6(2–3):103–126.
Banz, R. W. (1981). The relationship between return and market value of common stocks.
Journal of Financial Economics, 9(1):3–18.
Bayer, M. (2012). Sqlalchemy. In Brown, A. and Wilson, G., editors, The Architecture of
Open Source Applications Volume II: Structure, Scale, and a Few More Fearless Hacks.
aosabook.org.
Bessembinder, H., Kahle, K. M., Maxwell, W. F., and Xu, D. (2008). Measuring abnormal
bond performance. Review of Financial Studies, 22(10):4219–4258.
Bessembinder, H., Maxwell, W., and Venkataraman, K. (2006). Market transparency, liquid-
ity externalities, and institutional trading costs in corporate bonds. Journal of Financial
Economics, 82(2):251–288.
Black, F. and Scholes, M. (1973). The pricing of options and corporate liabilities. Journal
of Political Economy, 81(3):637–654.
235
236 Bibliography
Brandt, M. W. (2010). Portfolio choice problems. In Ait-Sahalia, Y. and Hansen, L. P., edi-
tors, Handbook of Financial Econometrics: Tools and Techniques, volume 1 of Handbooks
in Finance, pages 269–336. North-Holland.
Brandt, M. W., Santa-Clara, P., and Valkanov, R. (2009). Parametric portfolio policies:
Exploiting characteristics in the cross-section of equity returns. Review of Financial
Studies, 22(9):3411–3447.
Brown, S. J. (1976). Optimal portfolio choice under uncertainty: A Bayesian approach. Phd
thesis, University of Chicago.
Bryan, J. (2022). Happy Git and GitHub for the useR. https://github.com/jennybc/happy-
git-with-r.
Bryzgalova, S., Pelger, M., and Zhu, J. (2022). Forest through the trees: Building cross-
sections of stock returns. Working Paper.
Cameron, A. C., Gelbach, J. B., and Miller, D. L. (2011). Robust inference with multiway
clustering. Journal of Business & Economic Statistics, 29(2):238–249.
Campbell, J. Y. (1987). Stock returns and the term structure. Journal of Financial Eco-
nomics, 18(2):373–399.
Campbell, J. Y., Hilscher, J., and Szilagyi, J. (2008). In search of distress risk. The Journal
of Finance, 63(6):2899–2939.
Campbell, J. Y., Lo, A. W., MacKinlay, A. C., and Whitelaw, R. F. (1998). The economet-
rics of financial markets. Macroeconomic Dynamics, 2(4):559–562.
Campbell, J. Y. and Shiller, R. J. (1988). Stock prices, earnings, and expected dividends.
The Journal of Finance, 43(3):661–676.
Campbell, J. Y. and Vuolteenaho, T. (2004). Inflation illusion and stock prices. American
Economic Review, 94(2):19–23.
Campbell, J. Y. and Yogo, M. (2006). Efficient tests of stock return predictability. Journal
of Financial Economics, 81(1):27–60.
Chen, A. Y. and Zimmermann, T. (2022). Open source cross-sectional asset pricing. Critical
Finance Review, 11(2):207–264.
Chen, H.-Y., Lee, A. C., and Lee, C.-F. (2015). Alternative errors-in-variables models and
their applications in finance research. The Quarterly Review of Economics and Finance,
58:213–227.
Chen, L., Pelger, M., and Zhu, J. (2023). Deep learning in asset pricing. Management
Science.
Chopra, V. K. and Ziemba, W. T. (1993). The effect of errors in means, variances, and
covariances on optimal portfolio choice. Journal of Portfolio Management, 19(2):6–11.
Cochrane, J. H. (2005). Writing tips for PhD students. Note.
Cochrane, J. H. (2009). Asset pricing (revised edition). Princeton University Press.
Cochrane, J. H. (2011). Presidential address: Discount rates. The Journal of Finance,
66(4):1047–1108.
Coqueret, G. and Guida, T. (2020). Machine learning for factor investing: R version.
Chapman and Hall/CRC.
Bibliography 237
Coqueret, G. and Guida, T. (2023). Machine learning for factor investing: Python version.
Chapman and Hall/CRC.
Davis, M. H. A. and Norman, A. R. (1990). Portfolio selection with transaction costs.
Mathematics of Operations Research, 15(4):676–713.
De Prado, M. L. (2018). Advances in financial machine learning. John Wiley & Sons.
DeMiguel, V., Martín-Utrera, A., and Nogales, F. J. (2015). Parameter uncertainty in mul-
tiperiod portfolio optimization with transaction costs. Journal of Financial and Quanti-
tative Analysis, 50(6):1443–1471.
DeMiguel, V., Nogales, F. J., and Uppal, R. (2014). Stock return serial dependence and
out-of-sample portfolio performance. Review of Financial Studies, 27(4):1031–1073.
Dick-Nielsen, J. (2009). Liquidity biases in TRACE. The Journal of Fixed Income, 19(2):43–
55.
Dick-Nielsen, J. (2014). How to clean enhanced TRACE data. Working Paper.
Dick-Nielsen, J., Feldhütter, P., and Lando, D. (2012). Corporate bond liquidity before and
after the onset of the subprime crisis. Journal of Financial Economics, 103(3):471–492.
Dickerson, A., Mueller, P., and Robotti, C. (2023). Priced risk in corporate bonds. Journal
of Financial Economics, 150(2):103707.
Dixon, M. F., Halperin, I., and Bilokon, P. (2020). Machine learning in finance. Springer.
Donald, S. and Lang, K. (2007). Inference with difference-in-differences and other panel
data. The Review of Economics and Statistics, 89(2):221–233.
Easley, D., de Prado, M., O’Hara, M., and Zhang, Z. (2020). Microstructure in the machine
age. Review of Financial Studies, 34(7):3316–3363.
Edwards, A. K., Harris, L. E., and Piwowar, M. S. (2007). Corporate bond market trans-
action costs and transparency. The Journal of Finance, 62(3):1421–1451.
Erickson, T. and Whited, T. M. (2012). Treating measurement error in Tobin’s q. Review
of Financial Studies, 25(4):1286–1329.
Fama, E. F. and French, K. R. (1989). Business conditions and expected returns on stocks
and bonds. Journal of Financial Economics, 25(1):23–49.
Fama, E. F. and French, K. R. (1992). The cross-section of expected stock returns. The
Journal of Finance, 47(2):427–465.
Fama, E. F. and French, K. R. (1993). Common risk factors in the returns on stocks and
bonds. Journal of Financial Economics, 33(1):3–56.
Fama, E. F. and French, K. R. (1997). Industry costs of equity. Journal of Financial
Economics, 43(2):153–193.
Fama, E. F. and French, K. R. (2015). A five-factor asset pricing model. Journal of Financial
Economics, 116(1):1–22.
Fama, E. F. and MacBeth, J. D. (1973). Risk, return, and equilibrium: Empirical tests.
Journal of Political Economy, 81(3):607–636.
Fan, J., Fan, Y., and Lv, J. (2008). High dimensional covariance matrix estimation using a
factor model. Journal of Econometrics, 147(1):186–197.
238 Bibliography
Fazzari, S. M., Hubbard, R. G., Petersen, B. C., Blinder, A. S., and Poterba, J. M. (1988).
Financing constraints and corporate investment. Brookings Papers on Economic Activity,
1988(1):141–206.
Foundation, P. S. and JetBrains (2022). Python developers survey 2022 results. https:
//lp.jetbrains.com/python-developers-survey-2022/.
Frazzini, A. and Pedersen, L. H. (2014). Betting against beta. Journal of Financial Eco-
nomics, 111(1):1–25.
Gagliardini, P., Ossola, E., and Scaillet, O. (2016). Time-varying risk premium in large
cross-sectional equity data sets. Econometrica, 84(3):985–1046.
Gareth, J., Daniela, W., Trevor, H., and Robert, T. (2013). An introduction to statistical
learning: With applications in R. Springer.
Garlappi, L., Uppal, R., and Wang, T. (2007). Portfolio selection with parameter and model
uncertainty: A multi-prior approach. Review of Financial Studies, 20(1):41–81.
Gârleanu, N. and Pedersen, L. H. (2013). Dynamic trading with predictable returns and
transaction costs. The Journal of Finance, 68(6):2309–2340.
Gehrig, T., Sögner, L., and Westerkamp, A. (2020). Making portfolio policies work. Working
Paper.
Goldstein, I., Koijen, R. S. J., and Mueller, H. M. (2021). COVID-19 and its impact on
financial markets and the real economy. Review of Financial Studies, 34(11):5135–5148.
Gu, S., Kelly, B., and Xiu, D. (2020). Empirical asset pricing via machine learning. Review
of Financial Studies, 33(5):2223–2273.
Gulen, H. and Ion, M. (2015). Policy uncertainty and corporate investment. Review of
Financial Studies, 29(3):523–564.
Guo, H. (2006). On the out-of-sample predictability of stock market returns. The Journal
of Business, 79(2):645–670.
Halling, M., Yu, J., and Zechner, J. (2021). Primary corporate bond markets and social
responsibility. Working Paper.
Handler, L., Jankowitsch, R., and Pasler, A. (2022). The effects of ESG performance and
preferences on us corporate bond prices. Working Paper.
Handler, L., Jankowitsch, R., and Weiss, P. (2021). Covenant prices of US corporate bonds.
Working Paper.
Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau,
D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van
Kerkwijk, M. H., Brett, M., Haldane, A., del Río, J. F., Wiebe, M., Peterson, P., Gérard-
Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., and
Oliphant, T. E. (2020). Array programming with NumPy. Nature, 585(7825):357–362.
Harvey, C. R. (2017). Presidential address: The scientific outlook in financial economics.
The Journal of Finance, 72(4):1399–1440.
Harvey, C. R., Liu, Y., and Zhu, H. (2016). … and the cross-section of expected returns.
Review of Financial Studies, 29(1):5–68.
Hasler, M. (2021). Is the value premium smaller than we thought? Working Paper.
Bibliography 239
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning:
Data mining, inference and prediction. Springer, 2 edition.
Hautsch, N., Kyj, L. M., and Malec, P. (2015). Do high-frequency data improve high-
dimensional portfolio allocations? Journal of Applied Econometrics, 30(2):263–290.
Hautsch, N. and Voigt, S. (2019). Large-scale portfolio allocation under transaction costs
and model uncertainty. Journal of Econometrics, 212(1):221–240.
Hilpisch, Y. (2018). Python for Finance. O’Reilly Media, Inc., second edition.
Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Applications to nonorthogonal
problems. Technometrics, 12(1):69–82.
Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural
Networks, 4(2):251–257.
Hou, K., Xue, C., and Zhang, L. (2014). Digesting anomalies: An investment approach.
Review of Financial Studies, 28(3):650–705.
Hou, K., Xue, C., and Zhang, L. (2020). Replicating anomalies. Review of Financial Studies,
33(5):2019–2133.
Huang, J.-Z. and Shi, Z. (2021). What do we know about corporate bond returns? Annual
Review of Financial Economics, 13(1):363–399.
Hull, J. C. (2020). Machine learning in business. An introduction to the world of data
science. Independently published.
Huynh, T. D. and Xia, Y. (2021). Climate change news risk and corporate bond returns.
Journal of Financial and Quantitative Analysis, 56(6):1985–2009.
Jacobsen, B. (2014). Some research and writing tips. Note.
Jagannathan, R. and Ma, T. (2003). Risk reduction in large portfolios: Why imposing the
wrong constraints helps. The Journal of Finance, 58(4):1651–1684.
Jegadeesh, N. and Titman, S. (1993). Returns to buying winners and selling losers: Impli-
cations for stock market efficiency. The Journal of Finance, 48(1):65–91.
Jensen, T. I., Kelly, B. T., Malamud, S., and Pedersen, L. H. (2022). Machine learning and
the implementable efficient frontier. Working Paper.
Jiang, J., Kelly, B. T., and Xiu, D. (2023). (Re-)Imag(in)ing Price Trends. The Journal of
Finance, 78(6):3193–3249.
Jobson, D. J. and Korkie, B. (1980). Estimation for Markowitz efficient portfolios. Journal
of the American Statistical Association, 75(371):544–554.
Jorion, P. (1986). Bayes-Stein estimation for portfolio analysis. Journal of Financial and
Quantitative Analysis, 21(03):279–292.
Kan, R. and Zhou, G. (2007). Optimal portfolio choice with parameter uncertainty. Journal
of Financial and Quantitative Analysis, 42(3):621–656.
Kelliher, C. (2022). Quantitative Finance with Python. Chapman and Hall/CRC, 1st edition.
Kelly, B. T., Palhares, D., and Pruitt, S. (2021). Modeling corporate bond returns. Working
Paper.
240 Bibliography
Kibirige, H. (2023a). mizani: A port of the scales package for use with plotnine. https:
//pypi.org/project/mizani/. Version 0.9.3.
Kibirige, H. (2023b). plotnine: An implementation of the grammar of graphics in python.
https://pypi.org/project/plotnine/. Version 0.12.3.
Kiesling, L. (2003). Writing tips for economics (and pretty much anything else). Note.
Kim, D. (1995). The errors in the variables problem in the cross-section of expected stock
returns. The Journal of Finance, 50(5):1605–1634.
Kothari, S. P. and Shanken, J. A. (1997). Book-to-market, dividend yield, and expected
market returns: A time-series analysis. Journal of Financial Economics, 44(2):169–203.
Kraft, D. (1994). Algorithm 733: TOMP–Fortran Modules for Optimal Control Calculations.
ACM Trans. Math. Softw., 20(3):262–281.
Lamont, O. (1998). Earnings and expected returns. The Journal of Finance, 53(5):1563–
1587.
Ledoit, O. and Wolf, M. (2003). Improved estimation of the covariance matrix of stock re-
turns with an application to portfolio selection. Journal of Empirical Finance, 10(5):603–
621.
Ledoit, O. and Wolf, M. (2004). Honey, I shrunk the sample covariance matrix. The Journal
of Portfolio Management, 30(4):110–119.
Ledoit, O. and Wolf, M. (2012). Nonlinear shrinkage estimation of large-dimensional covari-
ance matrices. The Annals of Statistics, 40(2):1024–1060.
Lintner, J. (1965). Security prices, risk, and maximal gains from diversification. The Journal
of Finance, 20(4):587–615.
Magill, M. J. P. and Constantinides, G. M. (1976). Portfolio selection with transactions
costs. Journal of Economic Theory, 13(2):245–263.
Markowitz, H. (1952). Portfolio selection. The Journal of Finance, 7(1):77–91.
Mclean, R. D. and Pontiff, J. (2016). Does academic research destroy stock return pre-
dictability? The Journal of Finance, 71(1):5–32.
Menkveld, A. J., Dreber, A., Holzmeister, F., Huber, J., Johannesson, M., Kirchler, M.,
Neusüss, S., Razen, M., and Weitzel, U. (2021). Non-standard errors. Working Paper.
Merton, R. C. (1972). An analytic derivation of the efficient portfolio frontier. Journal of
Financial and Quantitative Analysis, 7(4):1851–1872.
Mossin, J. (1966). Equilibrium in a capital asset market. Econometrica, 34(4):768–783.
Mullainathan, S. and Spiess, J. (2017). Machine learning: An applied econometric approach.
Journal of Economic Perspectives, 31(2):87–106.
Nagel, S. (2021). Machine learning in asset pricing. Princeton University Press.
Newey, W. K. and West, K. D. (1987). A simple, positive semi-definite, heteroskedasticity
and autocorrelation consistent covariance Matrix. Econometrica, 55(3):703–708.
O’Hara, M. and Zhou, X. A. (2021). Anatomy of a liquidity crisis: Corporate bonds in the
COVID-19 crisis. Journal of Financial Economics, 142(1):46–68.
Bibliography 241
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel,
M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau,
D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in
Python. Journal of Machine Learning Research, 12:2825–2830.
Peters, R. H. and Taylor, L. A. (2017). Intangible capital and the investment-q relation.
Journal of Financial Economics, 123(2):251–272.
Petersen, M. A. (2008). Estimating standard errors in finance panel data sets: Comparing
approaches. Review of Financial Studies, 22(1):435–480.
Pflug, G., Pichler, A., and Wozabal, D. (2012). The 1/N investment strategy is optimal
under high model ambiguity. Journal of Banking & Finance, 36(2):410–417.
Python Software Foundation (2023). Python language reference, version 3.10.11.
Roberts, M. R. and Whited, T. M. (2013). Endogeneity in empirical corporate finance. In
Handbook of the Economics of Finance, volume 2, pages 493–572. Elsevier.
Scheuch, C., Voigt, S., and Weiss, P. (2023). Tidy Finance with R. Chapman and Hall/CRC,
1st edition.
Seabold, S. and Perktold, J. (2010). statsmodels: Econometric and statistical modeling with
python. In 9th Python in Science Conference.
Seltzer, L. H., Starks, L., and Zhu, Q. (2022). Climate regulatory risk and corporate bonds.
Working Paper.
Shanken, J. (1992). On the estimation of beta-pricing models. Review of Financial Studies,
5(1):1–33.
Sharpe, W. F. (1964). Capital asset prices: A theory of market equilibrium under conditions
of risk . The Journal of Finance, 19(3):425–442.
Sheppard, K. (2023). linearmodels: Instrumental variable and linear panel models for python.
https://pypi.org/project/linearmodels/. Version 5.3.
Soebhag, A., Van Vliet, B., and Verwijmeren, P. (2022). Mind your sorts. Working Paper.
Team, J. D. (2023). Joblib: running python functions as pipeline jobs. https://joblib.readt
hedocs.io/. Version 1.3.2.
Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. Journal of the
Royal Statistical Society. Series B (Methodological), 58(1):267–288.
Torakis, J. (2023). httpimport: Module for remote in-memory python package/module
loading through http. https://pypi.org/project/httpimport/. Version 1.3.1.
Tsay, R. S. (2010). Analysis of financial time series. John Wiley & Sons.
Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D.,
Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S. J., Brett, M.,
Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson,
E., Carey, C. J., Polat, İ., Feng, Y., Moore, E. W., VanderPlas, J., Laxalde, D., Perktold,
J., Cimrman, R., Henriksen, I., Quintero, E. A., Harris, C. R., Archibald, A. M., Ribeiro,
A. H., Pedregosa, F., van Mulbregt, P., and SciPy 1.0 Contributors (2020). SciPy 1.0:
Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–
272.
242 Bibliography
Walter, D., Weber, R., and Weiss, P. (2022). Non-standard errors in portfolio sorts. Working
Paper.
Wang, Z. (2005). A shrinkage approach to model uncertainty and asset allocation. Review
of Financial Studies, 18(2):673–705.
Wasserstein, R. L. and Lazar, N. A. (2016). The ASA Statement on p-Values: Context,
process, and purpose. The American Statistician, 70(2):129–133.
Weiming, J. M. (2019). Mastering Python for Finance. Packt, 2nd edition.
Welch, I. and Goyal, A. (2008). A comprehensive look at the empirical performance of
equity premium prediction. Review of Financial Studies, 21(4):1455–1508.
Wes McKinney (2010). Data Structures for Statistical Computing in Python. In Stéfan
van der Walt and Jarrod Millman, editors, Proceedings of the 9th Python in Science
Conference, pages 56–61.
Wickham, H. (2014). Tidy data. Journal of Statistical Software, 59(1):1–23.
Wickham, H. and Girlich, M. (2022). tidyr: Tidy messy data. R package version 1.2.1.
Wilkinson, L. (2012). The grammar of graphics. Springer.
Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data. The MIT
Press.
Zaffaroni, P. and Zhou, G. (2022). Asset pricing: Cross-section predictability. Working
Paper.
Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net.
Journal of the Royal Statistical Society. Series B (Statistical Methodology), 67(2):301–
320.
Index
243
244 Index