0% found this document useful (0 votes)

155 views64 pages

Machine Learning Methods in Finance

We study how researchers can apply machine learning (ML) methods in finance. We first establish that the three distinct categories of ML (supervised learning, unsupervised learning, reinforcement learning and others) address fundamentally different problems than traditional econometric approaches. Our results suggest large benefits of ML methods compared to traditional approaches and indicate that ML holds great potential for future research in finance.

Uploaded by

Sumit Sahay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

155 views64 pages

Machine Learning Methods in Finance

Uploaded by

Sumit Sahay

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

Machine Learning Methods in Finance:

Recent Applications and Prospects†

Daniel Hoang and Kevin Wiegratz∗

First Version: December 15, 2020

This Version: February 10, 2021

Abstract

We study how researchers can apply machine learning (ML) methods in finance. We first
establish that the three distinct categories of ML (supervised learning, unsupervised learn-
ing, reinforcement learning and others) address fundamentally different problems than tra-
ditional econometric approaches. Then, we review the current state of research of ML in
finance and identify three archetypes of applications: i) the construction of superior and
novel measures, ii) the reduction of prediction error, and iii) the extension of the standard
econometric toolset. With this taxonomy, we provide an outlook on potential future direc-
tions for both researchers and practitioners. We finally apply ML to typical problems in
finance. Our results suggest large benefits of ML methods compared to traditional ap-
proaches and indicate that ML holds great potential for future research in finance.

This draft is preliminary. Comments welcome.

Please do not circulate without permission of the authors.

JEL classiﬁcation: C45, G00

Keywords: Machine Learning, Artiﬁcial Intelligence, Big Data

†
We appreciate helpful comments and suggestions made by Andreas Benz, Martin Ruckes, and Fabian Silbereis.
∗
Hoang and Wiegratz are with the Karlsruhe Institute of Technology (KIT). Address correspondence to Daniel Hoang,
Institute of Finance, Banking, and Insurance, Karlsruhe Institute of Technology, Kaiserstr. 12, 76131 Karlsruhe, Ger-
many, Phone: +49 721 608-44768 or e-mail: [email protected].
1. Motivation
Artiﬁcial intelligence is increasingly entering our day-to-day life with impressive applications: face

detection enables safe and eﬃcient airport travel, voice recognition allows us to talk to personal

assistants on our smartphones and smart home devices, and ever more ﬁrms are using chatbots

for quick customer support. Almost everyone interacts with modern artiﬁcial intelligence many

times per day.

The main technology behind artiﬁcial intelligence is machine learning (ML). ML methods enable

machines to conduct such complex tasks as detecting faces, understanding speech, or answering

messages. Given the power of ML technology, it is natural to ask whether we can also apply ML

methods elsewhere. This paper addresses the use of ML to solve problems in ﬁnancial economics.

Several overview papers indicate the potential of ML in ﬁnancial economics.

Varian (2014) describes ML as an appropriate tool in the economic analysis of big data and

presents some ML methods with examples in economics. He further hints at potential ML appli-

cations in econometrics. Mullainathan and Spiess (2017) identify prediction problems as the main

use case of ML in economics and present diﬀerent categories of existing and potential future

applications. Athey and Imbens (2019) illustrate the most relevant ML methods from an econo-

metric perspective. They also provide an overview of ML’s potential beyond pure prediction,

especially for causality in economic questions.

In ﬁnancial economics, while the number of published ML papers is still limited, an increasing

number of recent applications exploit the potential of ML. For instance, Bandiera et al. (2020)

analyze CEO behavior with ML methods. Gu, Kelly, and Xiu (2020) use ML to predict risk

premiums as a typical problem in empirical asset pricing. Bertsch et al. (2020) study bank mis-

conduct by applying ML to customer complaint texts. The universe of ML applications in ﬁnancial

economics has greatly expanded recently. However, it is still mostly unclear where and how to

apply ML to solve problems in ﬁnancial economics.

The contribution of this paper is threefold. First, we give a high-level introduction to ML aimed

at ﬁnancial economists. We illuminate the diﬀerent types of ML, their purposes and functionali-

ties, and the available methods for each type. Given our focus on ﬁnancial economics, we place

1
special emphasis on the diﬀerence between traditional econometrics and ML. Our concise intro-

duction allows researchers in the ﬁeld to quickly grasp the ML essentials that are relevant for

applications in ﬁnancial economics without assuming any prior knowledge of ML.

Second, we construct a taxonomy of current and future ML applications in ﬁnancial economics.

Given the increasing number of recent studies, earlier classiﬁcations do not capture existing ap-

plications well. We review the up-to-date literature in the ﬁeld and divide it into three distinct

archetypes. Our taxonomy allows researchers to better understand the current state of the litera-

ture and how diﬀerent contributions relate to each other. Furthermore, it serves as guidance for

future ML applications in ﬁnancial economics.

Third, we apply ML to solve typical problems at the intersection of ﬁnancial economics and real

estate ﬁnance: the pricing of real estate assets and credit risk prediction. To study the accuracy

and usefulness of ML predictions for heterogeneous real estate assets, we exploit a unique dataset

of more than four million German real estate properties listed for sale on ﬁve German property

portals and in major newspapers between 2000 and 2020. Our results show that the ML-based

price estimates for individual properties exhibit dramatically lower pricing errors than traditional

approaches such as hedonic pricing with ordinary least squares (OLS). Hence, ML can directly

help participants in real estate markets make more informed decisions and improve aggregate

market eﬃciency. In general, our application illustrates how ML adds value in solving a typical

problem in (household) ﬁnance compared to traditional econometric approaches. It also illustrates

how researchers can apply ML technology in big data problems.

Traditional econometrics aims to provide causal explanations for economic phenomena by analyz-

ing relationships between economic variables. ML, in contrast, serves diﬀerent purposes. There

are two major types of ML: supervised and unsupervised learning. Supervised learning provides

us with predictions that exhibit low out-of-sample prediction errors by automatically considering

nonlinearities and interaction eﬀects. Unsupervised learning infers structural information from the

given data. Hence, ML is suited for diﬀerent kinds of applications than traditional econometrics.

Based on our review of the ﬁnancial economics literature, we classify ML applications into three

distinct archetypes: 1) construction of superior and novel measures, 2) reduction of prediction

error in economic prediction problems, and 3) extension of the existing econometric toolset.

2
First, researchers can use ML to construct superior and novel measures. ML methods are able to

extract information from unconventional data such as text or images. The extracted information

can then serve as a superior or novel measure of an economic variable. Superior ML measures

exhibit lower measurement error and therefore allow more precise estimates of economic relation-

ships than traditional measures can. Novel measures allow us to conduct analyses with previously

unmeasurable economic variables.

Second, researchers can use ML to reduce prediction error in economic prediction problems. There

are certain problems in ﬁnancial economics that are prediction problems at their core. For in-

stance, the fundamental problem in credit risk is the prediction of credit default. Given that the

main functionality of supervised ML is prediction, ML methods are able to provide better results

than traditional approaches in solving such economic prediction problems.

Third, researchers can use ML to extend the existing econometric toolset. Econometric tools often

contain a prediction component. For instance, the ﬁrst stage of an instrumental variable design is

eﬀectively a prediction problem. ML methods can enhance such existing econometric tools by

improving the performance of their prediction component. Furthermore, some ML methods them-

selves directly serve as new econometric tools. For instance, clustering methods from unsupervised

ML extend the set of existing clustering methods from econometrics.

To demonstrate the beneﬁts of ML over traditional econometric methods, we apply ML to real

estate pricing, which is particularly relevant in the areas of household ﬁnance and real estate

economics. More speciﬁcally, we predict the prices of real estate assets in Germany with various

ML methods and compare their accuracy to estimates from traditional hedonic pricing with OLS.

Figure 1 illustrates our key results. The two charts compare the actual property prices with the

OLS estimates and with the price predictions of our best-performing ML method (boosted regres-

sion trees). On average, the price predictions from the ML approach are much closer to the actual

prices than the OLS estimates. The diﬀerence in pricing accuracy is especially pronounced at the

upper end of the price range: while the OLS estimates show large deviations from the actual

prices, the ML-based price predictions are much closer. As we show in more detail below, ML is

especially able to improve pricing accuracy for more expensive real estate assets, which traditional

hedonic regression cannot price well. Furthermore, our results indicate that nonlinearities and

interaction eﬀects are most relevant for real estate assets at the upper end of the price range.

3
Figure 1. Comparison of pricing accuracy between hedonic pricing (OLS) and ML
This ﬁgure depicts the pricing accuracy of traditional hedonic pricing (OLS) and ML. On average, the
ML-based price estimates are much closer to the actual prices than the OLS estimates are. The beneﬁt
of ML is most pronounced at the upper end of the price range, where OLS performs especially poorly.

The superior pricing power of ML compared to traditional hedonic pricing with OLS becomes

even more apparent if we look at diﬀerent performance metrics. While hedonic pricing with OLS

can only explain approximately 40% of the price variation as measured by R², ML almost doubles

R² to approximately 77%. On average, the OLS estimates misprice real estate assets by almost

44%, which ML can reduce to less than 27%. In monetary terms, ML reduces the average mis-

pricing by more than 82,000 EUR. Given that the average property price in our sample is 393,000

EUR, the improvement in pricing accuracy is not only statistically signiﬁcant but also economi-

cally very meaningful.

In the top price quintile where hedonic pricing with OLS performs especially poorly, ML shows

even stronger performance. The ML-based price predictions exhibit an average pricing error of

less than 24% compared to over 50% for the OLS estimates. In monetary terms, ML reduces the

average mispricing by more than 240,000 EUR, which is economically very large, given the average

property price of approximately 884,000 EUR in the top price quintile.

The remainder of this paper is organized as follows. Section 2 contains a high-level introduction

to ML. In Section 3, we present the three types of ML applications and review the corresponding

literature. In Section 4, we apply ML in real estate pricing to illustrate the beneﬁts of ML. In

Section 5, we apply ML in credit risk prediction. Section 6 concludes the paper.

4
2. Introduction to ML
In this section, we provide a high-level introduction to ML to facilitate a better understanding of

its applications in ﬁnancial economics later in this paper. We focus on the fundamental problems

that ML can solve, how the diﬀerent types of ML work, and which methods exist. Since our main

target audience is economists, we place special emphasis on the diﬀerences between ML and

traditional econometric methods.

Most studies in empirical economics aim for causal explanations of economic phenomena by ana-

lyzing relationships between economic variables. For instance, we might want to explain the cross-

sectional differences in real estate prices. We then mainly care about how different influencing

factors, such as location or number of bedrooms, aﬀect price. Traditional econometric methods

provide us with estimates 𝛽 ̂ for the direction and strength of these inﬂuencing factors.

ML, in contrast, serves diﬀerent purposes. Instead of providing direct insights into the relation-

ships between economic variables, ML serves as a method for prediction or for data structure

inference. In prediction, we use the given observations to infer estimates for the dependent variable

𝑦̂ of new observations based on their covariates 𝑋. For instance, we might want to use the observed

prices and property characteristics in the real estate market to predict prices of previously unob-

served properties based on their characteristics. The ﬁrst major type of ML, supervised learning,

provides us with methods to make such predictions.

In data structure inference, we derive diﬀerent kinds of structural information from given data 𝑋.

For instance, we might want to identify clusters in the data to learn how diﬀerent observations

relate to each other. The second major type of ML, unsupervised learning, provides us with

methods to arrive at such structural information from data.

Figure 2 gives an overview of the diﬀerences between traditional econometrics and the two major

types of ML, supervised and unsupervised learning. Most importantly, the three approaches serve

diﬀerent purposes. As explained above, traditional econometrics aims for explanations; that is, it

solves so-called 𝛽 -problems

̂ (Mullainathan and Spiess, 2017). Supervised learning provides us with

predictions; that is, it solves so-called 𝑦̂-problems (Mullainathan and Spiess, 2017). Unsupervised

learning infers the data structure from given data, so it solves 𝑋-problems.

5
Figure 2. Differences between traditional econometrics and the two major types of ML:
supervised and unsupervised learning
This figure gives an overview of how traditional econometrics and the two major types of ML, supervised
and unsupervised learning, differ with regard to their methodological process and purpose. Traditional
econometrics enables explanations of economic phenomena, while supervised learning provides predictions
and unsupervised learning infers data structure.

The three approaches also diﬀer with regard to their methodological process. Every approach

starts from data. In traditional econometrics, we have a dependent variable 𝑦 and multiple inde-

pendent variables 𝑋. In ML jargon, we call such data “labeled data”, since there is a special label

𝑦 for each observation. The dominant method in traditional econometrics is linear regression,

mainly due to its ﬂexibility and interpretability. We usually require unbiased estimates of the

strength and direction of economic relations; thus, the OLS estimator, as the best linear unbiased

estimator, is most common. As a result, we obtain an explanatory model in the form of a regres-

sion line and diﬀerent metrics of statistical signiﬁcance, such as t-values and p-values. Finally,

these results can indicate causal relationships between economic variables.

In supervised learning, we also start with labeled data. Here, the special label 𝑦 represents the

target variable that we want to predict based on the predictor variables 𝑋. Applying a supervised

ML method on the given data yields a prediction model as well as estimates for its expected

prediction performance. We can use the prediction model to make out-of-sample predictions, that

is, predictions for the value of the target variable of previously unseen examples based on their

characteristics.

In unsupervised learning, we start with unlabeled data, which is the deﬁning distinction between

unsupervised and supervised learning in the literature (Hastie, Tibshirani, and Friedman, 2009,

6
pp. 485-486). Unlabeled data means that there is no special variable; all variables are considered

equal. Applying an unsupervised ML method to the given data provides us with a data structure

model and data structure characteristics. Finally, we can use both results to infer structural

information from the data.1

In the following subsections, we describe the two major types of ML – supervised and unsupervised

learning – in more detail and give an overview of the relevant methods for each type. In the last

subsection, we brieﬂy cover other, less common types of ML.

2.1 Supervised Learning

The purpose of supervised learning is prediction. More speciﬁcally, we aim for out-of-sample pre-

dictions with high prediction performance. To accurately assess the expected prediction perfor-

mance on previously unseen observations, we typically split the given data into training data and

test data. We apply a supervised ML method on the training data to build a prediction model.

Then, we apply the prediction model on the test data to derive an estimate for the expected out-

of-sample prediction performance.

To build a prediction model, there exist various supervised ML methods of diﬀering complexity.

In general, more complex methods enable higher prediction performance but reduce interpreta-

bility. Figure 3 gives an overview of common methods in supervised ML arranged by typical

prediction performance and interpretability. We can further distinguish between diﬀerent classes

of methods based on similarities in their general approach and based on the data types for which

the diﬀerent methods are most suitable.

The simplest method is linear regression with the OLS estimator. As stated before, OLS provides

excellent interpretability. However, its out-of-sample prediction performance is comparably weak.

A simple way to improve the prediction performance of the linear OLS model would be to add

nonlinear transformations and interactions of the original predictor variables to the model speci-

ﬁcation. In many cases, however, it is ex ante unclear which nonlinearities and interactions are

actually relevant. Including all possible combinations is unfeasible since it results in an exorbitant

1
The more abstract descriptions of the individual steps of unsupervised learning result from the fact that various
methods with many diﬀerent goals fall under the umbrella term unsupervised learning. See Section 2.2 for an overview
of the diﬀerent categories and methods in unsupervised ML.

7
number of variables, which quickly exceeds the number of observations. In many cases, the sheer

size of the resulting datasets also leads to computational problems.

Since OLS is only the best linear unbiased estimator (BLUE), a more feasible way to improve its

prediction performance is to introduce bias. In contrast to explanation problems, prediction prob-

lems do not require unbiased variable coeﬃcients. Instead, we only aim for maximal prediction

performance. Regularized linear methods oﬀer a way to systematically introduce bias to improve

the prediction performance of OLS (Hastie, Tibshirani, and Friedman, 2009, pp. 61-79). More

speciﬁcally, regularization means that such methods shrink the coeﬃcients of the predictor varia-

bles to increase prediction performance.2 The most common method for regularized linear regres-

sion is the least absolute shrinkage and selection operator (LASSO). LASSO works similarly to

OLS but introduces bias by adding a penalty term in its optimization function to penalize large

variable coeﬃcients with little informational content. The speciﬁc functional form of the penalty

term tends to drive irrelevant coeﬃcients to exactly zero.3 Hence, LASSO is often used for variable

selection in addition to pure prediction and thereby provides relatively good interpretability.

In addition to LASSO, there are other regularized linear methods that diﬀer with regard to the

functional form of their penalty terms. Ridge regression uses a penalty term that does not drive

coeﬃcients to exactly zero and is therefore less interpretable.4 However, ridge regression often

provides superior prediction performance compared to LASSO. Elastic net regression combines

the two methods (Zou and Hastie, 2005). Its penalty term is a linear combination of the penalty

terms of LASSO and ridge regression to incorporate their respective strengths.5

In contrast to the linear methods just discussed, more complex ML methods automatically con-

sider relevant nonlinearities and interaction eﬀects. For numerical data, tree-based ML methods

are widespread (Hastie, Tibshirani, and Friedman, 2009, pp. 305-334). The simplest tree-based

2
The introduction of bias can increase prediction performance because of the bias-variance tradeoff. See, for instance,
Hastie, Tibshirani, and Friedman (2009, pp. 37-38, 219-228) for technical details.
3
LASSO uses the penalty term 𝛼 ∑ |𝛽 |, where 𝛼 is a parameter that controls the amount of regularization and 𝛽
are the variable coefficients.
4
Ridge regression uses the penalty term 𝛼 ∑ 𝛽 , where 𝛼 is a parameter that controls the amount of regularization
and 𝛽 are the variable coefficients.
5
Elastic net regression uses the penalty term 𝛼 ∑ |𝛽 | + 𝛼 ∑ 𝛽 , where 𝛼 and 𝛼 are parameters that control the
amount of regularization and the proportion of the two subparts and 𝛽 are the variable coefficients.

8
method is the decision tree, which at the same time acts as the building block of all other tree-

based methods. Figure 4 depicts a simpliﬁed decision tree trained for house price prediction. It

consists of nodes at which the tree splits depending on the value of a certain predictor variable.

Decision trees typically contain multiple layers of nodes, so they implicitly consider interactions

between multiple variables. When the tree reaches a leaf node, that is, a node after which there

is no further split, the tree returns a prediction value. For more details on decision trees and how

to build them algorithmically from training data, see, for example, Loh (2011). Given that we can

observe the relevant predictor variables and thresholds in the splits, decision trees are character-

ized by relatively high interpretability.

Figure 3. Overview of common methods in supervised ML arranged by typical prediction

performance and interpretability
This figure gives an overview of the most common methods in supervised ML. The methods differ by their
complexity: more complex methods typically achieve higher prediction performance but are less interpret-
able. We can further distinguish between different classes of methods based on similarities in their general
approach. Less complex methods tend to work well with numerical data, while more complex methods are
required for unconventional data such as text, images, or videos.

Random forests combine multiple decision trees (Breiman, 2001). More speciﬁcally, the random

forest method repeatedly draws bootstrap samples from the given data and builds a separate

decision tree from each sample. The prediction of a random forest is then the average prediction

value of the diﬀerent trees. Random forests typically achieve much higher prediction performance

than single decision trees but are inherently less interpretable.

9
Figure 4. Illustrative depiction of a decision tree trained for house price prediction
This ﬁgure depicts a simpliﬁed version of a decision tree trained for house price prediction. Nodes repre-
sent splits according to the value of a certain predictor variable. Trained decision trees typically consist
of multiple layers, so they implicitly consider variable interactions. At leaf nodes, the decision tree returns
a prediction value. Decision trees in practice usually consist of many more layers than are shown in this
illustrative example.

Boosted regression trees extend the concept of random forests to further improve their prediction

performance (Hastie, Tibshirani, and Friedman, 2009, pp. 353-358). Instead of combining many

independent decision trees, the boosted regression tree method builds the trees iteratively and

considers which observations the previous trees could not predict well. Boosted regression trees

typically not only outperform random forests but often are among the winning algorithms in data

science competitions, which highlights their state-of-the-art prediction performance level.

While tree-based ML methods and, in particular, boosted regression trees achieve state-of-the-art

prediction performance with numerical data, neural networks excel with unconventional data such

as text, images, or videos. Figure 5 depicts a small feed-forward neural network. A neural network

consists of neurons with links among each other and arranged in layers (Hastie, Tibshirani, and,

Friedman, 2009, pp. 389-415). In their most basic version, each neuron can be thought of as a

single linear regression model combined with a nonlinear “activation” function. The links describe

the ﬂow of data between the neurons. First, a neural network’s input layer receives the predictor

variables, for instance, pixel-level image data. Then, the hidden layers process the data and deliver

them to the output layer, which returns the ﬁnal prediction value. The neural network in our

example is a highly simpliﬁed version of the neural networks used in practice. Neural networks

for real applications are much larger, with many hidden layers and millions of neurons and links.

10
Furthermore, they do not have to be fully connected, so not every neuron of a layer necessarily

needs to forward its output to every neuron of the next layer.

Figure 5. Illustration of a feed-forward neural network

This ﬁgure shows a small feed-forward neural network. It consists of neurons that are connected with
links and aggregated into layers. The neurons in the input layer receive the predictor variables. The
neurons in the hidden layers then process the data. Finally, the output layer neuron returns the prediction
value. The feed-forward characteristic of the shown neural network refers to the fact that there are no
backlinks or neurons with advanced functionality, so data can only ﬂow from left to right.

Our exemplary neural network uses a simple feed-forward architecture, which means that the

neurons come in their most basic variant and that no backlinks exist; thus, data simply ﬂows from

left to right. More advanced neural networks employ more complex neurons and architectures.

Recurrent neural networks (RNNs) are designed for sequential data such as text (Medsker and

Jain, 1999). The hidden layer neurons in RNNs have an additional memory feature that allows

them to accumulate information over multiple related observations (for instance, words in a sen-

tence). There are diﬀerent possibilities for how to exactly design the memory feature in the neu-

rons. Widespread design examples are gated recurrent units (GRU) and long short-term memory

(LSTM).

Convolutional neural networks (CNNs) are even more complex neural networks whose general

architecture ﬁts well with visual data such as images and videos (Albawi, Mohammed, and Al-

Zawi, 2017). Simply put, their hidden layers represent trainable ﬁlters that iteratively detect

11
increasingly complex structures. The architecture of CNNs is typically highly customized towards

a speciﬁc application. Adequately designed CNNs show outstanding performance for tasks such

as face detection or general image recognition.

Due to their high complexity, neural networks are inherently hard to interpret. In general, we can

infer very little information from the hidden layers, which represent the learned knowledge of a

neural network. Improving the interpretability of neural networks is subject to ongoing research

in computer science.

In addition to the methods just discussed, there are older methods that typically achieve worse

prediction performance and/or provide lower interpretability than newer methods. Since many

early studies that applied ML in ﬁnancial economics used these methods extensively, we also

brieﬂy cover them in the following.

A widespread example is the naïve Bayes method (Rish, 2001), which uses Bayes’ theorem to

classify observations into categories. For instance, we might want to classify loan applications as

accept or reject. Naïve Bayes then calculates the probability for each possible classiﬁcation cate-

gory of a new loan application (accept or reject) conditional on the given training data. Its ﬁnal

classiﬁcation decision is the category with the highest probability. Modern methods such as

boosted regression trees typically outperform naïve Bayes by a wide margin, so it is much less

commonly used in current studies.

Methods based on the support vector machine (SVM) are also common among older studies

(Hastie, Tibshirani, and Friedman, 2009, pp. 417-455). In support vector classiﬁcation (SVC), we

separate observations in diﬀerent classiﬁcation categories (for instance, positive and negative ex-

amples) with a hyperplane, which is eﬀectively just a multidimensional line. We position the

hyperplane between the training examples in such a way that the margin between examples of

diﬀerent categories is maximized. The hyperplane then allows us to classify new examples de-

pending on which side of the hyperplane they lie. Support vector regression (SVR) extends the

idea of SVM to regression problems, that is, predictions of continuous values instead of categories.

In general, SVM-based methods provide lower prediction performance than newer methods such

as boosted regression trees. Hence, we also see them less and less often in current studies.

12
2.2 Unsupervised Learning

The purpose of unsupervised learning is data structure inference. Since data structure subsumes

many diﬀerent kinds of information, we divide the methods of unsupervised learning into diﬀerent

categories. Figure 6 gives an overview of the most important categories in unsupervised learning.

There are two major categories: clustering and dimensionality reduction. Further categories in-

clude association rule mining, outlier detection, and synthetic data generation.

Figure 6. Overview of categories in unsupervised ML

This ﬁgure gives an overview of the categories in unsupervised ML. The two major categories are clus-
tering and dimensionality reduction. Both categories include various methods. Further categories include
association rule mining, outlier detection, and synthetic data generation. The illustration of dimension-
ality reduction has been adapted from Liu and Özsu (2009, p. 841).

The ﬁrst major category of unsupervised learning is clustering. Methods for clustering group the

given observations in a way that results in high within-group similarity and low cross-group sim-

ilarity. There exist various kinds of methods for clustering. Centroid-based methods form clusters

around centroids. After the initial positioning of the centroids, they iteratively update their posi-

tion to arrive at suitable clusters. A common example of a very early but still heavily used cen-

troid-based method is K-means (MacQueen, 1967). Density-based methods build clusters depend-

ing on the diﬀering density in the space of observations. Simply put, they group observations with

many similar observations nearby into clusters. An example of a density-based clustering method

is DBSCAN from Ester et al. (1996), which is also one of the most famous clustering methods.

Distribution-based methods assign observations to clusters based on whether they likely belong

13
to the same statistical distribution. Hence, they require us to know the distribution of the under-

lying data process in advance. For normally distributed data, Gaussian mixture models are wide-

spread (Rasmussen, 1999). Finally, hierarchical methods construct clusters that consider the hi-

erarchical relationship in the data. They start with initial clusters, where each cluster consists of

a single observation. Then, they iteratively combine smaller clusters into larger clusters to build

a hierarchy. A common method for hierarchical clustering is BIRCH (Zhang, Ramakrishnan, and

Livny, 1996). While the method classes just discussed are most common for clustering, it should

be noted that there are additional but much less often used methods.

The second major category is dimensionality reduction. Methods in this category try to increase

the information density of the given data by decreasing their dimensionality while retaining most

of the inherent information. There are various methods for dimensionality reduction; we cover

only the two most common ones. First, methods based on principal component analysis (PCA)

derive linear combinations of the original variables (“principal components”) that cover as much

of the data’s variance as possible. While the basic variant of PCA is inherently linear, nonlinear

generalizations also exist. For more details on the diﬀerent PCA-based methods, see, for instance,

Hastie, Tibshirani, and Friedman (2009, pp. 534-552). Second, methods based on neural networks

reduce dimensionality with special architectures. A widely used method is the autoencoder neural

network (Goodfellow, Bengio, and Courville, 2016, pp. 499-523). An autoencoder consists of an

encoder network that creates a condensed representation of the input data and a subsequent

decoder network that reconstructs the original data from the condensed representation. A special

bottleneck layer connects the encoder and decoder networks to train them on given data. If the

autoencoder is able to reconstruct the original data well, then the condensed data representation

in the bottleneck layer has successfully retained most of the information in the data while reducing

its dimensionality.

In addition to the two major categories, there are minor categories of unsupervised ML methods.

Association rule mining tries to identify relations between variables (Agrawal, Imieliński, and

Swami, 1993). For instance, it can learn from customer purchase data which products are often

bought together. The identiﬁed relations can often directly aﬀect decision making. Outlier detec-

tion methods try to ﬁnd observations that substantially diﬀer from the remaining data. While

many traditional methods for outlier detection exist, ML-based methods often provide superior

14
performance, especially in high-dimensional settings (Domingues et al., 2018). In synthetic data

generation, we try to generate new data that satisﬁes certain requirements. Generative adversarial

networks, for instance, use neural networks to create new, synthetic data that closely mimics the

given training data (Goodfellow et al., 2020). Their neural network architecture makes them

especially useful for unconventional data, for example, to create artiﬁcial images that are similar

to existing images. While the categories just discussed are the most common ones in unsupervised

learning, it should be noted that there are even more but less commonly used categories and

methods.

2.3 Reinforcement Learning and Other Types

In addition to the two major types of ML, supervised and unsupervised learning, there are other

types of ML, which we brieﬂy cover here. Figure 7 gives an overview of these additional types of

ML. One of the more common types is reinforcement learning, which is suitable for sequential

decision problems with a long-term goal. Such problems are common, for instance, in robotics

(Sutton and Barto, 2018). We usually model such problems with a Markov decision process that

consists of an environment and an agent whose actions change the environment and bring rewards.

Reinforcement learning methods then try to ﬁnd a policy for the agent to maximize the expected

total reward by optimally trading short-term vs. long-term rewards.

Semi-supervised learning combines supervised and unsupervised learning. It aims for prediction

as in supervised learning, but it uses data in a way more similar to unsupervised learning (Zhu,

2005). More speciﬁcally, the training data in semi-supervised learning consists of few labeled

examples and many unlabeled examples. While we cannot directly train a prediction model with

unlabeled data, we can still obtain information about the probability distribution of the data.

Active learning is a closely related variant of semi-supervised learning (Settles, 2009). It is useful

in cases where obtaining additional labels is costly, for example, because an expert has to manually

assign labels to the examples. In active learning, we ﬁrst train the model with the already labeled

examples. Then, we can calculate an importance score for each unlabeled example based on re-

latedness to other examples and prediction uncertainty. Finally, the expert must label only those

observations that help improve our prediction model the most, which can vastly reduce costs.

15
Figure 7. Overview of reinforcement learning and other types of ML
This ﬁgure gives an overview of types of ML other than supervised and unsupervised learning. Among
these other ML types, reinforcement learning and semi-supervised learning and are two of the more
common. Less common types include deductive learning, federated learning, and genetic algorithms.

In addition to these more common ML types, reinforcement learning and semi-supervised learning,

there are other types of ML. In deductive learning, we try to algorithmically infer valid logical

statements from other logical statements. Applications for deductive learning can be found, for

instance, in natural language processing (Cambria and White, 2014). In federated learning, we

train ML models across multiple machines that do not share the same data. Federated learning

oﬀers beneﬁts in special domains, for instance, if we need to preserve data privacy (Yang et al.,

2019). Finally, genetic algorithms use evolutionary principles such as mutation and selection mech-

anisms to derive optimal solutions to various kinds of problems (Mitchell, 1998). Practical appli-

cations of genetic algorithms, however, are still scarce.

3. Taxonomy of ML Applications in Financial Economics

An increasing number of ﬁnancial economics papers that use ML in at least some part of their

study get published. However, many researchers are still unaware of how and where to apply ML

in the ﬁeld of ﬁnancial economics. In this section, we present a taxonomy of existing ML applica-

tions, which serves multiple purposes. First, it outlines where ML can add value in ﬁnancial

economics research. Second, it provides a structured overview of existing ML applications in the

ﬁeld of ﬁnancial economics. Third, it allows us to better understand new contributions and how

16
they relate to the existing literature. Finally, it guides researchers in discovering possible applica-

tions of ML in ﬁnancial economics and thus facilitates new studies.

As explained above, ML solves diﬀerent problems compared to traditional econometric methods.

The workhorse model of ﬁnancial economics research, linear regression with OLS, has one direct

application: identiﬁcation of causal relationships between economic variables to explain economic

phenomena. In contrast, ML provides us with predictions that minimize prediction error or infers

structural information from given data. Our taxonomy shows the categories in which these func-

tionalities can add value in ﬁnancial economics research.

We identify three archetypical applications of ML from the existing ﬁnancial economics literature:

1) construction of superior and novel measures, 2) reduction of prediction error in economic pre-

diction problems, and 3) extension of the existing econometric toolset. Figure 8 illustrates our

taxonomy and depicts characteristic equations for each of the three application categories to

highlight where ML adds value.

Figure 8. Taxonomy of the three categories of ML applications in ﬁnancial economics

This ﬁgure depicts the three categories of ML applications in ﬁnancial economics. It also shows their
characteristic equations to highlight where ML adds value. ML can help us construct superior and novel
measures for economic variables 𝑋, reduce the prediction error of the dependent variable 𝑦̂ in economic
prediction problems, or extend the existing econometric toolset.

Studies in the ﬁrst category use ML to construct a superior or novel measure for one of the

independent variables 𝑋. The main analysis still relies on a traditional linear model, which we

estimate with OLS. In the second category, studies use ML to reduce the prediction error of

17
predictions 𝑦̂ in economic prediction problems. Supervised ML methods achieve superior predic-

tion performance by using ﬂexible functional forms 𝑓(∗) in the prediction model. Studies in the

third category use ML to extend the existing econometric toolset. ML methods either serve as

new econometric methods themselves or optimize some part of a traditional econometric method.

In the following subsections, we review the relevant literature for each of the three categories of

ML applications in ﬁnancial economics in detail.

3.1 Construction of Superior and Novel Measures

The ﬁrst category of ML applications in ﬁnancial economics is the construction of superior and

novel measures. Figure 9 shows the general methodological process of studies in this category.

Figure 9. General methodological process of studies that use ML to construct superior

and novel measures
This figure depicts the general methodological process of studies in the first category of ML applica-
tions in financial economics. Based on unconventional data, they use ML methods to construct a meas-
ure for an economic variable. The superior or novel measure then enters the main analysis as an inde-
pendent variable.

The process starts with unconventional and nonnumerical data such as text, images, or videos.

To use the information from such data in econometric analyses, we need to construct a numerical

measure. For textual data, traditional approaches use word counts based on domain-speciﬁc dic-

tionaries.6 For image and video data, only human assessments have been available for a long time.

6
See Loughran and McDonald (2016) for an overview of mostly traditional text analytics methods in accounting and
ﬁnance.

18
ML-based approaches provide easier and, at the same time, more powerful access to the infor-

mation contained in unconventional data. All types of ML are applicable here. We can use pre-

dictions from supervised learning, data structure information from unsupervised learning, or re-

sults from other types of ML to construct measures for economic variables.

By applying ML to the given unconventional data, we can construct a superior or novel measure

for an economic variable. From the current literature, we identify three subcategories of typical

ML-based measures: measures of market sentiment, measures of corporate executives’ personality

and decisions, and measures of ﬁrm characteristics and corporate policies.

Finally, the superior or novel measure serves as an independent variable in the main analysis of

an economic relation. Using superior measures with lower measurement error than existing

measures reduces attenuation bias, so we obtain more precise estimates for the parameters de-

scribing an economic relation. Novel measures allow us to conduct new analyses with previously

unmeasurable economic aspects. In the main analysis, most studies that construct ML-based

measures apply traditional econometric methods such as linear regression with OLS.

Figure 10 presents an overview of the relevant studies that use ML to construct superior or novel

measures. In the following, we present them in three subcategories: 1) measures of market senti-

ment, 2) measures of corporate executives’ personality and decisions, and 3) measures of ﬁrm

characteristics and corporate policies.

3.1.1 Measures of Market Sentiment

Measures of market sentiment describe opinions and moods of market-participating agents. Most

studies in this subcategory construct measures of market sentiment from textual data. There are

multiple approaches to construct a one-dimensional (positive vs. negative) measure of market

sentiment from textual data. Loughran and McDonald (2011) present a dictionary approach to

derive market sentiment from ﬁnancial texts. More speciﬁcally, they count negative words based

on a ﬁnance-speciﬁc word list. Dictionary approaches, however, miss the context of words within

a sentence (Loughran and McDonald, 2016). In contrast, ﬂexible ML-based approaches can con-

sider not only the context of words within a sentence but also how diﬀerent sentences interrelate

with each other. For an extensive review of sentiment with traditional econometric and ML-based

approaches, see Algaba et al. (2020).

19
Figure 10. Overview of studies that use ML to construct superior and novel measures
This figure presents an overview of the relevant studies in financial economics that apply ML to construct
superior and novel measures. There are three main categories: measures of market sentiment, measures
of corporate executives’ personality and decisions, and measures of firm characteristics and corporate
policies.

Sentiment exists for many topics and is derived from many sources. In ﬁnancial economics, our

interest lies in the sentiment of markets such as the stock market. There is, however, a large array

of potential sources. For instance, analyst reports are a common source of stock market sentiment.

Recently, novel sources of market sentiment, such as social media, have also become more perva-

sive. Kearney and Liu (2014) present diﬀerent sources of textual sentiment in ﬁnance as well as

speciﬁc methods and measures in detail.

The most common ML-based measure of market sentiment in the literature relates to stock market

sentiment. Several studies construct a measure of stock market sentiment from social media.

Antweiler and Frank (2004) use the ML methods naïve Bayes and SVM to classify user posts on

the Yahoo Finance message board as positive or negative. Then, they aggregate their classiﬁca-

tions to construct a measure of stock market sentiment. Renault (2017) similarly classiﬁes user

posts on the ﬁnance-focused social network StockTwits to construct a measure of stock market

sentiment. Bartov, Faurel, and Mohanram (2017) derive stock market sentiment from user posts

on Twitter.

20
In addition to social media, news articles are another source of stock market sentiment. Barbon

et al. (2019) enhance the naïve Bayes method to build a stock market sentiment variable based

on ﬁrm-speciﬁc news. Ke, Kelly, and Xiu (2019) implement a customized ML-based approach that

specializes in extracting information that is relevant for stock returns. Their method then allows

them to extract a measure of stock market sentiment from Dow Jones Newswire articles.

Other studies use traditional analyst reports for stock market sentiment. For example, Huang,

Zang, and Zheng (2014) apply naïve Bayes to analyst reports to construct measures of stock

market sentiment.

The main analysis in all discussed studies concerns the eﬀect of stock market sentiment on future

stock returns. Antweiler and Frank (2004) additionally analyze the eﬀect on stock volatility. Re-

nault (2017) studies intraday stock index returns. Bartov, Faurel, and Mohanram (2017) focus on

returns around earnings announcements and study the eﬀect of sentiment on quarterly earnings.

Huang, Zang, and Zheng (2014) additionally analyze sentiment’s eﬀect on earnings growth. In the

study by Barbon et al. (2019), stock market sentiment only serves as a control variable in their

main analysis on information leakage by stock market brokers.

Manela and Moreira (2017) deviate from the traditional positive vs. negative market sentiment

measures. Instead, they construct a measure of stock market uncertainty from Wall Street Journal

front-page articles. Their novel measure allows them to analyze the eﬀect of news on equity risk

premia. Vamossy (2020) measures investor emotions by extracting diﬀerent emotional states from

StockTwits posts with textual analysis based on deep learning. He then studies the eﬀect of

investor emotions on ﬁrms’ earnings announcement returns.

Slightly diﬀerent from the stock market sentiment discussed above, Liew and Wang (2016) con-

struct a measure of pre-IPO sentiment. They use a commercial ML-based service to extract sen-

timent information from Twitter posts. Finally, they study the eﬀect of pre-IPO sentiment on

ﬁrst-day stock performance.

While most studies that construct ML-based measures of market sentiment deal with stock market

sentiment, Tang (2018) examines product market sentiment. The study uses a commercial service

to create a measure of product and brand sentiment based on Twitter posts. The subsequent main

analysis then studies the eﬀect of product market sentiment on ﬁrm sales.

21
3.1.2 Measures of Corporate Executives’ Personality and Decisions

The prominent role of a ﬁrm’s leadership has led to a vast amount of ﬁnancial economics literature

that studies various aspects of corporate executives. Related to this stream of literature, ML can

help us construct superior and novel measures of executives’ personality and decisions. While

most measures in this subcategory are based on textual data, some studies construct measures

from analyzing images and videos.

Multiple studies construct ML-based measures of executives’ personality. Gow et al. (2016) use

ML to extract a measure of CEO personality from the Q&A part of conference call transcripts.

The extracted measure then allows them to analyze the eﬀect of personality on ﬁnancing and

investment choices and on operating performance. Similarly, Hrazdil et al. (2020) measure CEO

and CFO personality traits from conference calls by using the commercial service IBM Watson

Personality Insights. Based on the extracted personality traits, they construct a measure of risk

tolerance to analyze the eﬀect of executives’ risk tolerance on audit fees. Hsieh et al. (2020)

leverage recent advances in ML-based face detection technology to extract a measure of trustwor-

thiness from executives’ business headshot images. More speciﬁcally, they detect and use certain

facial features (for instance, eyebrow angle or face roundness) to predict perceived trustworthiness.

Their main analysis then studies the eﬀect of executives’ trustworthiness on audit fees. Du et al.

(2019) analyze the personality of mutual fund managers. They use mutual fund managers’ letters

to shareholders to construct a measure of managers’ level of conﬁdence. Their main analysis then

studies the eﬀect of conﬁdence on future performance.

Rather than measuring executives’ personalities, several studies construct ML-based measures of

executives’ general decisions. Bandiera et al. (2020) use CEO survey data to construct a measure

of CEO behavior that captures whether CEOs perform more low-level or more high-level tasks.

In their main analysis, they use this measure to analyze the eﬀect of CEO behavior on ﬁrm

performance. Barth, Mansouri, and Woebbeking (2020) study how executives withhold infor-

mation from shareholders. They create an ML-based measure of executives’ obstruction of infor-

mation from transcripts of earnings conference calls. More speciﬁcally, their measure captures how

executives answer questions in such calls. Finally, they analyze the eﬀect of information obstruc-

tion by management on abnormal stock returns and implied volatility. Hu and Ma (2020) leverage

22
the fact that many venture capital ﬁrms have started to require startups to publish their appli-

cation pitch video on YouTube. The authors use a combination of diﬀerent commercial services

on these publicly available pitch videos to construct measures of three distinct aspects: how the

founders speak, what they say, and how they present themselves visually. Finally, they analyze

the eﬀect of the three dimensions of founders’ behavior on the probability of obtaining a venture

capital investment.

3.1.3 Measures of Firm Characteristics and Corporate Policies

Studies in the third subcategory construct measures of ﬁrm characteristics and corporate policies

with ML methods. We can further distinguish these measures by the company type involved:

corporates or ﬁnancials. In the literature on corporates , Li et al. (2020) extract aspects of corpo-

rate culture from conference call transcripts with ML and build measures of ﬁve diﬀerent corpo-

rate culture values. Using these measures allows them to analyze the eﬀect of corporate culture

on ﬁrm policies such as executive compensation and risk-taking. Furthermore, they study the

effect on firm performance metrics such as operational efficiency and firm value. Buehlmaier and

Whited (2018) apply ML to annual reports to construct a measure of ﬁnancial constraints. Their

ML-based measure achieves superior performance compared to existing measures of ﬁnancial con-

straints. Lowry, Michaely, and Volkova (2020) analyze ﬁrms’ communications with the SEC prior

to IPOs. They apply ML to extract the discussed topics from SEC letters and construct a measure

of regulatory IPO concern. Finally, they use their measure to study the eﬀect of regulatory concern

on IPO outcomes.

In the literature on ﬁnancials, Hanley and Hoberg (2019) construct a measure of the aggregate

risk exposure of the ﬁnancial sector from individual banks’ annual reports by using a commercial

ML-based service. They use their measure to study the eﬀect of ﬁnancial sector risk on banks’

stock returns and volatility as well as bank failure. Bubna, Das, and Prabhala (2020) study ven-

ture capital syndications and create a measure of venture capital relatedness. More speciﬁcally,

they cluster venture capital ﬁrms using ML to identify syndication groups. Finally, they analyze

the eﬀect of venture capital relatedness on startup maturation and innovation. Bertsch et al.

(2020) construct an ML-based measure of bank misconduct from customer complaint texts sent

to the regulator. They use their measure to study how bank misconduct aﬀects online lending

demand.

23
3.2 Reduction of Prediction Error in Economic Prediction Problems

Studies in the second category of ML applications in ﬁnancial economics apply ML to reduce

prediction error in economic prediction problems. While many problems in economics require

causal relationships between economic variables, some problems directly require prediction. ML

can reduce the prediction error in such problems, that is, generate more accurate predictions than

simpler approaches such as ﬁtted values from linear regression with OLS.

Figure 11 shows the general methodological process of studies in this category. We can create

predictions based on numerical data as well as unconventional data such as text, images, or videos.

Since the purpose of ML in this category is to minimize prediction error in economic prediction

problems, only supervised ML is directly applicable here. Given the large number of available ML

methods, most studies use a multitude of diﬀerent methods to assess which method works best

on the given data. For numerical data, regularized linear methods (LASSO, ridge, and elastic

net), tree-based methods (random forest and boosted regression trees) and SVM-based methods

(SVC and SVR) are most common. Unconventional data such as text, images, and videos require

more complex methods, so neural network-based methods such as deep learning models are most

common here.

Applying supervised ML methods then results in predictions for an economic variable, which

directly helps in solving an economic prediction problem. From the literature, we identify three

subcategories of economic prediction problems: prediction of capital market behavior, prediction

of credit risk, and prediction of ﬁrm characteristics and corporate policies.

Finally, some studies also try to derive explanations from ML models and their predictions. ML

models are often known as black boxes; that is, they produce predictions, but we cannot directly

observe how the algorithm has generated them. The ﬁeld of interpretable ML oﬀers methods that

deliver explanations on how an ML model has derived its prediction results. There are three

relevant subgroups of interpretable ML techniques. First, feature importance methods such as

permutation importance yield importance scores for the diﬀerent predictor variables. For instance,

in predicting real estate prices, such methods can tell us whether the number of bedrooms or the

lot size is more important in the prediction of house prices. Second, feature dependence methods

(such as partial dependence plots) uncover the relations between predictor variables and the pre-

diction target. For instance, they can show the possibly nonlinear dependency between lot size

24
and house price. Third, single prediction analysis methods such as Shapley Additive Explanation

(SHAP) values disentangle the contribution of every predictor variable to a speciﬁc prediction

value. For instance, the SHAP value for the predictor variable swimming pool can tell us how

much the presence or absence of a swimming pool contributes to the price of a speciﬁc house.

Figure 11. General methodological process of studies that use ML to reduce prediction
error in economic prediction problems
This ﬁgure depicts the general methodological process of studies in the second category of ML applica-
tions in ﬁnancial economics. Based on conventional or unconventional data, ML methods create predic-
tions for an economic prediction problem. Some studies also try to derive explanations from ML models
and their predictions.

Figure 12 gives an overview of the relevant studies that use ML in economic prediction problems

to reduce prediction error. In the following, we present these studies in the three subcategories of

1) prediction of capital market behavior, 2) prediction of credit risk, and 3) prediction of ﬁrm

characteristics and corporate policies.

3.2.1 Prediction of Capital Market Behavior

In capital markets, we are most often interested in the future behavior of security prices. Hence,

capital markets provide many diﬀerent kinds of prediction problems, in which ML can reduce the

prediction error. We can distinguish between predictions in ﬁve diﬀerent areas of capital market

behavior: individual stock returns, the equity risk premium, the stochastic discount factor, option

prices, and other aspects of capital market behavior.

25
Figure 12. Overview of studies that use ML in economic prediction problems
This figure presents an overview of relevant studies in financial economics that apply ML in economic
prediction problems to reduce prediction error. There are three main categories of economic prediction
problems in which ML is relevant: prediction of capital market behavior, prediction of credit risk, and
prediction of firm characteristics and corporate policies.

Most common in ML-based prediction of capital market behavior is the prediction of individual

stock returns, which is closely related to the ﬁeld of cross-sectional asset pricing. Rasekhschaﬀe

and Jones (2019) provide an overview of the use of ML for predicting the cross-section of stock

returns and for the selection of individual stocks. Martin and Nagel (2019) additionally present

the challenges of cross-sectional asset pricing with high-dimensional data. Gu, Kelly, and Xiu

(2020) directly predict future stock returns based on ﬁrm characteristics, historical returns, and

macroeconomic indicators. They use ML methods with varying complexity, from regularized linear

26
models to neural networks. Furthermore, they analyze which predictor variables are most informa-

tive to predict the cross-section of stock returns. Rossi (2018) predicts future stock returns and

future volatility based on established predictor variables from Welch and Goyal (2008). The stud-

ies from Moritz and Zimmermann (2016), Kelly, Pruitt, and Su (2019), Gu, Kelly, and Xiu (2019),

and Freyberger, Neuhierl, and Weber (2020) all predict future stock returns based on ﬁrm char-

acteristics and historical returns. However, they diﬀer with respect to the speciﬁc ML methods

they apply. Grammig et al. (2020) construct a hybrid approach that combines traditional methods

based on ﬁnancial theory with ML to predict future excess stock returns. Chinco, Clark‐Joseph,

and Ye (2019) predict ultra-short-term future stock returns based on the cross-section of ultra-

short-term historical returns with LASSO. Amel-Zadeh et al. (2020) predict abnormal stock re-

turns around earnings announcements based on ﬁnancial statement variables. They use LASSO,

random forests, and neural networks, and they analyze which ﬁnancial statement variables are

most informative.

In addition to predictions of individual stock returns, ML can reduce the prediction error in

predicting the aggregate stock market behavior, particularly the equity risk premium. Jacobsen,

Jiang, and Zhang (2019) predict the equity risk premium based on established stock market

predictor variables from Welch and Goyal (2008) with an ensemble of multiple ML models.

Routledge (2019) predicts the equity risk premium from macroeconomic indicators and FOMC

texts. Adämmer and Schüssler (2020) extract topics discussed in general news articles with ML

to predict the equity risk premium.

Additionally, two studies directly predict the stochastic discount factor. Chen, Pelger, and Zhu

(2019) use generative adversarial networks based on deep neural networks on diﬀerent predictors,

such as ﬁrm characteristics, historical returns, and macroeconomic indicators. Kozak, Nagel, and

Santosh (2018) develop a custom ML method based on Bayesian priors to predict the stochastic

discount factor from ﬁrm characteristics and historical returns.

The studies discussed above all predict future capital market behavior. However, predicting cur-

rent market prices can also be useful. In particular, the pricing of derivatives was a very early

application of ML in ﬁnance. Hutchinson, Lo, and Poggio (1994) predict option prices on the S&P

500 future based on the Black-Scholes variables with an early variant of neural networks. Similarly,

Yao, Li, and Tan (2000) predict option prices on the Nikkei 225 future. In more recent work,

27
Spiegeleer et al. (2018) ﬁnd that ML methods can price derivatives much faster than advanced

mathematical models while achieving only slightly worse accuracy.

Some studies use ML to predict aspects of capital markets behavior other than the aspects dis-

cussed above. Bianchi, Büchner, and Tamoni (2020) predict future excess returns of US treasury

bonds from general yield data and macroeconomic indicators. Reichenbacher, Schuster, and Uhrig‐

Homburg (2020) predict future bond liquidity using elastic net and random forests on bond trans-

action and characteristics data. Colombo, Forte, and Rossignoli (2019) predict the direction of

changes in exchange rates based on indicators of market uncertainty. Two studies focus on ﬁnan-

cial market volatility: Kogan et al. (2009) predict future stock volatility based on annual reports,

and Osterrieder et al. (2020) predict the intraday volatility index VIX from option prices. McInish

et al. (2019) focus on the market microstructure. More speciﬁcally, they predict the lifespan of

orders based on order characteristics and market data. Finally, Rossi and Utkus (2020) study

investors’ behavior and the eﬀects of robo-advising. They apply boosted regression trees to predict

investors’ portfolio allocation and performance based on diﬀerent investor characteristics. They

also study which characteristics are most important in the prediction and how exactly they aﬀect

the prediction results to uncover nonlinear eﬀects.

3.2.2 Prediction of Credit Risk

Credit risk is a typical economic prediction problem: we ultimately want to know which prospec-

tive borrowers will eventually default. As such, ML can help us lower prediction error and improve

decision making, for instance, in loan origination. We can divide the current literature on ML-

based predictions of credit risk into three areas: consumer credit risk, real estate credit risk, and

corporate credit risk.

Studies on consumer credit risk apply ML to make default predictions for any type of consumer

credit. Albanesi and Vamossy (2019) study general consumer credit default. They use advanced

ML methods such as boosted regression trees and deep neural networks to derive more accurate

predictions from credit bureau data compared to standard credit scoring models. Furthermore,

they analyze which predictors are most relevant with feature importance methods and how the

diﬀerent predictors aﬀect the predictions with feature dependence methods. Khandani, Kim, and

Lo (2010) predict consumer credit card default based on transaction data and traditional credit

bureau data. Similarly, Butaru et al. (2016) predict credit card default but consider more general

28
account data and macroeconomic indicators. They both use tree-based ML methods that auto-

matically consider nonlinearities and interactions between predictor variables. Butaru et al. (2016)

also use feature importance methods to identify which predictor variables drive default predic-

tions. Björkegren and Grissen (2018, 2019) focus on bill payment and use random forests on

mobile phone metadata to predict the payment of consumer bills in developing countries. Being

able to make credit risk predictions based on easily obtainable data from mobile phones can help

unbanked people in developing countries without a credit score obtain access to loans. Slightly

diﬀerent from the studies above, Gathergood et al. (2019) use credit card transaction data to

predict credit card repayment patterns. They predict not whether customers pay their credit card

bills but how customers split repayment on multiple cards with diﬀerent interest rates. They also

apply various ML methods and analyze which predictors are most informative.

Whenever algorithm-based decisions aﬀect people, algorithmic bias is a potential issue. Since ML-

based predictions in consumer credit risk directly aﬀect credit approval decisions, we need to

ensure that the algorithm does not discriminate against people based on attributes such as gender

or race. The literature does not paint a uniform picture of whether ML reduces or increases bias

in consumer credit decisions. Rambachan et al. (2020a) and Rambachan et al. (2020b) argue that

discrimination by algorithms crucially depends on the given data. Since algorithms base their

decisions on the data on which they have been trained, they might propagate biases present in

the data. Fuster et al. (2020) apply ML to a concrete dataset to create an ML model for credit

decisions. They ﬁnd that ML increases the disparity between and within diﬀerent groups relative

to simpler methods. In particular, it disadvantages Hispanic and Black borrowers compared to

traditional approaches. Hence, we need to be aware of potential discrimination of ML-based algo-

rithms as soon as their predictions inﬂuence decisions that directly aﬀect people, such as lending.

On the other hand, there are also studies showing that ML use can decrease bias in consumer

credit decisions. Based on a theoretical model, Philippon (2019) shows how algorithms can reduce

discrimination in credit markets. Dobbie et al. (2018) train an ML model to maximize expected

proﬁt from credit applications and ﬁnd that the resulting lending decisions eliminate bias. Klein-

berg et al. (2018) show that including problematic variables such as gender and race in ML models

can actually reduce discrimination. To conclude the discussion on algorithmic bias in consumer

29
credit risk, there is no uniform picture in the literature yet. Some studies ﬁnd that using ML to

determine consumer credit risk increases bias, while other studies ﬁnd that it decreases bias.

The second area of ML-based credit risk predictions, real estate credit risk, involves the risk of

mortgages and commercial real estate loans. Sirignano, Sadhwani, and Giesecke (2018) use deep

neural networks for the prediction of mortgage loan risk from mortgage origination and perfor-

mance data as well as macroeconomic indicators. They also analyze which predictor variables are

most important and how they aﬀect predictions. Cowden, Fabozzi, and Nazemi (2019) use various

ML methods to predict commercial real estate default based on property characteristics.

Corporate credit risk is another area in which ML can provide superior credit risk predictions.

Jones, Johnstone, and Wilson (2015) predict ﬁrms’ credit rating changes based on ﬁrm funda-

mentals, analyst forecasts, and macroeconomic indicators. Gündüz and Uhrig-Homburg (2011)

predict CDS prices as market-based indicators of credit risk based on observed CDS prices for

other time horizons and from other ﬁrms. Tian, Yu, and Guo (2015) directly predict corporate

bankruptcy from ﬁrms’ ﬁnancial statements and market data by using the LASSO method.

Lahmiri and Bekiros (2019) similarly predict bankruptcy from ﬁrm fundamentals but additionally

include general risk indicators. They use more sophisticated neural networks. Croux et al. (2020)

apply LASSO to predict ﬁntech loan default from loan and borrower characteristics as well as

macroeconomic indicators. In contrast to the above studies, Nazemi and Fabozzi (2018) focus on

the time after credit default and predict the recovery rates of corporate bonds based on bond and

industry characteristics as well as macroeconomic indicators.

3.2.3 Prediction of Firm Characteristics and Corporate Policies

Firm characteristics and corporate policies, as the fundamental subject of study in the ﬁeld of

corporate ﬁnance, can also be the target of ML-based predictions. Depending on the speciﬁc

target of the prediction, we can divide the current literature in this category into three areas:

prediction of ﬁrm fundamentals, prediction of accounting fraud, and prediction of startups’ suc-

cess.

Two studies use ML to predict diﬀerent ﬁrm fundamentals. Amini, Elmore, and Strauss (2019)

study ﬁrms’ capital structure as a typical problem in corporate ﬁnance. They predict corporate

leverage based on diﬀerent hypothesized predictors from the literature (Frank and Goyal, 2009)

30
with various ML methods. Furthermore, they analyze which predictors are actually informative

for capital structure and how they inﬂuence predictions in detail. The study by Van Binsbergen,

Han, and Lopez-Lira (2020) applies ML to predict ﬁrms’ future earnings based on their accounting

data, macroeconomic data, and analyst forecasts.

Another typical prediction problem in the subcategory of ﬁrm characteristics and corporate poli-

cies is accounting fraud. While there are traditional approaches to predict accounting fraud (such

as the Beneish (1999) model for earnings manipulation), ML can provide superior prediction

accuracy. Bao et al. (2020) use boosted regression trees on raw ﬁnancial statement variables to

predict accounting fraud. They ﬁnd that the ML-based predictions outperform simpler existing

fraud models. Brown, Crowley, and Elliott (2020) apply ML to extract the topics discussed in

annual reports and use them to predict accounting fraud. They also employ feature importance

and feature dependence methods to further analyze which topics are most informative and how

they aﬀect fraud predictions.

Finally, studies in the ﬁeld of entrepreneurial ﬁnance use ML to predict startups’ success. Xiang

et al. (2012) predict startup acquisitions based on firms’ fundamental data and firm-specific news.

Similarly, Ang, Chia, and Saghaﬁan (2020) apply ML to predict startups’ valuations and their

probabilities of success.

3.3 Extension of the Existing Econometric Toolset

Studies in the third category of ML applications extend the existing econometric toolset. Many

commonly used econometric methods contain a prediction component. For instance, the ﬁrst stage

of instrumental variable regression with 2SLS is eﬀectively a prediction problem, as only the ﬁtted

(predicted) value of the instrumented variable enters the second stage. ML methods can provide

superior predictions and hence improve the capabilities of such econometric methods. On the

other hand, some ML methods already serve similar purposes as existing econometric methods.

For instance, clustering is a known problem in econometrics and in ML. ML-based methods often

provide superior performance, so they can directly extend the econometric toolset. Figure 13 gives

an overview of the literature on ML-based econometric methods. We can distinguish between

causal ML that uses ML for the estimation of treatment eﬀects and other isolated applications of

ML in econometrics. Within the subcategory of causal ML, we can further divide the literature

31
into ML-enhanced methods for instrumental variable regression, the novel methods of causal trees

and causal forests, and other approaches related to causal ML. In the following, we brieﬂy review

the corresponding literature.

Figure 13. Overview of ML-based methods that extend the existing econometric toolset
This figure shows the different categories of ML-based methods that extend the existing econometric
toolset. The largest subcategory is causal ML for the estimation of treatment effects. ML enhances exist-
ing methods, such as instrumental variable regression, or introduces new methods, such as causal trees
and causal forests. ML also provides other methods relevant for the estimation of treatment effects, such
as verifying the balance between treatment and control groups. The second subcategory includes special
applications of ML in econometric approaches in addition to treatment effects, such as the generation of
simulated data.

3.3.1 Causal ML

While traditional econometric methods aim for causality, ML methods are designed for prediction

or for data structure inference. The ﬁeld of causal ML tries to combine the advantages of both to

create superior econometric methods suitable for causality and especially for the estimation of

treatment eﬀects. The most developed methods within causal ML are ML-enhanced instrumental

variable regression and the novel methods of causal trees and forests.

As noted before, ML can directly improve the ﬁrst stage of instrumental variable regression. By

providing better predictions for the instrumented variable, the coeﬃcient of determination R² of

the ﬁrst stage improves, resulting in more precise estimates in the second stage. Concrete imple-

mentations of this idea already exist for diﬀerent ML methods, including LASSO (Belloni et al.,

32
2012), ridge regression (Carrasco, 2012; Hansen and Kozbur, 2014), and neural networks (Hart-

ford et al., 2017). However, Angrist and Frandsen (2019) argue that ML-enhanced instrumental

variable methods might not be superior to existing specialized approaches in selecting instrumen-

tal variables.

For the estimation of treatment eﬀects with ML, causal trees and causal forests are other well-

developed methods. The seminal work from Athey and Imbens (2016) introduced the causal tree

approach, which uses tree-based ML methods to partition data into subpopulations with diﬀerent

magnitudes of treatment eﬀects. Causal forests from Athey and Wager (2019) extend this concept

by using an entire ensemble of causal trees. There already exist studies that apply causal forests

to concrete problems in ﬁnancial economics. Gulen, Jens, and Page (2020) apply causal forests to

estimate heterogeneous treatment eﬀects of debt covenant violations on ﬁrms’ investment levels.

O’Malley (2020) estimates treatment heterogeneity of a legislative change in home repossession

risk on mortgage default with causal forests.

In addition to causal trees and causal forests, other approaches use ML to improve the estimation

of treatment eﬀects. Lee, Lessler, and Stuart (2010) estimate the propensity score with ML. Mul-

lainathan and Spiess (2017) suggest the use of ML for verifying the balance between treatment

and control groups. They argue that if we can predict the treatment assignment with ML, then

the split into treatment and control group cannot be balanced. However, this idea works in only

one direction: we can infer imbalance but not balance from applying ML to predict the treatment

assignment. It is always possible that our chosen ML methods are just not powerful enough to

predict the treatment assignment of imbalanced data. Chernozhukov et al. (2017, 2018) directly

calculate treatment eﬀects from ML-based predictions for treatment assignment and outcome.

Finally, Athey et al. (2019a) predict the counterfactual with ensemble methods to estimate treat-

ment eﬀects from panel data.

3.3.2 Special Applications of ML in Econometrics

While causal ML for the estimation of treatment eﬀects is currently the most developed applica-

tion of ML in econometrics, there exist various special applications of ML in econometrics that

also extend the existing econometric toolset.

33
Above, we have presented how ML can create measures for economic variables. By generalizing

this concept, ML can also construct a predictability measure for entire economic theories.

Peysakhovich and Naecker (2017) introduce the notion that we can use ML to derive an upper

bound for the predictive power of theories: the explainable variation of the dependent variable

from a given dataset with ML methods. Fudenberg et al. (2019) extend this idea to construct a

completeness measure for economic theories. They calculate completeness by comparing two pre-

diction errors: the error achieved from using the model and variables hypothesized by an economic

theory and the error achievable with ML. In general, diﬀerent datasets contain diﬀerent levels of

information, so they allow diﬀerent levels of predictability. By comparing prediction errors to

those achievable with ML methods, we can create a fairer and more informative measure of com-

parison for diﬀerent economic theories.

A diﬀerent problem relevant in econometrics as well as in ML is imbalanced data. For instance,

in loan performance data, actual defaults are much rarer than uneventful repayments. Sigrist and

Hirnschall (2019) combine ML with traditional econometric methods to address such problem

types. More speciﬁcally, they use boosted regression trees to enhance the traditional Tobit model.

They also illustrate the advantages of their method in a concrete problem by applying it to loan

defaults in Switzerland.

In the ﬁeld of simulation, Athey et al. (2019b) use generative adversarial networks instead of

traditional Monte Carlo methods to simulate data that more closely mimics real data. They

illustrate their method by using simulated data for performance comparisons between diﬀerent

econometric estimators.

Finally, Ludwig, Mullainathan, and Spiess (2019) introduce ML-augmented pre-analysis plans for

the avoidance of p-hacking. They augment standard linear regression with new regressors from

ML. The new regressors aggregate many potentially relevant variables into a single index. Hence,

their method avoids the otherwise necessary pre-speciﬁcation of concrete analysis choices in stand-

ard pre-analysis plans.

34
4. Real Estate Price Prediction
Real estate is one of the most important asset classes in the economy. In the US, the total value

of real estate assets is comparable to the size of the equities and ﬁxed income markets. For most

households, real estate is the greatest source of wealth. The Global Financial Crisis in 2007/08

exempliﬁed how spillover eﬀects from the real estate sector can destabilize economies around the

world.

In contrast to other asset classes, real estate assets show a high level of heterogeneity, which makes

real estate pricing challenging. The traditional approach to derive price estimates for individual

properties is hedonic pricing. In hedonic pricing, we ﬁrst regress the property characteristics on

the observed property prices with OLS to obtain a linear pricing model. Then, we can use this

model to obtain price estimates for new, previously unobserved properties. We can also interpret

the regression coeﬃcients as the characteristics’ shadow prices. However, hedonic pricing relies on

an inherently linear model and therefore does not directly consider nonlinearities and interaction

eﬀects. For instance, we can assume relevant interactions between lot size and location: an addi-

tional m² in lot size for a property in a city center is likely worth more than in a suburb. While

we could manually add such speciﬁc eﬀects to the linear model, there may exist a plethora of

unknown nonlinear and interaction eﬀects. By ignoring these eﬀects, the linear model of hedonic

pricing potentially leaves important information contained in the data unexploited.

ML methods, in contrast, automatically consider nonlinearities and interactions. Therefore, su-

pervised ML can potentially provide us with price predictions that exhibit lower pricing error

than the linear model from hedonic pricing. In this section, we study whether and how ML pro-

vides superior price estimates for individual real estate assets.

We use a unique dataset of real estate listing data in Germany to train diﬀerent ML models for

the prediction of individual property prices and compare them with the linear OLS model from

hedonic pricing. Figure 14 shows the key result of our analysis. ML methods strongly improve the

accuracy of price predictions over the OLS baseline. Our best-performing ML method, boosted

regression trees, dramatically increases out-of-sample R2 to 77%, compared to 40% for OLS; thus,

it almost doubles the amount of explained price variation. On average, the predictions from

boosted regression trees deviate from the actual prices by approximately 27%, compared to 44%

for OLS. In monetary terms, the superior prediction performance of boosted regression trees

35
corresponds to an average pricing error of approximately 94,000 EUR, compared to 176,000 EUR

for OLS. Since the mean property price in our sample is 393,000 EUR, the improvements in

pricing accuracy from ML are not only statistically signiﬁcant but also economically large.

Figure 14. Prediction performance of hedonic pricing (OLS) and different ML methods
This figure compares the prediction performance (R²) of traditional hedonic pricing (OLS) with different
ML methods. While most ML methods outperform OLS, the boosted regression trees method performs
best by far and almost doubles the OLS performance.

While the improvements in pricing accuracy induced by ML are already impressive on average,

their beneﬁts become even more pronounced at the upper end of the price range. Figure 15 depicts

the prediction performance of the best-performing ML method, boosted regression trees, com-

pared to that of OLS in the ﬁve property price quintiles. The boosted regression trees method

outperforms OLS in all quintiles. While OLS performs worst at the extremes of the price range,

ML is especially useful in reducing the pricing error for the most expensive properties. In the

highest price quintile, the boosted regression trees method lowers the average pricing error to

24%, compared to 50% for OLS. In monetary units, the superior prediction performance of

boosted regression trees relative to that of OLS corresponds to a reduction of the average pricing

error by more than 240,000 EUR in the highest price quintile. Given that the average property

price in the top quintile is approximately 884,000 EUR, the improvements in pricing power from

ML are dramatic. Our results indicate that nonlinearities and interaction eﬀects are relevant in

real estate pricing and especially important for the most expensive properties.

36
Existing research on real estate pricing almost exclusively focuses on the aggregate real estate

market instead of individual properties. Ghysels et al. (2013) forecast the aggregate price level in

the real estate market with mathematical forecasting methods such as GARCH or ARMA models.

Other studies investigate the relationship between aggregate real estate price levels and stock

prices (Quan and Titman, 1999), banks’ mortgage lending behavior (Hott, 2011), or subprime

lending (Pavlov and Wachter, 2011). The few studies that analyze individual property prices tend

to focus on the effect of specific influencing factors such as environmental variables (Din, Hoesli,

and Bender, 2001), spatial factors (Bourassa, Cantoni, and Hoesli, 2007), or climate change (Bal-

dauf, Garlappi, and Yannelis, 2020). Closest to our work is the study by Park and Bae (2015),

which predicts house prices in a US county with basic ML methods.

Figure 15. Average pricing error of the best-performing ML method, boosted regression
trees, compared to that of OLS in the five property price quintiles
This figure shows the average pricing error (measured by mean absolute error (MAE)) for the best-
performing ML method, boosted regression trees, and for the OLS baseline in the five price quintiles. In
all quintiles, the boosted regression trees method significantly outperforms OLS. The reduction in pricing
error from ML is most pronounced in the highest price quintile, where OLS performs relatively poorly.

Our contribution in this section is threefold. First, we demonstrate the beneﬁt of using ML to

reduce the prediction error in economic prediction problems. We present evidence that ML can

yield a statistically and economically signiﬁcant reduction in prediction error compared to tradi-

tional linear regression with OLS in addressing the problem of real estate price prediction. By

using the most common ML methods and relevant metrics, we further illustrate how exactly

37
researchers can apply ML to solve such problems. Hence, our analysis can serve as a blueprint for

studies on other economic prediction problems.

Second, we identify that nonlinearities and interaction eﬀects are highly relevant in real estate

pricing. ML methods that automatically consider such eﬀects signiﬁcantly outperform the linear

OLS method traditionally used in hedonic pricing. We further ﬁnd that the pricing eﬀect of

nonlinearities and interaction eﬀects is most pronounced at the upper end of the property price

range. While linear OLS struggles in the pricing of expensive properties, ML achieves much lower

pricing errors for such properties in this context.

Third, we provide researchers as well as practitioners with a real estate pricing model that yields

much more accurate price predictions than traditional methods. Researchers can use the predic-

tions from our model to impute prices at the individual property level for dates at which no

observed prices exist. We therefore facilitate subsequent studies that require estimates for indi-

vidual property prices, for instance, to study real estate price behavior or to use as control varia-

bles. For practitioners, multiple usage scenarios exist. Realtors and prospective sellers can use our

pricing model to obtain an immediate estimate of a reasonable price for the property they are

selling. Hence, they may be able to sell faster and avoid selling at a too low price. Our model can

also help prospective buyers assess whether a certain oﬀer is reasonably priced. Finally, home-

owners can use our model to obtain more precise estimates for the value of their properties and

subsequently their net worth, for instance, to make more informed consumption decisions.

This section is organized as follows. In Subsection 4.1, we describe our dataset and our variable

construction process as well as the speciﬁc ML methods that we use. Subsection 4.2 then presents

our prediction results and their interpretation in detail. Subsection 4.3 concludes the section and

points out future research directions.

4.1 Data, Variables, and Methods

In contrast to other asset classes such as equities or ﬁxed income, no regular market prices exist

in the real estate market. In many cases, transaction prices are also not publicly available, as they

are the result of private negotiations. Instead, we often have to rely on list prices to study real

estate price behavior. List prices are set by sellers or realtors to attract potential buyers and then

merely serve as a starting point for the subsequent negotiation process. As such, list prices deviate

38
from realized transaction prices. Empirical evidence from various real estate markets, however,

shows that the deviations between list and transaction prices are relatively small: on average, list

prices overestimate transaction prices by less than 10% (Yavas and Yang, 1995; Palmon, Smith,

and Sopranzetti, 2004; Haurin et al., 2010). Hence, we work with the assumption that, especially

over a longer time period, the bias from using list prices instead of transaction prices is negligible.

We construct our sample based on a unique, proprietary dataset from a specialized German real

estate data provider. The dataset consists of a comprehensive collection of detailed real estate

listings for the entire German residential market from ﬁve German property portals and major

newspapers between January 2000 and September 2020. We restrict our analyses to single-family

houses, as they are the most common property type in the dataset. Table A1 in the Appendix

gives an overview of the available variables. For our sample, we ﬁrst eliminate observations with

missing values for any continuous variable. To reduce the inﬂuence of outliers and data errors, we

then truncate all continuous variables at the 0.01st and 99.99th percentiles.7 Our sample construc-

tion procedure leaves us with 4,076,951 observations.

For the prediction of real estate prices, we follow the literature (for example, Mullainathan and

Spiess, 2017) and choose the natural logarithm of the list price as our actual prediction target.8

In our main speciﬁcation, we use the most relevant factors inﬂuencing real estate prices from the

given variables. Physical attribute variables describe the characteristics of the property: general

house type, size, number of rooms, lot size, and construction year. As a macro location variable,

we use a property’s county to describe its approximate location within Germany.9 The granular

location variables horizontal geocoordinates and vertical geocoordinates capture a property’s loca-

tion more precisely. Note that they describe not a property’s exact location but the approximate

center of the city district implied by the property’s zip code. They still capture, for instance,

whether a property is located in a city center or in a suburb. Finally, oﬀer variables describe oﬀer-

7
Given the huge size of our dataset and the already high data quality, we choose relatively small outlier percentiles.
Using different percentiles has only a minor effect on the results.
8
In unreported analyses, we use price or price per square meter as the prediction target. These alternative specifications
produce qualitatively similar results but slightly lower prediction performance.
9
In unreported analyses, we use city and state, in addition to and instead of county, to capture macro location effects.
Prediction performance and conclusion are virtually unchanged.

39
specific features: offer year captures time trends and price-level effects, online listing indicates

whether the sale oﬀer is listed on an online platform, and seller type describes who is selling the

property (realtor, developer, or private owner). We include all categorical variables as dummy

variables in our speciﬁcation and ﬁnally arrive at 388 predictor variables. We also tested an alter-

native speciﬁcation with a set of additional property characteristics, such as the number of bal-

conies or bathrooms (see Table A1 in the Appendix), which are available for only a limited subset

of our sample. The results were qualitatively similar to those of the main speciﬁcation.

To accurately assess and compare the out-of-sample prediction performance of diﬀerent prediction

methods, we divide the sample into two subsamples: training data and test data (also called hold-

out data). We train our prediction models on the training data and subsequently determine their

prediction performance on the test data. Since the algorithm has not seen the test data before,

the measured prediction performance serves as an adequate estimation of a model’s out-of-sample

prediction performance. Many studies that use cross-sectional data assign the dataset’s observa-

tions into training and test data at random. However, our data exhibits a time component. A

random assignment would imply that our ML models can learn from future information (look-

ahead bias): for instance, we would train on some observations from 2020 to predict prices from

2000. Hence, our measured prediction performance would be biased upwards. To avoid this issue,

we split our sample into disjoint time periods. We hereby follow common practice from panel

studies that also have to consider the temporal order in the data (for example, Gu, Kelly, and

Xiu, 2020). We assign observations from 2000 to 2019 as training data and observations from 2020

as test data. Thereby, we take the standpoint of a practitioner who uses all historical data to

learn the pricing mechanism and predicts prices for the most recent observations. We use sample

weights to take into account that the observations from 2019 are more informative for price pre-

dictions in 2020 than observations from 2000. More speciﬁcally, we weight the training data line-

arly depending on the oﬀer year: observations from 2000 have a weight of 1, while observations

from 2019 have a weight of 20.10

10
In unreported analyses, we also use alternative weighting schemes such as hyperbolic weighting. The results remain
qualitatively unchanged.

40
Table 1 shows summary statistics for the continuous variables in the total sample, the training

sample, and the test sample. The diﬀerences in all variables other than list price are negligible.

List prices in the test sample, which covers the most recent observations from 2020, are much

higher than those in the training sample and total sample, which cover previous years, as a result

of price-level effects. We account for such price-level effects by having the offer year variable in

our speciﬁcation and, as discussed above, by using year-dependent weights in the training data.

Table 1. Summary statistics for the continuous variables of the total sample and the training
and test samples
This table reports summary statistics for the continuous variables in our three samples: the total sample
of all observations, the training sample on which we train our prediction models, and the test sample on
which we evaluate prediction performance. The training sample consists of observations from 2000 to
2019, while the test sample covers 2020.

Total Sample Training Sample Test Sample

Mean Std. Dev. Mean Std. Dev. Mean Std. Dev.

List price (€) 271,971.01 258,099.78 266,406.95 251,083.40 393,075.20 359,215.65

Size (m²) 166.83 87.48 166.60 87.16 171.95 93.93

Rooms 5.81 2.31 5.80 2.28 6.05 2.82

Lot size (m²) 1,204.10 3,704.06 1,200.11 3,691.19 1,290.81 3,972.95

Construction 1,965.02 45.30 1,965.42 45.11 1,956.46 48.47

year
Horizontal 51.06 1.82 51.06 1.82 51.06 1.82
geocoordinate
Vertical 9.46 2.01 9.45 2.01 9.63 2.07
geocoordinate
Oﬀer year 2013.13 3.76 2012.81 3.54 2020.00 0.00

N obs. 4,076,951 4,076,951 3,897,866 3,897,866 179,085 179,085

To predict real estate prices, we apply linear regression with OLS (traditional hedonic regression)

and various supervised ML methods. The pricing performance of the OLS estimates serves as our

baseline against which we compare the performance of the diﬀerent ML methods. We choose

diﬀerent classes of supervised ML methods that are widespread in the current literature and

promise state-of-the-art prediction performance. Regularized linear regression is most similar to

traditional OLS but introduces bias to potentially improve prediction performance. We apply the

most common methods of regularized linear regression: LASSO, ridge, and elastic net. Tree-based

41
methods are especially well suited for capturing nonlinearities and interaction eﬀects. We also

apply the most common of these methods: decision tree, random forest, and boosted regression

trees. Finally, we leverage the common ensemble learning concept and build an ensemble model

that returns the unweighted average of all other models’ predictions.11 We derive suitable hy-

perparameters for each ML model by using ﬁvefold cross-validation.12 For a detailed description

of the individual ML methods, see Section 2.

In addition to the abovementioned methods, there are many more ML methods to make predic-

tions. Currently, a very popular ML method is deep learning with neural networks. Neural net-

works, however, do not perform particularly well for pure prediction based on original numerical

data. Instead, they are the method of choice for unconventional data such as images, videos, or

text (see Section 2). In an unreported analysis, we nevertheless trained a basic feed-forward neural

network on our data. As expected, it not only achieved worse prediction performance compared

to the other methods but also required much higher computational eﬀort.

4.2 Prediction Results and Interpretation

Various metrics exist to assess prediction performance: R2, mean squared error, mean absolute

error, etc. Since R² is also a common metric in many empirical studies in economics and hence

enables quantitative comparisons, we ﬁrst focus on R² in our assessments of prediction perfor-

mance. The diﬀerent methods’ prediction performance on the test data is most meaningful in

assessing the expected out-of-sample prediction performance. To derive 95% conﬁdence intervals

for the test data performance, we follow Mullainathan and Spiess (2017) and use bootstrap sam-

pling with ﬁxed prediction functions (see their Online Appendix for a more detailed description

of the method). We further calculate the relative improvement of each method over the OLS

baseline by quintile of property price (based on mean squared error). In addition to reporting the

11
In an unreported analysis, we also build a more complex ensemble model that uses a weighted average of the other
models’ predictions. We follow the linear regression approach from Mullainathan and Spiess (2017) to derive the optimal
weights. The complex ensemble model puts a large weight on the boosted regression tree method and hence performs
very similarly.
12
Common practice in literature is ﬁve- to tenfold cross-validation. Given our large dataset and the resulting long
computation times, we choose the computationally less demanding ﬁvefold cross-validation.

42
test data metrics, we report the performance of each method on the training data (in-sample) to

allow comparisons with traditional studies and to illustrate the amount of overﬁtting.13

Table 2 shows the prediction performance of the OLS baseline and the diﬀerent ML methods. The

table reveals ﬁve main results. First, the prediction performance on the test data is lower than

that on the training data for every method. This observation illustrates the eﬀect of overﬁtting:

prediction models can closely ﬁt the given training data by picking up noise, so their performance

is lower for test data that the algorithm has not seen before. Thus, our results highlight the well-

established fact that the (in-sample) performance of prediction models on training data is upwards

biased, so we need to evaluate them on held-out test data to derive unbiased estimates for out-

of-sample prediction performance.

Table 2. Prediction performance of OLS and diﬀerent ML methods

This table compares the prediction performance of the OLS baseline and diﬀerent ML methods on the
training and test data. The 95% conﬁdence interval of test data R² is reported in brackets. The table
further shows the relative improvement of each method’s prediction performance over the OLS baseline
by quintile of property price. The relative improvements are calculated from mean squared error values.

Prediction Relative improvement over OLS

performance (R²) by quintile of property price
Training Test
Method data data 1st 2nd 3rd 4th 5th
OLS (baseline) 56.5% 39.9% - - - - -
[39.4%, 40.4%]
Decision Tree 58.5% 51.4% -30.1% 9.8% 45.6% 40.7% 20.1%
[50.9%, 51.8%]
LASSO 58.6% 56.1% -34.2% 35.0% 45.0% 44.4% 34.2%
[55.7%, 56.5%]
Elastic Net 58.6% 56.2% -33.6% 34.4% 44.7% 44.5% 34.6%
[55.8%, 56.6%]
Ridge 58.6% 56.3% -33.0% 33.8% 44.5% 44.6% 35.0%
[55.9%, 56.7%]
Random Forest 70.2% 63.6% -6.7% 31.0% 56.2% 55.8% 45.7%
[63.2%, 63.9%]
Ensemble 71.3% 64.5% 2.9% 45.7% 55.1% 52.4% 43.9%
[64.2%, 64.8%]
Boosted Regression 89.8% 76.9% 23.4% 39.4% 66.3% 73.8% 76.2%
Trees [76.7%, 77.2%]

13
Overﬁtting refers to the phenomenon that complex prediction methods can ﬂexibly adapt to the given data and
possibly pick up noise that does not generalize beyond the training data.

43
Second, every ML method outperforms the OLS baseline on average, and more complex ML

models achieve higher prediction performance on the test data. Regularized linear regression

methods (LASSO, ridge, elastic net) and the simple decision tree method achieve only modest

improvements over the OLS baseline. More complex methods (random forest, boosted regression

trees, ensemble), however, can strongly improve the prediction performance. The performance

ranking of the diﬀerent methods is also in line with typical expectations (see Figure 3 in Section

2). Our results strongly indicate that the nonlinearities and interaction eﬀects captured by more

complex ML methods contain relevant information for real estate pricing.

Third, most ML methods do not outperform OLS in every price quintile. Especially in the lowest

quintile, only the boosted regression trees method and the ensemble model achieve superior pre-

diction performance. Hence, we need complex ML methods to achieve not only maximal perfor-

mance on average but also consistent outperformance relative to OLS over the entire price range.

Fourth, the boosted regression trees method outperforms the OLS baseline and every other ML

method on average as well as in every price quintile. Given that boosted regression trees is a

highly optimized ML method that captures complex nonlinearities and interaction eﬀects, this

result further strengthens our previous indication that nonlinear eﬀects are relevant for real estate

pricing.

Fifth, the outperformance of our best-performing ML method, boosted regression trees, monoton-

ically increases by price quintile. For low-priced properties, the improvement induced by ML is

relatively modest even with the best ML method. For high-priced properties, however, the pre-

diction performance of ML dramatically improves over that of OLS. Hence, our results indicate

that nonlinearities and interaction eﬀects are most relevant for properties at the upper end of the

price range.

Having established that advanced ML models outperform OLS in real estate price prediction, we

now analyze the economic magnitude of our ﬁndings. To make statements about economic rele-

vance, R² values are less suited. Instead, we use metrics that are more interpretable. First, the

mean absolute percentage error (MAPE) quantiﬁes by what percent a model’s predictions deviate

from the actual prices on average. Based on the MAPE, we calculate the improvement of each

ML model over the OLS baseline on average and per price quintile, and we report their statistical

signiﬁcance. Table 3 shows our results.

44
Table 3. Improvements in prediction accuracy for diﬀerent ML methods
This table shows the MAPE values (in %) for OLS and diﬀerent ML methods as well as the improvements
over the OLS baseline on average and by quintile of property price. The numbers in brackets show the
respective t-values.

Change in MAPE over OLS

Change in by quintile of property price
MAPE MAPE
Method (%) over OLS 1st 2nd 3rd 4th 5th
OLS (baseline) 43.59 - - - - - -

Decision Tree 40.02 -3.57 13.91 0.65 -12.66 -11.70 -7.80

[-16.41] [17.93] [1.18] [-43.22] [-43.40] [-29.48]
LASSO 41.17 -2.42 20.06 -3.76 -9.07 -10.24 -8.74
[-8.44] [22.40] [-4.48] [-21.11] [-25.36] [-22.90]
Elastic Net 41.08 -2.51 19.70 -3.63 -9.04 -10.34 -8.88
[-8.76] [22.04] [-4.32] [-21.10] [-25.71] [-23.30]
Ridge 41.01 -2.59 19.38 -3.53 -9.00 -10.39 -9.04
[-9.04] [21.72] [-4.20] [-21.05] [-25.91] [-23.75]
Random Forest 34.72 -8.87 4.00 -4.44 -14.01 -15.44 -14.30
[-42.29] [5.45] [-8.05] [-48.56] [-57.74] [-54.38]
Ensemble 34.85 -8.74 1.21 -8.21 -12.22 -12.56 -11.77
[-40.86] [1.64] [-14.34] [-40.69] [-45.41] [-44.16]
Boosted Regres- 26.81 -16.79 -12.73 -7.58 -16.07 -21.37 -26.19
sion Trees [-82.84] [-18.35] [-13.72] [-55.78] [-80.40] [-100.92]

The results from using the MAPE metric are consistent with those from using R². Overly simple

methods (LASSO, ridge, elastic net, and decision tree) achieve a statistically signiﬁcant but not

economically meaningful improvement over the OLS baseline, with a maximum improvement of

3.6 percentage points. The more complex methods perform much better. The improvements in

pricing accuracy from the best-performing ML method, boosted regression trees, are not only

statistically signiﬁcant but also economically large. On average, we achieve a pricing error of

26.8% with boosted regression trees compared to 43.6% with OLS. While the average reduction

in pricing error by 16.8 percentage points is already highly meaningful, the improvements from

ML become even larger at the upper end of the price range. In the highest price quintile, boosted

regression trees reduce the average pricing error by 26.2 percentage points. Hence, complex ML

methods, especially the boosted regression trees method, yield price predictions with much higher

accuracy than the OLS approach from hedonic pricing by considering nonlinearities and interac-

tion eﬀects.

45
To quantify the value of superior prediction performance in monetary units (EUR), we use the

mean absolute error (MAE) metric calculated from the diﬀerent methods’ price predictions.

Again, we also determine the improvement of each ML method over the OLS baseline on average

and per price quintile and report their statistical signiﬁcance. Table 4 shows our results. While

the OLS estimates exhibit an average pricing error of over 176,000 EUR, the boosted regression

trees predictions lower the error to approximately 94,000 EUR. Given that the average property

price in the sample is 393,000 EUR, the reduction in pricing error by more than 82,000 EUR is

economically very large. In the highest price quintile, boosted regression trees reduce the average

pricing error by more than 240,000 EUR for an average property price of approximately 884,000

EUR. Hence, we conclude that ML, especially the boosted regression trees method, is able to

reduce the prediction error in real estate pricing in a statistically signiﬁcant and economically

meaningful way.

Table 4. Improvements in prediction accuracy for diﬀerent ML methods in monetary units

This table shows the MAE values in EUR for diﬀerent ML methods as well as the improvements over the
OLS baseline on average and by quintile of property price. The numbers in brackets show the respective
t-values.

Change in MAE over OLS

Change in
by quintile of property price
MAE MAE
Method (EUR) over OLS 1st 2nd 3rd 4th 5th
OLS (baseline) 176,435.44 - - - - - -

Decision Tree 148,592.85 -27,842.60 17,749.34 -481.23 -40,821.12 -51,193.06 -63,909.14

[-27.73] [34.61] [-0.43] [-42.78] [-43.82] [-16.17]
LASSO 148,684.30 -27,751.14 13,922.58 -9,188.25 -29,114.45 -45,394.28 -68,561.98
[-22.45] [22.49] [-5.53] [-20.78] [-26.31] [-13.91]
Elastic Net 148,399.48 -28,035.96 13,712.44 -8,893.75 -29,021.97 45,868.37 -69,692.01
[-22.68] [22.20] [-5.34] [-20.79] [-26.68] [-14.13]
Ridge 148,113.83 -28,321.62 13,582.91 -8,676.93 -28,919.77 -46,111.37 -71,078.51
[-22.95] [22.04] [-5.52] [-20.75] [-26.91] [-14.43]
Random Forest 127,941.27 -48,494.17 8,307.91 -10,961.45 -45,072.15 -68,217.87 -126,002.79
[-50.08] [17.04] [-9.95] [-47.92] [-58.96] [-32.92]
Ensemble 133,280.72 -43,145.73 1,971.02 -18,403.97 -39,097.61 -55,465.77 -104,181.39
[-43.77] [4.07] [-16.11] [-39.94] [-46.43] [-26.93]
Boosted Re- 94,385.53 -82,049.91 -4,671.03 -17,169.22 -51,847.87 -95,042.88 -240,942.39
gression Trees [-89.75] [-9.91] [-15.55] [-55.23] [-82.88] [-66.37]

46
4.3 Concluding Remarks and Outlook

In this section, we applied state-of-the art ML methods to predict real estate prices based on

property characteristics, location, and oﬀer details. As our data source, we used a proprietary set

of real estate listings in Germany from various online and oﬄine sources. While simple methods

such as LASSO or decision tree already perform superior to traditional hedonic pricing with OLS,

we found that the more complex boosted regression trees method yields much lower pricing errors.

While the average pricing error is almost 44% for OLS, boosted regression trees lowers that value

to less than 27%. In monetary units, the improved pricing accuracy corresponds to a reduction in

pricing error by approximately 82,000 EUR for an average property price of 393,000 EUR. We

infer that nonlinearities and interaction eﬀects captured by complex ML methods are relevant for

real estate pricing. They become even more important at the upper end of the price range: in the

highest price quintile, ML reduces the average pricing error by more than 240,000 EUR for an

average property price of approximately 884,000 EUR.

The biggest limitation of our approach is the reliance on listing data instead of transaction data.

List prices often serve as a mere starting point in the subsequent price negotiation. Depending on

the state of the market at the time of selling, the ﬁnal transaction price might be higher or lower.

Furthermore, it is possible that certain listed properties do not get sold at all. Empirical evidence,

however, indicates that the diﬀerences between list prices and transaction prices are rather small

on average. Nevertheless, future studies might look into repeating our prediction exercise with

superior transaction data where available.

Future research might also consider integrating further data sources to enhance prediction perfor-

mance. For instance, macroeconomic data such as GDP or inﬂation data could provide additional

information that is relevant for real estate prices but is not yet included in our dataset.

Another future research avenue is model interpretation. We only predicted prices without analyz-

ing how our ML models arrive at their predictions and which inﬂuencing factors are most im-

portant. To identify relevant predictor variables, feature importance methods such as permutation

importance have become common. However, most feature importance methods produce highly

misleading results, especially if there are strong dependencies between the predictor variables

(Hooker and Mentch, 2019). As we cannot rule out relevant dependencies between our real estate

47
variables (for instance, size and number of rooms are highly correlated), specialized methods such

as conditional permutations are necessary in real estate pricing.

Finally, it might be interesting to see whether the large beneﬁts of using ML over using OLS for

real estate price prediction also hold in countries other than Germany.

5. Credit Risk Prediction

This section is currently work in progress.

48
6. Conclusion
In this paper, we studied the question of how researchers can leverage ML technology in ﬁnancial

economics. First, we identified that different types of ML solve different problems than traditional

linear regression with OLS does. While the properties of OLS are beneﬁcial for explanation prob-

lems, supervised ML is the superior method for prediction problems. The other major type of

ML, unsupervised learning, deals with data structure inference. We also brieﬂy covered other, less

common, types of ML, such as reinforcement learning and semi-supervised learning.

In the second part of this paper, we identiﬁed the three main application categories of ML in

ﬁnancial economics: 1) construction of superior and novel measures, 2) reduction of prediction

error in economic prediction problems, and 3) extension of the existing econometric toolset. For

each application category in our taxonomy, we further identiﬁed subcategories and reviewed the

existing literature.

In the third part, we applied ML in typical prediction problems in economics: real estate price

prediction and credit risk prediction. In real estate price prediction, we compared diﬀerent ML

methods with OLS and found that the ML-based price predictions achieve dramatically lower

pricing errors than OLS. We also identiﬁed that nonlinearities and interaction eﬀects are highly

relevant for real estate pricing, especially at the upper end of the price range. Our analysis can

further serve as a blueprint for studies that want to apply ML to other economic prediction

problems.

Given its already successful applications in practice, we expect ML to become much more wide-

spread in ﬁnancial economics research in the coming years. ML is likely to contribute to all three

application categories identiﬁed above. In constructing superior and novel measures, ML can im-

prove traditional measures that do not work well, create novel measures from traditional datasets

already used in ﬁnancial economics, or create novel measures from entirely new data sources

outside of traditional ﬁnancial economics research. To use ML for the reduction of prediction error

in economic prediction problems, the identiﬁcation of suitable prediction problems is crucial. Fi-

nally, multiple ML-enhanced econometric methods are already available for application in tradi-

tional and novel problems in ﬁnancial economics and beyond. We hope that this paper can serve

as a guide for researchers who want to apply ML in ﬁnancial economics and conduct some of the

mentioned future research.

49
References
Adämmer, P. and Schüssler, R.A., 2020. Forecasting the Equity Premium: Mind the News! Review

of Finance 24, 1313-1355.

Agrawal, R., Imieliński, T., and Swami, A., 1993. Mining Association Rules between Sets of Items

in Large Databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Man-

agement of Data, 207–216.

Albanesi, S. and Vamossy, D.F., 2019. Predicting Consumer Default: A Deep Learning Approach.

NBER Working Paper No. 26165. National Bureau of Economic Research.

Albawi, S., Mohammed, T.A., and Al-Zawi, S., 2017. Understanding of a Convolutional Neural

Network. In 2017 International Conference on Engineering and Technology (ICET), 1–6.

Algaba, A., Ardia, D., Bluteau, K., Borms, S., and Boudt, K., 2020. Econometrics Meets Senti-

ment: An Overview of Methodology and Applications. Journal of Economic Surveys 34, 512–47.

Amel-Zadeh, A., Calliess, J.-P., Kaiser, D., and Roberts, S., 2020. Machine Learning-Based Fi-

nancial Statement Analysis. SSRN Working Paper No. 3520684. Social Science Research Network.

Amini, S., Elmore, R., and Strauss, J., 2019. Can Machines Learn Capital Structure? SSRN

Working Paper No. 3473322. Social Science Research Network.

Ang, Y.Q., Chia, A., and Saghaﬁan, S., 2020. Using Machine Learning to Demystify Startups

Funding, Post-Money Valuation, and Success. HKS Working Paper No. RWP20-028. Harvard

Kennedy School.

Angrist, J. and Frandsen, B., 2019. Machine Labor. NBER Working Paper No. 26584. National

Bureau of Economic Research.

Antweiler, W. and Frank, M.Z., 2004. Is All That Talk Just Noise? The Information Content of

Internet Stock Message Boards. The Journal of Finance 59, 1259–94.

Athey, S., Bayati, M., Imbens, G.W., and Qu, Z., 2019. Ensemble Methods for Causal Eﬀects in

Panel Data Settings. AEA Papers and Proceedings 109, 65–70.

Athey, S. and Imbens, G.W., 2016. Recursive Partitioning for Heterogeneous Causal Eﬀects. In

Proceedings of the National Academy of Sciences 113, 7353–60.

50
Athey, S. and Imbens G.W., 2019. Machine Learning Methods That Economists Should Know

About. Annual Review of Economics 11, 685–725.

Athey, S., Imbens, G.W., Metzger J., and Munro E.M., 2019. Using Wasserstein Generative Ad-

versarial Networks for the Design of Monte Carlo Simulations. NBER Working Paper No. 26566.

National Bureau of Economic Research.

Athey, S. and Wager, S., 2019. Estimating Treatment Eﬀects with Causal Forests: An Application.

arXiv preprint no. 1902.07409. arXiv.

Baldauf, M., Garlappi, L., and Yannelis, C., 2020. Does Climate Change Aﬀect Real Estate

Prices? Only If You Believe In It. The Review of Financial Studies 33, 1256–95.

Bandiera, O., Prat, A., Hansen, S., and Sadun, R., 2020. CEO Behavior and Firm Performance.

Journal of Political Economy 128, 1325–69.

Bao, Y., Ke, B., Li, B., Yu, Y.J., and Zhang, J., 2020. Detecting Accounting Fraud in Publicly

Traded U.S. Firms Using a Machine Learning Approach. Journal of Accounting Research 58, 199–

235.

Barbon, A., Di Maggio, M., Franzoni, F., and Landier, A., 2019. Brokers and Order Flow Leakage:

Evidence from Fire Sales. The Journal of Finance 74, 2707–49.

Barth, A., Mansouri, S., and Woebbeking, F., 2020. Econlinguistics. SSRN Working Paper No.

3567724. Social Science Research Network.

Bartov, E., Faurel, L., and Mohanram, P.S., 2017. Can Twitter Help Predict Firm-Level Earnings

and Stock Returns? The Accounting Review 93, 25–57.

Belloni, A., Chen, D., Chernozhukov, V., and Hansen, C., 2012. Sparse Models and Methods for

Optimal Instruments With an Application to Eminent Domain. Econometrica 80, 2369–2429.

Beneish, M.D, 1999. The Detection of Earnings Manipulation. Financial Analysts Journal 55, 24–

36.

Bertsch, C., Hull, I., Qi, Y., and Zhang, X., 2020. Bank Misconduct and Online Lending. Journal

of Banking & Finance 116, 105822.

51
Bianchi, D., Büchner, M., and Tamoni, A., 2020. Bond Risk Premiums with Machine Learning.

The Review of Financial Studies, Forthcoming.

Binsbergen, J.H.v., Han, X., and Lopez-Lira, A., 2020. Man versus Machine Learning: Earnings

Expectations and Conditional Biases. NBER Working Paper No. 27843. National Bureau of Eco-

nomic Research.

Björkegren, D. and Grissen, D., 2018. The Potential of Digital Credit to Bank the Poor. AEA

Papers and Proceedings 108, 68–71.

Björkegren, D. and Grissen, D., 2019. Behavior Revealed in Mobile Phone Usage Predicts Credit

Repayment. The World Bank Economic Review, Forthcoming.

Bourassa, S.C., Cantoni, E., and Hoesli, M., 2007. Spatial Dependence, Housing Submarkets, and

House Price Prediction. The Journal of Real Estate Finance and Economics 35, 143–60.

Breiman, L., 2001. Random Forests. Machine Learning 45, 5–32.

Brown, N.C., Crowley, R.M., and Elliott, W.B., 2020. What Are You Saying? Using Topic to

Detect Financial Misreporting. Journal of Accounting Research 58, 237–91.

Bubna, A., Das, S.R., and Prabhala, N., 2020. Venture Capital Communities. Journal of Financial

and Quantitative Analysis 55, 621–51.

Buehlmaier, M.M. and Whited, T.M., 2018. Are Financial Constraints Priced? Evidence from

Textual Analysis. The Review of Financial Studies 31, 2693–2728.

Butaru, F., Chen, Q., Clark, B., Das, S., Lo, A.W., and Siddique, A., 2016. Risk and Risk Man-

agement in the Credit Card Industry. Journal of Banking & Finance 72, 218–39.

Cambria, E. and White, B., 2014. Jumping NLP Curves: A Review of Natural Language Pro-

cessing Research. IEEE Computational Intelligence Magazine 9, 48–57.

Carrasco, M., 2012. A Regularization Approach to the Many Instruments Problem. Journal of

Econometrics 170, 383–98.

Chen, L., Pelger, M., and Zhu, J., 2019. Deep Learning in Asset Pricing. SSRN Working Paper

No. 3350138. Social Science Research Network.

52
Chernozhukov, V., Chetverikov, D., Demirer, M., Duﬂo, E., Hansen, C., and Newey, W., 2017.

Double/Debiased/Neyman Machine Learning of Treatment Eﬀects. American Economic Review

107, 261–65.

Chernozhukov, V., Chetverikov, D., Demirer, M., Duﬂo, E., Hansen, C., Newey, W., and Robins,

J., 2018. Double/Debiased Machine Learning for Treatment and Structural Parameters. The

Econometrics Journal 21, C1–68.

Chinco, A., Clark‐Joseph, A.D., and Ye, M., 2019. Sparse Signals in the Cross-Section of Returns.

The Journal of Finance 74, 449–92.

Colombo, E., Forte, G., and Rossignoli, R., 2019. Carry Trade Returns with Support Vector

Machines. International Review of Finance 19, 483-504.

Cowden, C., Fabozzi, F.J., and Nazemi, A., 2019. Default Prediction of Commercial Real Estate

Properties Using Machine Learning Techniques. The Journal of Portfolio Management, Forth-

coming.

Croux, C., Jagtiani, J., Korivi, T., and Vulanovic, M., 2020. Important Factors Determining

Fintech Loan Default: Evidence from a Lendingclub Consumer Platform. Journal of Economic

Behavior & Organization 173, 270–96.

Din, A., Hoesli, M., and Bender, A., 2001. Environmental Variables and Real Estate Prices. Urban

Studies 38, 1989–2000.

Dobbie, W., Liberman, A., Paravisini, D., and Pathania, V., 2018. Measuring Bias in Consumer

Lending. NBER Working Paper No. 24953. National Bureau of Economic Research.

Domingues, R., Filippone, M., Michiardi, P., and Zouaoui, J., 2018. A Comparative Evaluation

of Outlier Detection Algorithms: Experiments and Analyses. Pattern Recognition 74, 406–421.

Du, Q., Jiao, Y., Ye, P., and Fan, W., 2019. When Mutual Fund Managers Write Conﬁdently.

SSRN Working Paper No. 3513288. Social Science Research Network.

Ester, M., Kriegel, H.-P., Sander, J., and Xu, X., 1996. A Density-Based Algorithm for Discovering

Clusters in Large Spatial Databases with Noise. In KDD 96, 226–231.

Frank, M.Z. and Goyal, V.K., 2009. Capital Structure Decisions: Which Factors Are Reliably

Important? Financial Management 38, 1–37.

53
Freyberger, J., Neuhierl, A., and Weber, M., 2020. Dissecting Characteristics Nonparametrically.

The Review of Financial Studies 33, 2326–77.

Fudenberg, D., Kleinberg, J., Liang, A., and Mullainathan, S., 2019. Measuring the Completeness

of Theories. SSRN Working Paper No. 3018785. Social Science Research Network.

Fuster, A., Goldsmith-Pinkham, P., Ramadorai, T., and Walther, A., 2020. Predictably Unequal?

The Eﬀects of Machine Learning on Credit Markets. SSRN Working Paper No. 3072038. Social

Science Research Network.

Gathergood, J., Mahoney, N., Stewart, N., and Weber, J., 2019. How Do Individuals Repay Their

Debt? The Balance-Matching Heuristic. American Economic Review 109, 844–75.

Ghysels, E., Plazzi, A., Valkanov, R., and Torous, W., 2013. Chapter 9 - Forecasting Real Estate

Prices. In Handbook of Economic Forecasting, edited by Graham Elliott and Allan Timmermann,

2:509–80. Handbook of Economic Forecasting. Elsevier, Amsterdam, Netherlands.

Goodfellow, I., Bengio, Y., and Courville, A., 2016. Deep Learning. Vol. 1. MIT Press, Cambridge,

MA, USA.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.,

and Bengio, Y., 2020. Generative Adversarial Networks. Communications of the ACM 63, 139–

144.

Gow, I.D., Kaplan, S.N., Larcker, D.F., and Zakolyukina, A.A., 2016. CEO Personality and Firm

Policies. SSRN Working Paper No. 2805635. Social Science Research Network.

Grammig, J., Hanenberg, C., Schlag, C., and Sönksen, J., 2020. Diverging Roads: Theory-Based

vs. Machine Learning-Implied Stock Risk Premia. SSRN Working Paper No. 3536835. Social Sci-

ence Research Network.

Gu, S., Kelly, B.T., and Xiu, D., 2019. Autoencoder Asset Pricing Models. SSRN Working Paper

No. 3335536. Social Science Research Network.

Gu, S., Kelly, B.T., and Xiu, D., 2020. Empirical Asset Pricing via Machine Learning. The Review

of Financial Studies 33, 2223–73.

54
Gulen, H., Jens, C., and Page, T.B., 2020. An Application of Causal Forest in Corporate Finance:

How Does Financing Aﬀect Investment? SSRN Working Paper No. 3583685. Social Science Re-

search Network.

Gündüz, Y. and Uhrig-Homburg, M., 2011. Predicting Credit Default Swap Prices with Financial

and Pure Data-Driven Approaches. Quantitative Finance 11, 1709–27.

Hanley, K.W. and Hoberg, G., 2019. Dynamic Interpretation of Emerging Risks in the Financial

Sector. The Review of Financial Studies 32, 4543–4603.

Hansen, C. and Kozbur, D., 2014. Instrumental Variables Estimation with Many Weak Instru-

ments Using Regularized JIVE. Journal of Econometrics 182, 290–308.

Hartford, J., Lewis, G., Leyton-Brown, K., and Taddy, M., 2017. Deep IV: A Flexible Approach

for Counterfactual Prediction. In Proceedings of the 34th International Conference on Machine

Learning 70, 1414–1423.

Hastie, T., Tibshirani, R., and Friedman, J., 2009. The Elements of Statistical Learning: Data

Mining, Inference, and Prediction, Second Edition. Springer Science & Business Media, Luxem-

bourg.

Haurin, D.R., Haurin, J.L., Nadauld, T., and Sanders, A., 2010. List Prices, Sale Prices and

Marketing Time: An Application to U.S. Housing Markets. Real Estate Economics 38, 659–85.

Hooker, G. and Mentch, L., 2019. Please Stop Permuting Features: An Explanation and Alterna-

tives. arXiv preprint No. 1905.03151. arXiv.

Hott, C., 2011. Lending Behavior and Real Estate Prices. Journal of Banking & Finance 35, 2429–

42.

Hrazdil, K., Novak, J., Rogo, R., Wiedman, C., and Zhang, R., 2020. Measuring Executive Per-

sonality Using Machine-Learning Algorithms: A New Approach and Audit Fee-Based Validation

Tests. Journal of Business Finance & Accounting 47, 519–44.

Hsieh, T.-S., Kim, J.-B., Wang, R.R., and Wang, Z., 2020. Seeing Is Believing? Executives’ Facial

Trustworthiness, Auditor Tenure, and Audit Fees. Journal of Accounting and Economics 69,

101260.

55
Hu, A., and Ma, S., 2020. Human Interactions and Financial Investment: A Video-Based Ap-

proach. SSRN Working Paper No. 3583898. Social Science Research Network.

Huang, A.H., Zang, A.Y., and Zheng, R., 2014. Evidence on the Information Content of Text in

Analyst Reports. The Accounting Review 89, 2151–80.

Hutchinson, J.M., Lo, A.W., and Poggio, T., 1994. A Nonparametric Approach to Pricing and

Hedging Derivative Securities Via Learning Networks. The Journal of Finance 49, 851–89.

Jacobsen, B., Jiang, F., and Zhang, H., 2019. Ensemble Machine Learning and Stock Return

Predictability. SSRN Working Paper No. 3310289. Social Science Research Network.

Jones, S., Johnstone, D., and Wilson, R., 2015. An Empirical Evaluation of the Performance of

Binary Classiﬁers in the Prediction of Credit Ratings Changes. Journal of Banking & Finance 56,

72–85.

Ke, Z.T., Kelly, B.T., and Xiu, D., 2019. Predicting Returns With Text Data. NBER Working

Paper No. 26186. National Bureau of Economic Research.

Kearney, C. and Liu, S., 2014. Textual Sentiment in Finance: A Survey of Methods and Models.

International Review of Financial Analysis 33, 171–85.

Kelly, B.T., Pruitt, S., and Su, Y., 2019. Characteristics Are Covariances: A Uniﬁed Model of

Risk and Return. Journal of Financial Economics 134, 501–24.

Khandani, A.E., Kim, A.J., and Lo, A.W., 2010. Consumer Credit-Risk Models via Machine-

Learning Algorithms. Journal of Banking & Finance 34, 2767–87.

Kleinberg, J., Ludwig, J., Mullainathan, S., and Sunstein, C.R., 2018. Discrimination in the Age

of Algorithms. Journal of Legal Analysis 10, 113–74.

Kogan, S., Levin, D., Routledge, B.R., Sagi, J.S., and Smith, N.A., 2009. Predicting Risk from

Financial Reports with Regression. In Proceedings of Human Language Technologies: The 2009

Annual Conference of the North American Chapter of the Association for Computational Linguis-

tics, 272–280.

Kozak, S., Nagel, S., and Santosh, S., 2018. Shrinking the Cross Section. SSRN Working Paper

No. 2945663. Social Science Research Network.

56
Lahmiri, S. and Bekiros, S., 2019. Can Machine Learning Approaches Predict Corporate Bank-

ruptcy? Evidence from a Qualitative Experimental Design. Quantitative Finance 19, 1569–77.

Lee, B.K., Lessler, J., and Stuart, E.A., 2010. Improving Propensity Score Weighting Using Ma-

chine Learning. Statistics in Medicine 29, 337–46.

Li, K., Mai, F., Shen, R., and Yan, X., 2020. Measuring Corporate Culture Using Machine Learn-

ing. SSRN Working Paper No. 3256608. Social Science Research Network.

Liew, J.K.-S. and Wang, G.Z., 2016. Twitter Sentiment and IPO Performance: A Cross-Sectional

Examination. The Journal of Portfolio Management 42, 129–35.

Liu, L. and Özsu, M.T., 2009. Encyclopedia of Database Systems. Vol. 6. Springer, New York,

USA.

Loh, W.-Y., 2011. Classiﬁcation and Regression Trees. WIREs Data Mining and Knowledge Dis-

covery 1, 14–23.

Loughran, T. and McDonald, B., 2011. When Is a Liability Not a Liability? Textual Analysis,

Dictionaries, and 10-Ks. The Journal of Finance 66, 35–65.

Loughran, T. and McDonald, B., 2016. Textual Analysis in Accounting and Finance: A Survey.

Journal of Accounting Research 54, 1187–1230.

Lowry, M., Michaely, R., and Volkova, E., 2020. Information Revealed through the Regulatory

Process: Interactions between the SEC and Companies Ahead of Their IPO. The Review of Fi-

nancial Studies, Forthcoming.

Ludwig, J., Mullainathan, S., and Spiess, J., 2019. Augmenting Pre-Analysis Plans with Machine

Learning. AEA Papers and Proceedings 109, 71–76.

MacQueen, J., 1967. Some Methods for Classiﬁcation and Analysis of Multivariate Observations.

In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability 1, 281–

297.

Manela, A. and Moreira, A., 2017. News Implied Volatility and Disaster Concerns. Journal of

Financial Economics 123, 137–62.

57
Martin, I. and Nagel, S., 2019. Market Eﬃciency in the Age of Big Data. NBER Working Paper

No. 26586. National Bureau of Economic Research.

McInish, T.H., Nikolsko‐Rzhevska, O., Nikolsko‐Rzhevskyy, A., and Panovska, I., 2019. Fast and

Slow Cancellations and Trader Behavior. Financial Management 49, 973-96.

Medsker, L.R. and Jain, L.C., 1999. Recurrent Neural Networks: Design and Applications. CRC

Press, Boca Raton, FL, USA.

Mitchell, M., 1998. An Introduction to Genetic Algorithms. MIT Press, Cambridge, MA, USA.

Moritz, B. and Zimmermann, T., 2016. Tree-Based Conditional Portfolio Sorts: The Relation

between Past and Future Stock Returns. SSRN Working Paper No. 2740751. Social Science Re-

search Network.

Mullainathan, S. and Spiess, J., 2017. Machine Learning: An Applied Econometric Approach.

Journal of Economic Perspectives 31, 87–106.

Nazemi, A. and Fabozzi, F.J., 2018. Macroeconomic Variable Selection for Creditor Recovery

Rates. Journal of Banking & Finance 89, 14–25.

O’Malley, T., 2020. The Impact of Repossession Risk on Mortgage Default. The Journal of Fi-

nance, Forthcoming.

Osterrieder, J., Kucharczyk, D., Rudolf, S., and Wittwer, D., 2020. Neural Networks and Arbi-

trage in the VIX. SSRN Working Paper No. 3591755. Social Science Research Network.

Palmon, O., Smith, B., and Sopranzetti, B., 2004. Clustering in Real Estate Prices: Determinants

and Consequences. Journal of Real Estate Research 26, 115–36.

Park, B. and Bae, J.K., 2015. Using Machine Learning Algorithms for Housing Price Prediction:

The Case of Fairfax County, Virginia Housing Data. Expert Systems with Applications 42, 2928–

34.

Pavlov, A. and Wachter S., 2011. Subprime Lending and Real Estate Prices. Real Estate Econom-

ics 39, 1–17.

58
Peysakhovich, A. and Naecker, J., 2017. Using Methods from Machine Learning to Evaluate Be-

havioral Models of Choice under Risk and Ambiguity. Journal of Economic Behavior & Organi-

zation 133, 373–84.

Philippon, T., 2019. On Fintech and Financial Inclusion. NBER Working Paper No. 26330. Na-

tional Bureau of Economic Research.

Quan, D.C. and Titman, S., 1999. Do Real Estate Prices and Stock Prices Move Together? An

International Analysis. Real Estate Economics 27, 183–207.

Rambachan, A., Kleinberg, J., Ludwig, J., and Mullainathan, S., 2020. An Economic Perspective

on Algorithmic Fairness. AEA Papers and Proceedings 110, 91–95.

Rambachan, A., Kleinberg, J., Mullainathan, S., and Ludwig, J., 2020. An Economic Approach

to Regulating Algorithms. NBER Working Paper No. 27111. National Bureau of Economic Re-

search.

Rasekhschaﬀe, K.C. and Jones, R.C., 2019. Machine Learning for Stock Selection. Financial An-

alysts Journal 75, 70–88.

Rasmussen, C., 1999. The Inﬁnite Gaussian Mixture Model. Advances in Neural Information

Processing Systems 12, 554–560.

Reichenbacher, M., Schuster, P., and Uhrig‐Homburg, M., 2020. Expected Bond Liquidity. SSRN

Working Paper No. 3642604. Social Science Research Network.

Renault, T., 2017. Intraday Online Investor Sentiment and Return Patterns in the U.S. Stock

Market. Journal of Banking & Finance 84, 25–40.

Rish, I., 2001. An Empirical Study of the Naive Bayes Classiﬁer. In IJCAI 2001 Workshop on

Empirical Methods in Artiﬁcial Intelligence 3, 41–46.

Rossi, A.G., 2018. Predicting Stock Market Returns with Machine Learning. Working Paper.

Rossi, A.G. and Utkus, S.P., 2020. Who Beneﬁts from Robo-Advising? Evidence from Machine

Learning. SSRN Working Paper No. 3552671. Social Science Research Network.

Routledge, B.R., 2019. Machine Learning and Asset Allocation. Financial Management 48, 1069–

94.

59
Settles, B., 2009. Active Learning Literature Survey. Technical Report. University of Wisconsin-

Madison.

Sigrist, F. and Hirnschall, C., 2019. Grabit: Gradient Tree-Boosted Tobit Models for Default

Prediction. Journal of Banking & Finance 102, 177–92.

Sirignano, J., Sadhwani, A., and Giesecke, K., 2018. Deep Learning for Mortgage Risk. arXiv

preprint no. 1607.02470. arXiv.

Spiegeleer, J.D., Madan, D.B., Reyners, S., and Schoutens, W., 2018. Machine Learning for Quan-

titative Finance: Fast Derivative Pricing, Hedging and Fitting. Quantitative Finance 18, 1635–43.

Sutton, R.S. and Barto, A.G., 2018. Reinforcement Learning: An Introduction. MIT Press, Cam-

bridge, MA, USA.

Tang, V.W., 2018. Wisdom of Crowds: Cross-Sectional Variation in the Informativeness of Third-

Party-Generated Product Information on Twitter. Journal of Accounting Research 56, 989–1034.

Tian, S., Yu, Y., and Guo, H., 2015. Variable Selection and Corporate Bankruptcy Forecasts.

Journal of Banking & Finance 52, 89–100.

Varian, H.R., 2014. Big Data: New Tricks for Econometrics. Journal of Economic Perspectives 28,

3–28.

Vamossy, D.F., 2020. Investor Emotions and Earnings Announcements. SSRN Working Paper No.

3626025. Social Science Research Network.

Welch, I. and Goyal, A., 2008. A Comprehensive Look at The Empirical Performance of Equity

Premium Prediction. The Review of Financial Studies 21, 1455–1508.

Xiang, G., Zheng, Z., Wen, M., Hong, J., Rose, C., and Liu, C., 2012. A Supervised Approach to

Predict Company Acquisition with Factual and Topic Features Using Proﬁles and News Articles

on TechCrunch. In Proceedings of the Sixth International AAAI Conference on Weblogs and Social

Media 4.

Yang, Q., Liu, Y., Chen, T., and Tong, Y., 2019. Federated Machine Learning: Concept and

Applications. ACM Transactions on Intelligent Systems and Technology 10, 1–19.

60
Yao, J., Li, Y., and Tan, C.L., 2000. Option Price Forecasting Using Neural Networks. Omega 28,

455–66.

Yavas, A. and Yang, S., 1995. The Strategic Role of Listing Price in Marketing Real Estate:

Theory and Evidence. Real Estate Economics 23, 347–68.

Zhang, T., Ramakrishnan, R., and Livny, M., 1996. BIRCH: An Eﬃcient Data Clustering Method

for Very Large Databases. ACM SIGMOD Record 25, 103–114.

Zhu, X.J., 2005. Semi-Supervised Learning Literature Survey. Technical Report. University of

Wisconsin-Madison.

Zou, H. and Hastie, T., 2005. Regularization and Variable Selection via the Elastic Net. Journal

of the Royal Statistical Society: Series B (Statistical Methodology) 67, 301–320.

61
Appendix
Table A1. Overview of variables with definitions
This table gives an overview of the variables used in our analysis and provides definitions. Our target
variable that we want to predict is list price. There are multiple types of predictor variables: physical
attribute variables, macro location variables, granular location variables, and offer variables. Variables
that are available for only a limited subset of the sample are included in an additional specification.

Original German
Variable Definition
Variable Name
Target variable
List price Price of the property in EUR as given in the listing Preis
Physical attribute variables
Type of house: detached, semi-detached, or ter- Haustyp: EFH, DHH/
House type
raced/townhouse REH, RH
Size Size of the property in m² Wohnfläche
Rooms Number of rooms of the property Zimmer
Lot size Lot size of the property in m² Grundstücksfläche
Construction year Construction year of the building Baujahr
Macro location variables
County The county where the property is located Landkreis
Granular location variables
Horizontal Precise latitude of the center of the city district implied Horizontale
geocoordinate by the property’s zip code Geo-Koordinate
Vertical Precise longitude of the center of the city district implied Vertikale
geocoordinate by the property’s zip code Geo-Koordinate
Offer variables
Offer year Year of the listing Angebotsjahr
Indicates whether the sale offer is listed on an online plat-
Online listing Online-Angebot
form
Seller type as stated in the listing: realtor, developer, or
Seller type Verkäufer
private owner

Variables for additional specification (available for only a limited subset of the sample)
Patios Number of the property’s patios Terrassen
Balconies Number of the property’s balconies Balkone
Garages Number of garages belonging to the property Garagen
Parking lots Number of parking lots belonging to the property Kfz-Stellplätze
Bathrooms Number of the property’s bathrooms Bäder
The property’s status of renovation: necessary, partly ren-
Renovation status Zustand
ovated, or renovated
The property’s type of basement: basement, livable, fully
Basement type Art des Kellers
finished, or partially finished
Balcony type The property’s type of balcony: balcony or loggia Art des Balkons
Leased lot Indicates whether the property’s lot is leased Erbbaupacht
Wintergarden Indicates whether the property has a wintergarden Wintergarten
Rooftop Indicates whether the property has a rooftop terrace Dachterrasse
Indicates whether the property has solar panels on the
Solar Solaranlage
roof
Indicates whether there is other usage (e.g., commercial)
Other usage Alternativnutzung
possible for the property
Hillside Indicates whether the property is located on a hillside Hanglage

62
Studio Indicates whether the property’s top floor is a studio Dachstudio
Indicates whether the property contains an extra-large
Large kitchen Wohnküche
kitchen
Indicates whether there is a recreational room in the
Recreation room Hobbyraum
property
Sauna Indicates whether there is a sauna in the property Sauna
Gallery Indicates whether the property contains a gallery Galerie
Fireplace Indicates whether there is fireplace in the property Kamin
Indicates whether underfloor heating is available in the
Underfloor heat Fußbodenheizung
property
Pool Indicates whether there is a pool on the property Schwimmbad
Indicates whether the property’s rooms have hardwood
Hardwood floors Parkett
floors
Prefab Indicates whether the building has been prefabricated Fertighaus

Separate ﬂat Indicates whether a separate ﬂat belongs to the property Einliegerwohnung

Ausgebautes
Attic ﬁnished Indicates whether the property’s attic is ﬁnished
Dachgeschoss
Garden Indicates whether there is a garden on the property Garten
Pond Indicates whether there is a pond on the property Teich

2013 Book MeanFieldGamesAndMeanFieldType
No ratings yet
2013 Book MeanFieldGamesAndMeanFieldType
132 pages
Hans-Martin Krolzig: Modelling, Statistical Inference, and Application To Business Cycle Analysis
No ratings yet
Hans-Martin Krolzig: Modelling, Statistical Inference, and Application To Business Cycle Analysis
375 pages
Bovier & Den Hollander - Metastability
No ratings yet
Bovier & Den Hollander - Metastability
578 pages
Markovian Projection Method
No ratings yet
Markovian Projection Method
22 pages
July 2025 Timetable
100% (1)
July 2025 Timetable
58 pages
Breaking Into AI!
No ratings yet
Breaking Into AI!
30 pages
An Introduction To Wavelet Theory in Finance - A Wavelet Multiscale Approach PDF
No ratings yet
An Introduction To Wavelet Theory in Finance - A Wavelet Multiscale Approach PDF
213 pages
Application of Reinforcement Learning - Finance
No ratings yet
Application of Reinforcement Learning - Finance
540 pages
Spectral Methods For Time-Dependent Problems
No ratings yet
Spectral Methods For Time-Dependent Problems
281 pages
Theory of Groups of Finite Order
From Everand
Theory of Groups of Finite Order
William S. Burnside
No ratings yet
Computational Intelligence: Synergies of Fuzzy Logic, Neural Networks and Evolutionary Computing
From Everand
Computational Intelligence: Synergies of Fuzzy Logic, Neural Networks and Evolutionary Computing
Nazmul Siddique
No ratings yet
Matrix Calculus Kronecker Product Applications C++ Programs: and With and
No ratings yet
Matrix Calculus Kronecker Product Applications C++ Programs: and With and
263 pages
Creel M Econometrics
No ratings yet
Creel M Econometrics
479 pages
Uffe Høgsbro Thygesen - Stochastic Differential Equations For Science and Engineering-CRC Press - Chapman & Hall (2023)
No ratings yet
Uffe Høgsbro Thygesen - Stochastic Differential Equations For Science and Engineering-CRC Press - Chapman & Hall (2023)
381 pages
Heavy-Tail Phenomena - Probabilistic and Statistical Modeling - S. Resnick (Springer, 2007) WW PDF
No ratings yet
Heavy-Tail Phenomena - Probabilistic and Statistical Modeling - S. Resnick (Springer, 2007) WW PDF
410 pages
Björn Böttcher, René Schilling, Jian Wang Auth. Lévy Matters III Lévy-Type Processes Construction, Approximation and Sample Path Properties
100% (2)
Björn Böttcher, René Schilling, Jian Wang Auth. Lévy Matters III Lévy-Type Processes Construction, Approximation and Sample Path Properties
215 pages
Personality Development
No ratings yet
Personality Development
53 pages
Lecture Notes Stochastic Calculus
100% (3)
Lecture Notes Stochastic Calculus
365 pages
(Applied Quaternionic Analysis 28 Research and Exposition in Mathematics) Vladislav v. Kravchenko. 28-Heldermann Verlag (2003)
100% (2)
(Applied Quaternionic Analysis 28 Research and Exposition in Mathematics) Vladislav v. Kravchenko. 28-Heldermann Verlag (2003)
134 pages
Simo Särkkä and Arno Solin - Applied Stochastic Differential Equations (2019, Cambridge University Press)
No ratings yet
Simo Särkkä and Arno Solin - Applied Stochastic Differential Equations (2019, Cambridge University Press)
324 pages
Introduction Mathematical Portfolio Theo
No ratings yet
Introduction Mathematical Portfolio Theo
159 pages
Political Socialisation
No ratings yet
Political Socialisation
19 pages
Furniture Shop Management System Project Report
100% (1)
Furniture Shop Management System Project Report
54 pages
Mathematical Advances Towards Sustainable Environmental Systems
No ratings yet
Mathematical Advances Towards Sustainable Environmental Systems
355 pages
2018 Operator Theory Operator Algebras and Matrix Theory - Book
100% (1)
2018 Operator Theory Operator Algebras and Matrix Theory - Book
381 pages
On The Use of Expert Judgement in The Qualification of Risk Assessment
No ratings yet
On The Use of Expert Judgement in The Qualification of Risk Assessment
52 pages
Solving Inverse Problems Using Datadriven Models
No ratings yet
Solving Inverse Problems Using Datadriven Models
174 pages
Prediction, Learning, and Games (2006)
100% (1)
Prediction, Learning, and Games (2006)
407 pages
Luenberger - Optimzation by Vector Space Methods
No ratings yet
Luenberger - Optimzation by Vector Space Methods
341 pages
Fontaine, Ouyang - Theory of P-Adic Galois Representations-1
No ratings yet
Fontaine, Ouyang - Theory of P-Adic Galois Representations-1
257 pages
Grade 9 Religious Studies Syllabus
No ratings yet
Grade 9 Religious Studies Syllabus
36 pages
Matrix Representations of Groups
From Everand
Matrix Representations of Groups
Morris Newman
No ratings yet
Jean Gallier, Jocelyn Quaintance - Linear Algebra and Optimization With Applications To Machine Learning - Volume II - Fundamentals of Optimization Theory With Applications To Machine Learning. 2-Wor
100% (1)
Jean Gallier, Jocelyn Quaintance - Linear Algebra and Optimization With Applications To Machine Learning - Volume II - Fundamentals of Optimization Theory With Applications To Machine Learning. 2-Wor
896 pages
Advances in Applied Mathematics and Global Optimization in Honor
No ratings yet
Advances in Applied Mathematics and Global Optimization in Honor
542 pages
Nonlinear Transformations of Random Processes
From Everand
Nonlinear Transformations of Random Processes
Ralph Deutsch
No ratings yet
Lovering - Etale Cohomology and Galois Representaions
100% (1)
Lovering - Etale Cohomology and Galois Representaions
57 pages
Johan G. F. Belinfante, Bernard Kolman
No ratings yet
Johan G. F. Belinfante, Bernard Kolman
175 pages
(Springer Series in Statistics) Michael L. Stein (Auth.) - Interpolation of Spatial Data - Some Theory For Kriging-Springer-Verlag New York (1999)
No ratings yet
(Springer Series in Statistics) Michael L. Stein (Auth.) - Interpolation of Spatial Data - Some Theory For Kriging-Springer-Verlag New York (1999)
262 pages
Alexander J. Zaslavski - Optimization in Banach Spaces-Springer (2022)
No ratings yet
Alexander J. Zaslavski - Optimization in Banach Spaces-Springer (2022)
132 pages
The Hidden History of New Women in Serbian Culture Toward A New History of Literature Svetlana Tomic Download
No ratings yet
The Hidden History of New Women in Serbian Culture Toward A New History of Literature Svetlana Tomic Download
86 pages
A Practical ImplementationOfHJM
No ratings yet
A Practical ImplementationOfHJM
336 pages
AdS CFT Bottom Up
No ratings yet
AdS CFT Bottom Up
113 pages
Maslov Operational Methods Mir 1976
No ratings yet
Maslov Operational Methods Mir 1976
558 pages
Yunshu InformationGeometry PDF
No ratings yet
Yunshu InformationGeometry PDF
79 pages
Mat67 Course Notes
No ratings yet
Mat67 Course Notes
252 pages
LargeScaleInference PDF
No ratings yet
LargeScaleInference PDF
273 pages
(Series in Mathematical Analysis and Applications 9) Leszek Gasinski, Nikolaos S. Papageorgiou - Nonlinear Analysis-Chapman & Hall - CRC (2006)
100% (1)
(Series in Mathematical Analysis and Applications 9) Leszek Gasinski, Nikolaos S. Papageorgiou - Nonlinear Analysis-Chapman & Hall - CRC (2006)
960 pages
Scientific Inference
From Everand
Scientific Inference
Harold Jeffreys
No ratings yet
Stochastic Calculus Notes 1/5
100% (3)
Stochastic Calculus Notes 1/5
25 pages
Forsyth - Theory of Differential Equation - Vol. 5 PDF
100% (3)
Forsyth - Theory of Differential Equation - Vol. 5 PDF
508 pages
Data Driven Model Discovery and Coordinate Embeddings For Physical Systems by Nathan Kutz, University of Washington
No ratings yet
Data Driven Model Discovery and Coordinate Embeddings For Physical Systems by Nathan Kutz, University of Washington
43 pages
A Short Course in Automorphic Functions
From Everand
A Short Course in Automorphic Functions
Joseph Lehner
No ratings yet
DSGE Models - Solution Strategies: Stefanie Flotho IFAW-WT, Albert-Ludwigs-University Freiburg
No ratings yet
DSGE Models - Solution Strategies: Stefanie Flotho IFAW-WT, Albert-Ludwigs-University Freiburg
33 pages
Many-Body Physics
No ratings yet
Many-Body Physics
228 pages
Math 1301 Notes
No ratings yet
Math 1301 Notes
2 pages
Imbalanced Data: How To Handle Imbalanced Classification Problems
No ratings yet
Imbalanced Data: How To Handle Imbalanced Classification Problems
17 pages
An Introduction to Statistical Computing: A Simulation-based Approach
From Everand
An Introduction to Statistical Computing: A Simulation-based Approach
Jochen Voss
No ratings yet
Basic Elements of Queueing Theory Lec Notes Philippe NAIN
No ratings yet
Basic Elements of Queueing Theory Lec Notes Philippe NAIN
110 pages
Gestalt Therapy 100 Key Points and Techniques 2nd Edition ISBN 1138067725, 9781138067721 All Sections Download
No ratings yet
Gestalt Therapy 100 Key Points and Techniques 2nd Edition ISBN 1138067725, 9781138067721 All Sections Download
14 pages
(Undergraduate Texts in Mathematics) Stephanie Frank Singer-Linearity, Symmetry, and Prediction in The Hydrogen Atom-Springer (2005) PDF
100% (4)
(Undergraduate Texts in Mathematics) Stephanie Frank Singer-Linearity, Symmetry, and Prediction in The Hydrogen Atom-Springer (2005) PDF
404 pages
Book of Greeks Edition 1.0 (Preview)
No ratings yet
Book of Greeks Edition 1.0 (Preview)
80 pages
Bartle Elements of Integration
No ratings yet
Bartle Elements of Integration
136 pages
Yoruba Three Value Logic (New)
No ratings yet
Yoruba Three Value Logic (New)
18 pages
Overview Of Bayesian Approach To Statistical Methods: Software
From Everand
Overview Of Bayesian Approach To Statistical Methods: Software
Vinaitheerthan Renganathan
No ratings yet
01-c Plant Nursery Skill Development
No ratings yet
01-c Plant Nursery Skill Development
5 pages
Schur Complement
No ratings yet
Schur Complement
12 pages
Counter Examples in Probability
No ratings yet
Counter Examples in Probability
12 pages
Notes Spectral Methods PDF
No ratings yet
Notes Spectral Methods PDF
12 pages
Psych
No ratings yet
Psych
18 pages
Transformational Leadership. VK
No ratings yet
Transformational Leadership. VK
6 pages
40 North Pearl Street ALBANY, NY 12243-0001: New York State Office of Temporary and Disability Assistance
No ratings yet
40 North Pearl Street ALBANY, NY 12243-0001: New York State Office of Temporary and Disability Assistance
60 pages
Variations in Psychological Attributes
100% (2)
Variations in Psychological Attributes
43 pages
English Class 2
No ratings yet
English Class 2
48 pages
Convex Optimization For Machine Learning
No ratings yet
Convex Optimization For Machine Learning
110 pages
Who HR Administration
No ratings yet
Who HR Administration
5 pages
Lectures on Ergodic Theory
From Everand
Lectures on Ergodic Theory
Paul R. Halmos
No ratings yet
Graziella Moraes Silva CV
No ratings yet
Graziella Moraes Silva CV
10 pages
Science - Investigation - Fabric Forensics
No ratings yet
Science - Investigation - Fabric Forensics
3 pages
Phil Iri Sample
No ratings yet
Phil Iri Sample
5 pages
ASA - Style - Guide 7th Ed
No ratings yet
ASA - Style - Guide 7th Ed
5 pages
Storybook Reading Lesson Plan Proudest Blue
No ratings yet
Storybook Reading Lesson Plan Proudest Blue
5 pages
Week001 Module
No ratings yet
Week001 Module
5 pages
Open House Parent Quiz
No ratings yet
Open House Parent Quiz
2 pages
A Child Said What Is Grass
No ratings yet
A Child Said What Is Grass
2 pages
PDF Evolutionary Optimization Algorithms Full Online: Book Details
No ratings yet
PDF Evolutionary Optimization Algorithms Full Online: Book Details
1 page
Course 3 Syllabus 2024 2025
No ratings yet
Course 3 Syllabus 2024 2025
2 pages
Research Proposal
No ratings yet
Research Proposal
4 pages
Tran-Eportfolio Clinical Exemplar
No ratings yet
Tran-Eportfolio Clinical Exemplar
5 pages
12 Definitive Traits of A Middle Child
No ratings yet
12 Definitive Traits of A Middle Child
2 pages
Approaches To The Study of Social Problems
No ratings yet
Approaches To The Study of Social Problems
2 pages