Random Forest For Binary Classification
Random Forest For Binary Classification
Submitted to
Hon-Prof. Dr. Martin Schmidberger
Faculty of Economics and Business
Goethe University Frankfurt
Frankfurt am Main
By: Huifeng Li
Email: [email protected]
Contents
1 Introduction ...........................................................................................................................1
2 Methods ..................................................................................................................................2
2.1 Decision Tree ..................................................................................................................2
2.2 Random Forest ................................................................................................................4
3 Data .........................................................................................................................................5
3.1 Data Preprocessing ..........................................................................................................6
3.1.1 Data Quality ..........................................................................................................6
3.1.2 Missing values .......................................................................................................8
3.2 Exploratory Data Analysis ..............................................................................................8
3.2.1 Target Feature .......................................................................................................8
3.2.2 Interactions between Input Features and Target Features .....................................9
5 Conclusion ............................................................................................................................15
6 References ............................................................................................................................16
1 Introduction
The main goal of this seminar paper is to apply random forest algorithm to solve a real binary
classification problem with the customer analytics dataset provided by a German direct Bank
ING DiBa based in Frankfurt. With this dataset we would like to gain some insights about
cross selling of the Bank’s checking account product to its existing clients, who have already
bought at least one product such as loan products, mortgages, investment products, and so
on. These insights can help the bank increase the response rate of its future marketing cam-
paign among its existing clients and thereby sell more checking account products to its ex-
isting clients.
Since the focus of this seminar paper lies on practical application, instead of theoretical in-
novation, therefore, I devote very few efforts to reviewing the abundant literature on both
theoretical and practical aspects of random forest algorithm; instead, more efforts are de-
voted to the implementation aspects of random forest algorithm on this data set, such as data
preprocessing, data engineering and hyperparameter search.
As a data science project, the seminar paper intends to provide a playground where I can
gain hands-on experiences with data processing, data visualization, and predictive model-
ling. As integral part of this seminar paper, I also provide the code/Jupyter notebook in R
used for this project. The results presented in this paper are reproducible. Since the under-
standing of random forest algorithm is essential for this project, I also provide a detailed
description on how decision tree and random forest algorithms work.
This seminar paper is organized as follows: in Chapter 2, I describe the random forest algo-
rithm, which is the machine learning algorithm I apply on the cross-selling data set “xsell”
of the Bank. Since random forest is built on an ensemble of decision tree algorithm, therefore
I first introduce the decision tree algorithm. This chapter is designed to provide the method-
ological basis of this seminar paper; In Chapter 3, I describe the data set used in this seminar
paper, including the target feature and the input features, the data quality checks, and data
preprocessing. More importantly, I perform exploratory data analysis to show how different
input features interact with the target feature, which aims to provide the readers with first
impressions of which are the important input features for the predictive modelling; In Chap-
ter 4, I apply the random forest algorithm on the “xsell” data set. This Chapter includes the
feature engineering, the split of “xsell” data set into training, validation and test set, the hy-
perparameter search and finally the results; Chapter 5 concludes.
-2-
The results I achieve in this seminar paper is promising. After feature engineering and hy-
perparameter search, the best model I have identified is able to achieve an Area under ROC
Curve (AUC) on the test dataset 75.5%. This predictive performance is comparable to the
model used in production in the Bank. It has to be noted that this result is achieved after only
doing a moderate amount of work on hyperparameter tuning. The predictive performance of
the random forest algorithm is expected to improve further if we extend the extent of hy-
perparameter tuning.
2 Methods
It is possible to grow a set of different decision trees that are all consistent with the training
instances by splitting the dataset using different ordered set of descriptive features. The
CART algorithm uses the criterium on the depth of decision trees to decide among those
different decision trees: shallow decisions are preferred. To build a shallow decision tree,
one must put the most discriminating feature with respect to the target feature on the root of
the tree and apply the same rule to recursively partition all the subsequent subsets resulting
from the first split. Choosing the most discriminating feature to split each partition of the
dataset is the inductive bias inherent in decision tree models.2
To measure the discriminatory power of a descriptive feature with respect to the different
levels of the target feature, one can use the measure of information gain from information
theory. This is the reason why are decision trees information-based machine learning algo-
rithm. Information gain measures the improvement on the impurity or heterogeneity of the
data points in the dataset with respect to the target feature after the split of the datasets using
a feature and the before the split. For CART algorithm, the measure of impurity of a dataset
1
CART Algorithm was first developed by Breiman (1984). A detailed description of CART is de-
scribed in Breiman (1984).
2
See page 123, Kelleher, Namee and D’Arcy (2015)
-3-
with respect to a target feature is called Gini-index. The Gini-index is a measure of impurity
of a data set with respect to the target feature. For a data set S with the target feature denoted
as t and its level as l, the Gini index is defined as follows:
Gini index is a impurity measure of a data set with regards to the target feature; it measures
the probability that a randomly chosen instance from the data set would be incorrectly la-
beled. For example, with a binary target variable, the Gini-index can be written as follows:
= 1 − 𝑃(𝑡 = 𝑙1 )2 − 𝑃(𝑡 = 𝑙2 )2
CART Algorithm
Data: set of Input features D and target feature t; a set of training examples m
Specify the early stopping criterion for the induction of a decision tree, such as the depth of
the tree, minimal number of instances in each leaf nodes and so on.
1. find the feature d and its level 𝑙𝑘 that optimally split the training instances into two sub-
sets.
𝑚𝑙𝑒𝑓𝑡 𝑚𝑟𝑖𝑔ℎ𝑡
𝑑, 𝑙𝑘 = arg min( 𝐽(𝑑, 𝑙𝑘 )) = 𝐺𝑖𝑛𝑖𝑙𝑒𝑓𝑡 + 𝐺𝑖𝑛𝑖𝑟𝑖𝑔ℎ𝑡
𝑑∈𝐷, 𝑙𝑘 ∈{𝑙𝑒𝑣𝑒𝑙𝑠 𝑜𝑓 𝑑}) m m
𝑚𝑙𝑒𝑓𝑡
where 𝐺𝑖𝑛𝑖𝑙𝑒𝑓𝑡 denotes the impurity of the left subset, denotes the number of instances
m
2. add a node d to the tree and grow a binary subtree with two branches 𝑙𝑘 and non
𝒍𝒌 , thereby partitioning the dataset m into two subsets: 𝑚𝑙𝑒𝑓𝑡 und 𝑚𝑟𝑖𝑔ℎ𝑡
4. stop growing the trees and add a leaf node if the early stopping criterion are met.
-4-
Decision trees are very sensitive to small variations in the dataset and therefore prone to
overfitting. 3 To achieve good generalizability beyond the training dataset, it is usually re-
quired to regularize the induction of the trees regarding the maximum depth of the tree, the
minimum number of samples a node must have before it can be split, the minimum number
of samples a leaf node must have, the maximum number of leaf nodes and so on. Regulari-
zation limits the extent to which a data set can be split. Largely due to regularization, the
instances in the leaf nodes are usually not pure in terms of the levels of the target feature and
the predicted probability that the decision tree algorithm makes about an instance belonging
to a class is based on the fraction of samples of the same class in a leaf node that this instance
is assigned into. To make predictions about whether an instance belongs to a class, it is
usually required to set the threshold for the probability values, those instances whose prob-
ability of belonging to a class is above this threshold is classified as belonging to a class.
The random forest algorithm for binary classification can be described as follows:
1. Set the number of decision trees (n) contained in the random forest algorithm
3
Kelleher, Namee and D’Arcy (2015)
4
Kelleher, Namee and D’Arcy (2015)
-5-
3. Aggregate the predictions (average the probabilities) from the n decision trees into
the final predictions
3 Data
The dataset “xsell_giro”, provided by ING-DiBa, a German direct bank based in Frankfurt.
With this dataset ING-DiBa would like to gain some insight about the cross selling of its
checking account product to its existing clients, who have already bought at least one product
such as loan products, mortgages, investment products, and so on.
This data set is a typical marketing dataset and contains 100,000 clients, with about 40 fea-
tures per client. The characteristic features for each client in this data set are gathered before
the 6-month marketing campaign begins. The marketing campaign took place via various
channels (mailing, e-mailing, online, etc.) during the 6-month period of time. At the end of
the marketing campaign, the target feature “xsell”, indicating whether a client opens a check-
ing account, will be observed and gathered.
The most part of data preprocessing work has already been performed by ING-DiBa, and the
dataset is already in tidy form. The majority of the features in this data set, as shown in the
following table are created through ETL process from the data bases in the productive system
of the bank, which extracts, transforms and loads the raw data into a separate analytics en-
vironment.
-6-
In addition to the features originating from the productive system of the Bank, there are 7 f
eatures (“ext_city_size", "ext_house_size", "ext_purchase_power", "ext_share_new_house
s", "ext_share_new_cars", "ext_car_power", "ext_living_duration" ) from external source,
such as Deutsche Post or Bertelsmann the, merged with the features from the productive
system in the data set.
These outliers are regarded as implausible and therefore those rows containing outliers are
removed from the dataset, which leads to removal of 167 rows from the data set. After re-
moving the outliers, the boxplot of “age” by occupational groups is drawn and shown as
Figure 2.
To train random forest algorithm on this dataset, we need first to impute the missing values
for these features. In this regard, I use a function “na.roughfix()” in the R package “random-
Forest”, which replace the missing values in numeric features with column medians, and the
missing values for factor features with the most frequent levels (breaking ties at random).
In Figure 4 , the x-axis shows the different levels of the selected numeric features (“age”,
“customer_tenure_months”, “directmails”, and “nr_products”) and the y-axis shows the rate
#𝑥𝑠𝑒𝑙𝑙=1
of successful cross selling or xsell-likelihood for each subgroup ( ) corresponding
# 𝑥𝑠𝑒𝑙𝑙
each different level of a feature. It is clearly shown that younger clients, or clients with short
tenure since onboarding show higher xsell-likelihood; up to a certain level, xsell-likelihood
of selling a check account increases with the total number of mailings last year and the total
number of existing products (accounts) already bought by the clients.
- 10 -
Figure 4: xsell_rates per age, customer tenure, direct mails and number of products
Figure 5 shows the probability densities of the feature “last_acc_opening_days” (the number
of days since last account opening) by the two levels of the target feature “xsell”. It is easy
to see if a client opened his last account with this Bank fewer than 1000 days ago, then this
client is much more likely to be responsive to the marketing campaign and open a checking
account with the Bank.
For the each of the following six categoric features: giro_mailing, maritial_status, mem-
ber_get_member_active, member_get_member_passive, occup and gender, Figure 6 shows
the xsell-likelihood for each subgroup of the clients according to its levels.
From this figure we can make the following observations: the xsell likelihood of those ex-
isting clients with giro_mailing equal to 1 is more than twice that of those clients with
giro_mailing equal to 0; those existing clients not married or living together with partner
show comparatively higher xsell likelihood; existing clients that have recommended a client
to the Bank or have been recommended by a client of the Bank are more than 3 times more
likely to open a checking account by the Bank after being approached by marketing cam-
paign; the xsell-likelihood of those existing clients that belong to the following occupa-
tional groups: “Azubis”, “Schüler” and “Studenten”, are more than twice those of the aver-
age xsell_likelihod in the data set, and this might be due to financial needs of the clients in
these groups, who are more responsive to the financial incentive offered by the marketing
campaign (50 Euros).
To denote the change of the debit account of a client from 6 month ago and now, I create a
feature “volume_change” as “volume_debit-volume_debit_6months” and use it instead of
the “volume_debit” and “volume_debit_6months” in the random forest model.
- 13 -
4.2 Models
After data preprocessing and feature engineering, the data set “xsell” is partitioned into three
datasets: training set (70%), validation set (15%) and test set (15%).
The random forest models are trained on the data in training set and tuned on the data in
validation set. Since there are many hyperparameters in the random forest algorithm and
with different configuration of the hyperparameters we obtain models that perform differ-
ently, therefore it is necessary to search over many different hyperparameter configurations
in order to find the configuration with good performance.
To perform systematic hyperparameter tuning, I use random grid search in the R package
“h2o”. In this regard, I firstly specify the hyperparameter space to search over and tuning
strategy. In doing this, a certain number of random forest models with hyperparameter ran-
domly selected from the hyperparameter space will be trained on the training dataset. To
evaluate the model performance of the trained models, I apply the trained models on the data
in the validation set and comparing the performance metrics “Area under the ROC Curve
(AUC)”. After the hyperparameter tuning is finished, all trained models will be ranked by
the evaluation metrics “AUC” on the validation dataset. The model with highest AUC on
the validation dataset is chosen as the best model and used for prediction on the data in test
set.
4.3 Results
In the random grid search, only a random subset (42 models) of all possible combination of
the hyperparameter specified in the hyperparameter space are chosen and trained on the
training dataset. After ranking the evaluation metric AUC of the 42 models on validation
dataset, I obtain the best model with the following hyperparameters:
In the validation dataset, the best model achieves AUC 0.756. Applying this model on the
data in test set, I obtain the performance metric AUC of the best model on the test set: 0.7546.
Remember that the direct predictions from random forest algorithm for the binary classifi-
cation (xsell dataset) are “scores” of a client opening a checking account with the Bank, with
value between 0 and 1. These scores denote the propensity of a client opening a checking
account with the Bank. Since these scores have values between 0 and 1, they are also viewed
as probabilities of belonging to a class. The AUC can be understood as the probability that
a random positive example has a larger score value than that of a random negative example.5
If AUC = 1, it means the classifier is able to perfectly distinguish between all the positive
examples and negative examples.
Figure 8 ranks the importance of the features in the best model. The features “age”,
“last_acc_opening_days”, and “customer_tenure_months” have the largest predictive power
in the best model. This is consistent with our first impression in the chapter 3.2.2.
5
Engelmann, Hayden and Tasche (2003)
- 15 -
5 Conclusion
In this seminar paper, I use the data set on cross selling to develop a random forest algorithm.
The work I have done for this seminar paper covers the whole process of data science project,
such as data preprocessing, data quality check, exploratory data analysis, feature engineer-
ing, and model development and deployment.
The results I achieve in this seminar paper is promising. After feature engineering and hy-
perparameter search, the best model I have identified is able to achieve an Area under ROC
Curve (AUC) on the test dataset 75.5%. This predictive performance is comparable to the
model used in production in the Bank. It has to be noted that this result is achieved after only
doing a moderate amount of work on hyperparameter tuning. The predictive performance of
the random forest algorithm is expected to improve further if we extend the extent of hy-
perparameter tuning.
An interesting lesson from this project is that one should always check the data quality of
the dataset before man begins with the model development, as shown in chapter 3.1.1. A
data set with good quality is the basis for a successful model development. In addition,
- 16 -
exploratory data analysis is also an integral part of a data science project, through which we
can get first impressions about how input features associate with the target features, and
whether the data for input features are plausible.
6 References
Breiman, Leo, Friedman, Jerome, Stone, Charles J., & Olshen R.A. (1984). Classification
and Regression Trees, Chapman and Hall/CRC.
Kelleher, D., Namee, B., & D'Arcy A., (2015). Machine Learning for Predictive Analytics,
117-170.
Engelmann, B., Hayden E., & Tasche D., Measuring the Discriminative Power of Rating
Systems. (2003) Bundesbank Discussion Papers
- 17 -
7 Statutory Declaration
Ich versichere hiermit, dass ich die vorliegende Arbeit selbständig und ohne Benutzung an-
derer als der angegebenen Quellen und Hilfsmittel verfasst habe. Wörtlich übernommene
Sätze oder Satzteile sind als Zitat belegt, andere Anlehnungen, hinsichtlich Aussage und
Umfang, unter Quellenangabe kenntlich gemacht. Die Arbeit hat in gleicher oder ähnlicher
Form noch keiner Prüfungsbehörde vorgelegen und ist nicht veröffentlicht. Sie wurde nicht,
auch nicht auszugsweise, für eine andere Prüfungs- oder Studienleistung verwendet.
____________________________________ ___________________________________
Ort, Datum Unterschrift