0% found this document useful (0 votes)

52 views19 pages

Random Forest For Binary Classification

The document describes applying a random forest algorithm to a bank's customer data set to predict whether existing customers will purchase a checking account product. Random forest is an ensemble method that creates decision trees on randomly selected subsets of data and features. It is described as more accurate than a single decision tree. The data set is explored through visualizations to understand relationships between input features and the target. The random forest model achieves an AUC of 75.5% on the test data, comparable to the bank's existing model. Feature engineering and hyperparameter tuning are expected to further improve performance.

Uploaded by

huifeng952

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views19 pages

Random Forest For Binary Classification

Uploaded by

huifeng952

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Random Forest for Binary Classification:

Cross Selling Data Set of a Bank

Seminar Paper / Term Paper

Submitted to
Hon-Prof. Dr. Martin Schmidberger
Faculty of Economics and Business
Goethe University Frankfurt
Frankfurt am Main

By: Huifeng Li
Email: [email protected]

Program: Bachelor of Science in Mathematics

4. Semester
Student ID: 3812127
- II -

Contents
1 Introduction ...........................................................................................................................1

2 Methods ..................................................................................................................................2
2.1 Decision Tree ..................................................................................................................2
2.2 Random Forest ................................................................................................................4

3 Data .........................................................................................................................................5
3.1 Data Preprocessing ..........................................................................................................6
3.1.1 Data Quality ..........................................................................................................6
3.1.2 Missing values .......................................................................................................8
3.2 Exploratory Data Analysis ..............................................................................................8
3.2.1 Target Feature .......................................................................................................8
3.2.2 Interactions between Input Features and Target Features .....................................9

4 Random Forest Algorithm on xsell data set ......................................................................12

4.1 Features Selection .........................................................................................................12
4.1.1 Categoric Features ...............................................................................................12
4.1.2 Numeric Features ................................................................................................12
4.2 Models ...........................................................................................................................13
4.3 Results ...........................................................................................................................13

5 Conclusion ............................................................................................................................15

6 References ............................................................................................................................16

7 Statutory Declaration ..........................................................................................................17

-1-

1 Introduction
The main goal of this seminar paper is to apply random forest algorithm to solve a real binary
classification problem with the customer analytics dataset provided by a German direct Bank
ING DiBa based in Frankfurt. With this dataset we would like to gain some insights about
cross selling of the Bank’s checking account product to its existing clients, who have already
bought at least one product such as loan products, mortgages, investment products, and so
on. These insights can help the bank increase the response rate of its future marketing cam-
paign among its existing clients and thereby sell more checking account products to its ex-
isting clients.

Since the focus of this seminar paper lies on practical application, instead of theoretical in-
novation, therefore, I devote very few efforts to reviewing the abundant literature on both
theoretical and practical aspects of random forest algorithm; instead, more efforts are de-
voted to the implementation aspects of random forest algorithm on this data set, such as data
preprocessing, data engineering and hyperparameter search.

As a data science project, the seminar paper intends to provide a playground where I can
gain hands-on experiences with data processing, data visualization, and predictive model-
ling. As integral part of this seminar paper, I also provide the code/Jupyter notebook in R
used for this project. The results presented in this paper are reproducible. Since the under-
standing of random forest algorithm is essential for this project, I also provide a detailed
description on how decision tree and random forest algorithms work.

This seminar paper is organized as follows: in Chapter 2, I describe the random forest algo-
rithm, which is the machine learning algorithm I apply on the cross-selling data set “xsell”
of the Bank. Since random forest is built on an ensemble of decision tree algorithm, therefore
I first introduce the decision tree algorithm. This chapter is designed to provide the method-
ological basis of this seminar paper; In Chapter 3, I describe the data set used in this seminar
paper, including the target feature and the input features, the data quality checks, and data
preprocessing. More importantly, I perform exploratory data analysis to show how different
input features interact with the target feature, which aims to provide the readers with first
impressions of which are the important input features for the predictive modelling; In Chap-
ter 4, I apply the random forest algorithm on the “xsell” data set. This Chapter includes the
feature engineering, the split of “xsell” data set into training, validation and test set, the hy-
perparameter search and finally the results; Chapter 5 concludes.
-2-

The results I achieve in this seminar paper is promising. After feature engineering and hy-
perparameter search, the best model I have identified is able to achieve an Area under ROC
Curve (AUC) on the test dataset 75.5%. This predictive performance is comparable to the
model used in production in the Bank. It has to be noted that this result is achieved after only
doing a moderate amount of work on hyperparameter tuning. The predictive performance of
the random forest algorithm is expected to improve further if we extend the extent of hy-
perparameter tuning.

2 Methods

2.1 Decision Tree

Decision trees are the fundamental component of random forest algorithm. The mostly
widely used decision tree algorithm is Classification and Regression Trees (CART).1 In bi-
nary classification problem, the goal is to find out a set of feature-based decision rules to
optimally split the data points with respect to the binary target feature.

It is possible to grow a set of different decision trees that are all consistent with the training
instances by splitting the dataset using different ordered set of descriptive features. The
CART algorithm uses the criterium on the depth of decision trees to decide among those
different decision trees: shallow decisions are preferred. To build a shallow decision tree,
one must put the most discriminating feature with respect to the target feature on the root of
the tree and apply the same rule to recursively partition all the subsequent subsets resulting
from the first split. Choosing the most discriminating feature to split each partition of the
dataset is the inductive bias inherent in decision tree models.2

To measure the discriminatory power of a descriptive feature with respect to the different
levels of the target feature, one can use the measure of information gain from information
theory. This is the reason why are decision trees information-based machine learning algo-
rithm. Information gain measures the improvement on the impurity or heterogeneity of the
data points in the dataset with respect to the target feature after the split of the datasets using
a feature and the before the split. For CART algorithm, the measure of impurity of a dataset

1
CART Algorithm was first developed by Breiman (1984). A detailed description of CART is de-
scribed in Breiman (1984).
2
See page 123, Kelleher, Namee and D’Arcy (2015)
-3-

with respect to a target feature is called Gini-index. The Gini-index is a measure of impurity
of a data set with respect to the target feature. For a data set S with the target feature denoted
as t and its level as l, the Gini index is defined as follows:

𝑮𝒊𝒏𝒊(𝒕, 𝑺) = 𝟏 − ∑𝒍∈{𝒍𝒆𝒗𝒆𝒍𝒔 𝒐𝒇 𝒕}) 𝑷(𝒕 = 𝒍)𝟐

Gini index is a impurity measure of a data set with regards to the target feature; it measures
the probability that a randomly chosen instance from the data set would be incorrectly la-
beled. For example, with a binary target variable, the Gini-index can be written as follows:

𝐺𝑖𝑛𝑖(𝑡, 𝑆) = 𝑃(𝑡 = 𝑙1 )(1 − 𝑃(𝑡 = 𝑙1 )) + 𝑃(𝑡 = 𝑙2 )(1 − 𝑃(𝑡 = 𝑙2 ))

= 1 − 𝑃(𝑡 = 𝑙1 )2 − 𝑃(𝑡 = 𝑙2 )2

The CART algorithm is described as follows:

CART Algorithm

Data: set of Input features D and target feature t; a set of training examples m

Specify the early stopping criterion for the induction of a decision tree, such as the depth of
the tree, minimal number of instances in each leaf nodes and so on.

1. find the feature d and its level 𝑙𝑘 that optimally split the training instances into two sub-
sets.
𝑚𝑙𝑒𝑓𝑡 𝑚𝑟𝑖𝑔ℎ𝑡
𝑑, 𝑙𝑘 = arg min( 𝐽(𝑑, 𝑙𝑘 )) = 𝐺𝑖𝑛𝑖𝑙𝑒𝑓𝑡 + 𝐺𝑖𝑛𝑖𝑟𝑖𝑔ℎ𝑡
𝑑∈𝐷, 𝑙𝑘 ∈{𝑙𝑒𝑣𝑒𝑙𝑠 𝑜𝑓 𝑑}) m m

𝑚𝑙𝑒𝑓𝑡
where 𝐺𝑖𝑛𝑖𝑙𝑒𝑓𝑡 denotes the impurity of the left subset, denotes the number of instances
m

in the left subset.

2. add a node d to the tree and grow a binary subtree with two branches 𝑙𝑘 and non
𝒍𝒌 , thereby partitioning the dataset m into two subsets: 𝑚𝑙𝑒𝑓𝑡 und 𝑚𝑟𝑖𝑔ℎ𝑡

3. apply step 1 and 2 on subsets 𝑚𝑙𝑒𝑓𝑡 , 𝑚𝑟𝑖𝑔ℎ𝑡 recursively.

4. stop growing the trees and add a leaf node if the early stopping criterion are met.
-4-

Decision trees are very sensitive to small variations in the dataset and therefore prone to
overfitting. 3 To achieve good generalizability beyond the training dataset, it is usually re-
quired to regularize the induction of the trees regarding the maximum depth of the tree, the
minimum number of samples a node must have before it can be split, the minimum number
of samples a leaf node must have, the maximum number of leaf nodes and so on. Regulari-
zation limits the extent to which a data set can be split. Largely due to regularization, the
instances in the leaf nodes are usually not pure in terms of the levels of the target feature and
the predicted probability that the decision tree algorithm makes about an instance belonging
to a class is based on the fraction of samples of the same class in a leaf node that this instance
is assigned into. To make predictions about whether an instance belongs to a class, it is
usually required to set the threshold for the probability values, those instances whose prob-
ability of belonging to a class is above this threshold is classified as belonging to a class.

2.2 Random Forest

Random forest algorithm is built on an ensemble of many decision trees, with each tree
trained on a random subsample of the original data set and at each split of the tree a random
subsample of the set of features will be chosen and the predictions from each tree are aggre-
gated together to obtain the final prediction. By introducing randomness this way, Random
forest algorithm is designed to train uncorrelated trees, the aggregated predictions of those
trees are expected to overcome the shortcomings of decision trees mentioned above.4

The random forest algorithm for binary classification can be described as follows:

Random Forest Algorithm

1. Set the number of decision trees (n) contained in the random forest algorithm

2. For tree i in (1: n), do the following

• Generate a bootstrap subsample of the original training dataset 𝐷𝑖

3
Kelleher, Namee and D’Arcy (2015)
4
Kelleher, Namee and D’Arcy (2015)
-5-

• Train a decision tree on the dataset 𝐷𝑖 , choose a random subsample of

the features for each split.

3. Aggregate the predictions (average the probabilities) from the n decision trees into
the final predictions

3 Data
The dataset “xsell_giro”, provided by ING-DiBa, a German direct bank based in Frankfurt.
With this dataset ING-DiBa would like to gain some insight about the cross selling of its
checking account product to its existing clients, who have already bought at least one product
such as loan products, mortgages, investment products, and so on.

This data set is a typical marketing dataset and contains 100,000 clients, with about 40 fea-
tures per client. The characteristic features for each client in this data set are gathered before
the 6-month marketing campaign begins. The marketing campaign took place via various
channels (mailing, e-mailing, online, etc.) during the 6-month period of time. At the end of
the marketing campaign, the target feature “xsell”, indicating whether a client opens a check-
ing account, will be observed and gathered.

The most part of data preprocessing work has already been performed by ING-DiBa, and the
dataset is already in tidy form. The majority of the features in this data set, as shown in the
following table are created through ETL process from the data bases in the productive system
of the bank, which extracts, transforms and loads the raw data into a separate analytics en-
vironment.
-6-

Table 1: List of Features in the data

In addition to the features originating from the productive system of the Bank, there are 7 f
eatures (“ext_city_size", "ext_house_size", "ext_purchase_power", "ext_share_new_house
s", "ext_share_new_cars", "ext_car_power", "ext_living_duration" ) from external source,
such as Deutsche Post or Bertelsmann the, merged with the features from the productive
system in the data set.

3.1 Data Preprocessing

3.1.1 Data Quality
The data set contain characteristics of 10,000 existing clients for different age and occupa-
tional groups. It would be interesting to check whether some combinations of characteristic
features for the clients are plausible. In this regard, I plot the boxplot of the feature “age”
by occupational groups. As shown in Figure 1, for each occupational group, the statistical
distribution of “age” is plotted as boxplot, the outliers are denoted with points at the both
end of the boxplot. For some groups, the outliers are obviously not plausible, for example:
“Auszubildender” with age above 40 years, “Renter/Pensionäre” with age below 55,
“Schüler” with age at 40; “Studenten” with age above 50.
-7-

Figure 1: boxplot of Age per occupational groups

These outliers are regarded as implausible and therefore those rows containing outliers are
removed from the dataset, which leads to removal of 167 rows from the data set. After re-
moving the outliers, the boxplot of “age” by occupational groups is drawn and shown as
Figure 2.

Figure 2: boxplot of Age per occupational groups after removing outliers

-8-

3.1.2 Missing values

The following table (Table 2) shows the features with missing values as well as the number
of missing values in the dataset. It has to be noted that the missing values in the features
“gender” and “occup” are not shown as NA, but as blank space. Those values of blank space
are first converted into NA in the data preprocessing.

To train random forest algorithm on this dataset, we need first to impute the missing values
for these features. In this regard, I use a function “na.roughfix()” in the R package “random-
Forest”, which replace the missing values in numeric features with column medians, and the
missing values for factor features with the most frequent levels (breaking ties at random).

Table 2: Number of Missing Values in Features

3.2 Exploratory Data Analysis

3.2.1 Target Feature

The target feature in the data set is “xsell”, which takes the value of 1 if a client has bought
a product after the marketing campaign, 0 if not. This dataset consists of a random sample
of 90,000 non-buyers and 10,000 buyers, as shown in Figure 3. The average success rate of
cross selling among all the clients in the data set is 10%.
-9-

Figure 3: the distribution of target feature in the Dataset

3.2.2 Interactions between Input Features and Target Features

Before we apply the random forest algorithm on this data set, it is useful to perform explor-
atory data analysis to gain some insight about how different input features associate with the
target feature.

In Figure 4 , the x-axis shows the different levels of the selected numeric features (“age”,
“customer_tenure_months”, “directmails”, and “nr_products”) and the y-axis shows the rate
#𝑥𝑠𝑒𝑙𝑙=1
of successful cross selling or xsell-likelihood for each subgroup ( ) corresponding
# 𝑥𝑠𝑒𝑙𝑙

each different level of a feature. It is clearly shown that younger clients, or clients with short
tenure since onboarding show higher xsell-likelihood; up to a certain level, xsell-likelihood
of selling a check account increases with the total number of mailings last year and the total
number of existing products (accounts) already bought by the clients.
- 10 -

Figure 4: xsell_rates per age, customer tenure, direct mails and number of products

Figure 5 shows the probability densities of the feature “last_acc_opening_days” (the number
of days since last account opening) by the two levels of the target feature “xsell”. It is easy
to see if a client opened his last account with this Bank fewer than 1000 days ago, then this
client is much more likely to be responsive to the marketing campaign and open a checking
account with the Bank.

Figure 5: the probability density of feature “last_acc_opening_days” per xsell

- 11 -

For the each of the following six categoric features: giro_mailing, maritial_status, mem-
ber_get_member_active, member_get_member_passive, occup and gender, Figure 6 shows
the xsell-likelihood for each subgroup of the clients according to its levels.
From this figure we can make the following observations: the xsell likelihood of those ex-
isting clients with giro_mailing equal to 1 is more than twice that of those clients with
giro_mailing equal to 0; those existing clients not married or living together with partner
show comparatively higher xsell likelihood; existing clients that have recommended a client
to the Bank or have been recommended by a client of the Bank are more than 3 times more
likely to open a checking account by the Bank after being approached by marketing cam-
paign; the xsell-likelihood of those existing clients that belong to the following occupa-
tional groups: “Azubis”, “Schüler” and “Studenten”, are more than twice those of the aver-
age xsell_likelihod in the data set, and this might be due to financial needs of the clients in
these groups, who are more responsive to the financial incentive offered by the marketing
campaign (50 Euros).

Figure 6: xsell_rates per levels of categoric features

- 12 -

4 Random Forest Algorithm on xsell data set

4.1 Features Selection

The algorithm of random forest is built upon an ensemble of decision trees. As explained
above, only the most discriminating feature based on information gain will be selected
among the available features in the feature (sub)space during the induction of decision tree.
Therefore, including redundant features in the feature space, such as highly correlated fea-
tures and non-informative features, would generally not be a problem, since they will not be
selected. Nevertheless, removing redundant features from the feature space can speed up the
training process of random forest algorithms. In this sense, feature selection can be important
for random forest.

4.1.1 Categoric Features

In the data set, there are 7 categoric features: “acad_title” , “gender”, “maritial_status”,
“member_get_member_active”, “member_get_member_passive”, “occup”, and
“giro_mailing”. To test statistical dependence of these features with the target feature and
decide whether to include these features in the model, I perform chi-squared Test on the
contingency table between each of the 7 categoric features and the target feature. The test
results show that we can reject the null hypothesis that these 7 categoric features are inde-
pendent of the target feature. Therefore, I include these features in the model.

4.1.2 Numeric Features

In the data set, there are the following features denoting the number of products of the Bank
a client has by product type: “prod_loan”, “prod_mortgages”, “prod_brokerage”, “prod_pen-
sionplan”, “prod_savings”. In addition, there is a feature called “nr_products”, which de-
notes the total number of products a client has. I check the relationship between the “nr_prod-
ucts” and the five above-mentioned product number and find that “nr_products” is larger or
equal to the sum of the five features, therefore I create a feature “prod_other” to denote the
difference, which denotes the number of the other products a client has with the bank. In the
random forest models, the feature “nr_products” will be removed, and the newly created
feature “prod_other” will be accepted into the feature space.

To denote the change of the debit account of a client from 6 month ago and now, I create a
feature “volume_change” as “volume_debit-volume_debit_6months” and use it instead of
the “volume_debit” and “volume_debit_6months” in the random forest model.
- 13 -

4.2 Models
After data preprocessing and feature engineering, the data set “xsell” is partitioned into three
datasets: training set (70%), validation set (15%) and test set (15%).

The random forest models are trained on the data in training set and tuned on the data in
validation set. Since there are many hyperparameters in the random forest algorithm and
with different configuration of the hyperparameters we obtain models that perform differ-
ently, therefore it is necessary to search over many different hyperparameter configurations
in order to find the configuration with good performance.

To perform systematic hyperparameter tuning, I use random grid search in the R package
“h2o”. In this regard, I firstly specify the hyperparameter space to search over and tuning
strategy. In doing this, a certain number of random forest models with hyperparameter ran-
domly selected from the hyperparameter space will be trained on the training dataset. To
evaluate the model performance of the trained models, I apply the trained models on the data
in the validation set and comparing the performance metrics “Area under the ROC Curve
(AUC)”. After the hyperparameter tuning is finished, all trained models will be ranked by
the evaluation metrics “AUC” on the validation dataset. The model with highest AUC on
the validation dataset is chosen as the best model and used for prediction on the data in test
set.

4.3 Results
In the random grid search, only a random subset (42 models) of all possible combination of
the hyperparameter specified in the hyperparameter space are chosen and trained on the
training dataset. After ranking the evaluation metric AUC of the 42 models on validation
dataset, I obtain the best model with the following hyperparameters:

“max_depth”: 15, “min_rows”:14, “mtries “: 8, “sample_rate” 0.55

In the validation dataset, the best model achieves AUC 0.756. Applying this model on the
data in test set, I obtain the performance metric AUC of the best model on the test set: 0.7546.

The AUC for the test dataset is shown as below.

- 14 -

Figure 7: AUC for the test dataset

Remember that the direct predictions from random forest algorithm for the binary classifi-
cation (xsell dataset) are “scores” of a client opening a checking account with the Bank, with
value between 0 and 1. These scores denote the propensity of a client opening a checking
account with the Bank. Since these scores have values between 0 and 1, they are also viewed
as probabilities of belonging to a class. The AUC can be understood as the probability that
a random positive example has a larger score value than that of a random negative example.5
If AUC = 1, it means the classifier is able to perfectly distinguish between all the positive
examples and negative examples.

Figure 8 ranks the importance of the features in the best model. The features “age”,
“last_acc_opening_days”, and “customer_tenure_months” have the largest predictive power
in the best model. This is consistent with our first impression in the chapter 3.2.2.

5
Engelmann, Hayden and Tasche (2003)
- 15 -

Figure 8: Feature importance of the best model

5 Conclusion
In this seminar paper, I use the data set on cross selling to develop a random forest algorithm.
The work I have done for this seminar paper covers the whole process of data science project,
such as data preprocessing, data quality check, exploratory data analysis, feature engineer-
ing, and model development and deployment.

An interesting lesson from this project is that one should always check the data quality of
the dataset before man begins with the model development, as shown in chapter 3.1.1. A
data set with good quality is the basis for a successful model development. In addition,
- 16 -

exploratory data analysis is also an integral part of a data science project, through which we
can get first impressions about how input features associate with the target features, and
whether the data for input features are plausible.

6 References

Breiman, Leo, Friedman, Jerome, Stone, Charles J., & Olshen R.A. (1984). Classification
and Regression Trees, Chapman and Hall/CRC.

Kelleher, D., Namee, B., & D'Arcy A., (2015). Machine Learning for Predictive Analytics,
117-170.

Engelmann, B., Hayden E., & Tasche D., Measuring the Discriminative Power of Rating
Systems. (2003) Bundesbank Discussion Papers
- 17 -

7 Statutory Declaration

Ich versichere hiermit, dass ich die vorliegende Arbeit selbständig und ohne Benutzung an-
derer als der angegebenen Quellen und Hilfsmittel verfasst habe. Wörtlich übernommene
Sätze oder Satzteile sind als Zitat belegt, andere Anlehnungen, hinsichtlich Aussage und
Umfang, unter Quellenangabe kenntlich gemacht. Die Arbeit hat in gleicher oder ähnlicher
Form noch keiner Prüfungsbehörde vorgelegen und ist nicht veröffentlicht. Sie wurde nicht,
auch nicht auszugsweise, für eine andere Prüfungs- oder Studienleistung verwendet.

Frankfurt 19.07.2021 Huifeng Li

____________________________________ ___________________________________
Ort, Datum Unterschrift

Personal Loan Campaign Final
No ratings yet
Personal Loan Campaign Final
12 pages
Lecture 3 - Decision Trees and Random Forest
No ratings yet
Lecture 3 - Decision Trees and Random Forest
20 pages
Business Analytics: Foundation: Material Handouts
No ratings yet
Business Analytics: Foundation: Material Handouts
7 pages
Machine Learning Random Forest Algorithm - Javatpoint
No ratings yet
Machine Learning Random Forest Algorithm - Javatpoint
14 pages
DS Unit - 4
No ratings yet
DS Unit - 4
76 pages
Random Forest Algorithm
No ratings yet
Random Forest Algorithm
9 pages
Ca-Project: Aryan Devesh Puja Shabnas Mudit
No ratings yet
Ca-Project: Aryan Devesh Puja Shabnas Mudit
8 pages
Practical No4 - 5 ML
No ratings yet
Practical No4 - 5 ML
11 pages
Tree 7
No ratings yet
Tree 7
31 pages
Random Forest - Basics
No ratings yet
Random Forest - Basics
9 pages
Decision Trees
67% (3)
Decision Trees
14 pages
5.random Forest
No ratings yet
5.random Forest
12 pages
Classification and Regression Trees CART
No ratings yet
Classification and Regression Trees CART
40 pages
Classification and Regression Trees (CART) Theory and Applications
No ratings yet
Classification and Regression Trees (CART) Theory and Applications
40 pages
FMLanswerkey-IT 2
No ratings yet
FMLanswerkey-IT 2
11 pages
Unit 3,4,5 ML (CS - AI)
No ratings yet
Unit 3,4,5 ML (CS - AI)
37 pages
Data Mining - Decision Tree
No ratings yet
Data Mining - Decision Tree
13 pages
Decision Tree Classification Algorithm
No ratings yet
Decision Tree Classification Algorithm
4 pages
Guo Paper 2019
No ratings yet
Guo Paper 2019
4 pages
Random Forest
No ratings yet
Random Forest
30 pages
Random Forest Algorithm Overview
No ratings yet
Random Forest Algorithm Overview
11 pages
Unit-IV New
No ratings yet
Unit-IV New
18 pages
Module 5 Machine Learning
No ratings yet
Module 5 Machine Learning
36 pages
Machine Learning: Practical Tutorial On Random Forest and Parameter Tuning in R
No ratings yet
Machine Learning: Practical Tutorial On Random Forest and Parameter Tuning in R
11 pages
Week 6 - Random Forest
No ratings yet
Week 6 - Random Forest
12 pages
Naive Bayes and Decision Tree Classification
No ratings yet
Naive Bayes and Decision Tree Classification
21 pages
Guided Tour To Random Forest
No ratings yet
Guided Tour To Random Forest
42 pages
Random Forest
No ratings yet
Random Forest
83 pages
Random Forests
No ratings yet
Random Forests
35 pages
Machine Learning: Classification & Decision Trees
No ratings yet
Machine Learning: Classification & Decision Trees
24 pages
Decision Tree & Regression
No ratings yet
Decision Tree & Regression
33 pages
S&ML Unit 6 - Q & A
No ratings yet
S&ML Unit 6 - Q & A
12 pages
Chapter 09 CART-3
No ratings yet
Chapter 09 CART-3
42 pages
Credit Card Fraud Detection Using Random Forest & Cart Algorithm
No ratings yet
Credit Card Fraud Detection Using Random Forest & Cart Algorithm
7 pages
TEAA - Tree Ensembles-1
No ratings yet
TEAA - Tree Ensembles-1
43 pages
A Random Forest Guided Tour: Gerard - Biau@
No ratings yet
A Random Forest Guided Tour: Gerard - Biau@
41 pages
Lecture-12 Machine Learning With Python
No ratings yet
Lecture-12 Machine Learning With Python
18 pages
Ch5 Data Science
No ratings yet
Ch5 Data Science
60 pages
Ml-Unit Iii-1
No ratings yet
Ml-Unit Iii-1
46 pages
Lecture-4 Unit 2
No ratings yet
Lecture-4 Unit 2
73 pages
Biau 2016
No ratings yet
Biau 2016
31 pages
ML Asst.-01
No ratings yet
ML Asst.-01
21 pages
2023AIB1008 Lab08
No ratings yet
2023AIB1008 Lab08
8 pages
Random Forest Algorithm
No ratings yet
Random Forest Algorithm
39 pages
Present
No ratings yet
Present
20 pages
Primer On Major Data Mining Algorithms
No ratings yet
Primer On Major Data Mining Algorithms
86 pages
Bankruptcy Prediction Model 24-04-2017
No ratings yet
Bankruptcy Prediction Model 24-04-2017
22 pages
ML Lec6
No ratings yet
ML Lec6
4 pages
Random Forest
No ratings yet
Random Forest
10 pages
Random Forest
No ratings yet
Random Forest
8 pages
FINAL
No ratings yet
FINAL
20 pages
Classification Using Desiccion Tree On Audit Dataset Through R
No ratings yet
Classification Using Desiccion Tree On Audit Dataset Through R
9 pages
Montillo RandomForests 4-2-2009
No ratings yet
Montillo RandomForests 4-2-2009
28 pages
Sat - 149.Pdf - Prediction of Bigmart Sales Using Machine Learning Algorihms
No ratings yet
Sat - 149.Pdf - Prediction of Bigmart Sales Using Machine Learning Algorihms
11 pages
DL Unit 1
No ratings yet
DL Unit 1
20 pages
Customer Churn Prediction On Credit Card Services Using Random Forest Method
No ratings yet
Customer Churn Prediction On Credit Card Services Using Random Forest Method
8 pages
Random Forest Classifier
No ratings yet
Random Forest Classifier
9 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
Machine Learning - A Comprehensive, Step-by-Step Guide to Learning and Applying Advanced Concepts and Techniques in Machine Learning: 3
From Everand
Machine Learning - A Comprehensive, Step-by-Step Guide to Learning and Applying Advanced Concepts and Techniques in Machine Learning: 3
Peter Bradley
No ratings yet
A Novel Hybrid Classification Model For The Loan Repayment Capability Prediction System
No ratings yet
A Novel Hybrid Classification Model For The Loan Repayment Capability Prediction System
6 pages
Application - Reservoir Fluid Properties
No ratings yet
Application - Reservoir Fluid Properties
20 pages
Edited Version of Cardiovascular Diseases Risk Prediction Dataset Report
No ratings yet
Edited Version of Cardiovascular Diseases Risk Prediction Dataset Report
25 pages
Data Science Paper
No ratings yet
Data Science Paper
8 pages
TARP
No ratings yet
TARP
21 pages
Thesis 1
No ratings yet
Thesis 1
39 pages
Data Science & Machine Learning by Using R Programming
No ratings yet
Data Science & Machine Learning by Using R Programming
6 pages
Anticipating Customer Churn in Telecommunication Using Machine Learning Algorithms For Customer Retention
No ratings yet
Anticipating Customer Churn in Telecommunication Using Machine Learning Algorithms For Customer Retention
7 pages
Unit 3
No ratings yet
Unit 3
35 pages
Data Analytics Report - Case Study - Employee Attrition
100% (1)
Data Analytics Report - Case Study - Employee Attrition
41 pages
Predictive Maintenance Model For Marine Vessels Using Machine Learning
No ratings yet
Predictive Maintenance Model For Marine Vessels Using Machine Learning
12 pages
2017 - Predicting Student Performance With Neural Networks
No ratings yet
2017 - Predicting Student Performance With Neural Networks
33 pages
Tomato Disease Prediction Model Using Machine Learning Algorithms and Image Processing Techniques
No ratings yet
Tomato Disease Prediction Model Using Machine Learning Algorithms and Image Processing Techniques
6 pages
Analysis of User Behavior Patterns Using Machine Learning Algorithms
No ratings yet
Analysis of User Behavior Patterns Using Machine Learning Algorithms
7 pages
Module 3
No ratings yet
Module 3
63 pages
216 649 4 PB - 3
No ratings yet
216 649 4 PB - 3
9 pages
Recognition of Fingerprint Images Using CNN For Cybercrime Detection System
No ratings yet
Recognition of Fingerprint Images Using CNN For Cybercrime Detection System
6 pages
Leveraging Machine Learning For Predicting Mental Health Outcomes A Data-Driven Approach
No ratings yet
Leveraging Machine Learning For Predicting Mental Health Outcomes A Data-Driven Approach
9 pages
1 s2.0 S0168169921004592 Main
No ratings yet
1 s2.0 S0168169921004592 Main
18 pages
Google Play Store Apps-Data Analysis and Ratings Prediction
No ratings yet
Google Play Store Apps-Data Analysis and Ratings Prediction
10 pages
Mining Free Text Medical Notes
No ratings yet
Mining Free Text Medical Notes
8 pages
Algo-Trading Research Paper
No ratings yet
Algo-Trading Research Paper
20 pages
Algosintrvwques
No ratings yet
Algosintrvwques
27 pages
Review 2
No ratings yet
Review 2
11 pages
Classification of Gender by Voice Recognition Using Machine Learning Algorithms
No ratings yet
Classification of Gender by Voice Recognition Using Machine Learning Algorithms
13 pages
Machine Learning and Statistical Prediction of Fastball Velocity
No ratings yet
Machine Learning and Statistical Prediction of Fastball Velocity
8 pages
Visvesvaraya Technological University Belagavi
No ratings yet
Visvesvaraya Technological University Belagavi
74 pages
Vi Sem Bca Unit 4 Artificial Intelligence and Applications Notes K.r.r.sir
No ratings yet
Vi Sem Bca Unit 4 Artificial Intelligence and Applications Notes K.r.r.sir
24 pages
Credit Card Fraud Detection System
100% (1)
Credit Card Fraud Detection System
7 pages
My CV
No ratings yet
My CV
10 pages

Random Forest For Binary Classification

Uploaded by

Random Forest For Binary Classification

Uploaded by

Random Forest for Binary Classification:

Cross Selling Data Set of a Bank

Seminar Paper / Term Paper

Program: Bachelor of Science in Mathematics

4 Random Forest Algorithm on xsell data set ......................................................................12

7 Statutory Declaration ..........................................................................................................17

2.1 Decision Tree

𝑮𝒊𝒏𝒊(𝒕, 𝑺) = 𝟏 − ∑𝒍∈{𝒍𝒆𝒗𝒆𝒍𝒔 𝒐𝒇 𝒕}) 𝑷(𝒕 = 𝒍)𝟐

𝐺𝑖𝑛𝑖(𝑡, 𝑆) = 𝑃(𝑡 = 𝑙1 )(1 − 𝑃(𝑡 = 𝑙1 )) + 𝑃(𝑡 = 𝑙2 )(1 − 𝑃(𝑡 = 𝑙2 ))

The CART algorithm is described as follows:

in the left subset.

3. apply step 1 and 2 on subsets 𝑚𝑙𝑒𝑓𝑡 , 𝑚𝑟𝑖𝑔ℎ𝑡 recursively.

2.2 Random Forest

Random Forest Algorithm

2. For tree i in (1: n), do the following

• Generate a bootstrap subsample of the original training dataset 𝐷𝑖

• Train a decision tree on the dataset 𝐷𝑖 , choose a random subsample of

Table 1: List of Features in the data

3.1 Data Preprocessing

Figure 1: boxplot of Age per occupational groups

Figure 2: boxplot of Age per occupational groups after removing outliers

3.1.2 Missing values

Table 2: Number of Missing Values in Features

3.2 Exploratory Data Analysis

3.2.1 Target Feature

Figure 3: the distribution of target feature in the Dataset

3.2.2 Interactions between Input Features and Target Features

Figure 5: the probability density of feature “last_acc_opening_days” per xsell

Figure 6: xsell_rates per levels of categoric features

4 Random Forest Algorithm on xsell data set

4.1 Features Selection

4.1.1 Categoric Features

4.1.2 Numeric Features

“max_depth”: 15, “min_rows”:14, “mtries “: 8, “sample_rate” 0.55

The AUC for the test dataset is shown as below.

Figure 7: AUC for the test dataset

Figure 8: Feature importance of the best model

Frankfurt 19.07.2021 Huifeng Li

You might also like