0% found this document useful (0 votes)
259 views5 pages

Comparison Between Xgboost, Lightgbm and Catboost Using A Home Credit Dataset

This document compares three gradient boosting methods - XGBoost, LightGBM, and CatBoost - using a home credit dataset with 219 features and over 350,000 records. The implementation found that LightGBM was faster and more accurate than CatBoost and XGBoost when using different numbers of features and records. Gradient boosting methods construct solutions iteratively to optimize loss functions and avoid overfitting, updating base learner functions in each iteration.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
259 views5 pages

Comparison Between Xgboost, Lightgbm and Catboost Using A Home Credit Dataset

This document compares three gradient boosting methods - XGBoost, LightGBM, and CatBoost - using a home credit dataset with 219 features and over 350,000 records. The implementation found that LightGBM was faster and more accurate than CatBoost and XGBoost when using different numbers of features and records. Gradient boosting methods construct solutions iteratively to optimize loss functions and avoid overfitting, updating base learner functions in each iteration.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

World Academy of Science, Engineering and Technology

International Journal of Computer and Information Engineering


Vol:13, No:1, 2019

Comparison between XGBoost, LightGBM and


CatBoost Using a Home Credit Dataset
Essam Al Daoud

 provided in Section V.
Abstract—Gradient boosting methods have been proven to be a
very important strategy. Many successful machine learning solutions II. RELATED WORK
were developed using the XGBoost and its derivatives. The aim of
this study is to investigate and compare the efficiency of three Gradient boosting methods construct the solution in a stage-
gradient methods. Home credit dataset is used in this work which wise fashion and solve the over fitting problem by optimizing
contains 219 features and 356251 records. However, new features are the loss functions. For example, assume that you have a
generated and several techniques are used to rank and select the best custom base-learner h(x, θ) (such as decision tree), and a loss
features. The implementation indicates that the LightGBM is faster function 𝜓 𝑦, 𝑓 𝑥 ; it is challenging to estimate the
Open Science Index, Computer and Information Engineering Vol:13, No:1, 2019 waset.org/Publication/10009954

and more accurate than CatBoost and XGBoost using variant number parameters directly, and thus, an iterative model is suggested
of features and records.
such that at each iteration. The model will be updated and a
Keywords—Gradient boosting, XGBoost, LightGBM, CatBoost, new base-learner function h(x, θt)is selected, where the
home credit. increment is guided by:

,
I. INTRODUCTION 𝑔 𝑥 𝐸 |𝑥

D ESPITEthe recent re-rise and popularity of artificial


neural networks (ANN), boosting methods are still more
useful for a medium dataset because the training time is
This allows the substitution of the hard optimization
problem with the usual least-squares optimization problem:
relatively very fast and it does not require a long time to tune
its parameters. 𝜌 ,𝜃 arg 𝑚𝑖𝑛 , ∑ 𝑔 𝑥 𝜌 ℎ 𝑥 ,𝜃
Boosting is an ensemble strategy that endeavors to make an
accurate classifier from various weak classifiers. This is done Algorithm 1summarizes the Friedman algorithm.
by dividing the training data and using each part to train
different models or one model with a different setting, and Algorithm 1 Gradient Boost
then the results are combined together using a majority vote. 1- Let 𝑓 be a constant
AdaBoost was the first effective boosting method discovered 2- For i= 1 to M
for binary classification by [1]. When AdaBoost makes its first a. Compute gi(x) using eq()
iteration, all records are weighted identically, but in the next b. Train the function h(x, θi)
iterations, more weight is given to the misclassified records, c. Find 𝜌 using eq()
and the model will continue until an efficient classifier is d. Update the function
constructed. Soon after AdaBoost was presented, it was noted 𝑓 𝑓 𝜌 ℎ 𝑥, 𝜃
3- End
that even if the number of iterations is increased, the test error
does not grow [2]. Thus, AdaBoost is a suitable model for The algorithm starts with a single leaf, and then the learning
solving the overfitting problem. In recent years, three efficient rate is optimized for each node and each record [4]-[6].
gradient methods based on decision trees are suggested: eXtreme Gradient Boosting (XGBoost) is a highly scalable,
XGBoost, CatBoost and LightGBM. The new methods have flexible and versatile tool; it was engineered to exploit
been used successfully in industry, academia and competitive resources correctly and to overcome the limitations of the
machine learning [3]. previous gradient boosting. The main difference between
The rest of this paper is organized as follows: Section II XGBoost and other gradient boosting is that it uses a new
provides a short introduction about the gradient boosting regularization technique to control the overfitting. Therefore,
algorithms and the recent developments. Section III explores it is faster and more robust during the model tuning. The
the home credit dataset and exploits the knowledge of the regularization technique is done by adding a new term to the
domain to generate new features. Section IV implements loss function, as:
gradient boosting algorithms and discusses a new mechanism
to generate useful random features, and the conclusion is 𝐿 𝑓 ∑ 𝐿 𝑦 ,𝑦 ∑ Ω 𝛿

E. Al-Daoud is with faculty of Information Technology, Computer Science with


Department, Zarka University, Jordan (phone: +96279668000, e-mail: Ω 𝛿 𝛼|𝛿| 0.5𝛽‖𝑤‖
[email protected]).

International Scholarly and Scientific Research & Innovation 13(1) 2019 6 ISNI:0000000091950263
World Academy of Science, Engineering and Technology
International Journal of Computer and Information Engineering
Vol:13, No:1, 2019

where || is the number of branches, w is the value of each leaf given categorical feature, totalCountis the number of previous
and is the regularization function. XGBoost uses a new gain objects and prior is specified by the starting parameters [9]-
function, as: [11].

𝐺 ∑∈ 𝑔 III. HOME CREDIT DATASET


𝐻 ∑∈ ℎ The aim of the home credit dataset is to predict the
capabilities of the clients repayment by using a variety of
𝐺𝑎𝑖𝑛 𝛼 alternative data [1], [12]. Due to shortage or non-existent
records of loan repayment, home credit attempts to expand the
safe borrowing experience for the unbanked clients by
where collecting and extracting more information about the clients
𝑔 𝜕 𝐿 𝑦 ,𝑦 from different resources as follows:
and 1- Application_{train|test}.csv: Each row in this file is
ℎ 𝜕 𝐿 𝑦 ,𝑦 considered one loan, the file application_train.csv
contains a target column, while application_test.csv does
G is the score of the right child, H is the score of the left child not contain a target column. The number of the clients in
andGain is the score in the case no new child [7]. this file is 307511, and the number of the features is 123
Open Science Index, Computer and Information Engineering Vol:13, No:1, 2019 waset.org/Publication/10009954

To reduce the implementation time, a team from Microsoft such as: SK_ID_CURR,
developed the light gradient boosting machine (LightGBM) in NAME_CONTRACT_TYPE,CODE_GENDER,
April 2017 [8]. The main difference is that the decision trees FLAG_OWN_CAR, FLAG_OWN, CNT_CHILDREN,
in LightGBM are grown leaf-wise, instead of checking all of AMT_INCOME, AMT_CREDIT, AMT_ANNUITY,
the previous leaves for each new leaf, as shown in Figs. 1 and TARGET, etc. The target variable defines whether the
2. All the attributes are sorted and grouped as bins. This loan was repaid or not.
implementation is called histogram implementation. 2- Bureau.csv: The previous applications about each client
LightGBM has several advantages such as better accuracy, from other financial institutions, a client could have
faster training speed, and is capable of large-scale handling several applications, thus the number of the records in this
data and is GPU learning supported. file more than the number of the clients. This file has
1716428 rows and 17 features. Fig. 3 shows a snapshot of
this data.

Fig. 1 XGBoost Level-wise tree growth

Fig. 3 Snapshot of Bureau data


Fig. 2 LightGBM Leaf-wise tree growth
3- Bureau_balance.csv: The balance of each month for every
CatBoost (for “categorical boosting”) focuses on categorical previous credit. This file has 27299925 rows and three
columns using permutation techniques, one_hot_max_size features. Fig. 4 shows a snapshot of this data.
(OHMS), and target-based statistics. CatBoost solves the
exponential growth of the features combination by using the
greedy method at each new split of the current tree. For each
feature that has more categories than OHMS (an input
parameter), CatBoost uses the following steps:
1. Dividing the records to subsets randomly,
2. Converting the labels to integer numbers, and
3. Transforming the categorical features to numerical, as: Fig. 4 Snapshot of Bureau balance data

𝑎𝑣𝑔𝑇𝑎𝑟𝑔𝑒𝑡 4- POS_CASH_balance.csv: The snapshots of monthly


balance for every previous point of sales (POS). This file
has 10001358 rows and eight features.
where, countInClass is the number of ones in the target for a

International Scholarly and Scientific Research & Innovation 13(1) 2019 7 ISNI:0000000091950263
World Academy of Science, Engineering and Technology
International Journal of Computer and Information Engineering
Vol:13, No:1, 2019

5- Credit_card_balance.csv: The snapshots of monthly summarizes the number of features before and after feature
balance for every credit with home credit. This file has generation.
3840312 rows and 23 features.
6- Previous_application.csv: Each row in this file represents IV. EXPERIMENTAL RESULTS
a previous application related to client loans. This file has To compare between the gradient methods, the home credit
1670214 rows and 37 features. dataset is used and tested by implementing XGBoost,
7- Installments_payments.csv: The history of the previous LightGBM and CatBoost. The number of rows is reduced by
repayments in home credit, where some rows outline deleting any row with missing values more than 75% or has a
missed installments, and other rows describe payments low importance rank. Five-fold validation is applied on a
made. This file has 13605401 rows and eight features. variant number of rows. Tables II-IV show that LightGBM
has the best area under the curve (AUC) and the fastest
training time, while XGBoost has the worst training time, and
CatBoost has the worst AUC. However, these results cannot
be generalized to other datasets. For example, if the dataset
has more categorical features, we expect CatBoost will
outperform the other methods; the implementation time seems
to be more independent and has low correlation with the
Open Science Index, Computer and Information Engineering Vol:13, No:1, 2019 waset.org/Publication/10009954

features type.

TABLE II
Fig. 5 Gender differences in repaying the loan TIME AND AUC USING XGBOOST
#Rows AUC Time
307507 0.788320 4306
250000 0.784516 3550
200000 0.781219 2892
150000 0.773347 2098
100000 0.772771 1219
50000 0.768899 9487

TABLE III
TIME AND AUC USING LIGHTGBM
#Rows AUC Time
307507 0.789996 786
Fig. 6 Snapshot of feature generation using python 250000 0.788589 638
200000 0.786344 512
When the home credit dataset is explored, we can note that 150000 0.786215 393
the target label is an imbalanced, where the target column in 100000 0.782477 263
the most of the records have the value 0 (about 91%), which 50000 0.777649 121
means that the client did his installments successfully, and
24000 applicants (about 9%) had difficulties in repaying the TABLE IV
loan. Another important observation can be exploited is that TIME AND AUC USING CATBOOST
males, more than the females, are more prone to failure to #Rows AUC Time
repay the loan or make installments successfully, as shown in 307507 0.787629 1803
Fig. 5. 250000 0.784402 1257
200000 0.782895 851
TABLE I 150000 0.780762 567
THE NUMBER OF FEATURES BEFORE AND AFTER FEATURE GENERATION 100000 0.776168 442
#Feature after 50000 0.770666 286
File #Records #features
generation
Application 356251 123 240
Table V illustrates the effect of the features preprocessing
Bureau 1716428 17 80
on the time and AUC. From the table, it can be noted that
Bureau_balance 27299925 3 15
Pos-cash 10001358 8 18
normalization, collinear or deleting the features which have
Credit card balance 3840312 23 113 missing values less than 75%,is unfeasible. Figs. 7 and 8 show
Previous applications 1670214 37 219 the features ranking using LightGBM and CatBoost,
Installments payments 13605401 8 36 respectively.
Total 219 721

More features can be generated by using the domain


knowledge and aggregations, as shown in Fig. 6. Table I

International Scholarly and Scientific Research & Innovation 13(1) 2019 8 ISNI:0000000091950263
World Academy of Science, Engineering and Technology
International Journal of Computer and Information Engineering
Vol:13, No:1, 2019

TABLE V
THE EFFECT OF THE FEATURES PREPROCESSING ON LIGHTGBM
PERFORMANCE
# Features AUC Time
Full data 721 0.789804 1748
Miss 75% 696 0.789933 1685
Miss 75%
696 0.789868 1716
normalization
Miss 80
392 0.790115 1437
Importance 1
Miss 75
200 0.789996 786
Importance 5
Miss 75
158 0.789897 645
Importance 7
Miss 75
Importance 10 113 0.788780 515
Collinear 95
Fig. 7 Feature ranking using LightGBM Miss 50
Importance 7 105 0.779643 432
Collinear 95
Miss 50
122 0.77310 533
Open Science Index, Computer and Information Engineering Vol:13, No:1, 2019 waset.org/Publication/10009954

Importance 7

Discovering new features can enhance the accuracy


significantly; however, the knowledge of the domain is not
sufficient to find all the important features. Thus, a random
features generation mechanism is adopted using random
operations (*, ^, /, +, - , max, …) with two or three of the top
features. To prevent the exponential growth of the random
features, a simple and fast rejection technique is used such as a
signal to noise feature ranking. By using the combination of
the above operations, thousands of the new features are
Fig. 8 Feature ranking using CatBoost generated. However, only 150 features are found which have
acceptable rank; therefore, the AUC is improved after adding
Fig. 9 shows the distribution of a high rank feature the new discovered features, and became 0.79304. Fig. 10
(EXT_SOURCE1) and a low rank feature (BURO_DAYS). shows a new random feature (b1n11) among the top features
using LightGBM ranking.

Fig. 9 The distribution of low and high rank features

V. CONCLUSION time budget of hyper-parameters optimization. The results can


Boosting methods iteratively train a set of weak learners, be improved by generating new features and selecting the best
where the weight of the records are updated according to the set.
regression results of the loss function of the previous learners.
In this study, we compared between three state-of-the-art
gradient boosting methods (XGBoost, CatBoost and
LightGBM) in terms of CPU runtime and accuracy.
LightGBM seems to be significantly faster than the other
gradient boosting methods and more accurate using the same

International Scholarly and Scientific Research & Innovation 13(1) 2019 9 ISNI:0000000091950263
World Academy of Science, Engineering and Technology
International Journal of Computer and Information Engineering
Vol:13, No:1, 2019

Fig. 10 New features ranking using LightGBM

REFERENCES
[1] Y. Freund, R. E. Schapire, “A decision-theoretic generalizationof online
Open Science Index, Computer and Information Engineering Vol:13, No:1, 2019 waset.org/Publication/10009954

learning and an application to boosting," Journal of Computer


andSystem Sciences, vol. 55, no. 1, pp 119-139, 1997.
[2] P. Kontschieder, M. Fiterau, A. Criminisi, S. Rota Bulo. “Deep neural
decision forests,” In Proceedings of the IEEE International Conference
on Computer Vision, pp 1467–1475, 2015.
[3] J. C. Wang, T. Hastie, “Boosted varying-coefficient regression models
for product demand prediction,” Journal of Computational and
Graphical Statistics, vol. 23, no. 2, pp 361–382, 2014.
[4] E Al Daoud, “Intrusion Detection Using a New Particle Swarm Method
and Support Vector Machines,” World Academy of Science, Engineering
and Technology, vol. 77, 59-62, 2013.
[5] E. Al Daoud, H Turabieh, “New empirical nonparametric kernels for
support vector machine classification,” Applied Soft Computing, vol. 13,
no. 4, 1759-1765, 2013.
[6] E. Al Daoud, "An Efficient Algorithm for Finding a Fuzzy Rough Set
Reduct Using an Improved Harmony Search," I.J. Modern Education
and Computer Science, vol. 7, no. 2, pp16-23, 2015.
[7] Y. Zhang, A. Haghani. “A gradient boosting method to improve travel
time prediction. Transportation Research Part C,” Emerging
Technologies, vol. 58, 308–324, 2015.
[8] K. Guolin, M. Qi, F. Thomas, W. Taifeng, C. Wei, M. Weidong, Y.
Qiwei, L. Tie-Yan, "LightGBM: A Highly Efficient Gradient Boosting
Decision Tree," Advances in Neural Information Processing Systems
vol. 30, pp. 3149-3157, 2017.
[9] A. Dorogush, V. Ershov, A. Gulin "CatBoost: gradient boosting with
categorical features support," NIPS, p1-7, 2017.
[10] M. Qi, K. Guolin, W. Taifeng, C. Wei, Y. Qiwei, M. Weidong, L. Tie-
Yan, "A Communication-Efficient Parallel Algorithm for Decision
Tree," Advances in Neural Information Processing Systems, vol. 29, pp.
1279-1287, 2016.
[11] A. Klein, S. Falkner, S. Bartels, P. Hennig, F. Hutter, “Fast bayesian
optimization of machine learning hyperparameters on large datasets,” In
Proceedings of Machine Learning Research PMLR, vol. 54, pp 528-
536,2017.
[12] J. H. Aboobyda, and M. A. Tarig, “Developing Prediction Model Of
Loan Risk In Banks Using Data Mining,” Machine Learning and
Applications: An International Journal (MLAIJ), vol. 3, no. 1, pp 1–9,
2016.

International Scholarly and Scientific Research & Innovation 13(1) 2019 10 ISNI:0000000091950263

You might also like