Comparison Between Xgboost, Lightgbm and Catboost Using A Home Credit Dataset
Comparison Between Xgboost, Lightgbm and Catboost Using A Home Credit Dataset
provided in Section V.
Abstract—Gradient boosting methods have been proven to be a
very important strategy. Many successful machine learning solutions II. RELATED WORK
were developed using the XGBoost and its derivatives. The aim of
this study is to investigate and compare the efficiency of three Gradient boosting methods construct the solution in a stage-
gradient methods. Home credit dataset is used in this work which wise fashion and solve the over fitting problem by optimizing
contains 219 features and 356251 records. However, new features are the loss functions. For example, assume that you have a
generated and several techniques are used to rank and select the best custom base-learner h(x, θ) (such as decision tree), and a loss
features. The implementation indicates that the LightGBM is faster function 𝜓 𝑦, 𝑓 𝑥 ; it is challenging to estimate the
Open Science Index, Computer and Information Engineering Vol:13, No:1, 2019 waset.org/Publication/10009954
and more accurate than CatBoost and XGBoost using variant number parameters directly, and thus, an iterative model is suggested
of features and records.
such that at each iteration. The model will be updated and a
Keywords—Gradient boosting, XGBoost, LightGBM, CatBoost, new base-learner function h(x, θt)is selected, where the
home credit. increment is guided by:
,
I. INTRODUCTION 𝑔 𝑥 𝐸 |𝑥
International Scholarly and Scientific Research & Innovation 13(1) 2019 6 ISNI:0000000091950263
World Academy of Science, Engineering and Technology
International Journal of Computer and Information Engineering
Vol:13, No:1, 2019
where || is the number of branches, w is the value of each leaf given categorical feature, totalCountis the number of previous
and is the regularization function. XGBoost uses a new gain objects and prior is specified by the starting parameters [9]-
function, as: [11].
To reduce the implementation time, a team from Microsoft such as: SK_ID_CURR,
developed the light gradient boosting machine (LightGBM) in NAME_CONTRACT_TYPE,CODE_GENDER,
April 2017 [8]. The main difference is that the decision trees FLAG_OWN_CAR, FLAG_OWN, CNT_CHILDREN,
in LightGBM are grown leaf-wise, instead of checking all of AMT_INCOME, AMT_CREDIT, AMT_ANNUITY,
the previous leaves for each new leaf, as shown in Figs. 1 and TARGET, etc. The target variable defines whether the
2. All the attributes are sorted and grouped as bins. This loan was repaid or not.
implementation is called histogram implementation. 2- Bureau.csv: The previous applications about each client
LightGBM has several advantages such as better accuracy, from other financial institutions, a client could have
faster training speed, and is capable of large-scale handling several applications, thus the number of the records in this
data and is GPU learning supported. file more than the number of the clients. This file has
1716428 rows and 17 features. Fig. 3 shows a snapshot of
this data.
International Scholarly and Scientific Research & Innovation 13(1) 2019 7 ISNI:0000000091950263
World Academy of Science, Engineering and Technology
International Journal of Computer and Information Engineering
Vol:13, No:1, 2019
5- Credit_card_balance.csv: The snapshots of monthly summarizes the number of features before and after feature
balance for every credit with home credit. This file has generation.
3840312 rows and 23 features.
6- Previous_application.csv: Each row in this file represents IV. EXPERIMENTAL RESULTS
a previous application related to client loans. This file has To compare between the gradient methods, the home credit
1670214 rows and 37 features. dataset is used and tested by implementing XGBoost,
7- Installments_payments.csv: The history of the previous LightGBM and CatBoost. The number of rows is reduced by
repayments in home credit, where some rows outline deleting any row with missing values more than 75% or has a
missed installments, and other rows describe payments low importance rank. Five-fold validation is applied on a
made. This file has 13605401 rows and eight features. variant number of rows. Tables II-IV show that LightGBM
has the best area under the curve (AUC) and the fastest
training time, while XGBoost has the worst training time, and
CatBoost has the worst AUC. However, these results cannot
be generalized to other datasets. For example, if the dataset
has more categorical features, we expect CatBoost will
outperform the other methods; the implementation time seems
to be more independent and has low correlation with the
Open Science Index, Computer and Information Engineering Vol:13, No:1, 2019 waset.org/Publication/10009954
features type.
TABLE II
Fig. 5 Gender differences in repaying the loan TIME AND AUC USING XGBOOST
#Rows AUC Time
307507 0.788320 4306
250000 0.784516 3550
200000 0.781219 2892
150000 0.773347 2098
100000 0.772771 1219
50000 0.768899 9487
TABLE III
TIME AND AUC USING LIGHTGBM
#Rows AUC Time
307507 0.789996 786
Fig. 6 Snapshot of feature generation using python 250000 0.788589 638
200000 0.786344 512
When the home credit dataset is explored, we can note that 150000 0.786215 393
the target label is an imbalanced, where the target column in 100000 0.782477 263
the most of the records have the value 0 (about 91%), which 50000 0.777649 121
means that the client did his installments successfully, and
24000 applicants (about 9%) had difficulties in repaying the TABLE IV
loan. Another important observation can be exploited is that TIME AND AUC USING CATBOOST
males, more than the females, are more prone to failure to #Rows AUC Time
repay the loan or make installments successfully, as shown in 307507 0.787629 1803
Fig. 5. 250000 0.784402 1257
200000 0.782895 851
TABLE I 150000 0.780762 567
THE NUMBER OF FEATURES BEFORE AND AFTER FEATURE GENERATION 100000 0.776168 442
#Feature after 50000 0.770666 286
File #Records #features
generation
Application 356251 123 240
Table V illustrates the effect of the features preprocessing
Bureau 1716428 17 80
on the time and AUC. From the table, it can be noted that
Bureau_balance 27299925 3 15
Pos-cash 10001358 8 18
normalization, collinear or deleting the features which have
Credit card balance 3840312 23 113 missing values less than 75%,is unfeasible. Figs. 7 and 8 show
Previous applications 1670214 37 219 the features ranking using LightGBM and CatBoost,
Installments payments 13605401 8 36 respectively.
Total 219 721
International Scholarly and Scientific Research & Innovation 13(1) 2019 8 ISNI:0000000091950263
World Academy of Science, Engineering and Technology
International Journal of Computer and Information Engineering
Vol:13, No:1, 2019
TABLE V
THE EFFECT OF THE FEATURES PREPROCESSING ON LIGHTGBM
PERFORMANCE
# Features AUC Time
Full data 721 0.789804 1748
Miss 75% 696 0.789933 1685
Miss 75%
696 0.789868 1716
normalization
Miss 80
392 0.790115 1437
Importance 1
Miss 75
200 0.789996 786
Importance 5
Miss 75
158 0.789897 645
Importance 7
Miss 75
Importance 10 113 0.788780 515
Collinear 95
Fig. 7 Feature ranking using LightGBM Miss 50
Importance 7 105 0.779643 432
Collinear 95
Miss 50
122 0.77310 533
Open Science Index, Computer and Information Engineering Vol:13, No:1, 2019 waset.org/Publication/10009954
Importance 7
International Scholarly and Scientific Research & Innovation 13(1) 2019 9 ISNI:0000000091950263
World Academy of Science, Engineering and Technology
International Journal of Computer and Information Engineering
Vol:13, No:1, 2019
REFERENCES
[1] Y. Freund, R. E. Schapire, “A decision-theoretic generalizationof online
Open Science Index, Computer and Information Engineering Vol:13, No:1, 2019 waset.org/Publication/10009954
International Scholarly and Scientific Research & Innovation 13(1) 2019 10 ISNI:0000000091950263