Islam Et Al. - 2024 - iXGB Improving The Interpretability of XGBoost Us

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

iXGB: Improving the Interpretability of XGBoost Using Decision Rules

and Counterfactuals

Mir Riyanul Islam∗ a


, Mobyen Uddin Ahmed b
and Shahina Begum c

Artificial Intelligence and Intelligent Systems Research Group, School of Innovation Design and Engineering,
Mälardalen University, Universitetsplan 1, 722 20 Västerås, Sweden

Keywords: Counterfactuals, Explainability, Explainable Artificial Intelligence, Interpretability, Regression, Rule-Based


Explanation, XGBoost.

Abstract: Tree-ensemble models, such as Extreme Gradient Boosting (XGBoost), are renowned Machine Learning
models which have higher prediction accuracy compared to traditional tree-based models. This higher
accuracy, however, comes at the cost of reduced interpretability. Also, the decision path or prediction rule
of XGBoost is not explicit like the tree-based models. This paper proposes the iXGB–interpretable XGBoost,
an approach to improve the interpretability of XGBoost. iXGB approximates a set of rules from the internal
structure of XGBoost and the characteristics of the data. In addition, iXGB generates a set of counterfactuals
from the neighbourhood of the test instances to support the understanding of the end-users on their operational
relevance. The performance of iXGB in generating rule sets is evaluated with experiments on real and
benchmark datasets, which demonstrated reasonable interpretability. The evaluation result also supports the
idea that the interpretability of XGBoost can be improved without using surrogate methods.

1 INTRODUCTION showcased by Gunning and Aha (2019) with a


notional diagram in their secondary study on the
Tree-ensemble is a class of Machine Learning (ML) research field of Explainable Artificial Intelligence
models which have gained recent popularity for their (XAI). Precisely, these ensemble models divide the
efficacy in handling a diverse array of tabular data input space into small regions and predict from that
in real-world applications (Sagi and Rokach, 2021). region. The number of small regions is generally
These tree-ensemble models, e.g., Random Forests large, theoretically, these regions represent a large
(Breiman, 2001), Gradient Boosted Trees (Friedman, number of rules for prediction. This excessive
2001), Extreme Gradient Boosting (XGBoost) (Chen number of rules makes the decision process less
and Guestrin, 2016), etc. operate by combining the interpretable for end-users. Hara and Hayashi (2016)
predictive power of multiple decision trees. One of proposed a post-processing method that improves
their key strengths is their ability to manage complex the interpretability of the tree-ensemble models
relationships within data, making them particularly and demonstrated their approach by interpreting
suitable for datasets characterised by heterogeneity predictions from XGBoost. The authors also
while very little preprocessing is required on the showed that smaller decision regions refer to more
data before model training. The collective strength transparent and understandable models. In another
of individual trees, each contributing a unique work, Blanchart (2021) described a method for
perspective, results in a powerful ensemble capable computing the decision regions of tree-ensemble
of tackling various predictive tasks. models for classification tasks. The authors
A major weakness of the tree-ensemble models also utilised counterfactual reasoning alongside the
(e.g., XGBoost) is that they lose interpretability decision regions to interpret the models’ decisions.
while improving the prediction accuracy. This was Sagi and Rokach (2021) proposed an approach
of approximating an ensemble of trees into an
a https://orcid.org/0000-0003-0730-4405 interpretable decision tree for classification problems.
b https://orcid.org/0000-0003-1953-6086 Nalenz and Augustin (2022) developed Compressed
c https://orcid.org/0000-0002-1212-7637
∗ Corresponding Author
Rule Ensemble (CRE) to interpret the output of

1345
Islam, M., Ahmed, M. and Begum, S.
iXGB: Improving the Interpretability of XGBoost Using Decision Rules and Counterfactuals.
DOI: 10.5220/0012474000003636
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 16th International Conference on Agents and Artificial Intelligence (ICAART 2024) - Volume 3, pages 1345-1353
ISBN: 978-989-758-680-4; ISSN: 2184-433X
Proceedings Copyright © 2024 by SCITEPRESS – Science and Technology Publications, Lda.
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence

Figure 1: Example of explanation generated for a single instance of flight TOT delay prediction using LIME. The red and
green horizontal bars correspond to the contributions for increasing and decreasing the delay respectively, and the blue bar
corresponds to the predicted delay.

tree-ensemble classifiers. These studies are the considering passengers, airlines, lost demand, and
only notable ones found in the literature which indirect costs, was thirty-three billion dollars (Lukacs,
contributed to improving the interpretability of the 2020). The significant expenses provide the rationale
tree-ensemble models for classification tasks with for increased attention towards predicting TOT and
indications towards their use in regression tasks. reducing delays of flights (Dalmau et al., 2021).
From the literature, it is evident that less effort To solve the problem of predicting flight TOT
is given towards making the ensemble models delay, an interpretable system was developed to
(e.g., XGBoost) interpretable for regression tasks. incorporate the existing operational interface of the
Moreover, different state-of-the-art methods produce ATCOs. In the process, the prediction model was
explanations that differ in the contents of the output. developed with XGBoost and its prediction was
Under these circumstances, this study aims to made interpretable with the help of several popular
improve the interpretability of XGBoost by utilising XAI tools, such as LIME – Local Interpretable
its mechanisms by design. The main contribution of Model-agnostic Explanation. Qualitative evaluation
this study is twofold – in the form of a user survey was conducted for the
• Explaining the predictions of XGBoost regression developed system with the following scenario –
models using decision rules extracted from the The current time is 0810 hrs. AFR141 is at the
trained model. gate and expected to take off from runway 09 at 0910
hrs. It is predicted that this flight will be delayed for
• Generation of counterfactuals from the actual unknown minutes. After this, the aircraft has 2 more
neighbourhood of the test instance. flights in the day. Concurrently, SAS652 is in the last
flight leg of the day and is expected to land on runway
1.1 Motivation 09 at 0916 hrs. Moreover, there is a scheduled runway
inspection at 0920 hrs.
The work presented in this paper is further motivated The target users of the survey were the ATCOs,
by a real-world regression application for the aviation both professionals and students. Participants were
industry. Particularly, the regression task is to predict prompted with several scenarios similar to the
the flight take-off time (TOT) delay from historical scenario stated above and corresponding predictions
data to support the responsibilities of the Air Traffic of the delay with explanation as illustrated in Figure 1,
Controllers (ATCO). It is worth mentioning that the which varied based on the explainability tool used to
aviation industry experiences a loss of approximately generate the explanation. At the end of each scenario,
100 Euros on average per minute for Air Traffic the participants were asked to respond to questions
Flow Management (ATFM) (Cook and Tanner, 2015). to evaluate the effectiveness of the XAI methods in
The Federal Aviation Administration (FAA)1 reported explaining the prediction results.
in 2019 that the estimated cost due to delay, The outcome of the user survey was deduced as–
the contribution to the final delay of the selected
1 https://www.faa.gov/

1346
iXGB: Improving the Interpretability of XGBoost Using Decision Rules and Counterfactuals

features from the XAI methods would not impact


Dataset
the operational relevance of the information received,
though the explanations are understandable. This
rationalisation was also reflected in the qualitative Training Set Test Set
interviews including the preference for user-centric
feature selection in the explanations and their
iXGB
corresponding values on which the practitioners can XGBoost
Regressor Prediction
act to mitigate the issues of delays. Extensive details
on the presented use case can be found in a prior work
by the authors (Jmoona et al., 2023). Rules
Rule Extraction
Based on the outcome of the previous study, the
aim was to generate a rule set and counterfactuals
Counterfactural Counterfactuals
in support of the prediction from XGBoost so that Generation
the understanding of the operational relevance of the
selected features is improved. Particularly, XGBoost Figure 2: Overview of the mechanism of the proposed
is an ensemble of decision trees that are interpretable iXGB. The grey-coloured boxes with lighter shades depict
by nature as the prediction rules from a single the principal components of iXGB.
decision tree are easily obtained (Gunning and Aha,
2019). This intrinsic characteristic of XGBoost In this study, Ω refers to an XGBoost (Chen
created the hypothesis of this work to extract decision and Guestrin, 2016) regression model for which the
rules from the trained XGBoost model and generate corresponding f computes the sum of residuals δ
counterfactuals that suggest changes in the feature from p decision trees dk , where k = 1, . . . , p and by
values influencing the prediction. definition, δd1 > δd2 > · · · > δd p . Therefore, f is
formalised as –
p
f (xi ) = ∑ δdk (1)
2 iXGB – INTERPRETABLE k=1
XGBoost iXGB explains f (xi ) as a pair of objects: hr, Φi,
where r = c → ŷi describing f (xi ) = ŷi . Here, c
The mechanism of the proposed iXGB is illustrated contains the conditions on the features a1 , . . . , am .
in Figure 2, which utilises the trained XGBoost And, Φ is the set of counterfactuals. A counterfactual
regression model as the starting point. The principal is defined as an instance xi0 as close as possible to a
components of iXGB are the XGBoost regressor, given xi with different values for at least one or more
the rule extractor and the counterfactual generator. features a, but for which f (xi ) outputs a different
Among these, the last two are described in the prediction ŷi 0 , i.e., yi 6= ŷi 0 .
following subsections including the formal definitions
from the context of a regression problem where the 2.2 Extraction of Rules
first component is addressed.
The decision rules r supporting the prediction ŷi by
2.1 Definitions the trained XGBoost regressor f is extracted from the
last trees (δd p ) while regressor f predicts y for the q
The regression model Ω is defined to predict a closest neighbours of the instance xi . The intuition
continuous target variable yi ∈ Y , based on a set behind using the last tree is that it generates the lowest
of m independent features or attributes a1 , . . . , am residual by definition of XGBoost. In other words,
represented by the vector xi = [xi1 , . . . , xim ] and xi ∈ the prediction is more accurate than the other trees
X . The dataset consists of n observations, each in f . The closest neighbours of xi are determined
comprising a feature vector xi and its corresponding using Euclidean distance metric. The value of q can
target value yi , where i = 1, . . . , n. The objective of be determined by changing the value and observing
the regression model is to learn a mapping function the quality of generated rules. Finally, all rules
f (xi ) = ŷi on (Xtrain , Ytrain ) that can accurately from the decision paths of closest neighbours and xi
estimate the target variable yi ∈ Ytest given the input are merged for each feature and the r is obtained.
feature vector xi ∈ Xtest . Here, the (Xtrain , Ytrain ) and The decision paths of the closest instances are also
(Xtest , Ytest ) are the training and test sets respectively included to obtain a generalised rule for the decision
split from the given dataset at a prescribed ratio. region. Algorithm 1 presents the steps of extracting
rules with iXGB.

1347
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence

Algorithm 1: Rule Extraction. regressor and K-Means clustering. The visualisations


Input: f : regressor, xi : test instance, Xtest : test were generated using Matplotlib (Hunter, 2007) and
set, q: number of neighbours Seaborn (Waskom, 2021). The datasets and metrics
Output: r: decision rule used to evaluate the performance of iXGB are
1 CN = {cn1 , . . . , cnq } ← q closest neighbours of discussed in the following subsections.
xi from Xtest within the its cluster
2 DP = {d pxi , d pcn1 , . . . , d pcnq } ← decision 3.1 Datasets
paths from δd p of f for {xi } ∪CN
3 r ← merge the conditions from DP for each Three different datasets were used in the conducted
feature a j , where j = 1, . . . , m experiments for this study. Among them, the first
4 return r one is the real-world dataset associated with the
motivating study described in Section 1.1, and the
other two are benchmark datasets. The summary of
2.3 Generation of Counterfactuals the datasets is presented in Table 1 followed by brief
descriptions of the datasets below.
The pseudo-code for generating counterfactuals is
stated in Algorithm 2. In the process of generating Table 1: Summary of the datasets used for evaluating the
the counterfactuals, all the instances of the test set performance of iXGB.
are clustered arbitrarily to form decision boundaries
Dataset Features Instances
around the instances based on their characteristics. In
this study, K-Means clustering is used and the number Flight Delay 5 1000
of clusters is determined with the Elbow method Auto MPG 7 392
(Yuan and Yang, 2019). Then, the closest neighbours Boston Housing 13 516
of the test instance xi in other clusters than its own are
selected. The differences in the feature values and the The real dataset was collected and processed by
change in predicted values are calculated for xi versus EUROCONTROL2 from the Enhanced Tactical Flow
the closest neighbours. Lastly, the pairs of differences Management System (ETFMS) flight data messages
in feature values and the changes in prediction are containing all flights in Europe throughout the year
generated as the set of counterfactuals Φ. 2019, from May to October. For this study, the
dataset was acquired from Aviation Data for Research
Algorithm 2: Counterfactual Generation. Repository3 . The dataset consists of fundamental
details of the flights, flight status, preceding flight
Input: f : regressor, xi : test instance, Xtest : test
legs, ATFM regulations, weather conditions, calendar
set, q: number of neighbours
information, etc. The definitions of the features from
Output: Φ: set of counterfactuals
the dataset are described in the works of Koolen and
1 C ← form arbitrary number of clusters with the
Coliban (2020) and Dalmau et al. (2021). Here,
instances of Xtest
the target variable is the flight take-off time delay
2 CN 0 = {cn01 , . . . , cn0q } ← q closest neighbours
in minutes. The acquired dataset contained 42
of xi in Xtest which are in different cluster than
features, whereas only 5 features were considered
xi , i.e., C(xi ) 6= C(cn0j ), where j = 1, . . . , q
for this study. The exclusion of the features was
3 {∆A1 , . . . , ∆Aq } ← differences in the feature done based on the observation of predicting flight
values of xi and CN 0 take-off delay from two different sets of data as
4 {∆y01 , . . . , ∆y0q } ← differences in the predictions illustrated in Figure 3. In the figure, the prediction
with f for xi and CN 0 performance of XGBoost improves until the top 5
5 Φ ← {(∆A1 , ∆y01 ), . . . , (∆Aq , ∆y0q )} most important features are used from the data. Here,
6 return Φ the feature importance values are obtained from the
global weights generated by XGBoost.
The benchmark datasets used in the experiments
are datasets commonly used to evaluate models
3 MATERIALS AND METHODS built for regression tasks. The first benchmark
dataset is the Auto MPG dataset (Quinlan, 1993)
The implementation of iXGB was done using Python
2 https://www.eurocontrol.int/
scripts. Scikit–Learn (Pedregosa et al., 2011)
3 https://www.eurocontrol.int/dashboard/rnd-data-
interface was used to build the models of XGBoost
archive

1348
iXGB: Improving the Interpretability of XGBoost Using Decision Rules and Counterfactuals

Prediction Performance with Different Number of Features 4 EVALUATION AND RESULTS


0.50 Set 1
0.45 Set 2
The proposed approach was evaluated through a
0.40
MAE (minutes)

0.35
series of experiments within the context of regression
0.30 problems. The experimental procedures and the result
0.25 of the evaluation experiments are presented in this
0.20 section.
0.15 To evaluate the extracted rules and the predictions
0.10
1 2 3 4 5 6 7 14 21 28 35 42 from iXGB, LIME (Ribeiro et al., 2016) is considered
Number of Features
as the baseline, which is widely used in recent
Figure 3: Prediction Performance of XGBoost in terms of literature to generate rule-based explanations (Islam
MAE for flight delay prediction with different numbers of et al., 2022). LIME is developed based on the
features ranked by XGBoost feature importance from two assumption that the behaviour of an instance can
different subsets of the data.
be explained by fitting an interpretable model (e.g.,
containing information about various car models, linear regression) with a simplified representation
including attributes such as cylinders, displacement, of the instance and its closest neighbours. While
horsepower, weight, acceleration, model year, and predicting a single prediction of a black box model,
origin in numerical features. The target variable is the LIME generates an interpretable representation of
miles per gallon, representing the fuel efficiency of the input instance. In this step, it standardises the
the cars. The other benchmark dataset was the Boston input by modifying the values of the measurement
Housing dataset (Harrison and Rubinfeld, 1978). It unit. The standardisation causes LIME to lose the
contains both numerical and categorical features, such original proportion of values for regression. In the
as per capita crime rate, the average number of rooms next step, LIME perturbs the values of the simplified
per dwelling, distance to employment centres, and input and predicts using the black box model, thus
others. Here, the target variable is the median value generating the data on which the interpretable model
of owner-occupied homes, which is generally utilised trains. Next, LIME draws samples from the generated
as a proxy for housing prices. data based on their similarity to select the closest
neighbours. Lastly, a linear regression model is
trained with the sampled neighbours. With the
3.2 Metrics prediction from the linear regression model and the
value ranges from the neighbourhood, LIME presents
The prediction performances of the models are
the local explanation with rules.
evaluated using Mean Absolute Error (MAE) and
standard deviation of the Absolute Error (σAE ).
MAE is the average difference between the actual 4.1 Prediction Performance
observation yi and the prediction ŷi from the model.
σAE signifies the dispersion of the absolute error The first evaluation experiment was conducted to
around the MAE. The measures were calculated using assess the prediction performance of the proposed
Equations 2 and 3 respectively. approach. For each dataset described in Section
3.1, the MAE and σAE were calculated using
1 n Equations 2 and 3. For iXGB, the predictions remain
MAE = ∑ |yi − ŷi |
n i=1
(2) unchanged as the predictions are directly taken from
s the XGBoost models which were compared with
1 n the target values from the datasets. For LIME, the
σAE = ∑ (|yi − ŷi | − MAE)2
n i=1
(3) predictions are compared with the predictions from
XGBoost and the target values from the datasets.
To assess the quality of the extracted decision The results of all the calculations of MAE and σAE
rules, the metric coverage or support (Molnar, 2022) are illustrated in Figure 4. For Boston Housing
was utilised. Coverage is the percentage of instances and Flight Delay datasets, it is observed that the
from the dataset which follow the given set of rules. error in prediction by iXGB is better than the LIME
It is calculated using the Equation 4 – predictions. However, for all the datasets, the
predictions from LIME are more erroneous than
|instances to which the rule applies| iXGB when compared to the original target values
coverage = from the datasets. These observations advocate
|instances in the dataset|
(4) that iXGB retains the prediction performance of the

1349
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence

XGBoost regressor than the surrogate LIME. circumstances, iXGB can be utilised by replacing
the surrogate models for rule-based explanation
Prediction Performance for Flight Delay Dataset (e.g., LIME) when performing regression tasks with
1.4
XGBoost.
1.2
1.0
MAE (minutes)

0.8 4.2 Coverage of Decision Rule


0.6 0.545
0.517
0.4
To evaluate the quality of rules generated from iXGB,
0.2 0.200 they were compared with the rules extracted from
0.0 LIME. For simplicity, only the rules extracted for
XGBoost vs Data LIME vs XGBoost LIME vs Data a single instance of prediction from the Auto MPG
Predictor vs Baseline
dataset by iXGB and LIME are presented. Using
(a) Flight Delay Dataset. Algorithm 1, the following rule (r) is extracted from
iXGB considering 5 closest instances from the test
Prediction Performance for Auto MPG Dataset
set:
6
IF (cylinders < 4.00) AND
MAE (miles per gallon)

5
(displacement <= 74.50) AND
4
(horsepower >= 96.50) AND
3 2.865
2.182 (2305.00 <= weight < 2337.50) AND
2 1.864
(13.10 <= acceleration < 13.75) AND
1
(model_year <= 72.00) AND
0
XGBoost vs Data LIME vs XGBoost LIME vs Data
(origin >= 3.00)
Predictor vs Baseline THEN (mpg = 19.00)
(b) Auto MPG Dataset. And, the decision rule extracted from LIME is:
Prediction Performance for Boston Housing Dataset
IF (cylinders <= 4.00) AND
(displacement <= 98.00) AND
8 (88.00 < horsepower <= 120.00) AND
MAE (thousand $)

(2157.00 < weight <= 2672.00) AND


6
(acceleration <= 14.15) AND
4 (model_year <= 73.00) AND
3.084 3.380
2.365 (origin > 2.00)
2
THEN (mpg = 23.66)
0
XGBoost vs Data LIME vs XGBoost LIME vs Data
Predictor vs Baseline Table 2: Coverage scores (average ± standard deviation)
of the rules extracted from iXGB and LIME. For local
(c) Boston Housing Dataset. explanation, lower values are better which are emphasised
Figure 4: Comparison of prediction performance of with blue fonts.
iXGB and LIME in terms of MAE with three different Coverage
datasets. Blue-coloured coloured box-plots are for Dataset
iXGB prediction compared with the target values. Red- iXGB LIME
and green-coloured box-plots are for LIME predictions Auto MPG 2.71 ± 1.55 7.24 ± 13.89
compared with XGBoost prediction and the target values
respectively. The mean values are presented on the Boston Housing 2.53 ± 1.56 1.36 ± 0.87
corresponding box–plots. Flight Delay 3.06 ± 1.41 20.50 ± 22.29
By design, LIME perturbs the input values to In both the decision rules, all the features from
generate samples to train an interpretable model the dataset are present. Particularly, for the feature
(e.g., linear regression) and use that model for weight the value range is smaller in the rule extracted
generating the local explanations. However, the from iXGB than the rule extracted from LIME. Again,
literature prohibits modification of measurement units the conditions are different for the feature origin
for regression tasks since this operation destroys the but both the rules indicate values greater or equal to
original proportion of the input values (Letzgus et al., 3.00. While rules were generated considering all the
2022). On the other hand, while explanations are datasets, it was observed that the value ranges from
generated with iXGB, the prediction performance the rules extracted from iXGB are smaller than the
of XGBoost is not compromised. Under these rules from LIME for the same instances.

1350
iXGB: Improving the Interpretability of XGBoost Using Decision Rules and Counterfactuals

Table 3: Sample set of counterfactuals generated using iXGB from the Auto MPG dataset.
Change in Feature Values Change
cylinders displacement hp weight acceleration model year origin in Target
+1 +43 -2 +42 +2 -2 0 -50%
0 -27 23 -29 -4 -3 +2 -10%
0 -27 +23 -29 -4 -3 +2 -10%
0 +5 +20 +22 -2 -11 0 +20%
0 +15 +30 -8 -6 -11 +2 +45%
0 -22 +11 -13 +2 -11 +2 +75%
0 -27 +23 -36 -4 -1 +2 +90%

Table 4: Sample set of counterfactuals generated using iXGB from the Boston Housing dataset.
Change in Feature Values Change
crim zn indus chas nox rm age dis rad tax prratio blck lstat in Target
+1 0 0 0 0 +1 +2 0 0 0 0 -287 +3 -600%
+4 0 0 0 0 0 0 0 0 0 0 -152 +9 -275%
+7 0 0 0 0 0 +5 0 0 0 0 -83 +6 -200%
-1 0 0 0 0 0 +4 0 0 0 0 -33 +5 +40%
+1 0 0 0 0 0 +22 0 0 0 0 -69 +8 +50%

Furthermore, the coverage of the rules from for both Boston Housing and Flight Delay datasets.
iXGB and LIME were calculated using Equation Unlike, counterfactuals from a classification task,
4. The results for all the datasets are presented the boundaries of the clusters formed with the test
in Table 2. The coverage values of rules for instances can be referred to as decision boundaries as
classification models are expected to be higher for they are clustered based on the characteristics of the
better generalisation (Guidotti et al., 2019). In data.
the case of local interpretability, the rule needs to The sample set of counterfactuals from the Auto
define a single instance of prediction that is the MPG dataset is presented in Table 3. For the table,
opposite of generalisation (Ribeiro et al., 2018). it is found that the target value changes when all
This claim is also supported in the works of (Sagi the feature values are changed except the feature
and Rokach, 2021). The authors argued that the cylinders in the first counterfactual. Likewise, for the
tree-ensemble models create several trees to improve Boston Housing dataset (Table 4), 8 out of 13 features
the performance of the model resulting in lots of needed not be changed to find the counterfactuals.
decision rules for prediction. This mechanism makes Again, changing the values of only 3 features can
it harder to be understood by the end users. Thus, the decrease the target value by 275%. Lastly, the
smaller coverage values are considered better in this counterfactuals from the Flight Delay dataset are
evaluation. presented in Table 5 which can be interpreted in a
similar way to the last two tables. For all the tables
4.3 Counterfactuals with counterfactuals, the feature names are shown as
it is present in the dataset since the names are not
For all the datasets, the sets of counterfactuals directly subjected to the mechanism of the proposed
(Φ) were generated by selecting a random instance iXGB.
from the test set to assess the impact on the target The set of counterfactuals for any regression task
when the feature values are changed. The process can support the end users when they need to modify
described in Algorithm 2 was followed to generate some feature values to achieve any target. Such
the counterfactuals. Here, the counterfactuals are the question can be – what would it take to increase
instances around the boundary of the closest clusters the target value by some percentage?. However,
of the selected instance. The number of clusters the change can be measured both in percentage and
was chosen with the Elbow method (Yuan and Yang, absolute values. After all, the counterfactuals would
2019), which was 7 for the Auto MPG dataset and 5 facilitate the decision-making process of end users by

1351
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence

Table 5: Sample set of counterfactuals generated using iXGB from the Flight Delay dataset.
Change in Feature Values Change
ts leg to ts flight duration leg ts to ta leg ta leg ts ifp to ts in Target
-118 -56 +9 +4 +3 -10%
+2 +2 +2 +2 +2 -5%
-14 +7 +11 -36 -89 -5%
-21 +35 +25 -5 -34 +5%
+29 +46 -25 -8 -49 +5%

maintaining operational relevance. REFERENCES


Blanchart, P. (2021). An Exact
Counterfactual-Example-based Approach to
5 CONCLUSION AND FUTURE Tree-ensemble Models Interpretability. ArXiv,
WORKS (arXiv:2105.14820v1 [cs.LG]).
Breiman, L. (2001). Random Forests. Machine Learning,
XGBoost is widely adopted in regression tasks 45(1):5–32.
because of its higher accuracy than other tree-based Chen, T. and Guestrin, C. (2016). XGBoost: A Scalable
ML models with the cost of interpretability. Tree Boosting System. In Proceedings of the 22nd
Generally, the interpretability is induced to XGBoost ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 785–794, San
through using various XAI methods. These XAI Francisco California USA. ACM.
methods (e.g., LIME) rely on perturbed samples
Cook, A. J. and Tanner, G. (2015). European Airline Delay
to provide explanations for XGBoost predictions. Cost Reference Values. Technical report, University of
In this paper, iXGB is proposed by utilising the Westminster, London, UK.
internal structure of XGBoost to generate rule-based Dalmau, R., Ballerini, F., Naessens, H., Belkoura, S., and
explanations and counterfactuals from the same data Wangnick, S. (2021). An Explainable Machine Learning
on which the model trains for prediction tasks. The Approach to Improve Take-off Time Predictions. Journal
proposed approach is functionally evaluated on three of Air Transport Management, 95:102090.
different datasets in terms of local accuracy and Friedman, J. H. (2001). Greedy Function Approximation:
quality of the rules, which shows the ability of A Gradient Boosting Machine. The Annals of Statistics,
iXGB to improve the interpretability of XGBoost 29(5).
reasonably. Future research directions include Guidotti, R., Monreale, A., Giannotti, F., Pedreschi, D.,
theoretically grounded evaluation of the proposed Ruggieri, S., and Turini, F. (2019). Factual and
approach on more diverse datasets and different Counterfactual Explanations for Black Box Decision
real-world problems. Moreover, further investigations Making. IEEE Intelligent Systems, 34(6):14–23.
are also required to adopt the proposed iXGB for Gunning, D. and Aha, D. W. (2019). DARPA’s
binary and multi-class classification tasks. Explainable Artificial Intelligence Program. AI
Magazine, 40(2):44–58.
Hara, S. and Hayashi, K. (2016). Making Tree Ensembles
Interpretable. ArXiv, (arXiv:1606.05390v1 [stat.ML]).
ACKNOWLEDGEMENTS Harrison, D. and Rubinfeld, D. L. (1978). Hedonic
Housing Prices and the Demand for Clean Air.
This study was supported by the following projects; Journal of Environmental Economics and Management,
i) ARTIMATION (Transparent Artificial intelligence 5(1):81–102.
and Automation to Air Traffic Management Systems), Hunter, J. D. (2007). Matplotlib: A 2D Graphics
funded by the SESAR JU under the European Union’s Environment. Computing in Science & Engineering,
Horizon 2020 Research and Innovation programme 9(3):90–95.
(Grant Agreement No. 894238) and ii) xApp Islam, M. R., Ahmed, M. U., Barua, S., and Begum, S.
(Explainable AI for Industrial Applications), funded (2022). A Systematic Review of Explainable Artificial
Intelligence in Terms of Different Application Domains
by the VINNOVA (Sweden’s Innovation Agency) and Tasks. Applied Sciences, 12(3):1353.
(Diary No. 2021-03971).
Jmoona, W., Ahmed, M. U., Islam, M. R., Barua, S.,

1352
iXGB: Improving the Interpretability of XGBoost Using Decision Rules and Counterfactuals

Begum, S., Ferreira, A., and Cavagnetto, N. (2023).


Explaining the Unexplainable: Role of XAI for Flight
Take-Off Time Delay Prediction. In Maglogiannis, I.,
Iliadis, L., MacIntyre, J., and Dominguez, M., editors,
Artificial Intelligence Applications and Innovations –
AIAI 2023, volume 676, pages 81–93, Léon, Spain.
Springer Nature Switzerland.
Koolen, H. and Coliban, I. (2020). Flight
Progress Messages Document. Technical report,
EUROCONTROL, Brussels, Belgium.
Letzgus, S., Wagner, P., Lederer, J., Samek, W., Muller,
K.-R., and Montavon, G. (2022). Toward Explainable
Artificial Intelligence for Regression Models: A
methodological perspective. IEEE Signal Processing
Magazine, 39(4):40–58.
Lukacs, M. (2020). Cost of Delay Estimates. Technical
report, Federal Aviation Administration, Washington,
DC, USA.
Molnar, C. (2022). Interpretable Machine Learning:
A Guide for Making Black Box Models Explainable.
Christoph Molnar, Munich, Germany, 2nd edition.
Nalenz, M. and Augustin, T. (2022). Compressed Rule
Ensemble Learning. In Proceedings of The 25th
International Conference on Artificial Intelligence and
Statistics, pages 9998–10014. PMLR.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
Thirion, B., Grisel, O., Blondel, M., Prettenhofer,
P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,
A., Cournapeau, D., Brucher, M., Perrot, M., and
Duchesnay, É. (2011). Scikit-learn: Machine Learning
in Python. Journal of Machine Learning Research,
12(85):2825–2830.
Quinlan, J. R. (1993). Combining Instance-based and
Model-based Learning. In Proceedings of the Tenth
International Conference on International Conference
on Machine Learning (ICML 1993), pages 236–243,
Amherst, MA, USA. Morgan Kaufmann Publishers Inc.
Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). “Why
Should I Trust You?”: Explaining the Predictions of Any
Classifier. In Proceedings of the 22nd ACM SIGKDD
International Conference on Knowledge Discovery and
Data Mining (KDD 2016), pages 1135–1144, San
Francisco, CA, USA. ACM.
Ribeiro, M. T., Singh, S., and Guestrin, C. (2018).
Anchors: High-Precision Model-Agnostic Explanations.
In Proceedings of the AAAI Conference on Artificial
Intelligence, volume 32.
Sagi, O. and Rokach, L. (2021). Approximating XGBoost
with an interpretable decision tree. Information Sciences,
572:522–542.
Waskom, M. (2021). Seaborn: Statistical Data
Visualization. Journal of Open Source Software,
6(60):3021.
Yuan, C. and Yang, H. (2019). Research on K-Value
Selection Method of K-Means Clustering Algorithm. J,
2(2):226–235.

1353

You might also like