Islam Et Al. - 2024 - iXGB Improving The Interpretability of XGBoost Us
Islam Et Al. - 2024 - iXGB Improving The Interpretability of XGBoost Us
Islam Et Al. - 2024 - iXGB Improving The Interpretability of XGBoost Us
and Counterfactuals
Artificial Intelligence and Intelligent Systems Research Group, School of Innovation Design and Engineering,
Mälardalen University, Universitetsplan 1, 722 20 Västerås, Sweden
Abstract: Tree-ensemble models, such as Extreme Gradient Boosting (XGBoost), are renowned Machine Learning
models which have higher prediction accuracy compared to traditional tree-based models. This higher
accuracy, however, comes at the cost of reduced interpretability. Also, the decision path or prediction rule
of XGBoost is not explicit like the tree-based models. This paper proposes the iXGB–interpretable XGBoost,
an approach to improve the interpretability of XGBoost. iXGB approximates a set of rules from the internal
structure of XGBoost and the characteristics of the data. In addition, iXGB generates a set of counterfactuals
from the neighbourhood of the test instances to support the understanding of the end-users on their operational
relevance. The performance of iXGB in generating rule sets is evaluated with experiments on real and
benchmark datasets, which demonstrated reasonable interpretability. The evaluation result also supports the
idea that the interpretability of XGBoost can be improved without using surrogate methods.
1345
Islam, M., Ahmed, M. and Begum, S.
iXGB: Improving the Interpretability of XGBoost Using Decision Rules and Counterfactuals.
DOI: 10.5220/0012474000003636
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 16th International Conference on Agents and Artificial Intelligence (ICAART 2024) - Volume 3, pages 1345-1353
ISBN: 978-989-758-680-4; ISSN: 2184-433X
Proceedings Copyright © 2024 by SCITEPRESS – Science and Technology Publications, Lda.
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
Figure 1: Example of explanation generated for a single instance of flight TOT delay prediction using LIME. The red and
green horizontal bars correspond to the contributions for increasing and decreasing the delay respectively, and the blue bar
corresponds to the predicted delay.
tree-ensemble classifiers. These studies are the considering passengers, airlines, lost demand, and
only notable ones found in the literature which indirect costs, was thirty-three billion dollars (Lukacs,
contributed to improving the interpretability of the 2020). The significant expenses provide the rationale
tree-ensemble models for classification tasks with for increased attention towards predicting TOT and
indications towards their use in regression tasks. reducing delays of flights (Dalmau et al., 2021).
From the literature, it is evident that less effort To solve the problem of predicting flight TOT
is given towards making the ensemble models delay, an interpretable system was developed to
(e.g., XGBoost) interpretable for regression tasks. incorporate the existing operational interface of the
Moreover, different state-of-the-art methods produce ATCOs. In the process, the prediction model was
explanations that differ in the contents of the output. developed with XGBoost and its prediction was
Under these circumstances, this study aims to made interpretable with the help of several popular
improve the interpretability of XGBoost by utilising XAI tools, such as LIME – Local Interpretable
its mechanisms by design. The main contribution of Model-agnostic Explanation. Qualitative evaluation
this study is twofold – in the form of a user survey was conducted for the
• Explaining the predictions of XGBoost regression developed system with the following scenario –
models using decision rules extracted from the The current time is 0810 hrs. AFR141 is at the
trained model. gate and expected to take off from runway 09 at 0910
hrs. It is predicted that this flight will be delayed for
• Generation of counterfactuals from the actual unknown minutes. After this, the aircraft has 2 more
neighbourhood of the test instance. flights in the day. Concurrently, SAS652 is in the last
flight leg of the day and is expected to land on runway
1.1 Motivation 09 at 0916 hrs. Moreover, there is a scheduled runway
inspection at 0920 hrs.
The work presented in this paper is further motivated The target users of the survey were the ATCOs,
by a real-world regression application for the aviation both professionals and students. Participants were
industry. Particularly, the regression task is to predict prompted with several scenarios similar to the
the flight take-off time (TOT) delay from historical scenario stated above and corresponding predictions
data to support the responsibilities of the Air Traffic of the delay with explanation as illustrated in Figure 1,
Controllers (ATCO). It is worth mentioning that the which varied based on the explainability tool used to
aviation industry experiences a loss of approximately generate the explanation. At the end of each scenario,
100 Euros on average per minute for Air Traffic the participants were asked to respond to questions
Flow Management (ATFM) (Cook and Tanner, 2015). to evaluate the effectiveness of the XAI methods in
The Federal Aviation Administration (FAA)1 reported explaining the prediction results.
in 2019 that the estimated cost due to delay, The outcome of the user survey was deduced as–
the contribution to the final delay of the selected
1 https://www.faa.gov/
1346
iXGB: Improving the Interpretability of XGBoost Using Decision Rules and Counterfactuals
1347
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
1348
iXGB: Improving the Interpretability of XGBoost Using Decision Rules and Counterfactuals
0.35
series of experiments within the context of regression
0.30 problems. The experimental procedures and the result
0.25 of the evaluation experiments are presented in this
0.20 section.
0.15 To evaluate the extracted rules and the predictions
0.10
1 2 3 4 5 6 7 14 21 28 35 42 from iXGB, LIME (Ribeiro et al., 2016) is considered
Number of Features
as the baseline, which is widely used in recent
Figure 3: Prediction Performance of XGBoost in terms of literature to generate rule-based explanations (Islam
MAE for flight delay prediction with different numbers of et al., 2022). LIME is developed based on the
features ranked by XGBoost feature importance from two assumption that the behaviour of an instance can
different subsets of the data.
be explained by fitting an interpretable model (e.g.,
containing information about various car models, linear regression) with a simplified representation
including attributes such as cylinders, displacement, of the instance and its closest neighbours. While
horsepower, weight, acceleration, model year, and predicting a single prediction of a black box model,
origin in numerical features. The target variable is the LIME generates an interpretable representation of
miles per gallon, representing the fuel efficiency of the input instance. In this step, it standardises the
the cars. The other benchmark dataset was the Boston input by modifying the values of the measurement
Housing dataset (Harrison and Rubinfeld, 1978). It unit. The standardisation causes LIME to lose the
contains both numerical and categorical features, such original proportion of values for regression. In the
as per capita crime rate, the average number of rooms next step, LIME perturbs the values of the simplified
per dwelling, distance to employment centres, and input and predicts using the black box model, thus
others. Here, the target variable is the median value generating the data on which the interpretable model
of owner-occupied homes, which is generally utilised trains. Next, LIME draws samples from the generated
as a proxy for housing prices. data based on their similarity to select the closest
neighbours. Lastly, a linear regression model is
trained with the sampled neighbours. With the
3.2 Metrics prediction from the linear regression model and the
value ranges from the neighbourhood, LIME presents
The prediction performances of the models are
the local explanation with rules.
evaluated using Mean Absolute Error (MAE) and
standard deviation of the Absolute Error (σAE ).
MAE is the average difference between the actual 4.1 Prediction Performance
observation yi and the prediction ŷi from the model.
σAE signifies the dispersion of the absolute error The first evaluation experiment was conducted to
around the MAE. The measures were calculated using assess the prediction performance of the proposed
Equations 2 and 3 respectively. approach. For each dataset described in Section
3.1, the MAE and σAE were calculated using
1 n Equations 2 and 3. For iXGB, the predictions remain
MAE = ∑ |yi − ŷi |
n i=1
(2) unchanged as the predictions are directly taken from
s the XGBoost models which were compared with
1 n the target values from the datasets. For LIME, the
σAE = ∑ (|yi − ŷi | − MAE)2
n i=1
(3) predictions are compared with the predictions from
XGBoost and the target values from the datasets.
To assess the quality of the extracted decision The results of all the calculations of MAE and σAE
rules, the metric coverage or support (Molnar, 2022) are illustrated in Figure 4. For Boston Housing
was utilised. Coverage is the percentage of instances and Flight Delay datasets, it is observed that the
from the dataset which follow the given set of rules. error in prediction by iXGB is better than the LIME
It is calculated using the Equation 4 – predictions. However, for all the datasets, the
predictions from LIME are more erroneous than
|instances to which the rule applies| iXGB when compared to the original target values
coverage = from the datasets. These observations advocate
|instances in the dataset|
(4) that iXGB retains the prediction performance of the
1349
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
XGBoost regressor than the surrogate LIME. circumstances, iXGB can be utilised by replacing
the surrogate models for rule-based explanation
Prediction Performance for Flight Delay Dataset (e.g., LIME) when performing regression tasks with
1.4
XGBoost.
1.2
1.0
MAE (minutes)
5
(displacement <= 74.50) AND
4
(horsepower >= 96.50) AND
3 2.865
2.182 (2305.00 <= weight < 2337.50) AND
2 1.864
(13.10 <= acceleration < 13.75) AND
1
(model_year <= 72.00) AND
0
XGBoost vs Data LIME vs XGBoost LIME vs Data
(origin >= 3.00)
Predictor vs Baseline THEN (mpg = 19.00)
(b) Auto MPG Dataset. And, the decision rule extracted from LIME is:
Prediction Performance for Boston Housing Dataset
IF (cylinders <= 4.00) AND
(displacement <= 98.00) AND
8 (88.00 < horsepower <= 120.00) AND
MAE (thousand $)
1350
iXGB: Improving the Interpretability of XGBoost Using Decision Rules and Counterfactuals
Table 3: Sample set of counterfactuals generated using iXGB from the Auto MPG dataset.
Change in Feature Values Change
cylinders displacement hp weight acceleration model year origin in Target
+1 +43 -2 +42 +2 -2 0 -50%
0 -27 23 -29 -4 -3 +2 -10%
0 -27 +23 -29 -4 -3 +2 -10%
0 +5 +20 +22 -2 -11 0 +20%
0 +15 +30 -8 -6 -11 +2 +45%
0 -22 +11 -13 +2 -11 +2 +75%
0 -27 +23 -36 -4 -1 +2 +90%
Table 4: Sample set of counterfactuals generated using iXGB from the Boston Housing dataset.
Change in Feature Values Change
crim zn indus chas nox rm age dis rad tax prratio blck lstat in Target
+1 0 0 0 0 +1 +2 0 0 0 0 -287 +3 -600%
+4 0 0 0 0 0 0 0 0 0 0 -152 +9 -275%
+7 0 0 0 0 0 +5 0 0 0 0 -83 +6 -200%
-1 0 0 0 0 0 +4 0 0 0 0 -33 +5 +40%
+1 0 0 0 0 0 +22 0 0 0 0 -69 +8 +50%
Furthermore, the coverage of the rules from for both Boston Housing and Flight Delay datasets.
iXGB and LIME were calculated using Equation Unlike, counterfactuals from a classification task,
4. The results for all the datasets are presented the boundaries of the clusters formed with the test
in Table 2. The coverage values of rules for instances can be referred to as decision boundaries as
classification models are expected to be higher for they are clustered based on the characteristics of the
better generalisation (Guidotti et al., 2019). In data.
the case of local interpretability, the rule needs to The sample set of counterfactuals from the Auto
define a single instance of prediction that is the MPG dataset is presented in Table 3. For the table,
opposite of generalisation (Ribeiro et al., 2018). it is found that the target value changes when all
This claim is also supported in the works of (Sagi the feature values are changed except the feature
and Rokach, 2021). The authors argued that the cylinders in the first counterfactual. Likewise, for the
tree-ensemble models create several trees to improve Boston Housing dataset (Table 4), 8 out of 13 features
the performance of the model resulting in lots of needed not be changed to find the counterfactuals.
decision rules for prediction. This mechanism makes Again, changing the values of only 3 features can
it harder to be understood by the end users. Thus, the decrease the target value by 275%. Lastly, the
smaller coverage values are considered better in this counterfactuals from the Flight Delay dataset are
evaluation. presented in Table 5 which can be interpreted in a
similar way to the last two tables. For all the tables
4.3 Counterfactuals with counterfactuals, the feature names are shown as
it is present in the dataset since the names are not
For all the datasets, the sets of counterfactuals directly subjected to the mechanism of the proposed
(Φ) were generated by selecting a random instance iXGB.
from the test set to assess the impact on the target The set of counterfactuals for any regression task
when the feature values are changed. The process can support the end users when they need to modify
described in Algorithm 2 was followed to generate some feature values to achieve any target. Such
the counterfactuals. Here, the counterfactuals are the question can be – what would it take to increase
instances around the boundary of the closest clusters the target value by some percentage?. However,
of the selected instance. The number of clusters the change can be measured both in percentage and
was chosen with the Elbow method (Yuan and Yang, absolute values. After all, the counterfactuals would
2019), which was 7 for the Auto MPG dataset and 5 facilitate the decision-making process of end users by
1351
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
Table 5: Sample set of counterfactuals generated using iXGB from the Flight Delay dataset.
Change in Feature Values Change
ts leg to ts flight duration leg ts to ta leg ta leg ts ifp to ts in Target
-118 -56 +9 +4 +3 -10%
+2 +2 +2 +2 +2 -5%
-14 +7 +11 -36 -89 -5%
-21 +35 +25 -5 -34 +5%
+29 +46 -25 -8 -49 +5%
1352
iXGB: Improving the Interpretability of XGBoost Using Decision Rules and Counterfactuals
1353