CatBoost based supervised machine learning classification
CatBoost based supervised machine learning classification
Energy Reports
journal homepage: www.elsevier.com/locate/egyr
Research paper
article info a b s t r a c t
Article history: This paper presents a novel supervised machine learning-based electric theft detection approach using
Received 3 March 2021 the feature engineered-CatBoost algorithm in conjunction with the SMOTETomek algorithm. Contrary
Received in revised form 5 June 2021 to the previous literature, where the missing observations in data are either ignored or imputed
Accepted 5 July 2021
with average values, this work utilizes k-Nearest neighbor technique for missing data imputation;
Available online 24 July 2021
thus, an accurate and realistic estimation of the missing data is achieved. To mitigate the biasness to
Keywords: the majority data class, the proposed model utilizes the SMOTETomek algorithm, which neutralizes
CatBoost algorithm the mentioned effect by managing a proper balance between over-sampling and under-sampling
NTL detection techniques. Feature Extraction and Scalable Hypothesis (FRESH) algorithm is utilized at the later stage
Smart meters of the proposed NTL detection framework to extract and select the most relevant data features from
Feature engineering the provided dataset. Afterward, the model is trained using the CatBoost algorithm to classify the
Machine learning model interpretation consumers into two distinct categories, i.e., genuine and theft. Finally, to interpret the model’s decision
for the corresponding predictions, the tree-SHAP algorithm is utilized. To validate the efficacy of the
proposed ML based theft detection approach, its performance is compared with that of the traditional
gradient boosting ML algorithms such as XGBoost, lightGBM, Ensemble bagging, boosting ML models,
and other conventional ML models using five of the most widely used performance metrics, i.e.,
precision, accuracy, F1score Kappa and MCC. The proposed technique achieved an accuracy of 93% and
a detection rate of 92%, which is significantly higher than all the considered competing algorithms
under identical dataset and hyperparameters.
© 2021 Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
1. Introduction every year due to NTLs (Northeast Group, 2017). The mentioned
scenario is precisely depicted in Fig. 1, which shows the intensity
1.1. Background of the NTL issue in different parts of the world.
Owing to such massive economic loss, the power utilities and
The transmission and distribution (T&D) of electricity suffers researchers in the field of data mining, computer science, and
from two major categories of losses, i.e., technical and non- electrical engineering are trying several intelligent and effec-
technical. The technical losses account for the energy losses that tive methods to minimize NTLs. One of the efficient methods to
occur in equipment that is essential for implementing the T&D of counter the electric theft issue is the implementation of smart
electricity. On the other hand, the non-technical losses (NTL) in meters. Such energy meters can monitor and record the con-
any power system account for power theft, billing irregularities, sumers’ consumption data remotely and precisely and provide
and corruption within utility workers. According to a report, the information to the utility directly in case of any suspicious
utilities around the globe are losing approximately US$96 billion activity. However, despite the vast number of benefits, smart me-
ters are not feasible for countries suffering from severe economic
∗ Corresponding author. issues due to huge expenditures associated with their implemen-
E-mail addresses: [email protected] (I. Khan), [email protected] (A. Khan). tation and operation. Furthermore, increasing cyber threats still
https://doi.org/10.1016/j.egyr.2021.07.008
2352-4847/© 2021 Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
S. Hussain, M.W. Mustafa, T.A. Jumani et al. Energy Reports 7 (2021) 4425–4436
class are removed to balance the data class. Such data balancing It is fair to mention and highlight the most relevant studies
techniques are easy to implement; however, they may cause on the current research work available in the literature. One of
substantial data loss, resulting in lower accuracy of the developed such studies was carried out by Gunturi and Sarkar (2020), where
NTL detection model. To avoid this issue, the current study uti- the authors have developed an ensemble machine learning-based
lizes an efficient statistical technique called SMOTETomek (Batista theft detection model. In another study, Punmiya and Choe (2019)
et al., 2004a). It combines the intelligence of SMOTE (oversam- proposed a gradient boosted theft detector framework, which
pling) and Tomek link (under-sampling) to balance the data class employs the latest XGBoost, lightGBM, and CatBoost for model
distribution. training. The current study differentiates itself from the men-
As discussed earlier in this section, the third critical issue in tioned research works in its novel data class balancing and feature
supervised-based NTL detection methods is the selection of the engineering approach. Furthermore, unlike the quoted studies
most relevant features for the model training. The efficiency of where the model’s outcome interpretability was not evaluated,
classification-based theft detection methods is highly dependent this research work utilizes the tree-SHAP algorithm to accomplish
on the type of input features selected. Since the smart meters’ the mentioned task.
data is generally high dimensional data containing many redun- Concluding the detailed discussion, the list of steps executed
dant and irrelevant features, it is essential to extract and select sequentially in order to accomplish the proposed supervised ML-
the most relevant features and discards the unnecessary ones. based NTL detection framework is presented as follows.
In this study, the mentioned issue is solved by using efficient
feature extraction and selection process. Feature extraction and i. k-Nearest Neighbors imputation technique is employed to
selection procedure is an effective practice for reducing the in- handle the missing and erroneous data values in the ac-
creased data dimensions, and redundant information in ML-based quired dataset.
NTL approaches. It is worthwhile to mention that unlike most of ii. SMOTE-Tomek based resampling technique is utilized to
the ML-based NTL detection approaches in literature where either tackle the data class imbalance issue.
feature extraction or selection process is adopted for model train- iii. The FRESH algorithm is used to extract and select the most
ing, this research work utilizes both for acquiring highly relevant relevant statistical features from raw smart meter data.
features from the considered smart meter dataset. The proposed iv. The implementation of the state-of-the-art CatBoost algo-
approach utilizes the intelligence of one of the most intelligent rithm and its comparative analysis with other well-known
algorithms called the Feature Extraction and Scalable Hypothesis ML classifiers is carried out for identifying the NTLs.
(FRESH) algorithm to accomplish the mentioned task. It does v. Interpretation of the model outcomes is performed through
so by utilizing more than 60 time-series analytical methods to the tree-SHAP algorithm.
capture 794 features from each dataset sample. The extracted vi. To validate the effectiveness of the proposed theft detec-
features are reduced to 300 most relevant features through the tion framework, an extensive performance evaluation is
Benjamini–Yekutieli statistical test. The resulting final set of fea- made based on five of the most widely utilized perfor-
tures are a combination of essential user consumption and newly mance metrics.
extracted features. vii. The proposed NTL framework achieves the highest detec-
Once the feature engineering process is completed, the next
tion rate and the lowest false positive rates among all the
challenge is to select an appropriate classifier for efficiently seg-
compared algorithms.
regating the genuine and theft consumers. In this study, the
CatBoost algorithm is utilized for the model training due to its The rest of the paper is divided into three sections. Sec-
efficient handling of the categorical features. These categorical tion 2 presents the proposed research methodology and is fur-
features are handled during the pre-processing phase in most of ther sub-categorized to discuss the CatBoost algorithm’s theoret-
the traditional ML models, which consequently increase the com- ical background, considered performance metrics, and proposed
putational time and complexity. On the other side, the CatBoost framework results and interpretations. In Section 3, the proposed
efficiently handle these features during the training process, thus model’s comparative analysis against the latest gradient boosting
avoids the mentioned problems faced by conventional classi- decision trees (GBTDs) and traditional ML models is discussed in
fiers. Furthermore, it utilizes the intelligence of ordered boosting, detail. Finally, the conclusion is made in Section 4 of this research
which avoids the prediction shift problem faced by XGBoost and work.
its variants. Also, by enabling the overfitting detector feature
in its framework, the trained model can achieve an improved 2. Research methodology
generalization ability.
Another important aspect of the proposed theft detection
In this section, the proposed NTL detection framework is pre-
model is its novel interpretability of the model outcomes. Mostly,
sented. The overall framework is broadly classified into three
site inspections are initiated on the list of suspected consumers
major stages, i.e., data pre-processing stage, feature engineer-
generated by the trained model on genuine and theft consumers’
ing stage, model training-testing, and interpretation stage. Each
data. However, a model’s prediction to place the consumer in
of the stages is depicted in Fig. 2. and detailly described in
a particular category based on a given input feature set is not
subsequent subsections.
justified logically. Nevertheless, few studies in literature such
as Batista et al. (2004b) and Christ (2018), have employed sim-
plistic decision tree diagrams to interrupt the model outcomes. 2.1. Stage-1: Data pre-processing stage
However, the latest state-of-the-art theft detection models em-
ploying deep learning, gradient boosting machines and ensemble Data pre-processing is required to transform the raw data into
ML techniques incorporate a diverse range of complex prediction a meaningful data structure. The electricity consumption data
strategies, making themselves extremely difficult to comprehend acquired from the State Grid Corporation of China (SGCC) (Zheng
through simplistic tree diagrams. To deal with the mentioned et al., 2018) is used for testing the efficacy of the proposed theft
issue, tree-SHhapley Additive exPlanations (SHAP) is utilized in detection model. Table 1 presents the metadata information of
the current study. It assists in opening the black-box ML model’s the acquired dataset.
outcomes in terms of explaining how the model concluded a As presented in Table 1, the daily electricity consumption
decision for a particular prediction. of 42372 consumers for approximately 1035 days (2014-Jan to
4427
S. Hussain, M.W. Mustafa, T.A. Jumani et al. Energy Reports 7 (2021) 4425–4436
Table 1 to initially find the consumer’s kth nearest neighbors and then
Statistics of obtained SGCC data. imputes the missing feature value using the mean of selected k-
Description Value neighbors. The current study utilizes the KNN-imputer module
The time window for electricity consumption 2014-01-01 to 2016-10-31 available in the Scikit-learn ML package to impute the missing
1035- days
data slots (Pedregosa et al., 2011). A few random consumers’
Number of total consumers 42372
Number of electricity thieves 3615 consumption samples are plotted to visualize the newly imputed
Number of genuine consumers 38757 values in consumers’ consumption data, as shown in Fig. 6.
Total data records 42372 * 1035 = 43855020
2.2. Stage-2: Data class balancing and feature engineering
2016-Oct). It comprises of 91.46% genuine and 8.54% theft con- This stage is further divided into two sub-stages, i.e., data class
sumers. Fig. 3 and Fig. 4 depict the electric power consumption balancing and feature engineering, as depicted in Fig. 2. Each of
patterns for few of the theft and the genuine consumers respec- the mentioned sub-stages is explained in subsequent subsections.
tively. It can be observed from the mentioned Figures that the
theft consumption patterns of the theft consumers are highly 2.2.1. Data class balancing
irregular and contains low periodicity. On the other hand, the For an efficient and unbiased classifying performance of a su-
patterns for the genuine consumers are periodical and exhibits pervised ML classifier, it is essential to extract and select the most
a correlation between the identical periods of the consecutive suitable features from a balanced dataset. Since the considered
years. smart meter dataset for the current study is unbalanced, as it oc-
To check the missing information in the data, the NaN values curs in most NTL detection data set, it is necessary to balance the
were computed for each consumer. It was found that 25.6% of class distribution before the feature extraction and selection pro-
43855020 data entries contains NaN or missing values, which is cess. In order to solve this issue, the SMOTETomek (Batista et al.,
significantly higher for any data set in the field of data mining. 2004b) algorithm is utilized in the current study. SMOTETomek
The distribution of computed null values in terms of the his- combines the intelligence of SMOTE and Tomek links techniques
togram is shown in Fig. 5. The histogram bar values depict the to over and under-sample data classes simultaneously. It accom-
number of consumers falling in the missing values range. plishes the mentioned task by discarding the majority class links
The computed histogram illustrates that 22.6% of total con- until both classes reach an equal number of entities. Even though
sumers fall into the range of more than 700 missing values per the SMOTE technique alone can mitigate the imbalanced data
consumer. To correctly estimate these consumers’ missing data class distribution issue, it skews the class distributions. Since in
values becomes extremely challenging since a significant portion most of the real-world smart meter datasets, clusters formed by
of the information is unavailable in the acquired dataset. There- different data classes are not well expressed. Therefore, a set of
fore, a viable option left is to drop such highly inadequate entries samples belonging to the minority or majority class is expected
from the rest of the dataset. The missing values in the remain- to be dominated during the SMOTE technique’s oversampling
ing consumers are imputed using the kNN interpolation tech- period. Consequently, feeding such biased data to the learning
nique (Troyanskaya et al., 2001). The kNN is a non-parametric and classifier will lead to model overfitting.
lazy learner algorithm that matches an observation in multidi- On the other hand, SMOTETomek does not only helps in pro-
mensional space to its nearest kth neighbors. The kNN’s capability ducing well-defined data class distribution, but it also generates
of dealing with almost all types of missing data makes it a suit- data class clusters equally. The data class distribution for the
able candidate for the missing value imputation. It accomplishes current study before and after using SMOTETomek is shown in
the imputation task by utilizing the Euclidean distance metric Fig. 7.
4428
S. Hussain, M.W. Mustafa, T.A. Jumani et al. Energy Reports 7 (2021) 4425–4436
4429
S. Hussain, M.W. Mustafa, T.A. Jumani et al. Energy Reports 7 (2021) 4425–4436
Fig. 7. Data class distribution before and after using the SMOTETomek technique.
Fig. 7, shows that genuine consumers are significantly higher efficient, features fed to the model must reflect appropriately un-
in number than those engaged in fraud before applying the derlying abnormalities in consumers’ consumption data. There-
SMOTETomek. In contrast, both the classes are well balanced after fore, the additional characteristics of the provided dataset are
employing the proposed technique. extracted using the feature extraction and selection process. In
this study, both the tasks are accomplished using the FRESH algo-
rithm, which simultaneously extracts and selects useful features
2.2.2. Feature engineering from the given balanced dataset. For ease in computation, the
In this section, the proposed feature engineering process is FRESH algorithm authors have developed a standardized python-
discussed in detail. Feature engineering is the process of extrac- based package called ‘‘ts-fresh’’, which makes use of the FRESH
tion and selection of the most important features from given data algorithm within its framework. The source code and GitHub
typically done to enhance the ML model’s learning ability. It is im- page of the ts-fresh package can be found in the link provided
portant to note that the dataset acquired from the smart meters in Christ (2018). A complete list of extracted features and their
lack statistical characteristics. For a theft detection model to be mathematical description can be found in Christ et al. (2016),
4430
S. Hussain, M.W. Mustafa, T.A. Jumani et al. Energy Reports 7 (2021) 4425–4436
Fig. 8. Feature extraction and selection process using the FRESH algorithm.
while the simplified pictorial version of the feature extraction and 2.3. Stage-3: Model training and evaluation stage
selection process employing the FRESH algorithm is presented in
Fig. 8. In this section, the training and evaluation of the proposed NTL
The FRESH algorithm implementation using the ts-FRESH detection model are discussed in detail. For ease of understanding
module is carried out in two steps, as depicted in Fig. 8. Initially, and interpretation, this section is divided into three sub-sections.
794 features are extracted automatically from each consumer’s
2.3.1. Performance evaluation metrics
consumption data using more than 60 time-series characteriza-
In any supervised ML technique, the labeled data is provided
tion methods. These extracted features can be broadly classified
to the learning classifier for its training purpose initially. The
into temporal, statistical and spectral domains as depicted in
trained model is then evaluated for its ability to predict and
Fig. 9. generalize the un-labeled data efficiently. The performance of
Features such as entropy, zero-crossing points, spectral varia- such models is assessed based on a number of performance eval-
tion, Mel-Frequency Cepstral Coefficients (MFCC), skewness, kur- uation metrics, such as mentioned in Messinis and Hatziargyriou
tosis, trend, linear and non-linear characteristics, correlation, and (2018). However, it is not feasible to evaluate and analyze all the
various statistical test-based features provide in-depth knowl- metrics mentioned in the stated study; therefore, few of the most
edge of each consumer consumption sample. Due to the space important metrics such as accuracy (Acc), recall, confusion matrix
limitation all the extracted features are not shown in Fig. 9, for the (CM), precision (P), Cohen’s kappa coefficient (kappa), Matthews
interested reader as mentioned above the detailed documenta- correlation coefficient (MCC), and F1score are utilized to evaluate
tion of each feature along with source code for its implementation the performance of the proposed classifier. The mathematical
can be found in authors provided webpage (Christ, 2018). expressions for calculating the mentioned metrics are depicted
In the second step, the derived features and consumers’ actual in Eqs. (2)–(9).
consumption data are combined to select only highly important TP + TN
Accuracy = (2)
feature. This selection process is made by initially arranging the TP + TN + FP + FN
features in descending order based on their significance gauged TP
through various statistical tests. Afterwards, the Benjamini and Recall or Detection rate = (3)
TP + FN
Yekutieli (2001) procedure is employed that sets a threshold for FP
feature selection criteria; thus, the features with the negligible False − positive rate = FPR = (4)
FP + TN
contribution to the target variable are discarded automatically. FN
Since the feature-set selected by the FRESH algorithm contains False − negative rate = FNR = (5)
FN + TP
diverse data points scattered over a wide range, the features with TP
higher magnitudes will cause biasness during the model training. Precision = PR = (6)
TP + FP
Therefore, it is crucial to standardize the accumulated features to
Precision ∗ DR 2TP
a common scale. The current study utilizes the feature-wise Min– F1score = 2∗ = (7)
Max data standardization method to overcome the mentioned Precision + DR 2TP + FP + FN
ρo − ρe
issue. Min–Max converts each numerical attribute to the range Kappa = (8)
of 0 to 1 by using the following mathematical expression. 1 − ρe
TP ∗ TN − FP*FN
xi − min(X) MCC = √ , (9)
f (xi ) = (1) (TP + FP)(TP + FN)(TN + FP)(TN + FN)
max (X) − min(X)
where FP and TP denote the false positive and true positive
where X is a vector composed of xi daily electricity consumption respectively, while FN and TN represent false negative and true
while the min(X) and max(X) are the minimum and maximum negative respectively. ρo is the predicted value and ρe is the
values of X respectively. actual value.
4431
S. Hussain, M.W. Mustafa, T.A. Jumani et al. Energy Reports 7 (2021) 4425–4436
In addition to the appropriate selection of performance assess- features with high repeatability, both the mentioned techniques
ment metrics, the performance evaluation of the considered ML require large memory and other computational resources. To
model on different test datasets is also important. Therefore, k- avoid the mentioned problem, the CatBoost algorithm utilizes
fold cross-validation technique is recommended in most of the efficient modified target-based statistics to appropriately han-
literature (Salman Saeed et al., 2020; Saeed et al., 2019). In the k- dle the categorical features during training time, thus saving
fold cross-validation technique, the entire dataset is divided into considerable computational time and resources.
the k-number of folds initially. Afterwards, the first k1 fold is used Another important aspect of the CatBoost algorithm is its
to train the model, and the remaining (k-k1 ) folds are used for ordered boosting mechanism. In traditional GBTs, all the training
validation purpose. Finally, the outcomes of all the considered samples are provided to construct a prediction model after exe-
evaluation metrics are averaged to depict the performance of the cuting several boosting steps. This approach causes a prediction
learning classifier. shift in the constructed model, which consequently leads to a spe-
cial kind of target leakage problem. The CatBoost algorithm avoids
2.3.2. CatBoost classification algorithm: Theoretical background and the stated issue by utilizing the ordered boosting framework.
its implementation in current classification problem Furthermore, contrary to the conventional learning classifiers, the
In this study, the CatBoost classification algorithm is utilized
CatBoost algorithm eloquently handles the overfitting issue by
for model training and evaluation purpose. CatBoost is a re-
using several permutations of the training dataset; hence it turns
fined version of the GBDTs, which utilizes a complex ensemble
out to be the key motivation behind utilizing its intelligence in
learning technique based on the gradient descent framework.
the current study.
During model training, a set of decision trees (DTs) are con-
For the effective implementation of the proposed CatBoost
structed sequentially to create each subsequent tree with de-
algorithm in the current NTL detection problem, the designed
creased loss. In other words, each DT learns from the preceding
model is initially trained on the data developed in Stage-2. Af-
tree and influences the next tree to boost the model performance,
thus builds a strong learner. CatBoost algorithm differs from terward, a10-folds cross-validation (CV) technique employing the
the rest of GBTs in terms of having two prominent features, considered performance metrics is utilized for performance eval-
i.e., efficient handling of categorical features and ordered boost- uation of the designed model. The corresponding outcomes are
ing (Prokhorenkova et al., 2018). The learning classifiers handle depicted in Table 2.
numerical features quite efficiently during the model training As can be seen from Table 2 that the CatBoost model attained
phase; however, interpreting categorical features is complicated an average accuracy and precision of 0.9338 and 0.9508 with a
for them. Therefore, in conventional approaches, categorical fea- standard deviation (SD) of 0.0029 and 0.0035, respectively. It is
tures are transformed into useful information using the one- essential to mention that in almost all data-oriented NTL detec-
hot encoding technique (Daniele, 2001) or gradient statistics (Ke tion systems, accuracy, and precision are two of the most widely
et al., 2017). In the former technique, each category of the original used metrics. Nevertheless, these metrics cannot be considered
categorical features is replaced by the binary values, while in as a conclusive measure to assess NTL detection-based classifiers’
the later technique, an estimated value is generated by using performance. For example, precision is a critical performance
gradient statistics to replace the original categorical feature at metric; however, it lacks significant information regarding False-
each boosting step. Nevertheless, in the case of the categorical negative (FN) instances. The FN value implies consumers involved
4432
S. Hussain, M.W. Mustafa, T.A. Jumani et al. Energy Reports 7 (2021) 4425–4436
Table 2
10-folds cross-validation results achieved using the proposed model.
in theft yet classified as genuine; hence the failure of this kind can and plotted against the selected features in order to evaluate its
cause permanent financial loss. impact on the model outcome. Since its quite challenging to show
For that reasons, the proposed approach’s performance is fur- all the features and their corresponding Shapley values in the
ther authenticated by computing the recall, F1score, kappa, and summary plot, therefore, only 20 most essential features are de-
MCC, Recall or detection rate (DR) value specifies a classifier’s picted in ascending order based on their significance in predicting
hit rate in accurately classifying the theft instances. The pro- the model outcomes. For example, the entropy feature attained
posed technique attained a high average DR value of 0.9237 with the highest importance in terms of predicting the target variable,
standard deviation (SD) of 0.0033. On the other hand, MCC is a as shown in Fig. 10. It implies that most of the consumers with
more balanced and informative statistical metric, which provides high entropy values (i.e., red color) obtain a positive SHAP value;
a high score only if the prediction has achieved good scores in thus, impacting the model outcomes positively. Further aspects
all four confusion matrix categories. MCC score ranges from −1 of interpreting the ML model using the SHAP technique can be
(total conflict between outcome and observation) to +1 (perfect found in this source (Molnar, 2018).
prediction). The average value of MCC attained in this study
is 0.8677 with SD of 0.0059, which implies that the proposed 3. Comparative analysis of proposed method with conven-
technique correctly classifies most of the theft and genuine cases tional ML classification methods
from the provided dataset.
In this section, the performance of the proposed theft detec-
tion framework is compared against the latest GBTDs and other
2.3.3. Proposed model’s outcomes interpretability using the tree-
well-known conventional ML models under an identical feature
SHAP algorithm
set. The 10-fold cross-validation technique is employed in con-
In this section, the proposed theft detection model outcomes
junction with the five most widely utilized performance metrics,
or predictions are interpreted using Shapley values computed by
i.e., precision, accuracy, F1score Kappa, and MCC, to evaluate the
the tree-SHAP algorithm. The Shapley values assist in opening the
performance of all studied classifiers. The proposed framework is
black-box ML model outcomes extensively. These values provide
sequentially implemented using the 8th generation, Intel Core-
a solution for fairly assigning the gains and costs to several i5, RAM-8-GB unit. It took approximately 280 s for the model
features working in alliance for predicting the model outcomes. training and testing, while the feature extraction and selection
In simple words, these values assist in explaining how model has process took around 600 s. Since the classifier utilized in the
concluded a decision for a particular prediction. In this study, the proposed framework is a modified variant of tree-based models,
Shapley values are computed using a recently introduced tech- therefore its performance is compared with other tree-based
nique called tree-SHAP developed by Lundberg et al. (2020). The models such as RF, ET, Ada Boost, XGBoost light and GBM. The
tree-SHAP algorithm is specially designed for tree-based models, outcomes of this comparison are depicted in Fig. 11.
and ensemble gradient boosted machines. One of the important As evident from Fig. 11, the proposed technique outperforms
features of this algorithm is that it computes the local feature in- all the conventional ML methods in terms of accuracy, recall, pre-
teraction, which in-turn facilitates the interpretation of the global cision, F1score, Kappa, and MCC; thus, proving its effectiveness
model structure for each prediction. A detailed explanation and and significance. Another performance evaluation-based compar-
source code of the tree-SHAP technique is presented tree-SHAP ison of the proposed method with a few of the well-known con-
GitHub webpage (https://shap.readthedocs.io/). Fig. 10 shows the ventional ML methods is made on identical performance evalua-
summary plot generated by the tree-SHAP algorithm that helps tion metrics. The corresponding outcomes are depicted in Fig. 12.
in interpreting the predicted outcomes of the proposed theft Once again, the proposed method’s performance superiority can
detection model. be observed from outcomes depicted in Fig. 12. It achieves an
The summary plot shown in Fig. 10, plots the consumers’ ex- accuracy, recall, precision, F1score , Kappa, and MCC of 93.38%, 92%,
tracted features against the computed Shapley values. The Shap- 95%, 93.7%, and 87%, respectively, which are significantly higher
ley values are computed for every consumer’s each feature value than all the competing models.
4433
S. Hussain, M.W. Mustafa, T.A. Jumani et al. Energy Reports 7 (2021) 4425–4436
Fig. 10. SHAP value of the proposed model. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this
article.)
4. Conclusion led to lowering the computational time and enhancing the pro-
posed classifier’s learning capability. To classify data into genuine
In this paper, a novel feature engineered CatBoost-based NTL and theft consumers, the intelligence of the CatBoost algorithm
detection framework is developed. At the initial stage of the was employed. Finally, the model’s decision for a particular out-
proposed NTL detection framework, the missing slots in the ac- come was interpreted using the tree-SHAP algorithm. To prove
quired data set were imputed using kNN missing value imputer. the proposed framework’s superior classification performance,
To avoid the data class imbalances, the SMOTETomek algorithm its performance was compared with that of the latest gradient
was utilized which simultaneously over and under-sample the boosted machines and traditional ML models based on few of
data classes. The FRESH algorithm’s intelligence was utilized at the well-known performance evaluation metrics. The proposed
the later stage to extract and select the most relevant features technique outperformed all the considered competing algorithms
from the acquired smart meter data set, which consequently and achieved 93% accuracy, 92% recall and 95% precision.
4434
S. Hussain, M.W. Mustafa, T.A. Jumani et al. Energy Reports 7 (2021) 4425–4436
Declaration of competing interest Hasan, M., Toma, R.N., Nahid, A.-A., Islam, M.M., Kim, J.-M., 2019. Electricity theft
detection in smart grid systems: a CNN-LSTM based approach. Energies 12
(17), 3310.
The authors declare that they have no known competing finan-
Hussain, S., Mustafa, M.W., Jumani, T.A., Baloch, S.K., Saeed, M.S., 2020. A novel
cial interests or personal relationships that could have appeared unsupervised feature-based approach for electricity theft detection using
to influence the work reported in this paper. robust PCA and outlier removal clustering algorithm. Int. Trans. Electr. Energy
Syst. 30 (11), e12572, %@ 2050-7038.
Acknowledgments Jaiswal, S., Ballal, M.S., 2020. Fuzzy inference based electricity theft prevention
system to restrict direct tapping over distribution line. J. Electr. Eng. Technol.
15 (3), 1095–1106. http://dx.doi.org/10.1007/s42835-020-00408-7.
Authors are agreed with this submission. They equally con- Jindal, A., Dua, A., Kaur, K., Singh, M., Kumar, N., Mishra, S., 2016. Decision tree
tributed in manuscript and its revision. and SVM-based data analytics for theft detection in smart grid. IEEE Trans.
Ind. Inf. 12 (3), 1005–1016, %@ 1551-3203.
Joenssen, D.W., Bankhofer, U., 2012. Hot deck methods for imputing missing
References
data. In: Machine Learning and Data Mining in Pattern Recognition. Springer
Berlin Heidelberg, Berlin, Heidelberg, pp. 63–75, %@ 978-3-642-31537-4.
Adil, M., Javaid, N., Qasim, U., Ullah, I., Shafiq, M., Choi, J.-G., 2020. LSTM and Jokar, P., Arianpoo, N., Leung, V.C.M., 2016. Electricity theft detection in AMI us-
bat-based RUSBoost approach for electricity theft detection. Appl. Sci. 10 ing customers’ consumption patterns. IEEE Trans. Smart Grid 7 (1), 216–226.
(12), 4378. http://dx.doi.org/10.3390/app10124378, 2020-06-25. http://dx.doi.org/10.1109/tsg.2015.2425222.
Badrinath Krishna, V., Iyer, R.K., Sanders, W.H., 2016. ARIMA-Based Modeling and Ke, G., et al., 2017. Lightgbm: A highly efficient gradient boosting decision tree.
Validation of Consumption Readings in Power Grids. Springer International In: Advances in Neural Information Processing Systems, Vol. 30. NIPS 2017.
Publishing, pp. 199–210. pp. 3146–3154.
Batista, G.E., Prati, R.C., Monard, M.C., 2004a. A study of the behavior of several Lundberg, S.M., et al., 2020. From local explanations to global understanding
methods for balancing machine learning training data. ACM SIGKDD Explor. with explainable AI for trees. Nat. Mach. Intell. 2 (1), 56–67. http://dx.doi.
Newsl. 6 (1), 20–29, %@ 1931-0145. org/10.1038/s42256-019-0138-9.
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C., 2004b. A study of the behavior of Messinis, G.M., Hatziargyriou, N.D., 2018. Review of non-technical loss detection
several methods for balancing machine learning training data. ACM SIGKDD methods. Electr. Power Syst. Res. 158, 250–266. http://dx.doi.org/10.1016/j.
Explor. Newsl. 6 (1), 20–29. http://dx.doi.org/10.1145/1007730.1007735. epsr.2018.01.005.
Benjamini, Y., Yekutieli, D., 2001. The control of the false discovery rate in Molnar, C., 2018. A guide for making black box models explainable. URL: https:
multiple testing under dependency. Ann. Statist. 1165–1188, %@ 0090-5364. //christophm.github.io/interpretable-ml-book.
Buzau, M.-M., Tejedor-Aguilera, J., Cruz-Romero, P., Gomez-Exposito, A., 2018a. Mwaura, F.M., 2012. Adopting electricity prepayment billing system to reduce
Detection of non-technical losses using smart meter data and super- non-technical energy losses in Uganda: Lesson from Rwanda. 23, pp. 72–79.
vised learning. IEEE Trans. Smart Grid 1. http://dx.doi.org/10.1109/tsg.2018. http://dx.doi.org/10.1016/j.jup.2012.05.004.
2807925. Never, B., 2015. Social norms, trust and control of power theft in Uganda: Does
Buzau, M.M., Tejedor-Aguilera, J., Cruz-Romero, P., Gómez-Expósito, A., 2018b. bulk metering work for MSEs? Energy Policy 82, 197–206. http://dx.doi.org/
Detection of non-technical losses using smart meter data and supervised 10.1016/j.enpol.2015.03.020.
learning. IEEE Trans. Smart Grid 10 (3), 2661–2670, %@ 1949-3053. Northeast Group, 2017. Electricity Theft and Non-Technical Losses: Global
Chen, L., Chee-Wooi, T., Shiyan, H., 2013. Strategic FRTU deployment considering Markets, Solutions and Vendors, 2017. Northeast Group, LLC, [Online].
cybersecurity in secondary distribution network. 4, (3), pp. 1264–1274. http: Available: http://www.northeast-group.com/reports/Brochure-Electricity%
//dx.doi.org/10.1109/tsg.2013.2256939. 20Theft%20&%20Non-Technical%20Losses%20-%20Northeast%20Group.pdf.
Christ, M., 2018. tsfresh, python library for FRESH algorithm-Documentation Passos Júnior, L.A., et al., 2016. Unsupervised non-technical losses identification
webpage. https://tsfresh.readthedocs.io/en/latest/index.html. (Accessed). through optimum-path forest. Electr. Power Syst. Res. 140, 413–423. http:
Christ, M., Kempa-Liehr, A., Feindt, M., 2016. Distributed and parallel time //dx.doi.org/10.1016/j.epsr.2016.05.036.
series feature extraction for industrial big data applications. arXiv, vol. Pedregosa, F., et al., 2011. Scikit-learn: Machine learning in Python. J. Mach.
abs/1610.07717. Learn. Res. 12, 2825–2830, %@ 1532-4435.
Daniele, M.-B., 2001. A preprocessing scheme for high-cardinality categorical Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A.V., Gulin, A., 2018.
attributes in classification and prediction problems. SIGKDD Explor. Newsl. CatBoost: unbiased boosting with categorical features. pp. 6638–6648, https:
3 (1), 27–32. http://dx.doi.org/10.1145/507533.507538, %@ 1931-0145. //arxiv.org/abs/1810.11363v1.
Ferreira, A.M.S., Cavalcante, C.A.M.T., Fontes, C.H.O., Marambio, J.E.S., 2013. A new Punmiya, R., Choe, S., 2019. Energy theft detection using gradient boosting theft
method for pattern recognition in load profiles to support decision-making detector with feature engineering-based preprocessing. IEEE Trans. Smart
in the management of the electric sector. Int. J. Electr. Power Energy Syst. Grid 10 (2), 2326–2329. http://dx.doi.org/10.1109/tsg.2019.2892595.
53, 824–831. http://dx.doi.org/10.1016/j.ijepes.2013.06.001. Roth, P.L., Switzer, F.S., 1995. A Monte Carlo analysis of missing data tech-
Gunturi, S.K., Sarkar, D., 2020. Ensemble machine learning models for the niques in a HRM setting. J. Manage. 21 (5), 1003–1023. http://dx.doi.org/
detection of energy theft. Electr. Power Syst. Res. http://dx.doi.org/10.1016/ 10.1177/014920639502100511, %U https://journals.sagepub.com/doi/abs/10.
j.epsr.2020.106904. 1177/014920639502100511.
4435
S. Hussain, M.W. Mustafa, T.A. Jumani et al. Energy Reports 7 (2021) 4425–4436
Rusitschka, S., Eger, K., Gerdes, C., 2010. Smart Grid Data Cloud: A Model for Viegas, J.L., Esteves, P.R., Melício, R., Mendes, V.M.F., Vieira, S.M., 2017. Solutions
Utilizing Cloud Computing in the Smart Grid Domain. IEEE, http://dx.doi. for detection of non-technical losses in the electricity grid: A review. Renew.
org/10.1109/smartgrid.2010.5622089, [Online]. Available: https://doi.org/10. Sustain. Energy Rev. 80, 1256–1268. http://dx.doi.org/10.1016/j.rser.2017.05.
1109/smartgrid.2010.5622089. 193.
Saad, M., Tariq, M.F., Nawaz, A., Jamal, M.Y., 2017. Theft Detection Based Winther, T., 2012. Electricity theft as a relational issue: A comparative look at
GSM Prepaid Electricity System. IEEE, http://dx.doi.org/10.1109/ccsse.2017. Zanzibar, Tanzania, and the Sunderban Islands, India. Energy Sustain. Dev.
8087973, [Online]. Available: https://doi.org/10.1109/ccsse.2017.8087973. 16 (1), 111–119. http://dx.doi.org/10.1016/j.esd.2011.11.002.
Saeed, M.S., Mustafa, M.W., Sheikh, U.U., Jumani, T.A., Mirjat, N.H., 2019. Xiao, Z., Xiao, Y., Du, D.H.-C., 2013. Exploring malicious meter inspection in
Ensemble bagged tree based classification for reducing non-technical losses neighborhood area smart grids. IEEE Trans. Smart Grid 4 (1), 214–226.
in multan electric power company of Pakistan. Electronics 8 (8), 860. http://dx.doi.org/10.1109/tsg.2012.2229397.
Salman Saeed, M., et al., 2020. An efficient boosted C5.0 decision-tree-based Yurtseven, Ç., 2015. The causes of electricity theft: An econometric analysis
classification approach for detecting non-technical losses in power utilities. of the case of Turkey. Util. Policy 37, 70–78. http://dx.doi.org/10.1016/j.jup.
Energies 13 (12), 3242. http://dx.doi.org/10.3390/en13123242. 2015.06.008.
Troyanskaya, O., et al., 2001. Missing value estimation methods for DNA Zhang, S., Zhang, J., Zhu, X., Qin, Y., Zhang, C., 2008. Missing value imputation
microarrays. Bioinformatics 17 (6), 520–525. http://dx.doi.org/10.1093/ based on data clustering. In: Transactions on Computational Science I.
bioinformatics/17.6.520. Springer, pp. 128–138.
Tureczek, A.M., Nielsen, P.S., 2017. Structured literature review of electricity Zheng, Z., Yang, Y., Niu, X., Dai, H.-N., Zhou, Y., 2018. Wide and deep con-
consumption classification using smart meter data. Energies 10 (5), 584. volutional neural networks for electricity-theft detection to secure smart
grids. IEEE Trans. Ind. Inf. 14 (4), 1606–1615. http://dx.doi.org/10.1109/tii.
2017.2785963.
4436