0% found this document useful (0 votes)

1 views

CatBoost based supervised machine learning classification

This paper introduces a novel supervised machine learning framework for electricity theft detection using the feature engineered-CatBoost algorithm alongside the SMOTETomek algorithm. It addresses challenges such as missing data, class imbalance, and feature selection, achieving a detection accuracy of 93% and a detection rate of 92%, outperforming traditional models. The framework incorporates k-Nearest neighbor for data imputation, the FRESH algorithm for feature extraction, and the tree-SHAP algorithm for model interpretation.

Uploaded by

Sibgatullah Wadho

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views

CatBoost based supervised machine learning classification

Uploaded by

Sibgatullah Wadho

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Energy Reports 7 (2021) 4425–4436

Contents lists available at ScienceDirect

Energy Reports
journal homepage: www.elsevier.com/locate/egyr

Research paper

A novel feature engineered-CatBoost-based supervised machine

learning framework for electricity theft detection
Saddam Hussain a , Mohd. Wazir Mustafa a , Touqeer A. Jumani b , Shadi Khan Baloch c ,
∗
Hammad Alotaibi d , Ilyas Khan e , , Afrasyab Khan f
a
School of Electrical Engineering, Universiti Teknologi Malaysia, Johor Bahru 81310, Malaysia
b
Department of Electrical Engineering, Mehran University of Engineering and Technology SZAB Campus Khairpur, Mirs 66020, Pakistan
c
Department of Mechatronics Engineering, Mehran University of Engineering and Technology Jamshoro, Sindh, 76062, Pakistan
d
Department of Mathematics, College of Science, Taif University, P.O. Box, 11099, Taif, 21944, Saudi Arabia
e
Department of Mathematics, College of Science Al-Zulfi, Majmaah University, Al-Majmaah 11952, Saudi Arabia
f
Institute of Engineering and Technology, Department of Hydraulics and Hydraulic and Pneumatic Systems, South Ural State University, Lenin
Prospect 76, Chelyabinsk, 454080, Russian Federation

article info a b s t r a c t

Article history: This paper presents a novel supervised machine learning-based electric theft detection approach using
Received 3 March 2021 the feature engineered-CatBoost algorithm in conjunction with the SMOTETomek algorithm. Contrary
Received in revised form 5 June 2021 to the previous literature, where the missing observations in data are either ignored or imputed
Accepted 5 July 2021
with average values, this work utilizes k-Nearest neighbor technique for missing data imputation;
Available online 24 July 2021
thus, an accurate and realistic estimation of the missing data is achieved. To mitigate the biasness to
Keywords: the majority data class, the proposed model utilizes the SMOTETomek algorithm, which neutralizes
CatBoost algorithm the mentioned effect by managing a proper balance between over-sampling and under-sampling
NTL detection techniques. Feature Extraction and Scalable Hypothesis (FRESH) algorithm is utilized at the later stage
Smart meters of the proposed NTL detection framework to extract and select the most relevant data features from
Feature engineering the provided dataset. Afterward, the model is trained using the CatBoost algorithm to classify the
Machine learning model interpretation consumers into two distinct categories, i.e., genuine and theft. Finally, to interpret the model’s decision
for the corresponding predictions, the tree-SHAP algorithm is utilized. To validate the efficacy of the
proposed ML based theft detection approach, its performance is compared with that of the traditional
gradient boosting ML algorithms such as XGBoost, lightGBM, Ensemble bagging, boosting ML models,
and other conventional ML models using five of the most widely used performance metrics, i.e.,
precision, accuracy, F1score Kappa and MCC. The proposed technique achieved an accuracy of 93% and
a detection rate of 92%, which is significantly higher than all the considered competing algorithms
under identical dataset and hyperparameters.
© 2021 Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).

1. Introduction every year due to NTLs (Northeast Group, 2017). The mentioned
scenario is precisely depicted in Fig. 1, which shows the intensity
1.1. Background of the NTL issue in different parts of the world.
Owing to such massive economic loss, the power utilities and
The transmission and distribution (T&D) of electricity suffers researchers in the field of data mining, computer science, and
from two major categories of losses, i.e., technical and non- electrical engineering are trying several intelligent and effec-
technical. The technical losses account for the energy losses that tive methods to minimize NTLs. One of the efficient methods to
occur in equipment that is essential for implementing the T&D of counter the electric theft issue is the implementation of smart
electricity. On the other hand, the non-technical losses (NTL) in meters. Such energy meters can monitor and record the con-
any power system account for power theft, billing irregularities, sumers’ consumption data remotely and precisely and provide
and corruption within utility workers. According to a report, the information to the utility directly in case of any suspicious
utilities around the globe are losing approximately US$96 billion activity. However, despite the vast number of benefits, smart me-
ters are not feasible for countries suffering from severe economic
∗ Corresponding author. issues due to huge expenditures associated with their implemen-
E-mail addresses: [email protected] (I. Khan), [email protected] (A. Khan). tation and operation. Furthermore, increasing cyber threats still

https://doi.org/10.1016/j.egyr.2021.07.008
2352-4847/© 2021 Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
S. Hussain, M.W. Mustafa, T.A. Jumani et al. Energy Reports 7 (2021) 4425–4436

1.2. Positioning of our work in literature

The supervised-based NTL detection methods generally face

five major challenges, i.e., handling missing data values during
data pre-processing, data class unbalancing, selecting the most
relevant features, choosing an appropriate classifier, and inter-
preting the model’s prediction. This subsection reviews the most
relevant literature pertaining to the challenges mentioned above
in conjunction with the significance of the current research work.
Paria et al. (Jokar et al., 2016) presented a consumption
pattern-based energy theft detection (CPBETD) algorithm to iden-
tify the malicious consumption patterns in a smart grid network.
Fig. 1. Non-technical loss in billion dollars country-wise.
The proposed CPBETD algorithm was made to detect the high
energy theft areas at the transformer level by utilizing the data
collected from the various distribution transformer meters. In
another study (Jindal et al., 2016), the authors developed a highly
need to be addressed appropriately for the wide-scale imple-
accurate energy theft detection framework by utilizing the sup-
mentation of such devices. In addition, the high-frequency data
port vector machine (SVM) intelligence in conjunction with the
gathered from smart meters pose some serious data storage and
decision tree algorithm. Even though both the studies have pro-
analysis issues. It was estimated that the volume of data obtained
posed very effective theft detection frameworks, however, none
from two million consumers’ smart meters might exceed 22 GB
of them has tackled the missing data issue. Furthermore, the
per day (Rusitschka et al., 2010). Therefore, it is extremely chal-
lenging to identify suspicious consumers’ profiles from such a authors in Tureczek and Nielsen (2017), after a detailed review
huge dataset. of 34 research papers on theft detection based on supervised ML
The NTL detection approaches can be broadly classified into methods, concluded that only half of the considered articles had
three major categories, i.e., theoretical, hardware, and non- addressed the issue of missing data values. Since current research
hardware based methods (Viegas et al., 2017). The theoretical work has detailly handled the mentioned problem, it is essential
methods utilize the relationship between the socio-economic to emphasize its repercussions.
and demographic factors for framing the policies to counter The consumption data obtained through the smart meters
NTLs (Winther, 2012; Never, 2015; Yurtseven, 2015; Mwaura, is generally inconsistent and often contains null values. Several
2012). On the other hand, the hardware-based or state-based factors, such as smart meter malfunction, inaccurate estimation
methods utilize physical instruments such as sensors, detec- of data transferred, unplanned device repair, and storage prob-
tion devices, transformers, and other electrical devices to detect lems, can be the root cause of this problem. It is extremely
NTLs (Chen et al., 2013; Xiao et al., 2013; Jaiswal and Ballal, difficult for a learning classifier to handle and learn patterns
2020; Saad et al., 2017). In these methods, the voltage, power, from such data types. To overcome the stated issue in ML based
and current sensors are installed at various network nodes, that classification methods, various data imputation strategies have
triggers an alarm whenever the malicious customers attempt to been proposed in the literature, such as Hot deck imputation
manipulate the actual grid characteristics at any network point. method (Joenssen and Bankhofer, 2012), data clustering based
Despite an effortless working mechanism, these methods are not imputation (Zhang et al., 2008), Monte Carlo missing values im-
feasible for various power utilities due to additional maintenance putation method (Roth and Switzer, 1995) etc. Two of the most
and sensor deployment costs. Contrary to the hardware-based widely practiced solutions to counter this issue are to delete the
methods, the non-hardware-based energy-theft detection ap- missing entries from the original data (listwise or pairwise) or
proaches do not require any additional NTL detection device. to impute the missing datapoints with mean values between the
These methods are generally classified into two major categories, adjacent data entries as witnessed in references (Buzau et al.,
i.e., game-based and data-driven systems. In the former approach, 2018a; Adil et al., 2020). The mentioned data adjusting methods
the theft detection method is developed as a game between the are elementary and reasonable; nevertheless, the former method
power thief and the service provider using game theory. Even produces a significant information loss while the second provides
though these approaches require a comparatively lower cost, noisy, inconsistent, and outlying data values. To overcome the
they pose a severe challenge in identifying the key position of
stated issues, this study utilizes the k-Nearest neighbor-based
players, offenders, regulating authorities, and distributors; thus,
imputer which imputes the average value from pre-selected kth
making it too complex to implement. The second category of the
number of nearest neighbors in a given sample of data, thus
non-hardware-based machine learning techniques is data-driven
providing very reliable estimates.
methods. These methods are further classified into unsuper-
Another critical issue in smart meters’ labeled data sets for
vised and supervised machine learning approaches. The former
NTL detection application is the data class unbalancing. It causes
methods utilize a clustering approach to segment consumers’
load profiles based on similarity or dissimilarity metric mea- difficulties for the learning systems to learn the concept re-
sures (Badrinath Krishna et al., 2016; Ferreira et al., 2013; Passos lated to the minority class (theft cases); thus, causing biasness
Júnior et al., 2016; Hussain et al., 2020). of ML models towards the majority samples. In order to achieve
On the other hand, the supervised or classification-based theft an effective and unbiased ML model performance, a balanced
detection methods utilize pre-labeled data (i.e., ‘‘Genuine’’ and set of the dataset is essentially required. Two of the prominent
‘‘Theft’’) to train the model at the initial stage. Based on the studies that have tackled the mentioned issue includes Hasan
information acquired from the training process, the model is et al. (2019) and Gunturi and Sarkar (2020). Both studies have
made to classify the unlabeled data into two mentioned dis- utilized the Synthetic minority oversampling technique (SMOTE)
tinct categories; thus, minimizing the expenses and labor of site- to balance the data class with reasonable accuracy. Since the
inspections. Since this research work proposes a supervised ML- SMOTE algorithm oversamples the minority class randomly, it
based approach, a detailed description of the most relevant litera- results in overfitting and low generalization ability of the model.
ture in the mentioned research field is provided in the subsequent In another study (Buzau et al., 2018b), the authors have utilized
subsection. an under-sampling technique where few samples of the majority
4426
S. Hussain, M.W. Mustafa, T.A. Jumani et al. Energy Reports 7 (2021) 4425–4436

class are removed to balance the data class. Such data balancing It is fair to mention and highlight the most relevant studies
techniques are easy to implement; however, they may cause on the current research work available in the literature. One of
substantial data loss, resulting in lower accuracy of the developed such studies was carried out by Gunturi and Sarkar (2020), where
NTL detection model. To avoid this issue, the current study uti- the authors have developed an ensemble machine learning-based
lizes an efficient statistical technique called SMOTETomek (Batista theft detection model. In another study, Punmiya and Choe (2019)
et al., 2004a). It combines the intelligence of SMOTE (oversam- proposed a gradient boosted theft detector framework, which
pling) and Tomek link (under-sampling) to balance the data class employs the latest XGBoost, lightGBM, and CatBoost for model
distribution. training. The current study differentiates itself from the men-
As discussed earlier in this section, the third critical issue in tioned research works in its novel data class balancing and feature
supervised-based NTL detection methods is the selection of the engineering approach. Furthermore, unlike the quoted studies
most relevant features for the model training. The efficiency of where the model’s outcome interpretability was not evaluated,
classification-based theft detection methods is highly dependent this research work utilizes the tree-SHAP algorithm to accomplish
on the type of input features selected. Since the smart meters’ the mentioned task.
data is generally high dimensional data containing many redun- Concluding the detailed discussion, the list of steps executed
dant and irrelevant features, it is essential to extract and select sequentially in order to accomplish the proposed supervised ML-
the most relevant features and discards the unnecessary ones. based NTL detection framework is presented as follows.
In this study, the mentioned issue is solved by using efficient
feature extraction and selection process. Feature extraction and i. k-Nearest Neighbors imputation technique is employed to
selection procedure is an effective practice for reducing the in- handle the missing and erroneous data values in the ac-
creased data dimensions, and redundant information in ML-based quired dataset.
NTL approaches. It is worthwhile to mention that unlike most of ii. SMOTE-Tomek based resampling technique is utilized to
the ML-based NTL detection approaches in literature where either tackle the data class imbalance issue.
feature extraction or selection process is adopted for model train- iii. The FRESH algorithm is used to extract and select the most
ing, this research work utilizes both for acquiring highly relevant relevant statistical features from raw smart meter data.
features from the considered smart meter dataset. The proposed iv. The implementation of the state-of-the-art CatBoost algo-
approach utilizes the intelligence of one of the most intelligent rithm and its comparative analysis with other well-known
algorithms called the Feature Extraction and Scalable Hypothesis ML classifiers is carried out for identifying the NTLs.
(FRESH) algorithm to accomplish the mentioned task. It does v. Interpretation of the model outcomes is performed through
so by utilizing more than 60 time-series analytical methods to the tree-SHAP algorithm.
capture 794 features from each dataset sample. The extracted vi. To validate the effectiveness of the proposed theft detec-
features are reduced to 300 most relevant features through the tion framework, an extensive performance evaluation is
Benjamini–Yekutieli statistical test. The resulting final set of fea- made based on five of the most widely utilized perfor-
tures are a combination of essential user consumption and newly mance metrics.
extracted features. vii. The proposed NTL framework achieves the highest detec-
Once the feature engineering process is completed, the next
tion rate and the lowest false positive rates among all the
challenge is to select an appropriate classifier for efficiently seg-
compared algorithms.
regating the genuine and theft consumers. In this study, the
CatBoost algorithm is utilized for the model training due to its The rest of the paper is divided into three sections. Sec-
efficient handling of the categorical features. These categorical tion 2 presents the proposed research methodology and is fur-
features are handled during the pre-processing phase in most of ther sub-categorized to discuss the CatBoost algorithm’s theoret-
the traditional ML models, which consequently increase the com- ical background, considered performance metrics, and proposed
putational time and complexity. On the other side, the CatBoost framework results and interpretations. In Section 3, the proposed
efficiently handle these features during the training process, thus model’s comparative analysis against the latest gradient boosting
avoids the mentioned problems faced by conventional classi- decision trees (GBTDs) and traditional ML models is discussed in
fiers. Furthermore, it utilizes the intelligence of ordered boosting, detail. Finally, the conclusion is made in Section 4 of this research
which avoids the prediction shift problem faced by XGBoost and work.
its variants. Also, by enabling the overfitting detector feature
in its framework, the trained model can achieve an improved 2. Research methodology
generalization ability.
Another important aspect of the proposed theft detection
In this section, the proposed NTL detection framework is pre-
model is its novel interpretability of the model outcomes. Mostly,
sented. The overall framework is broadly classified into three
site inspections are initiated on the list of suspected consumers
major stages, i.e., data pre-processing stage, feature engineer-
generated by the trained model on genuine and theft consumers’
ing stage, model training-testing, and interpretation stage. Each
data. However, a model’s prediction to place the consumer in
of the stages is depicted in Fig. 2. and detailly described in
a particular category based on a given input feature set is not
subsequent subsections.
justified logically. Nevertheless, few studies in literature such
as Batista et al. (2004b) and Christ (2018), have employed sim-
plistic decision tree diagrams to interrupt the model outcomes. 2.1. Stage-1: Data pre-processing stage
However, the latest state-of-the-art theft detection models em-
ploying deep learning, gradient boosting machines and ensemble Data pre-processing is required to transform the raw data into
ML techniques incorporate a diverse range of complex prediction a meaningful data structure. The electricity consumption data
strategies, making themselves extremely difficult to comprehend acquired from the State Grid Corporation of China (SGCC) (Zheng
through simplistic tree diagrams. To deal with the mentioned et al., 2018) is used for testing the efficacy of the proposed theft
issue, tree-SHhapley Additive exPlanations (SHAP) is utilized in detection model. Table 1 presents the metadata information of
the current study. It assists in opening the black-box ML model’s the acquired dataset.
outcomes in terms of explaining how the model concluded a As presented in Table 1, the daily electricity consumption
decision for a particular prediction. of 42372 consumers for approximately 1035 days (2014-Jan to
4427
S. Hussain, M.W. Mustafa, T.A. Jumani et al. Energy Reports 7 (2021) 4425–4436

Fig. 2. Proposed theft detection framework.

Table 1 to initially find the consumer’s kth nearest neighbors and then
Statistics of obtained SGCC data. imputes the missing feature value using the mean of selected k-
Description Value neighbors. The current study utilizes the KNN-imputer module
The time window for electricity consumption 2014-01-01 to 2016-10-31 available in the Scikit-learn ML package to impute the missing
1035- days
data slots (Pedregosa et al., 2011). A few random consumers’
Number of total consumers 42372
Number of electricity thieves 3615 consumption samples are plotted to visualize the newly imputed
Number of genuine consumers 38757 values in consumers’ consumption data, as shown in Fig. 6.
Total data records 42372 * 1035 = 43855020
2.2. Stage-2: Data class balancing and feature engineering

2016-Oct). It comprises of 91.46% genuine and 8.54% theft con- This stage is further divided into two sub-stages, i.e., data class
sumers. Fig. 3 and Fig. 4 depict the electric power consumption balancing and feature engineering, as depicted in Fig. 2. Each of
patterns for few of the theft and the genuine consumers respec- the mentioned sub-stages is explained in subsequent subsections.
tively. It can be observed from the mentioned Figures that the
theft consumption patterns of the theft consumers are highly 2.2.1. Data class balancing
irregular and contains low periodicity. On the other hand, the For an efficient and unbiased classifying performance of a su-
patterns for the genuine consumers are periodical and exhibits pervised ML classifier, it is essential to extract and select the most
a correlation between the identical periods of the consecutive suitable features from a balanced dataset. Since the considered
years. smart meter dataset for the current study is unbalanced, as it oc-
To check the missing information in the data, the NaN values curs in most NTL detection data set, it is necessary to balance the
were computed for each consumer. It was found that 25.6% of class distribution before the feature extraction and selection pro-
43855020 data entries contains NaN or missing values, which is cess. In order to solve this issue, the SMOTETomek (Batista et al.,
significantly higher for any data set in the field of data mining. 2004b) algorithm is utilized in the current study. SMOTETomek
The distribution of computed null values in terms of the his- combines the intelligence of SMOTE and Tomek links techniques
togram is shown in Fig. 5. The histogram bar values depict the to over and under-sample data classes simultaneously. It accom-
number of consumers falling in the missing values range. plishes the mentioned task by discarding the majority class links
The computed histogram illustrates that 22.6% of total con- until both classes reach an equal number of entities. Even though
sumers fall into the range of more than 700 missing values per the SMOTE technique alone can mitigate the imbalanced data
consumer. To correctly estimate these consumers’ missing data class distribution issue, it skews the class distributions. Since in
values becomes extremely challenging since a significant portion most of the real-world smart meter datasets, clusters formed by
of the information is unavailable in the acquired dataset. There- different data classes are not well expressed. Therefore, a set of
fore, a viable option left is to drop such highly inadequate entries samples belonging to the minority or majority class is expected
from the rest of the dataset. The missing values in the remain- to be dominated during the SMOTE technique’s oversampling
ing consumers are imputed using the kNN interpolation tech- period. Consequently, feeding such biased data to the learning
nique (Troyanskaya et al., 2001). The kNN is a non-parametric and classifier will lead to model overfitting.
lazy learner algorithm that matches an observation in multidi- On the other hand, SMOTETomek does not only helps in pro-
mensional space to its nearest kth neighbors. The kNN’s capability ducing well-defined data class distribution, but it also generates
of dealing with almost all types of missing data makes it a suit- data class clusters equally. The data class distribution for the
able candidate for the missing value imputation. It accomplishes current study before and after using SMOTETomek is shown in
the imputation task by utilizing the Euclidean distance metric Fig. 7.
4428
S. Hussain, M.W. Mustafa, T.A. Jumani et al. Energy Reports 7 (2021) 4425–4436

Fig. 3. Electric consumption samples of consumers involved in power theft.

Fig. 4. Electric consumption samples of genuine consumers.

Fig. 5. Histogram of missing values present in SGCC dataset.

4429
S. Hussain, M.W. Mustafa, T.A. Jumani et al. Energy Reports 7 (2021) 4425–4436

Fig. 6. Missing value imputation using the K-nearest neighbor technique.

Fig. 7. Data class distribution before and after using the SMOTETomek technique.

Fig. 7, shows that genuine consumers are significantly higher efficient, features fed to the model must reflect appropriately un-
in number than those engaged in fraud before applying the derlying abnormalities in consumers’ consumption data. There-
SMOTETomek. In contrast, both the classes are well balanced after fore, the additional characteristics of the provided dataset are
employing the proposed technique. extracted using the feature extraction and selection process. In
this study, both the tasks are accomplished using the FRESH algo-
rithm, which simultaneously extracts and selects useful features
2.2.2. Feature engineering from the given balanced dataset. For ease in computation, the
In this section, the proposed feature engineering process is FRESH algorithm authors have developed a standardized python-
discussed in detail. Feature engineering is the process of extrac- based package called ‘‘ts-fresh’’, which makes use of the FRESH
tion and selection of the most important features from given data algorithm within its framework. The source code and GitHub
typically done to enhance the ML model’s learning ability. It is im- page of the ts-fresh package can be found in the link provided
portant to note that the dataset acquired from the smart meters in Christ (2018). A complete list of extracted features and their
lack statistical characteristics. For a theft detection model to be mathematical description can be found in Christ et al. (2016),
4430
S. Hussain, M.W. Mustafa, T.A. Jumani et al. Energy Reports 7 (2021) 4425–4436

Fig. 8. Feature extraction and selection process using the FRESH algorithm.

while the simplified pictorial version of the feature extraction and 2.3. Stage-3: Model training and evaluation stage
selection process employing the FRESH algorithm is presented in
Fig. 8. In this section, the training and evaluation of the proposed NTL
The FRESH algorithm implementation using the ts-FRESH detection model are discussed in detail. For ease of understanding
module is carried out in two steps, as depicted in Fig. 8. Initially, and interpretation, this section is divided into three sub-sections.
794 features are extracted automatically from each consumer’s
2.3.1. Performance evaluation metrics
consumption data using more than 60 time-series characteriza-
In any supervised ML technique, the labeled data is provided
tion methods. These extracted features can be broadly classified
to the learning classifier for its training purpose initially. The
into temporal, statistical and spectral domains as depicted in
trained model is then evaluated for its ability to predict and
Fig. 9. generalize the un-labeled data efficiently. The performance of
Features such as entropy, zero-crossing points, spectral varia- such models is assessed based on a number of performance eval-
tion, Mel-Frequency Cepstral Coefficients (MFCC), skewness, kur- uation metrics, such as mentioned in Messinis and Hatziargyriou
tosis, trend, linear and non-linear characteristics, correlation, and (2018). However, it is not feasible to evaluate and analyze all the
various statistical test-based features provide in-depth knowl- metrics mentioned in the stated study; therefore, few of the most
edge of each consumer consumption sample. Due to the space important metrics such as accuracy (Acc), recall, confusion matrix
limitation all the extracted features are not shown in Fig. 9, for the (CM), precision (P), Cohen’s kappa coefficient (kappa), Matthews
interested reader as mentioned above the detailed documenta- correlation coefficient (MCC), and F1score are utilized to evaluate
tion of each feature along with source code for its implementation the performance of the proposed classifier. The mathematical
can be found in authors provided webpage (Christ, 2018). expressions for calculating the mentioned metrics are depicted
In the second step, the derived features and consumers’ actual in Eqs. (2)–(9).
consumption data are combined to select only highly important TP + TN
Accuracy = (2)
feature. This selection process is made by initially arranging the TP + TN + FP + FN
features in descending order based on their significance gauged TP
through various statistical tests. Afterwards, the Benjamini and Recall or Detection rate = (3)
TP + FN
Yekutieli (2001) procedure is employed that sets a threshold for FP
feature selection criteria; thus, the features with the negligible False − positive rate = FPR = (4)
FP + TN
contribution to the target variable are discarded automatically. FN
Since the feature-set selected by the FRESH algorithm contains False − negative rate = FNR = (5)
FN + TP
diverse data points scattered over a wide range, the features with TP
higher magnitudes will cause biasness during the model training. Precision = PR = (6)
TP + FP
Therefore, it is crucial to standardize the accumulated features to
Precision ∗ DR 2TP
a common scale. The current study utilizes the feature-wise Min– F1score = 2∗ = (7)
Max data standardization method to overcome the mentioned Precision + DR 2TP + FP + FN
ρo − ρe
issue. Min–Max converts each numerical attribute to the range Kappa = (8)
of 0 to 1 by using the following mathematical expression. 1 − ρe
TP ∗ TN − FP*FN
xi − min(X) MCC = √ , (9)
f (xi ) = (1) (TP + FP)(TP + FN)(TN + FP)(TN + FN)
max (X) − min(X)
where FP and TP denote the false positive and true positive
where X is a vector composed of xi daily electricity consumption respectively, while FN and TN represent false negative and true
while the min(X) and max(X) are the minimum and maximum negative respectively. ρo is the predicted value and ρe is the
values of X respectively. actual value.
4431
S. Hussain, M.W. Mustafa, T.A. Jumani et al. Energy Reports 7 (2021) 4425–4436

Fig. 9. Extracted features using the FRESH algorithm.

In addition to the appropriate selection of performance assess- features with high repeatability, both the mentioned techniques
ment metrics, the performance evaluation of the considered ML require large memory and other computational resources. To
model on different test datasets is also important. Therefore, k- avoid the mentioned problem, the CatBoost algorithm utilizes
fold cross-validation technique is recommended in most of the efficient modified target-based statistics to appropriately han-
literature (Salman Saeed et al., 2020; Saeed et al., 2019). In the k- dle the categorical features during training time, thus saving
fold cross-validation technique, the entire dataset is divided into considerable computational time and resources.
the k-number of folds initially. Afterwards, the first k1 fold is used Another important aspect of the CatBoost algorithm is its
to train the model, and the remaining (k-k1 ) folds are used for ordered boosting mechanism. In traditional GBTs, all the training
validation purpose. Finally, the outcomes of all the considered samples are provided to construct a prediction model after exe-
evaluation metrics are averaged to depict the performance of the cuting several boosting steps. This approach causes a prediction
learning classifier. shift in the constructed model, which consequently leads to a spe-
cial kind of target leakage problem. The CatBoost algorithm avoids
2.3.2. CatBoost classification algorithm: Theoretical background and the stated issue by utilizing the ordered boosting framework.
its implementation in current classification problem Furthermore, contrary to the conventional learning classifiers, the
In this study, the CatBoost classification algorithm is utilized
CatBoost algorithm eloquently handles the overfitting issue by
for model training and evaluation purpose. CatBoost is a re-
using several permutations of the training dataset; hence it turns
fined version of the GBDTs, which utilizes a complex ensemble
out to be the key motivation behind utilizing its intelligence in
learning technique based on the gradient descent framework.
the current study.
During model training, a set of decision trees (DTs) are con-
For the effective implementation of the proposed CatBoost
structed sequentially to create each subsequent tree with de-
algorithm in the current NTL detection problem, the designed
creased loss. In other words, each DT learns from the preceding
model is initially trained on the data developed in Stage-2. Af-
tree and influences the next tree to boost the model performance,
thus builds a strong learner. CatBoost algorithm differs from terward, a10-folds cross-validation (CV) technique employing the
the rest of GBTs in terms of having two prominent features, considered performance metrics is utilized for performance eval-
i.e., efficient handling of categorical features and ordered boost- uation of the designed model. The corresponding outcomes are
ing (Prokhorenkova et al., 2018). The learning classifiers handle depicted in Table 2.
numerical features quite efficiently during the model training As can be seen from Table 2 that the CatBoost model attained
phase; however, interpreting categorical features is complicated an average accuracy and precision of 0.9338 and 0.9508 with a
for them. Therefore, in conventional approaches, categorical fea- standard deviation (SD) of 0.0029 and 0.0035, respectively. It is
tures are transformed into useful information using the one- essential to mention that in almost all data-oriented NTL detec-
hot encoding technique (Daniele, 2001) or gradient statistics (Ke tion systems, accuracy, and precision are two of the most widely
et al., 2017). In the former technique, each category of the original used metrics. Nevertheless, these metrics cannot be considered
categorical features is replaced by the binary values, while in as a conclusive measure to assess NTL detection-based classifiers’
the later technique, an estimated value is generated by using performance. For example, precision is a critical performance
gradient statistics to replace the original categorical feature at metric; however, it lacks significant information regarding False-
each boosting step. Nevertheless, in the case of the categorical negative (FN) instances. The FN value implies consumers involved
4432
S. Hussain, M.W. Mustafa, T.A. Jumani et al. Energy Reports 7 (2021) 4425–4436

Table 2
10-folds cross-validation results achieved using the proposed model.

in theft yet classified as genuine; hence the failure of this kind can and plotted against the selected features in order to evaluate its
cause permanent financial loss. impact on the model outcome. Since its quite challenging to show
For that reasons, the proposed approach’s performance is fur- all the features and their corresponding Shapley values in the
ther authenticated by computing the recall, F1score, kappa, and summary plot, therefore, only 20 most essential features are de-
MCC, Recall or detection rate (DR) value specifies a classifier’s picted in ascending order based on their significance in predicting
hit rate in accurately classifying the theft instances. The pro- the model outcomes. For example, the entropy feature attained
posed technique attained a high average DR value of 0.9237 with the highest importance in terms of predicting the target variable,
standard deviation (SD) of 0.0033. On the other hand, MCC is a as shown in Fig. 10. It implies that most of the consumers with
more balanced and informative statistical metric, which provides high entropy values (i.e., red color) obtain a positive SHAP value;
a high score only if the prediction has achieved good scores in thus, impacting the model outcomes positively. Further aspects
all four confusion matrix categories. MCC score ranges from −1 of interpreting the ML model using the SHAP technique can be
(total conflict between outcome and observation) to +1 (perfect found in this source (Molnar, 2018).
prediction). The average value of MCC attained in this study
is 0.8677 with SD of 0.0059, which implies that the proposed 3. Comparative analysis of proposed method with conven-
technique correctly classifies most of the theft and genuine cases tional ML classification methods
from the provided dataset.
In this section, the performance of the proposed theft detec-
tion framework is compared against the latest GBTDs and other
2.3.3. Proposed model’s outcomes interpretability using the tree-
well-known conventional ML models under an identical feature
SHAP algorithm
set. The 10-fold cross-validation technique is employed in con-
In this section, the proposed theft detection model outcomes
junction with the five most widely utilized performance metrics,
or predictions are interpreted using Shapley values computed by
i.e., precision, accuracy, F1score Kappa, and MCC, to evaluate the
the tree-SHAP algorithm. The Shapley values assist in opening the
performance of all studied classifiers. The proposed framework is
black-box ML model outcomes extensively. These values provide
sequentially implemented using the 8th generation, Intel Core-
a solution for fairly assigning the gains and costs to several i5, RAM-8-GB unit. It took approximately 280 s for the model
features working in alliance for predicting the model outcomes. training and testing, while the feature extraction and selection
In simple words, these values assist in explaining how model has process took around 600 s. Since the classifier utilized in the
concluded a decision for a particular prediction. In this study, the proposed framework is a modified variant of tree-based models,
Shapley values are computed using a recently introduced tech- therefore its performance is compared with other tree-based
nique called tree-SHAP developed by Lundberg et al. (2020). The models such as RF, ET, Ada Boost, XGBoost light and GBM. The
tree-SHAP algorithm is specially designed for tree-based models, outcomes of this comparison are depicted in Fig. 11.
and ensemble gradient boosted machines. One of the important As evident from Fig. 11, the proposed technique outperforms
features of this algorithm is that it computes the local feature in- all the conventional ML methods in terms of accuracy, recall, pre-
teraction, which in-turn facilitates the interpretation of the global cision, F1score, Kappa, and MCC; thus, proving its effectiveness
model structure for each prediction. A detailed explanation and and significance. Another performance evaluation-based compar-
source code of the tree-SHAP technique is presented tree-SHAP ison of the proposed method with a few of the well-known con-
GitHub webpage (https://shap.readthedocs.io/). Fig. 10 shows the ventional ML methods is made on identical performance evalua-
summary plot generated by the tree-SHAP algorithm that helps tion metrics. The corresponding outcomes are depicted in Fig. 12.
in interpreting the predicted outcomes of the proposed theft Once again, the proposed method’s performance superiority can
detection model. be observed from outcomes depicted in Fig. 12. It achieves an
The summary plot shown in Fig. 10, plots the consumers’ ex- accuracy, recall, precision, F1score , Kappa, and MCC of 93.38%, 92%,
tracted features against the computed Shapley values. The Shap- 95%, 93.7%, and 87%, respectively, which are significantly higher
ley values are computed for every consumer’s each feature value than all the competing models.
4433
S. Hussain, M.W. Mustafa, T.A. Jumani et al. Energy Reports 7 (2021) 4425–4436

Fig. 10. SHAP value of the proposed model. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this
article.)

Fig. 11. Performance evaluation of studied tree-based ML models.

4. Conclusion led to lowering the computational time and enhancing the pro-
posed classifier’s learning capability. To classify data into genuine
In this paper, a novel feature engineered CatBoost-based NTL and theft consumers, the intelligence of the CatBoost algorithm
detection framework is developed. At the initial stage of the was employed. Finally, the model’s decision for a particular out-
proposed NTL detection framework, the missing slots in the ac- come was interpreted using the tree-SHAP algorithm. To prove
quired data set were imputed using kNN missing value imputer. the proposed framework’s superior classification performance,
To avoid the data class imbalances, the SMOTETomek algorithm its performance was compared with that of the latest gradient
was utilized which simultaneously over and under-sample the boosted machines and traditional ML models based on few of
data classes. The FRESH algorithm’s intelligence was utilized at the well-known performance evaluation metrics. The proposed
the later stage to extract and select the most relevant features technique outperformed all the considered competing algorithms
from the acquired smart meter data set, which consequently and achieved 93% accuracy, 92% recall and 95% precision.
4434
S. Hussain, M.W. Mustafa, T.A. Jumani et al. Energy Reports 7 (2021) 4425–4436

Fig. 12. Performance evaluation of studied conventional ML models.

Declaration of competing interest Hasan, M., Toma, R.N., Nahid, A.-A., Islam, M.M., Kim, J.-M., 2019. Electricity theft
detection in smart grid systems: a CNN-LSTM based approach. Energies 12
(17), 3310.
The authors declare that they have no known competing finan-
Hussain, S., Mustafa, M.W., Jumani, T.A., Baloch, S.K., Saeed, M.S., 2020. A novel
cial interests or personal relationships that could have appeared unsupervised feature-based approach for electricity theft detection using
to influence the work reported in this paper. robust PCA and outlier removal clustering algorithm. Int. Trans. Electr. Energy
Syst. 30 (11), e12572, %@ 2050-7038.
Acknowledgments Jaiswal, S., Ballal, M.S., 2020. Fuzzy inference based electricity theft prevention
system to restrict direct tapping over distribution line. J. Electr. Eng. Technol.
15 (3), 1095–1106. http://dx.doi.org/10.1007/s42835-020-00408-7.
Authors are agreed with this submission. They equally con- Jindal, A., Dua, A., Kaur, K., Singh, M., Kumar, N., Mishra, S., 2016. Decision tree
tributed in manuscript and its revision. and SVM-based data analytics for theft detection in smart grid. IEEE Trans.
Ind. Inf. 12 (3), 1005–1016, %@ 1551-3203.
Joenssen, D.W., Bankhofer, U., 2012. Hot deck methods for imputing missing
References
data. In: Machine Learning and Data Mining in Pattern Recognition. Springer
Berlin Heidelberg, Berlin, Heidelberg, pp. 63–75, %@ 978-3-642-31537-4.
Adil, M., Javaid, N., Qasim, U., Ullah, I., Shafiq, M., Choi, J.-G., 2020. LSTM and Jokar, P., Arianpoo, N., Leung, V.C.M., 2016. Electricity theft detection in AMI us-
bat-based RUSBoost approach for electricity theft detection. Appl. Sci. 10 ing customers’ consumption patterns. IEEE Trans. Smart Grid 7 (1), 216–226.
(12), 4378. http://dx.doi.org/10.3390/app10124378, 2020-06-25. http://dx.doi.org/10.1109/tsg.2015.2425222.
Badrinath Krishna, V., Iyer, R.K., Sanders, W.H., 2016. ARIMA-Based Modeling and Ke, G., et al., 2017. Lightgbm: A highly efficient gradient boosting decision tree.
Validation of Consumption Readings in Power Grids. Springer International In: Advances in Neural Information Processing Systems, Vol. 30. NIPS 2017.
Publishing, pp. 199–210. pp. 3146–3154.
Batista, G.E., Prati, R.C., Monard, M.C., 2004a. A study of the behavior of several Lundberg, S.M., et al., 2020. From local explanations to global understanding
methods for balancing machine learning training data. ACM SIGKDD Explor. with explainable AI for trees. Nat. Mach. Intell. 2 (1), 56–67. http://dx.doi.
Newsl. 6 (1), 20–29, %@ 1931-0145. org/10.1038/s42256-019-0138-9.
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C., 2004b. A study of the behavior of Messinis, G.M., Hatziargyriou, N.D., 2018. Review of non-technical loss detection
several methods for balancing machine learning training data. ACM SIGKDD methods. Electr. Power Syst. Res. 158, 250–266. http://dx.doi.org/10.1016/j.
Explor. Newsl. 6 (1), 20–29. http://dx.doi.org/10.1145/1007730.1007735. epsr.2018.01.005.
Benjamini, Y., Yekutieli, D., 2001. The control of the false discovery rate in Molnar, C., 2018. A guide for making black box models explainable. URL: https:
multiple testing under dependency. Ann. Statist. 1165–1188, %@ 0090-5364. //christophm.github.io/interpretable-ml-book.
Buzau, M.-M., Tejedor-Aguilera, J., Cruz-Romero, P., Gomez-Exposito, A., 2018a. Mwaura, F.M., 2012. Adopting electricity prepayment billing system to reduce
Detection of non-technical losses using smart meter data and super- non-technical energy losses in Uganda: Lesson from Rwanda. 23, pp. 72–79.
vised learning. IEEE Trans. Smart Grid 1. http://dx.doi.org/10.1109/tsg.2018. http://dx.doi.org/10.1016/j.jup.2012.05.004.
2807925. Never, B., 2015. Social norms, trust and control of power theft in Uganda: Does
Buzau, M.M., Tejedor-Aguilera, J., Cruz-Romero, P., Gómez-Expósito, A., 2018b. bulk metering work for MSEs? Energy Policy 82, 197–206. http://dx.doi.org/
Detection of non-technical losses using smart meter data and supervised 10.1016/j.enpol.2015.03.020.
learning. IEEE Trans. Smart Grid 10 (3), 2661–2670, %@ 1949-3053. Northeast Group, 2017. Electricity Theft and Non-Technical Losses: Global
Chen, L., Chee-Wooi, T., Shiyan, H., 2013. Strategic FRTU deployment considering Markets, Solutions and Vendors, 2017. Northeast Group, LLC, [Online].
cybersecurity in secondary distribution network. 4, (3), pp. 1264–1274. http: Available: http://www.northeast-group.com/reports/Brochure-Electricity%
//dx.doi.org/10.1109/tsg.2013.2256939. 20Theft%20&%20Non-Technical%20Losses%20-%20Northeast%20Group.pdf.
Christ, M., 2018. tsfresh, python library for FRESH algorithm-Documentation Passos Júnior, L.A., et al., 2016. Unsupervised non-technical losses identification
webpage. https://tsfresh.readthedocs.io/en/latest/index.html. (Accessed). through optimum-path forest. Electr. Power Syst. Res. 140, 413–423. http:
Christ, M., Kempa-Liehr, A., Feindt, M., 2016. Distributed and parallel time //dx.doi.org/10.1016/j.epsr.2016.05.036.
series feature extraction for industrial big data applications. arXiv, vol. Pedregosa, F., et al., 2011. Scikit-learn: Machine learning in Python. J. Mach.
abs/1610.07717. Learn. Res. 12, 2825–2830, %@ 1532-4435.
Daniele, M.-B., 2001. A preprocessing scheme for high-cardinality categorical Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A.V., Gulin, A., 2018.
attributes in classification and prediction problems. SIGKDD Explor. Newsl. CatBoost: unbiased boosting with categorical features. pp. 6638–6648, https:
3 (1), 27–32. http://dx.doi.org/10.1145/507533.507538, %@ 1931-0145. //arxiv.org/abs/1810.11363v1.
Ferreira, A.M.S., Cavalcante, C.A.M.T., Fontes, C.H.O., Marambio, J.E.S., 2013. A new Punmiya, R., Choe, S., 2019. Energy theft detection using gradient boosting theft
method for pattern recognition in load profiles to support decision-making detector with feature engineering-based preprocessing. IEEE Trans. Smart
in the management of the electric sector. Int. J. Electr. Power Energy Syst. Grid 10 (2), 2326–2329. http://dx.doi.org/10.1109/tsg.2019.2892595.
53, 824–831. http://dx.doi.org/10.1016/j.ijepes.2013.06.001. Roth, P.L., Switzer, F.S., 1995. A Monte Carlo analysis of missing data tech-
Gunturi, S.K., Sarkar, D., 2020. Ensemble machine learning models for the niques in a HRM setting. J. Manage. 21 (5), 1003–1023. http://dx.doi.org/
detection of energy theft. Electr. Power Syst. Res. http://dx.doi.org/10.1016/ 10.1177/014920639502100511, %U https://journals.sagepub.com/doi/abs/10.
j.epsr.2020.106904. 1177/014920639502100511.

4435
S. Hussain, M.W. Mustafa, T.A. Jumani et al. Energy Reports 7 (2021) 4425–4436

Rusitschka, S., Eger, K., Gerdes, C., 2010. Smart Grid Data Cloud: A Model for Viegas, J.L., Esteves, P.R., Melício, R., Mendes, V.M.F., Vieira, S.M., 2017. Solutions
Utilizing Cloud Computing in the Smart Grid Domain. IEEE, http://dx.doi. for detection of non-technical losses in the electricity grid: A review. Renew.
org/10.1109/smartgrid.2010.5622089, [Online]. Available: https://doi.org/10. Sustain. Energy Rev. 80, 1256–1268. http://dx.doi.org/10.1016/j.rser.2017.05.
1109/smartgrid.2010.5622089. 193.
Saad, M., Tariq, M.F., Nawaz, A., Jamal, M.Y., 2017. Theft Detection Based Winther, T., 2012. Electricity theft as a relational issue: A comparative look at
GSM Prepaid Electricity System. IEEE, http://dx.doi.org/10.1109/ccsse.2017. Zanzibar, Tanzania, and the Sunderban Islands, India. Energy Sustain. Dev.
8087973, [Online]. Available: https://doi.org/10.1109/ccsse.2017.8087973. 16 (1), 111–119. http://dx.doi.org/10.1016/j.esd.2011.11.002.
Saeed, M.S., Mustafa, M.W., Sheikh, U.U., Jumani, T.A., Mirjat, N.H., 2019. Xiao, Z., Xiao, Y., Du, D.H.-C., 2013. Exploring malicious meter inspection in
Ensemble bagged tree based classification for reducing non-technical losses neighborhood area smart grids. IEEE Trans. Smart Grid 4 (1), 214–226.
in multan electric power company of Pakistan. Electronics 8 (8), 860. http://dx.doi.org/10.1109/tsg.2012.2229397.
Salman Saeed, M., et al., 2020. An efficient boosted C5.0 decision-tree-based Yurtseven, Ç., 2015. The causes of electricity theft: An econometric analysis
classification approach for detecting non-technical losses in power utilities. of the case of Turkey. Util. Policy 37, 70–78. http://dx.doi.org/10.1016/j.jup.
Energies 13 (12), 3242. http://dx.doi.org/10.3390/en13123242. 2015.06.008.
Troyanskaya, O., et al., 2001. Missing value estimation methods for DNA Zhang, S., Zhang, J., Zhu, X., Qin, Y., Zhang, C., 2008. Missing value imputation
microarrays. Bioinformatics 17 (6), 520–525. http://dx.doi.org/10.1093/ based on data clustering. In: Transactions on Computational Science I.
bioinformatics/17.6.520. Springer, pp. 128–138.
Tureczek, A.M., Nielsen, P.S., 2017. Structured literature review of electricity Zheng, Z., Yang, Y., Niu, X., Dai, H.-N., Zhou, Y., 2018. Wide and deep con-
consumption classification using smart meter data. Energies 10 (5), 584. volutional neural networks for electricity-theft detection to secure smart
grids. IEEE Trans. Ind. Inf. 14 (4), 1606–1615. http://dx.doi.org/10.1109/tii.
2017.2785963.

4436

Toyota SP10 Sewing Machine Instruction Manual
No ratings yet
Toyota SP10 Sewing Machine Instruction Manual
44 pages
Pulsar RS200 BS IV 2017 Servicio
100% (3)
Pulsar RS200 BS IV 2017 Servicio
301 pages
Factory Acceptance Test For PRV
No ratings yet
Factory Acceptance Test For PRV
4 pages
Inara House - Case Study
100% (1)
Inara House - Case Study
28 pages
SP AlrajehN WILEY BigDataAnalytics
No ratings yet
SP AlrajehN WILEY BigDataAnalytics
21 pages
Analysis and Comparison of Forecasting Algorithms For Telecom Customer Churn
No ratings yet
Analysis and Comparison of Forecasting Algorithms For Telecom Customer Churn
7 pages
Zhao 2020
No ratings yet
Zhao 2020
32 pages
Network Anomaly Detection Using A Hybrid Approach of Machine H Öztekin
No ratings yet
Network Anomaly Detection Using A Hybrid Approach of Machine H Öztekin
12 pages
1-s2.0-S2199853124002324-main
No ratings yet
1-s2.0-S2199853124002324-main
15 pages
Rameshwaraiah - 2021 - IOP - Conf. - Ser. - Mater. - Sci. - Eng. - 1074 - 012015
No ratings yet
Rameshwaraiah - 2021 - IOP - Conf. - Ser. - Mater. - Sci. - Eng. - 1074 - 012015
6 pages
A Comparative Analysis of Class Imbalance Handling Techniques for Deep Models in the Detection of Anomalies in Energy Consumption
No ratings yet
A Comparative Analysis of Class Imbalance Handling Techniques for Deep Models in the Detection of Anomalies in Energy Consumption
17 pages
1 s2.0 S0925231220319032 Main
No ratings yet
1 s2.0 S0925231220319032 Main
11 pages
Artificial Intelligence and Evolutionary Approache
No ratings yet
Artificial Intelligence and Evolutionary Approache
23 pages
Shapley-Based Explainable AI For Clustering
No ratings yet
Shapley-Based Explainable AI For Clustering
23 pages
Machine Learning (CSE4020) Review III: A Review On Bio-Inspired Computing in Finance Management
No ratings yet
Machine Learning (CSE4020) Review III: A Review On Bio-Inspired Computing in Finance Management
20 pages
A Fast and Effective Partitional Clustering Algorithm For Large Categorical Datasets Using A K-Means Based Approach
No ratings yet
A Fast and Effective Partitional Clustering Algorithm For Large Categorical Datasets Using A K-Means Based Approach
21 pages
A Novel Two-Stage Method To Detect Non-Technical Losses in Smart Grids
No ratings yet
A Novel Two-Stage Method To Detect Non-Technical Losses in Smart Grids
13 pages
Kotlar Et Al. - 2021 - Novel Meta-Features For Automated Machine Learning Model Selection in Anomaly Detection
No ratings yet
Kotlar Et Al. - 2021 - Novel Meta-Features For Automated Machine Learning Model Selection in Anomaly Detection
13 pages
Analysing feature selection: impacts towards forecasting electricity power consumption
No ratings yet
Analysing feature selection: impacts towards forecasting electricity power consumption
8 pages
Class Imbalance Should Not Throw You Off Balance - Choosing The Right Classifiers and Performance Metrics For Brain Decoding With Imbalanced Data
No ratings yet
Class Imbalance Should Not Throw You Off Balance - Choosing The Right Classifiers and Performance Metrics For Brain Decoding With Imbalanced Data
14 pages
1 s2.0 S0045790621005371 Main
No ratings yet
1 s2.0 S0045790621005371 Main
15 pages
Feature Selection Based On Fuzzy Entropy
No ratings yet
Feature Selection Based On Fuzzy Entropy
5 pages
Seminar Synopsisreport
No ratings yet
Seminar Synopsisreport
6 pages
Performance Evaluation of Different Supervised Learning Algorithms For Mobile Price Classification
No ratings yet
Performance Evaluation of Different Supervised Learning Algorithms For Mobile Price Classification
10 pages
Energy Prediction of Appliances Using Supervised ML Algorithms
No ratings yet
Energy Prediction of Appliances Using Supervised ML Algorithms
17 pages
Deep Learning Based Electricity Theft Prediction in Non Smart Gri 2024 Heliy
No ratings yet
Deep Learning Based Electricity Theft Prediction in Non Smart Gri 2024 Heliy
26 pages
Expert Systems With Applications: Raquel Barco, Pedro Lázaro, Volker Wille, L. Díez, Sagar Patel
No ratings yet
Expert Systems With Applications: Raquel Barco, Pedro Lázaro, Volker Wille, L. Díez, Sagar Patel
8 pages
ResearchPaper2_1_David_Laredo
No ratings yet
ResearchPaper2_1_David_Laredo
31 pages
Advanced Prognosis methodology based on behavioral indicators and
No ratings yet
Advanced Prognosis methodology based on behavioral indicators and
13 pages
A New Optimal Strategy For Energy Minimization in Wireless Sensor Networks
No ratings yet
A New Optimal Strategy For Energy Minimization in Wireless Sensor Networks
10 pages
Anomaly Detection Using Machine Learning
No ratings yet
Anomaly Detection Using Machine Learning
10 pages
1-s2.0-S0957417422009848-main
No ratings yet
1-s2.0-S0957417422009848-main
16 pages
1 s2.0 S2665917422000411 Main
No ratings yet
1 s2.0 S2665917422000411 Main
6 pages
Segmentation Pure Reference
No ratings yet
Segmentation Pure Reference
13 pages
Comparison of Learning Techniques For Prediction of Customer Churn in Telecommunication
No ratings yet
Comparison of Learning Techniques For Prediction of Customer Churn in Telecommunication
36 pages
Application-of-deep
No ratings yet
Application-of-deep
24 pages
Mini Project
No ratings yet
Mini Project
11 pages
CNN - and - GRU - Based - Deep - Neural - Network - For - Electricity - Theft - Detection - To - Secure - Smart - Grid Dataset
No ratings yet
CNN - and - GRU - Based - Deep - Neural - Network - For - Electricity - Theft - Detection - To - Secure - Smart - Grid Dataset
5 pages
Anomaly_Detection_Review (1)(2)
No ratings yet
Anomaly_Detection_Review (1)(2)
3 pages
1-s2.0-S0020025522002821-main
No ratings yet
1-s2.0-S0020025522002821-main
11 pages
Conditional-Generative-Adversarial-Networks-with-Optimized-Machine-Learning-for-Fault-Detection-of-Triplex-Pump-in-Industrial-Digital-Twin_2024_Multidisciplinary-Digital-Publishing-Institute-MDPI
No ratings yet
Conditional-Generative-Adversarial-Networks-with-Optimized-Machine-Learning-for-Fault-Detection-of-Triplex-Pump-in-Industrial-Digital-Twin_2024_Multidisciplinary-Digital-Publishing-Institute-MDPI
20 pages
10 1016@j Neucom 2020 07 007
No ratings yet
10 1016@j Neucom 2020 07 007
14 pages
A Novel and Effective Method Based Deep Learning Model For Detecting Non-Technical Electricity Losses
No ratings yet
A Novel and Effective Method Based Deep Learning Model For Detecting Non-Technical Electricity Losses
10 pages
Econometrica - 2021 - Farrell - Deep Neural Networks For Estimation and Inference
No ratings yet
Econometrica - 2021 - Farrell - Deep Neural Networks For Estimation and Inference
33 pages
An Analysis On Fundamentals of Information Technology
No ratings yet
An Analysis On Fundamentals of Information Technology
7 pages
1-s2.0-S266630742300013X-main
No ratings yet
1-s2.0-S266630742300013X-main
9 pages
Automatic Root Cause Analysis For LTE Networks Based On Unsupervised Techniques
No ratings yet
Automatic Root Cause Analysis For LTE Networks Based On Unsupervised Techniques
18 pages
Expert Systems With Applications: Tug Ba Efendigil, Semih Önüt, Cengiz Kahraman
No ratings yet
Expert Systems With Applications: Tug Ba Efendigil, Semih Önüt, Cengiz Kahraman
11 pages
Customer Churn Prediction by Hybrid Neural Networks
No ratings yet
Customer Churn Prediction by Hybrid Neural Networks
7 pages
ProjectPPT(1)
No ratings yet
ProjectPPT(1)
17 pages
1 s2.0 S2666792420300068 Main
No ratings yet
1 s2.0 S2666792420300068 Main
17 pages
Energies 16 05477
No ratings yet
Energies 16 05477
3 pages
3644280 research paper
No ratings yet
3644280 research paper
6 pages
Applsci 11 07733 v2
No ratings yet
Applsci 11 07733 v2
18 pages
Paper Review
No ratings yet
Paper Review
4 pages
Paper 4 Chandigarh PDF
No ratings yet
Paper 4 Chandigarh PDF
4 pages
Energy Conversion and Econom - 2023 - Patel - Taxonomy of Outlier Detection Methods For Power System Measurements
No ratings yet
Energy Conversion and Econom - 2023 - Patel - Taxonomy of Outlier Detection Methods For Power System Measurements
16 pages
ISA Transactions: Te Han, Chao Liu, Wenguang Yang, Dongxiang Jiang
No ratings yet
ISA Transactions: Te Han, Chao Liu, Wenguang Yang, Dongxiang Jiang
13 pages
A Recommender System Using GA K-Means Clustering in An Online Shopping Market
No ratings yet
A Recommender System Using GA K-Means Clustering in An Online Shopping Market
10 pages
Machine learning for power system protection and control
No ratings yet
Machine learning for power system protection and control
7 pages
Editorial: Complexity Problems Handled by Big Data Technology
No ratings yet
Editorial: Complexity Problems Handled by Big Data Technology
8 pages
Microprediction: Building an Open AI Network
From Everand
Microprediction: Building an Open AI Network
Peter Cotton
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Min Kepatihan
No ratings yet
Min Kepatihan
42 pages
Love Plov: S e Tator
No ratings yet
Love Plov: S e Tator
28 pages
Beeswax
No ratings yet
Beeswax
27 pages
A unit cell structure with tunable Poisson's ratio from positive to negative
No ratings yet
A unit cell structure with tunable Poisson's ratio from positive to negative
4 pages
Annotations For Sound Art - Julian Cowley
No ratings yet
Annotations For Sound Art - Julian Cowley
7 pages
User's Manual: Professional Laptop Battery Analyzer Model: NLBA
No ratings yet
User's Manual: Professional Laptop Battery Analyzer Model: NLBA
50 pages
TOKYO
No ratings yet
TOKYO
7 pages
Lecture 2 The Manufacturing Process 2020
100% (1)
Lecture 2 The Manufacturing Process 2020
48 pages
Cambio de Neumatico
No ratings yet
Cambio de Neumatico
2 pages
EveAudio WebManual 107-108 EN
No ratings yet
EveAudio WebManual 107-108 EN
24 pages
Medical Information Form: Medif No
No ratings yet
Medical Information Form: Medif No
6 pages
TimerProPresentation PDF
No ratings yet
TimerProPresentation PDF
34 pages
Coating
No ratings yet
Coating
29 pages
LPG Monitoring
No ratings yet
LPG Monitoring
15 pages
HRA Warehouse
No ratings yet
HRA Warehouse
14 pages
Guided Writing
No ratings yet
Guided Writing
60 pages
Input Tax Credit Question Bank
No ratings yet
Input Tax Credit Question Bank
50 pages
Management Measures For International Student Dormitories at Nanjing Tech University (Revised)
No ratings yet
Management Measures For International Student Dormitories at Nanjing Tech University (Revised)
18 pages
College Brochure 3
No ratings yet
College Brochure 3
2 pages
bst projectpdf (1)
No ratings yet
bst projectpdf (1)
49 pages
D111171311 - Skripsi - 24-12-2021 Datar Pustaka
No ratings yet
D111171311 - Skripsi - 24-12-2021 Datar Pustaka
31 pages
6 Axis Example
No ratings yet
6 Axis Example
6 pages
Grounding and Bonding For Oil and Gas Drilling or
No ratings yet
Grounding and Bonding For Oil and Gas Drilling or
8 pages
Chandra Ya An
No ratings yet
Chandra Ya An
2 pages
The Use of Non Linear Metal Oxide Resistors in Transformer PDF
No ratings yet
The Use of Non Linear Metal Oxide Resistors in Transformer PDF
3 pages
5000 Palabras Mas Comunes
No ratings yet
5000 Palabras Mas Comunes
112 pages

CatBoost based supervised machine learning classification

Uploaded by

CatBoost based supervised machine learning classification

Uploaded by

Energy Reports 7 (2021) 4425–4436

Contents lists available at ScienceDirect

A novel feature engineered-CatBoost-based supervised machine

1.2. Positioning of our work in literature

The supervised-based NTL detection methods generally face

Fig. 2. Proposed theft detection framework.

Fig. 3. Electric consumption samples of consumers involved in power theft.

Fig. 4. Electric consumption samples of genuine consumers.

Fig. 5. Histogram of missing values present in SGCC dataset.

Fig. 6. Missing value imputation using the K-nearest neighbor technique.

Fig. 9. Extracted features using the FRESH algorithm.

Fig. 11. Performance evaluation of studied tree-based ML models.

Fig. 12. Performance evaluation of studied conventional ML models.

You might also like