Power Transformer Fault Diagnosis Using Neural Net
Power Transformer Fault Diagnosis Using Neural Net
Article
Power Transformer Fault Diagnosis Using Neural Network
Optimization Techniques
Vasiliki Rokani 1, * , Stavros D. Kaminaris 1 , Petros Karaisas 1 and Dimitrios Kaminaris 2
1 Department of Electrical and Electronics Engineering, University of West Attica, GR-12244 Egaleo, Greece;
[email protected] (S.D.K.); [email protected] (P.K.)
2 Institute of Physics, Ecole Polytechnique Federale de Lausanne (EPFL), 1015 Lausanne, Switzerland;
[email protected]
* Correspondence: [email protected]
Abstract: Artificial Intelligence (AI) techniques are considered the most advanced approaches for
diagnosing faults in power transformers. Dissolved Gas Analysis (DGA) is the conventional approach
widely adopted for diagnosing incipient faults in power transformers. The IEC-599 standard Ratio
Method is an accurate method that evaluates the DGA. All the classical approaches have limitations
because they cannot diagnose all faults accurately. Precisely diagnosing defects in power transformers
is a significant challenge due to their extensive quantity and dispersed placement within the power
network. To deal with this concern and to improve the reliability and precision of fault diagnosis,
different Artificial Intelligence techniques are presented. In this manuscript, an artificial neural
network (ANN) is implemented to enhance the accuracy of the Rogers Ratio Method. On the other
hand, it should be noted that the complexity of an ANN demands a large amount of storage and
computing power. In order to address this issue, an optimization technique is implemented with the
objective of maximizing the accuracy and minimizing the architectural complexity of an ANN. All the
procedures are simulated using the MATLAB R2023a software. Firstly, the authors choose the most
effective classification model by automatically training five classifiers in the Classification Learner
app (CLA). After selecting the artificial neural network (ANN) as the sufficient classification model,
we trained 30 ANNs with different parameters and determined the 5 models with the best accuracy.
We then tested these five ANNs using the Experiment Manager app and ultimately selected the ANN
Citation: Rokani, V.; Kaminaris, S.D.; with the best performance. The network structure is determined to consist of three layers, taking into
Karaisas, P.; Kaminaris, D. Power consideration both diagnostic accuracy and computing efficiency. Ultimately, a (100-50-5) layered
Transformer Fault Diagnosis Using ANN was selected to optimize its hyperparameters. As a result, following the implementation of
Neural Network Optimization the optimization techniques, the suggested ANN exhibited a high level of accuracy, up to 90.7%.
Techniques. Mathematics 2023, 11,
The conclusion of the proposed model indicates that the optimization of hyperparameters and the
4693. https://doi.org/10.3390/
increase in the number of data samples enhance the accuracy while minimizing the complexity of the
math11224693
ANN. The optimized ANN is simulated and tested in MATLAB R2023a—Deep Network Designer,
Academic Editor: Nicu Bizon resulting in an accuracy of almost 90%. Moreover, compared to the Rogers Ratio Method, which
exhibits an accuracy rate of just 63.3%, this approach successfully addresses the constraints associated
Received: 3 October 2023
Revised: 15 November 2023
with the conventional Rogers Ratio Method. So, the ANN has evolved a supremacy diagnostic
Accepted: 17 November 2023 method in the realm of power transformer fault diagnosis.
Published: 19 November 2023
Keywords: power transformers; fault diagnosis; DGA; ANN; MATLAB; neural network optimization
MSC: 68T07
Copyright: © 2023 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and 1. Introduction
conditions of the Creative Commons
A power transformer (PT) is a device that transmits power energy through circuits
Attribution (CC BY) license (https://
via electromagnetic induction. It is an integral component of the electrical power system
creativecommons.org/licenses/by/
4.0/).
(EPS) and has the effect of increasing or decreasing the voltage of an alternating current
power supply. Power transformers are essential in ensuring electrical energy is efficiently
and reliably transmitted over long distances. The fundamental idea behind transformer
theory is the utilization of a magnetic field generated by one coil to create an electromotive
force (EMF) in a second coil [1]. The electrical apparatus comprises windings made of
either copper or aluminum, a core composed of thin sheets of magnetic steel, and insulating
materials such as high-density paper and mineral oil [2]. The transformer is a very intricate
piece of equipment. Numerous forces are at play inside the tank, including phenomena
such as ageing, chemical processes, electric and magnetic fields, thermal expansion and
contraction, fluctuations in load, and the force of gravity. The transformer is subject to
several external factors, such as through-faults, significant ambient temperature variations,
voltage surges, and other forces like gravity and the Earth’s magnetic field. Transformers
are susceptible to a diverse range of problems [3].
The categorization of faults is as follows:
• Electrical: Partial Discharge, Corona, Arcing, Oil’s Breakdown Voltage;
• Thermal: Cellulose Overheating, Oil Overheating;
• Mechanical: Winding, Core Deformations.
The transformer transmits high voltage levels that create a strong electrical field. This
field causes strain on the insulation, and the rate at which it breaks down depends on
the current insulation and the surrounding situations. The assessment of the insulation
properties has significant implications for the longevity of a PT since it is strongly correlated
with the durability of the insulation material. Several important characteristics may be
assessed in relation to the insulating components of PTs. The frequently used methods are
Dissolved Gas Analysis (DGA) in the insulating oil, assessment of insulation resistance,
evaluation of partial discharges, and measurement of the power factor or tangent delta [2,3].
The DGA technique is a highly used approach in contemporary monitoring systems [4]. The
literature has presented many methodologies for evaluating the DGA method, including
the Key Gas, IEC ratio, Duval triangle, Doernenburg ratio, Rogers Ratio, and logarithmic
nomograph methods. The fault classification techniques use reference tables and charts
that have been developed based on the quantities or specific ratios of gases [5,6]. The issue
authors have to cope with is that conventional DGA has limits because it mainly relies
on empirical approaches and lacks mathematical formulations, hindering their ability to
analyze all types of faults effectively. Consequently, in several instances, an inaccurate or
unresolved diagnosis is seen. This occurs when more than one fault arises in a transformer
or when the concentration of gases is near the threshold. In order to address this issue and
improve the dependability of defect detection, the current research initiative employs an
artificial neural network (ANN) [7,8].
After conducting a literature review on power transformer fault diagnosis (PTFD)
employing DGA and artificial neural networks (ANNs), several research gaps and areas
for further investigation were identified [7]. From our point of view, the most crucial
research gap could be the limited study of model optimization techniques. Multiple studies
concentrate on applying ANNs to PTFD but often skip the specifics of optimizing the ANN
architecture and training process to improve the diagnostic precision. These issues can be
addressed using optimization algorithms, regularization techniques, and hyperparameter
tuning; all these will be further defined in Section 2.
To enhance the diagnostic precision of traditional DGA, an artificial neural network
(ANN) was developed to improve the diagnostic model’s resilience and accuracy. Moreover,
it was selected to combine the conventional DGA technique, Roger’s Ratio Method, with
an ANN, as this method is widely used in the literature and gives accurate predictions for
power transformer faults [7,8]. Furthermore, an optimization procedure is conducted to
maximize the accuracy and decrease the architectural complexity of the proposed network.
The optimization procedure is executed by adjusting the model’s hyperparameters. The
available literature on hyperparameter tuning is limited, with Machine Learning (ML)
algorithm developers only supplying brief descriptions of their functions [9]. While sci-
entific papers and online tutorials offer some additional information, there is a lack of
Mathematics 2023, 11, 4693 3 of 33
more extensive studies that investigate significant inquiries, such as the extent to which a
model can be enhanced using tuning, the most effective tuning strategy, and the influence
of tuning a specific hyperparameter on different datasets. More research on these topics
would be beneficial in order to establish the best practices for hyperparameter tuning. This
study aims to address the research concerns previously indicated within the realm of PTFD.
Our goals are twofold:
1. Maximizing the Accuracy for PTFD.
2. Minimizing the Architectural Complexity of the ANN.
The present publication utilizes the MATLAB R2023a software to construct an opti-
mized Multi-Layer Feedforward Backpropagation Neural Network. The experiment is
conducted as follows: We train 30 NNs and then choose the top 5 models based on the
validation accuracy. In order to determine which of these five ANNs was the most effective,
we put them through a series of tests using the Experiment Management app.
Finally, we pick the optimal ANN to optimize its hyperparameters and, decisively,
to provide the model with the greatest accuracy. Three different automated methods
(a Bayesian optimizer, a random search, and a grid search), as well as three different
optimizers (Adam, SGDM, and RMSprop), are compared and contrasted to assist in the
hyperparameter selection for our ANN.
The performance and simulation of the optimized ANN are achieved by using MAT-
LAB R2023a software. The accuracy of the conventional Rogers Method and the ANN
method is calculated using the type (A) = (Total Samples Correctly Classified)/(Total
Samples).
PTFD is accomplished via the use of a coding system that is developed from the
boundaries of gas concentrations, expressed in ppm, as given in Table 1. Table 2 provides the
gas ratio code for each fault type, indicating the presence of 12 distinct fault categories [8,13].
The 12 different types of faults are classified into five categories (F1, F2, F3, F4, F5) in
order to obtain only five outputs for the ANN. A reduced number of outputs makes the
network more functional and flexible in the MATLAB R2023a software. The numbers of
fault types according to the Rogers Ratio Method are included in parentheses.
The five different types of faults are:
• F1: No-Fault (1);
• F2: Low Energy Discharge (2), (12);
• F3: High Energy Discharge (9), (10), (11);
• F4: Low and Medium Thermal Faults (3), (4), (5), (7), (8);
• F5: High-Temperature Thermal Faults (6).
The data samples and transformer states were created utilizing the IEEE DGA datasets [15]
and the dataset from our previous work [8].
Schemeof
Figure1.1.Scheme
Figure ofan
anartificial
artificialneural
neuralnetwork.
network.
Backpropagation
Backpropagation
The multi-layer network should be trained by operating a backpropagation algorithm,
The multi-layer network should be trained by operating a backpropagation algo-
which is also used as the learning rule of deep learning [20]. The method of backpropagation
rithm, which is also used as the learning rule of deep learning [20]. The method of back-
allows the network to obtain knowledge from the training data by modifying the weights
propagation allows the network to obtain knowledge from the training data by modifying
of the neurons according to the error or loss experienced during the training process. This
the weights of the neurons according to the error or loss experienced during the training
method can be explained as follows [21,22]:
process. This method can be explained as follows [21,22]:
Forward-propagation: For each neuron in layer l, the weighted sum z is computed as:
Forward-propagation: For each neuron in layer l, the weighted sum z is computed
as: z = Wl ∗ aprev + bl (1)
z= W ∗a + b (1)
where Wl represents the weight matrix, aprev stands for the activation function from the
where W layer,
previous and bthe
represents l is weight vector afor layer
the biasmatrix, l. Then,
stands the
for the output afunction
activation of each neuron
from theis
obtainedlayer,
previous and b an
by applying activation
is the function
bias vector f, a =l.fThen,
for layer (z). the output a of each neuron is
obtained
• Lossby applying
function: anchoice
The activation function
of the f, a = depends
loss function f(z). on the specific problem. If we
• denote
Loss the predicted
function: The choice output of the
of the lossANN as ŷ depends
function and the target
on theoutput as problem.
specific y, then the
If loss
we
denote the predicted output of the ANN as y and the target output as y, then the
function L measures the discrepancy between ŷ and y, such as the mean squared error
for regression or cross-entropy for classification.
Backpropagation: For each neuron in the output layer, the partial derivative of the loss
function with respect to its output is computed:
∂L
δout = (2)
∂ŷ
Mathematics 2023, 11, 4693 6 of 33
Then, the gradient of the neuron’s output with respect to its input is computed by
multiplying δout by the derivative of the activation function f0 (z):
Weight update: The weights and biases are updated using the gradient descent
algorithm. For each weight connecting neuron j in layer l to neuron i in layer (l + 1), the
weight is computed as:
∆wij = −r ∗ δι ∗ aj (4)
where r is the learning rate, δι is the gradient error of neuron i, and aj is the activation
function of neuron j.
Backpropagation through hidden layers: In order to compute the error gradients for
the neurons in the hidden layers, the gradients from the subsequent layers are propagated
backward. For each neuron in layer l, the error gradient is computed as:
h i
δl = f0 (zl ) ∗ WT ∗ δl+1 (5)
where WT is the transpose of the weight matrix connecting layer l to layer (l + 1).
• 2023,
Mathematics 11, 4693
Dropout rate during the dropout regularization. 7 of 33
No activation function is needed for the input layer. The kind of activation function to
be employed in the hidden layers and the output layer is determined by the problem we
aim to solve.
Although there are many approaches to finding the optimal ANN architecture, none
can guarantee the optimal solution for all real prediction problems. More than one hidden
layer can comprise each ANN. Kolmogorov’s theorem states that ‘an ANN with one hidden
layer could be selected if someone chooses a suitable quantity of neurons in the hidden
layer’ [30]. Moreover, the basic empirical principle of machine learning, Ockham’s Razor,
states that ‘The best models are simple models that fit the data well’. The complexity of
an ANN demands a large amount of storage and computing power, so we try to construct
models with low complexity [23].
• The bias–variance dilemma [31,32].
Bias: The bias is the error between the prediction of the values by the network and the
right value. Bias defines the model’s capability to learn from the training data. A high-bias
network causes a large error in the training and testing data. An algorithm should be
low-bias to keep away from the problem of underfitting.
Variance: The difference between the training accuracy and testing accuracy. The
variance defines how well a network can generalize to the test data.
The bias–variance tradeoff appears as it is complicated to minimize bias and variance
simultaneously. When the complexity of an ANN increases, its bias decreases, and its
variance increases. Contrarily, when the complexity of an ANN decreases, its bias increases,
and its variance decreases. The optimum complexity of an ANN relies on the available
dataset and the specific model [33].
2.4.2. Optimization of Hyperparameters That Determine the State of the Network Training
and Directly Control the Training Process
The hyperparameters that define the way the network training and directly control
the training process are [26,34,35]:
Learning rule—the optimizer type: Reveals the algorithm employed for the network
training. There are various types of optimizers commonly used in ANNs. Some of them are
gradient descent (backpropagation) [36], stochastic gradient descent (SGD) [36], Stochastic
Gradient Descent with Momentum (SGDM), the Levenberg–Marquardt algorithm [23],
Bayesian regularization [23], RMSprop (Root Mean Square Propagation) and Adam (Adap-
tive Moment Estimation) [37]. Choosing an optimizer often relies on the specific issue,
architecture, and dataset; in our project, we use a trial-and-error method to find the finest
one for the PTFD.
Weight initialization methods: Weight initialization refers to arranging the initial
values of the weights in an ANN. Appropriate initialization is essential because it can
affect an ANN’s convergence speed and performance. Some standard weight initialization
methods include random initialization, Xavier initialization, and He initialization. These
methods aim to set the initial weights to balance the activation values and gradients
during training.
Type of loss (or cost) function: The cost function estimates the difference between the
forecasted output of an ANN and the real output. It quantifies the error of the model’s
predictions and is operated to direct the learning process. The selection of the loss function
depends on the model type and the data’s nature. Standard loss functions are the mean
squared error (MSE), Huber Loss, and categorical cross-entropy.
Learning rate: This hyperparameter controls the step size at which the model updates
its weights during training.
Batch size: It defines the amount of training data that is processed before the model’s
weights are updated. Employing a larger batch size during training can result in a shorter
training time but may demand more memory. Shorter batch sizes can supply more stochas-
ticity and better generalization, but they take longer to converge.
Mathematics 2023, 11, 4693 9 of 33
Number of epochs: An epoch represents a full pass through the complete training
dataset during the training process of an ANN. The number of epochs determines how
frequently the model passes and learns from the complete dataset. When the number of
epochs rises, the model’s performance is enhanced, but it also poses a risk of overfitting.
The optimum number of epochs relies on the problem’s complexity and the sample’s size;
moreover, it is usually defined using experimentation.
Training steps: Training steps refer to the iterations or updates made to the model’s
parameters during the training process. Each training step involves feeding a batch of
training examples to the model, computing the loss, and updating the weights based on the
gradients. The number of training steps depends on factors such as the batch size, training
dataset size, and convergence criteria. The number of epochs and the size of each epoch
determine it.
R = λ ∗ ∑ W2 (11)
where λ once again controls the strength of the process. The regularized loss function has
the same form as Equation (5). The gradient of the regularized loss function with respect to
the weights is computed as:
∂Lreg ∂L
= + 2∗λ∗W (12)
∂W ∂W
For each weight wij connecting neuron j in layer l to neuron i in layer (l + 1), the
weight update is computed as:
∂L
∆wij = −r∗ + 2∗λ∗wij (13)
∂wij
By incorporating the L1 and L2 regularization terms into the loss function and ad-
justing the weight update rule, we can attain sparsity in the learned weights. This can
be advantageous in reducing the ANN’s complexity and enhancing its knowledge to
generalize well to unseen data.
Lambda parameter—λ in L1 and L2 regularization
The λ factor assumes a significant role in determining the degree of L1 and L2
regularization.
Different values of λ correspond to different levels of regularization:
• When λ equals 0, no regularization is applied;
• When λ equals 1, complete regularization comes into effect;
• The default setting for Keras is λ = 0.01.
The differences between L1 and L2 are presented in Table 3.
L1 Regularization L2 Regularization
Shrinks weights magnitudes toward 0 Shrinks weights magnitudes to be small but not precisely 0
Penalizes the sum of the absolute values of the weights Penalizes the sum of the square values of the weights
The cost of outliers current in the data raises linearly The cost of outliers current in the data raises exponentially
Preferable when the model is simple Preferable when the model is complex
2.5. Visualize and Estimate the Performance of a Classifier in the Classification Learner App (CLA)
The experiment is executed in MATLAB R2023a in the Classification Learner app [38].
After training an ANN, the CLA automatically creates the confusion matrix (CM) and the
receiver operating characteristic (ROC) curve of the model.
2.5. Visualize and Estimate the Performance of a Classifier in the Classification Learner App
(CLA)
The experiment is executed in MATLAB R2023a in the Classification Learner app [38].
After training an ANN, the CLA automatically creates the confusion matrix (CM) and the
Mathematics 2023, 11, 4693 receiver operating characteristic (ROC) curve of the model. 11 of 33
Figure 3.
Figure 3. Confusion matrix.
matrix.
Mathematically,
Mathematically,these thesevalues
valuescan
canbe
beexpressed
expressedas asfollows:
follows:
•• True
True positives (TP): The number of occurrences
(TP): The number of occurrences where where thethe actual
actual class
class is positive
is positive (ac-
(actually positive)
tually positive) and andthethe model
model predicted
predicted it as
it as positive;
positive;
•• True
True negatives
negatives (TN):
(TN): The
The number
number of
of occurrences
occurrences where
where the
the actual
actual class
class is
is negative
negative
(actually
(actuallynegative),
negative),andandthethe model
model predicted
predictedititasas negative.
negative. This
This is often more relevant
relevant
in binary classification;
in binary classification;
•• False
False positives
positives (FP):
(FP): The
The number
number ofof occurrences
occurrences where
where thethe actual
actual class
class isis negative
negative
(actually negative), but the model predicted it as positive;
(actually negative), but the model predicted it as positive;
•• False
False negatives
negatives (FN):
(FN): The
The number
number ofof occurrences
occurrences where
where thethe actual
actual class
class is
is positive
positive
(actually
(actually positive),
positive), but
but the
the model
model predicted
predictedititasasnegative.
negative.
Using
Using these
these values,
values, various
various metrics
metrics can
can be
be calculated
calculated toto assess
assess the
the performance
performance of of the
the
ANN [41,42], such as:
ANN [41,42], such as:
•• Accuracy:
Accuracy: (TP
(TP + TN)/(TP++TN
+ TN)/(TP TN++FPFP++FN);
FN);
•• Precision: TP/(TP + FP);
Precision: TP/(TP + FP);
•• Recall
Recall == TP/(TP
TP/(TP ++FN);
FN);
• F1 Score = 2 ∗ (Precision ∗ Recall)/(Precision + Recall);
• Specificity = TN/(TN + FP).
Recall concentrates on the model’s capability to determine positive instances correctly.
F1 Score balances precision and recall, which is valid when evaluating both false
positives and false negatives.
Specificity calculates the model’s ability to accurately specify negative instances, which
is essential in procedures where false positives are inappropriate.
• These metrics provide different perspectives on the performance of the ANN. The
confusion matrix and the derived show where the model is making correct predictions
Mathematics 2023, 11, 4693 12 of 33
and where it is making mistakes, which is essential for improving the model or
choosing the right model for a given task.
2.6. Experiment
In this manuscript, the MATLAB R2023a software is applied to develop an optimized
Multi-Layer Feedforward Backpropagation Neural Network. In a few words, the experi-
ment is carried out as follows: After training 30 NNs, we select the five models with the
highest validation accuracy. We then conduct tests on these five ANNs using the Experi-
ment Manager app and ultimately select the neural network with the best performance.
Finally, the best ANN is selected to optimize the hyperparameters and, conclusively,
to present the model with the best accuracy. In this task, we examine three automated
techniques, Bayesian optimizer, random search, and grid search, and three optimizers,
Adam, SGDM, and RMSprop, to assist the hyperparameter selection for our ANN. The
workflow of our method is as follows:
Flowchart:
1. Data extraction;
2. Load data in the Classification Learner app;
3. Select classifier options;
4. Train classifiers;
5. Choose the best classifier type (ANN);
6. Train 30 ANNs with different parameters;
7. Select the five most accurate networks;
8. Train and test 5 ANNs and select the most accurate model;
9. ANN hyperparameter optimization;
10. Visualize and assess the ANN’s performance;
11. Select the ANN with the best accuracy for PTFD.
• Model size: the model’s size if it were exported with no training data.
Experiment settings:
1. Data extraction.
The four gas ratios are the inputs (predictors), and the output classes (response) are
the five transformer’s incipient faults. The total number of data samples is 400.
2. Load data in the Classification Learner app.
The predictor data are combined into a table (350 × 5), and the other 50 samples
constitute the test data, where the first four columns consist of the gas ratios and the last
column comprises the five classes. Each row of the table indicates one observation, so we
have 350 observations for the train set and 50 observations for the test set (Table 4).
Training
Accuracy% Cost Prediction Model Size
Classifiers Classifier Type Time
(Validation) (Validation) Speed (obj/s) (KB)
(s)
Fine Tree 78 77 6388.3 6.37 11.991
11.991
Decision trees Medium Tree 78 77 21,522.5 0.87
11.991
Coarse Tree 67.4 77 20,652.2 0.37 5.011
Quadratic 77.7 78 9763.7 0.7 60.282
Cubic 80.9 67 10,405.0 0.6 58.026
SVM Fine Gaussian 81.1 66 9530.5 0.7 65.810
Medium G. 80.6 68 10,127.4 0.6 65.042
Coarse G. 72.0 98 10,758.9 0.5 76.082
80.3 69 7547.1 0.7 28.934
Fine
80.3 69 7547.1 0.7 28.934
77.4
KNN Medium 79 9726.3 0.4 28.934
77.4
Cubic 76.9 81 10,022.9 0.4 28.934
Weighted 81.4 65 10,189.9 0.4 28.952
Narrow 81.4 65 13,372.7 3.8 6.143
Medium 82.9 60 21,741.1 2.5 7.343
ANN Wide 81.1 66 20,582.1 2.8 13.343
Bilayered 82.0 63 22,651.2 3.6 7.943
Trilayered 81.7 64 21,268.2 4.6 9.743
Naïve Bayes Kernel 70.0 105 6570.1 1.1 92.444
Mathematics 2023, 11, 4693 14 of 33
Activation Validation
NumLayers Layer 1 Size Layer 2 Size Layer 3 Size
Function Accuracy
1 Sigmoid 240 5 0 75.0
3 Tanh 200 100 5 80.1
3 ReLu 300 200 5 80.2
80.2
3 ReLu 100 5 0
80.2
3 Tanh 100 50 5 80.9
We then conducted tests on these five ANNs using the ExpM app. We manually trained
and tested them and ultimately selected the neural network with the best performance. We
selected the ANN with the following parameters (Table 7) to optimize its hyperparameters
and, conclusively, to present the model with the best accuracy.
Table
Table 8. Automated
8. Automated optimization.
optimization.
This
This experiment’s
experiment’s results
results areare illustrated
illustrated in in
thethe following
following figures:
figures: Figures
Figures 4–11.
4–11. WeWe
present
present thethe confusion
confusion matrix
matrix andand
thethe
ROC ROC curve
curve for for
eacheach model.
model. Model
Model 12 (Bayesian
12 (Bayesian op-
opti-
mizer) and Model
timizer) and Model4 (grid
4 search) stand out
(grid search) as the
stand outmost significant
as the models inmodels
most significant Table 8 because
in Table 8
of because
their highof validation
their high and test accuracy.
validation and testTheir significance
accuracy. indicates their
Their significance importance
indicates their im-
in portance
the analysis, so we
in the present
analysis, so awe
summary
present of the experimental
a summary process and the
of the experimental minimum
process and the
classification error plot only
minimum classification for these
error plot onlytwo for
models.
these two models.
Figure
Figure4 reveals
4 revealsananoverview
overview of of
thethe
training
trainingandandtest results
test according
results according to to
thethe
perfor-
perfor-
mance indicators for Model 4 and Model 12, as reported by the MATLAB
mance indicators for Model 4 and Model 12, as reported by the MATLAB R2023a software. R2023a software.
These CLA
These CLAclassifier performance
classifier performance indicators
indicators have been
have beendescribed
describedin Section 2.6.1. 2.6.1.
in paragraph
(a) (b)
Figure
Figure 4. (a)
4. (a) Summary
Summary of Model
of Model 4; (b)
4; (b) Summary
Summary of Model
of Model 12.12.
From Figure 5, the minimum classification error plot (this plot was created automat-
ically by the MATLAB R2023a software) for Bayesian optimization, we can observe the
following, very important for our experiment, outcomes:
• The estimated minimal classification error is represented by each light blue point. This
estimation is obtained using the optimization procedure, which takes into account all
the combinations of hyperparameter values that have been estimated, involving the
present iteration.
• The observed minimal classification error is represented in the graph by every dark blue
point, which refers to the calculated error obtained during the optimization procedure.
• The optimized hyperparameters are represented by the red square, representing the
iteration with the best performance. As we can see from Figure, the best point hyper-
parameter is the L2 with value 0.00000287, and the observed min classification error
is 0.172 in the eighth iteration. The optimized hyperparameters may not consistently
provide the reported minimal classification error. Here, we can also mention that the
application employs Bayesian optimization for hyperparameter tuning. It selects a
combination of hyperparameter metrics that eliminates the upper confidence range
cedure.
• The optimized hyperparameters are represented by the red square, representing the
iteration with the best performance. As we can see from Figure, the best point hy-
perparameter is the L2 with value 0.00000287, and the observed min classification
error is 0.172 in the eighth iteration. The optimized hyperparameters may not con-
Mathematics 2023, 11, 4693 17 of 33
sistently provide the reported minimal classification error. Here, we can also mention
that the application employs Bayesian optimization for hyperparameter tuning. It se-
lects a combination of hyperparameter metrics that eliminates the upper confidence
of the of
range objective model’s
the objective classification
model’s error instead
classification of minimizing
error instead the classification
of minimizing the classifi-
error itself.
cation error itself.
•• The hyperparameters
The hyperparameters that thatresult
resultininthe
thesmallest
smallestclassification error
classification areare
error represented
represented by
the yellow point, indicating the corresponding iteration. Here, we observed
by the yellow point, indicating the corresponding iteration. Here, we observed that that the
L2 regularization
the L2 regularizationwithwith
valuevalue
0.00000286 resulted
0.00000286 in theinmin
resulted theclassification error 0.172
min classification error
in the 6th iteration.
0.172 in the 6th iteration.
Figure
Figure 5.
5. Min.
Min. Classification
Classification error
error plot
plot of
of Model
Model 12.
12.
From Figure 6, for the grid search model, we can perceive perceive that:
that:
•• The best point hyperparameter is the L2 regularization with value 0.00006155, the
observed min
min classification
classificationerror
errorisis0.152
0.152ininthe
the 55th
55th iteration,
iteration, andand
thethe
L2 L2 regulariza-
regularization
tion with value 0.0000369 resulted when the min classification error
with value 0.0000369 resulted when the min classification error occurs with valueoccurs with value
0.152
0.152
in thein theiteration.
29th 29th iteration. According
According to these
to these two diagrams,
two diagrams, we can weconclude
can conclude that
that grid
grid search
search results
results in a more
in a more nominal
nominal classification
classification error.error.
Hence,Hence,
it hasitbetter
has better accuracy
accuracy than
Bayesian,
than as we as
Bayesian, have
wenoticed from thefrom
have noticed confusion matrix and
the confusion ROC and
matrix curve. However,
ROC curve.
the minimum classification error plots indicate that Bayesian has a higher coverage
speed as it reaches the minimum error in the sixth iteration.
Figures 7a, 8a, 9a and 10a illustrate the confusion matrix (CM) that summarizes the
performance of each model. CM is a 5 × 5 matrix, as we have five classes for prediction
in our ANNs. There is a general description of the CM in paragraph 2.5.2. and a precise
description in articles [41,42].
The diagonal features (from top-left to bottom-right) define the correct predictions for
each class, while the non-diagonal features denote misclassifications (incorrect prediction
for each class). The rows designate the true (actual) classes, and the columns designate the
predicted classes. The different colors in every orthogon are used to visually emphasize the
various categories in the confusion matrix, making it easier to interpret the performance
of the model at a glance. The choice of colors can differ based on the visualization tool or
software being used. We will describe in detail the CM of Model 4 (Grid search) due to its
superior validation and test accuracy performance.
Mathematics 2023, 11, x FOR PEER REVIEW 18 of 34
Figure6.6.Min.
Figure Min.Classification
Classification error
error plot
plot of
ofModel
Model4.4.
(a) (b)
Mathematics 2023, 11, x FOR PEER REVIEW 20 of 34
Figure 7. (a)
Figure 7. (a) Validation
Validation confusion
confusion matrix for Model
matrix for Model 4;
4; (b)
(b) validation
validation ROC
ROC curve
curve for
for Model
Model 4.
4.
(a) (b)
Figure
Figure 8.
8. (a)
(a) Validation
Validation confusion
confusion matrix
matrix for
for Model 12; (b)
Model 12; (b) validation
validation ROC
ROC curve
curve for
for Model
Model 12.
12.
Mathematics 2023, 11, 4693 (a) (b) 20 of 33
Figure 8. (a) Validation confusion matrix for Model 12; (b) validation ROC curve for Model 12.
(a) (b)
Mathematics 2023, 11, x FOR PEER REVIEW 21 of 34
Figure
Figure 9.
9. (a)
(a) Validation
Validation confusion
confusion matrix
matrix for
for Model
Model 8;
8; (b)
(b) validation
validation ROC
ROC curve
curve for
for Model 8.
Model 8.
2 100.0% 100.0%
T ru e C la s s
1 2 3 4 5 TPR FNR
Predicted Class
(a) (b)
Figure10.
Figure 10.(a)(a)Validation
Validationconfusion
confusionmatrix
matrixfor
for Model
Model 2;2;
(b)(b) validation
validation ROC
ROC curve
curve for
for Model
Model 2. 2.
5 12.8% 48.7% 2.6% 35.9% 35.9% 64.1%
1 2 3 4 5 TPR FNR
Predicted Class
Figure 10. (a) Validation confusion matrix for Model 2; (b) validation ROC curve for Model 2.
Figure
Figure 11.
11. The
The progress
progress of
of training for Adam.
training for Adam.
2. Adaptive
Adaptive Learning
Learning Rate
Rate Method
Method
The
The impact
impact of
of the stochastic gradient
the stochastic gradient descent
descent (SGD)
(SGD) approach
approach is is significantly
significantly influ-
influ-
enced
enced by the manually controlled learning rate. Setting a suitable value for
by the manually controlled learning rate. Setting a suitable value the learning
for the learning
rate
rate is
is aa difficult
difficultchallenge.
challenge.TheThelearning
learningrate may
rate may bebe
automatically
automatically adjusted
adjusted using several
using sev-
adaptive approaches [38,39]. These techniques do not need parameter adjustments,
eral adaptive approaches [38,39]. These techniques do not need parameter adjustments, con-
verge
convergequickly, andand
quickly, often provide
often providegood
good outcomes.
outcomes.InInthis
thissection,
section,we
weexamine
examine three
three of
of
them:
them: Adaptive
Adaptive Moment
Moment Estimation
Estimation (Adam),
(Adam), Mean
Mean Square
Square Propagation
Propagation (RMSprop),
(RMSprop), and and
Stochastic Gradient Descent with Momentum (SGDM).
Adam [37]: An improved SGD approach incorporating an adjustable learning rate
for every parameter. Moreover, it combines the techniques of adjustable learning rate and
momentum. The purpose of this architecture is to effectively modify the parameters of a
model as it is being trained.
RMSprop: The fundamental concept behind RMSprop is to dynamically modify the
learning rate assigned to each variable, by considering the size of the gradients. This
concept is accomplished by continuously updating and calculating the average of the
squared gradients for each parameter. The running average serves as a means of adjusting
the learning rate, enabling it to be increased for parameters with lower gradients and
decreased for values with higher gradients.
SGDM algorithm: In the conventional SGD approach, the model parameters are
updated by considering the gradient of the loss function estimated on an individual
training sample at each iteration. Noisy updates and weak convergence may occur, mainly
when there is a substantial variation in the gradients. This problem is effectively addressed
by introducing a momentum factor into the SGDM algorithm. The momentum term refers
to a computed value representing the average of the gradients obtained from previous
iterations. This approach facilitates the reduction of update irregularities and enables the
ongoing progression of the optimization process in a desirable position, even amongst
variations in the gradients. The update rule of SGDM has two primary elements: the
present gradient and the momentum term. The existing gradient is scaled using a learning
rate, which handles the magnitude of the update stage. The momentum term is multiplied
by a coefficient that regulates the impact of past gradients on the present update. The
model parameters are updated by combining these two components.
Mathematics 2023, 11, 4693 22 of 33
Experiment 3B.
In this section, we test the three optimizers, as mentioned above, in order to compare
their classification accuracy. To achieve this, we adjust the hyperparameters (the learning
rate, momentum, L2 regularization, and batch size) following a ‘trial-and-error’ process and
notice which ones work best for the network [43,44]. There are specific hyperparameters
for each optimizer that can be changed to improve the training performance. We will refer
to them in the next paragraph. The momentum term and the learning rate are dynamically
adjusted to improve the convergence speed. The L2 regularization term is adjusted to
prevent over-fitting. Moreover, a proper batch size helps converge faster.
Adjustment of L2 regularization.
First, the standard cross-entropy function with the L2 regularization term is formed
to prevent over-fitting. When the regularization parameter is set to zero, it poses the
risk of overfitting and decreases the network’s capacity to generalize. The regularization
parameter is modified in order not to have an impact on the other precisely adjusted
parameters inside the model. However, the consequences of it may be detected during the
convergence of the loss function. The regularization parameter is often chosen from a set of
commonly used values that are logarithmically distributed within the range of 0 to 0.1. We
will adjust the L2 term as 0.1, 0.001, 0.00001.
After, we employ the same parameter initialization for comparing the three optimiza-
tion algorithms. Finally, optimal values for the hyperparameters, such as the momentum
and learning rate, are determined using exhaustive grid searching, and the resulting pre-
dictions are reported.
In the editor of MATLAB, we write the code for training and testing the three optimiz-
ers. We present an example of the code that processes the data for training the models in
Box 1.
InputTable = trainingData;
Predictor Names = {‘INPUT1’, ‘INPUT2’, ‘INPUT3’, ‘INPUT4’};
predictors = inputTable(:, predictorNames);
response = inputTable.FAULT;
isCategoricalPredictor = [false, false, false, false];
classNames = [1; 2; 3; 4; 5];
A. Adam solver
We train the ANN using the Adam solver and set its properties.
The adjustable hyperparameters are:
• Squared Gradient Decay Factor;
• Learning rate;
• Gradient Decay Factor (controls the strength of L2 regularization factor);
• Epsilon (ε): A tiny value which is added to the denominator to prevent division by
zero, usually in the range of 1 × 10−7 or 1 × 10−8 . Machine learning usually encoun-
ters Epsilon when computing ratios, gradients, or other mathematical procedures
involving division. Adding Epsilon guarantees that the division stays defined even if
the denominator is near zero and that the computation does not result in numerical
instability or errors.
Set ‘Gradient Threshold’ as 1.
Experimentation outcomes:
Following an intensive search based on the various criteria discussed above (paragraph
2. Adaptive Learning Rate Method), we have determined the following magnitudes of
parameters provide the optimal results, as illustrated below:
The results obtained by employing the Adam optimizer are:
• Best validation accuracy: 90.4%
Mathematics 2023, 11, x FOR PEER REVIEW 24 of 34
Figure 12a exhibits a table created by the MATLAB software when we use the train-
Mathematics 2023, 11, 4693 23 of 33
Network’ function. It displays various training metrics every 50 iterations until 450 itera-
tions according to the progress of the network’s training. The max validation accuracy is
90.38% in the 10th epoch after 150 iterations and at the fourth second of elapsed time.
• Best test accuracy:
The minimum 0.907 loss is 0.3830 in the 27th epoch after 400 iterations and at
validation
• Gradient Decay Factor:
the sixth second of elapsed time.0.999Moreover, the test accuracy reaches 0.907.
• Learning rate: 0.001
Figure 12b presents the CM for the test data. The dataset used for testing consists of
We can Out
54 instances. observe the the
of these, progress of the
correctly network’s
predicted training
classes are 49, validating and testing
and the misclassified in
clas-
Figures 11
ses are 5. and 12a and the confusion matrix in Figure 12b.
(a) (b)
Figure 12.
Figure 12. (a)
(a) The
The progress of the
progress of the network’s
network’s training
training Adam;
Adam; (b)
(b) test
test confusion
confusion matrix
matrix for
for Adam.
Adam.
• We carry
Figure 11 out an identical
illustrates process
the training for bothdiagram,
progress the SGDM optimizer
which and
displays the RMSprop
various training
opt.atThe
metrics eachfindings
iteration.areThe
presented
diagram inprovides
the following tables and
information figures.
on the training time, training
cycle,
B. SGDMand other parameters, which are displayed on the right side. The shaded backdrop
optimization
serves as a visual representation of each training epoch (a full pass across the whole of
After conducting a thorough investigation using a range of criteria as previously de-
the dataset).
scribed, we have identified the magnitudes of parameters that result in the most favorable
The classification accuracy on each specific mini-batch is presented using a light blue
results.
curve; however, the program gives a smoothed version of the training accuracy (blue curve).
The adjustable hyperparameters are:
We have provided a validation set, so the black curve shows the classification accuracy
Momentum
on the complete validation set. We set the validation frequency to 50 for estimating the
ANNLearning rate
on the validation data every 50 iterations and the maximum quantity of epochs to
The results
30. Moreover, we obtained
set the by employing
iterations to 450;SGDM optimizer
every iterationare:
comprises the estimate of the
• Best and
gradient validation accuracy: updating
the subsequent 82.7% of the network parameters. After the end of the
• Bestprocess,
training test accuracy: 81%displays the ultimate validation accuracy and the rationale for
the figure
• termination
the Momentum:of0.9 training, which is the completion of the maximum number of epochs
•
(30). Learning
After thisrate: 0.00008
training process, the validation accuracy reaches 90.38% in the 10th epoch
after Experimentation
150 iterations. outcomes:
The loss functionoforthe
The magnitudes cross-entropy loss provide
parameters that diagramthe illustrates
optimum theresults
loss received on every
are summarized
mini-batch.
in Table 9. Both the actual curve and its smoothed form are observable. Furthermore, we
can see the loss curve on the validation dataset. The cross-entropy loss reaches 0.38 in the
27th epoch after 400 iterations.
Figure 12a exhibits a table created by the MATLAB software when we use the ‘trainNet-
work’ function. It displays various training metrics every 50 iterations until 450 iterations
according to the progress of the network’s training. The max validation accuracy is 90.38%
in the 10th epoch after 150 iterations and at the fourth second of elapsed time.
The minimum validation loss is 0.3830 in the 27th epoch after 400 iterations and at the
sixth second of elapsed time. Moreover, the test accuracy reaches 0.907.
Mathematics 2023, 11, 4693 24 of 33
Figure 12b presents the CM for the test data. The dataset used for testing consists
of 54 instances. Out of these, the correctly predicted classes are 49, and the misclassified
classes are 5.
• We carry out an identical process for both the SGDM optimizer and the RMSprop opt.
The findings are presented in the following tables and figures.
B. SGDM optimization
After conducting a thorough investigation using a range of criteria as previously
described, we have identified the magnitudes of parameters that result in the most favor-
able results.
The adjustable hyperparameters are:
Momentum
Learning rate
The results obtained by employing SGDM optimizer are:
• Best validation accuracy: 82.7%
• Best test accuracy: 81%
• Momentum: 0.9
• Learning rate: 0.00008
Experimentation outcomes:
The magnitudes of the parameters that provide the optimum results are summarized
in Table 9.
We can monitor the progress of the network’s training validating and testing in
Figures 13 and 14a and the confusion matrix in Figure 14b.
The max validation accuracy is 82.69% in the 17th epoch after 50 iterations and at the
third second of elapsed time.
The minimum validation loss is 0.5371 in the 34th epoch after 100 iterations and at the
third second of elapsed time. Moreover, the test accuracy reaches the 0.814.
Figure 14b presents the CM for the test data. The dataset used for testing consists
of 54 instances. Out of these, the correctly predicted classes are 44, and the misclassified
classes are 10.
The max validation accuracy is 82.69% in the 17th epoch after 50 iterations and at the
third second of elapsed time.
The minimum validation loss is 0.5371 in the 34th epoch after 100 iterations and at
the third second of elapsed time. Moreover, the test accuracy reaches the 0.814.
Figure 14b presents the CM for the test data. The dataset used for testing consists of
Mathematics 2023, 11, 4693 25 of 33
54 instances. Out of these, the correctly predicted classes are 44, and the misclassified clas-
ses are 10.
(a) (b)
Figure
Figure14.
14.(a)
(a)Progress
Progressof
ofthe
thenetwork’s
network’straining
trainingfor
forSGDM;
SGDM;(b)
(b)test
testconfusion
confusionmatrix
matrixfor
for SGDM.
SGDM.
C. C.RMSprop
RMSprop
We
Wetrain
trainthe
theANN
ANNusing
usingthe
theRMSprop
RMSprop solver
solverand andsetsetits
its properties.
properties.
The adjustable hyperparameters
The adjustable hyperparameters are: are:
Squared
SquaredGradient
GradientDecay
DecayFactor
Factor
Learning
Learningrate
rate
Gradient
GradientDecay
DecayFactor
Factor(controls
(controlsthe
thestrength
strength of of the
the L2
L2 regularization
regularization factor)
factor)
Epsilon
Epsilon(ε):
(ε):AAvery
verysmall
smallvalue
valueadded
addedtotoprevent
preventdividing
dividingby byzero
zerowhen
whenupdating
updatingthe
the
parameters;
parameters;typically,
typically, values
values are 10−7−or
are 11 ××10 7 or
1 ×1 10 . −8 .
× −810
Experimentation outcomes:
At the first training of the model, we observe an overfitting with these values:
• Validation accuracy: 67.31%
• Test accuracy: 0.75
Mathematics 2023, 11, 4693 26 of 33
Experimentation outcomes:
At the first training of the model, we observe an overfitting with these values:
• Validation accuracy: 67.31%
• Test accuracy: 0.75
• Momentum: 0.99
• Learning rate: 0.0003
In order to prevent overfitting, we adjust the momentum and learning rate hyper-
parameters. Firstly, we change the Gradient Decay Factor that controls the strength of
L2 regularization to the value 0.999. After many training procedures with different sets
of hyperparameters, we select those that provide the best accuracy to the model. The
magnitudes of the parameters that provide the optimum results are illustrated in Table 9.
The best results obtained by employing the RMSprop optimizer are:
•
Mathematics 2023, 11, x FOR PEER REVIEWBestvalidation accuracy: 86% 27 of 34
• Best test accuracy: 83%
• Momentum: 0.999
• Learning rate: 0.001
Figure 16b presents the CM for the test data. The dataset used for testing consists of
We can Out
54 instances. observe the the
of these, progress of predicted
correctly the network’s training
classes are 45, validating and testing
and the misclassified in
clas-
Figures 15
ses are 9. and 16a and the confusion matrix in Figure 16b.
The max validation accuracy is 86.54% in the 17th epoch after 50 iterations and at the
fourth second of elapsed time.
The minimum validation loss is 0.5198 in the 17th epoch after 50 iterations and at the
fourth second of elapsed time. Moreover, the test accuracy reaches 0.814.
Figure 16b presents the CM for the test data. The dataset used for testing consists
of 54 instances. Out of these, the correctly predicted classes are 45, and the misclassified
classes are 9.
Mathematics 2023, 11, 4693 27 of 33
(a) (b)
Figure
Figure 16.
16. (a)
(a)Training
Training progress
progress for
for RMSprop;
RMSprop; (b)
(b) test
test confusion
confusion matrix
matrix for
for RMSprop.
RMSprop.
Table
Table1010illustrates
illustratesthe
thevalidation
validationand testing
and accuracy
testing of the
accuracy Adam,
of the Adam, SGDM, and
SGDM,
RMSprop and RMSprop algorithms.
algorithms.
Table 10. Validation and training accuracy for Adam, SGDM, and RMSprop.
Based on the outcomes obtained from our experimentation, it can be inferred that
the Adam optimization method has several benefits in comparison to other optimization
techniques. The technique demonstrates computational efficiency, demands less memory
compared to different approaches, and has shown quicker convergence in several instances.
Additionally, the model exhibits robustness in selecting hyperparameters and demonstrates
satisfactory performance when using default values. Consequently, the final model of our
optimization approach is an ANN with the subsequent values:
• Number of layers: 3
• Layers size: 100, 50, 5
• Activation function: Tanh
• Best validation accuracy: 90.4%,
• Best test accuracy: 90.7%
Finally, in our model, we start with a validation accuracy of 80.9% with a test accuracy
of 79%, and after the optimization method, the validation accuracy reaches 90.4% with a
test accuracy of 90.7%. So, the optimized ANN with 90.7% accuracy is a superior method
for PTFD.
This optimization of ANNs in a mathematical way is inspected to overcome the
limitations of the IEC-599 standard, the Rogers Ratio Method. The optimized ANN is
simulated and tested in MATLAB R2023a—Deep Network Designer. The dataset used in
this study consisted of 60 data samples obtained from the IEEE DGA datasets [15] and the
dataset used in our previous work [8].
The results are illustrated in Table 11. The accuracy of each approach is calculated
using the formula (A) = (Total Samples Correctly Classified)/(Total Samples). Column (1)
is given the serial number of every sample. Columns (2) to (6) indicate the concentrations
of the gases in ppm. The real faults of the transformers under examination are provided in
column (7).
(7)
(1) (2) (3) (4) (5) (6) (8) (9) (10) (11)
Real
S.N H2 CH4 C2 H6 C2 H4 C2 H2 Rogers Agr. ANN Agr.
Fault
√ √
1 13 138 83 16 0 F4 F4 √ F4 √
2 762 93 38 54 126 F3 F3 √ F3 √
3 43 116 65 139 0 F4 F4 √ F4 √
4 179 306 73 579 0 F4 F4 √ F4 √
5 57 141 38 51 0 F4 F4 √ F4 √
6 40 8 34 15 0 F4 F4 F4 √
7 35 283 121 222 0 F4 U.F —– F4
8 15 159 29 87 0 F4 U.F —–
√ F5 —–
√
9 55 159 114 493 0 F4 F4 √ F4 √
10 37 123 67 52 0 F4 F4 √ F4 √
11 723 191 110 293 288 F3 F3 √ F3 √
12 7 15 78 58 0 F4 F4 √ F4 √
13 30 51 12 54 0 F4 F4 √ F4 √
14 31 56 33 77 0 F4 F4 √ F4 √
15 109 226 68 192 0 F4 F4 √ F4 √
16 137 279 66 505 0 F4 F4 √ F4 √
17 59 119 36 70 0 F4 F4 √ F4 √
18 151 242 68 232 0 F4 F4 √ F4 √
19 870 77 73 54 14 F2 F2 √ F2 √
20 376 575 146 1092 0 F4 F4 F4 √
21 269 1081 347 1725 25 F5 U.F —–
√ F5 √
22 10 10 8 1 0.01 F4 F4 √ F4 √
23 30 22 14 4.10 0.1 F1 F1 √ F1 √
24 2.90 2 2 0.3 0.1 F1 F1 √ F1 √
25 4 99 82 4 0.1 F4 F4 F4
Mathematics 2023, 11, 4693 30 of 33
(7)
(1) (2) (3) (4) (5) (6) (8) (9) (10) (11)
Real
S.N H2 CH4 C2 H6 C2 H4 C2 H2 Rogers Agr. ANN Agr.
Fault
√
26 21 34 5 47 62 F3 U.F —–
√ F3 √
27 50 100 51 305 9 F4 F4 F4 √
28 120 17 32 4 23 F1 U.F —–
√ F1 √
29 980 73 58 12 0.01 F2 F2 F2 √
30 1607 615 80 916 1294 F3 U.F —– F3
31 14.7 3.7 10.5 2.7 0.2 F4 U.F —– F5 —–
√
32 181 262 41 28 0.01 F4 U.F —–
√ F4 √
33 173 334 172 812.5 33.7 F4 F4 F4 √
34 127 107 11 154 224 F3 F4 —– F3 √
35 60 40 6.9 110 70 F3 F4 —– F3 √
36 980 73 58 12 0.01 F2 F3 —– F2
37 86 187 136 363 0.01 F4 F3 —– F5 —–
√
38 10 24 372 24 0.01 F4 U.F —–
√ F4 √
39 260 3 18 2 0.01 F2 F2 F2 √
40 586 19 77 6 0.01 F2 F4 —–
√ F2 √
41 20 175 92 14 0.02 F4 F4 √ F4 √
42 801 87 45 62 150 F3 F3 √ F3 √
43 51 99 75 150 0.03 F4 F4 F4
44 200 298 69 602 0.05 F5 F4 —–
√ F4 —–
√
45 60 154 41 49 0 F4 F4 F4
46 40 8 34 15 0.2 F1 F4 —– F4 —–
√
47 45 283 158 199 0 F4 U.F —– F4 √
48 21 159 22 91 0.02 F4 U.F —– F4 √
49 55 159 128 502 0 F5 F4 —–
√ F5
50 41 223 71 52 0 F3 F4 √ F4 —–
√
51 689 203 129 301 362 F2 F3 √ F2 √
52 10 24 95 45 0.02 F4 F4 F4 √
53 45 69 7 45 0.003 F5 F4 —–
√ F5 √
54 45 59 45 89 0.01 F4 F4 √ F4 √
55 98 198 70 201 0.04 F4 F4 F4 √
56 204 302 57 495 0 F5 F4 —–
√ F5 √
57 45 125 48 82 0 F4 F4 √ F4 √
58 201 256 54 224 0 F4 F4 √ F4 √
59 905 83 81 63 12 F2 F2 F2 √
60 402 604 99 998 0.02 F5 F4 —– F5
√
Key: UF = Unidentified fault, = fault type diagnosed correctly, —– = fault type not diagnosed correctly, Agr. =
agreement with real fault.
In columns (9) and (11), the outcomes of each method are contrasted with the actual
faults. In column (8), the faults are obtained using the conventional Rogers Ratio Method.
This method cannot predict 11 faults (the primary limitation of the Rogers Method), and
there are also 11 incorrect estimations, as we can notice in column (9).
The Rogers Ratio Method’s accuracy is 63.3%, which is relatively low.
Column (10) depicts the output of the ANN. The ANN model can accurately predict
38 faults and incorrectly predict 6 instances, as seen in column (11). The level of precision
the ANN shows is remarkably high; it obtains 90%. The aforementioned experiments
proved that the test accuracy was nearly 90.7%, which is extremely close to 90%.
So, the ANN method’s accuracy is 90%, which is remarkably high.
4. Conclusions
This research introduces an approach to improve the diagnostic precision of identifying
power transformer faults by employing an optimized ANN. The selection of an optimizer
is contingent upon the particular problem, the dataset, and the network’s architecture; it
often requires experimentation to find the most suitable one. Moreover, the parameters of
Mathematics 2023, 11, 4693 31 of 33
an ANN do not have definitive optimum values. The primary objective of ANN designers
is to identify a proper learning algorithm that may enhance the generalization capability
of an ANN model. Given this context, an intensive search is performed for the best
training algorithm and its hyperparameters, including the neurons, learning rate, activation
functions, L2 regularization factor, and momentum that may provide the desired results.
As we can observe from the confusion matrices and ROC curves of all search methods,
the class with the best predicted score (100%) is Class 2, Low Energy Discharge faults. As
we can notice from the data samples, the majority of the samples are from Low Energy
Discharge defects. Considering this point, we can determine that an increase in the number
of samples within a given dataset leads to a higher classification accuracy. The number of
samples in our project is 400 gas ratios, but achieving higher accuracies (observable from
the literature and confirmed by our experimental findings) requires a sufficient number
of training samples. However, obtaining a large amount of data is difficult in the field of
PTFD. The limited data availability limits the generalizability of the results in PTFD using
ANNs. There is a necessity for more extensive and diverse datasets to train and validate
ANN models effectively.
Samples must be taken in order to accomplish better classification accuracy and
reduce architectural complexity. Moreover, research focusing on enhancing data quality
and managing inconsistencies is crucial for accurate fault diagnosis. Data consistency and
the quality of DGA can differ significantly depending on the sensor station and maintenance
operations. Another issue which is crucial for future work is multi-fault diagnosis. In this
work (and most current studies), we concentrate on single-fault diagnosis, such as detecting
a distinct fault type, e.g., a thermal fault. It would be valuable to design ANN-based models
competent in diagnosing multiple faults that occur simultaneously or complicated faults.
It can be confirmed that hyperparameters have an essential role in controlling the
learning process of a model. Furthermore, the specific values assigned to these hyperpa-
rameters have a significant impact on the overall performance of the model. Likewise,
the optimization of hyperparameters and the augmentation of data samples are essential
factors in improving the accuracy and reducing the complexity of the ANN.
Furthermore, the optimized ANN model is systematically compared to the Rogers
Ratio Method, which has an accuracy of only 63.3%. In contrast, the ANN model demon-
strated remarkable precision, yielding an accuracy of 90%. Conclusively, these findings
highlight the potential of our optimized ANN model as an advanced and accurate solution
for transformer health assessment and maintenance, effectively overcoming the limitations
associated with the conventional DGA technique, the Rogers Ratio Method.
Author Contributions: Conceptualization, V.R. and S.D.K.; methodology, V.R.; software, V.R.; valida-
tion, S.D.K. and P.K.; formal analysis, V.R. and S.D.K.; investigation, V.R. and D.K.; resources, V.R.
and D.K.; data curation, V.R. and S.D.K.; writing—original draft preparation, V.R.; writing—review
and editing, V.R. and D.K.; visualization, V.R.; supervision, P.K.; project administration, S.D.K. and
P.K. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Data Availability Statement: The data are available in a publicly accessible repository that does not
issue DOIs. Publicly available datasets were analyzed in this study. This dataset can be found at
https://ieee-dataport.org/open-access/botnet-dga-dataset (accessed on 24 January 2023).
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Thango, B.A. Dissolved Gas Analysis and Application of Artificial Intelligence Technique for Fault Diagnosis in Power Trans-
formers: A South African Case Study. Energies 2022, 15, 9030. [CrossRef]
2. Alsuhaibani, S.; Khan, Y.; Beroual, A.; Malik, N.H. A Review of Frequency Response Analysis Methods for Power Transformer
Diagnostics. Energies 2016, 9, 879. [CrossRef]
3. Bhalla, D.; Bansal, R.K.; Gupta, H. Application of Artificial Intelligence Techniques for Dissolved Gas Analysis of Transformers—A
Review. World Acad. Sci. Eng. Technol. 2010, 62, 221–229.
Mathematics 2023, 11, 4693 32 of 33
4. Singh, J.; Sood, Y.; Jarial, R. Condition Monitoring of Power Transformers Bibliography Survey. IEEE Electr. Insul. Mag. 2008, 24,
11–25. [CrossRef]
5. Papadopoulos, A.E.; Psomopoulos, C.S. The contribution of dissolved gas analysis as a diagnostic tool for the evaluation of the
corrosive sulphur activity in oil insulated traction transformers. In Proceedings of the 6TH IET Conference on Railway Condition
Monitoring (RCM), University of Birmingham, Birmingham, UK, 17–18 September 2014.
6. Koroglu, S. A Case Study on Fault Detection in Power Transformers Using Dissolved Gas Analysis and Electrical Test Methods. J.
Electr. Syst. 2016, 12, 442–459.
7. Siva Sarma, D.V.S.S.; Kalyani, G.N.S. ANN approach for condition monitoring of power transformers using DGA. In Proceedings
of the IEEE Region 10 Conference TENCON, Chiang Mai, Thailand, 24 November 2004.
8. Rokani, V.; Kaminaris, S.D. Power Transformers Fault Diagnosis Using AI Techniques. AIP Conf. Proc. 2020, 2307, 020056.
[CrossRef]
9. Barkas, D.A.; Kaminaris, S.D.; Kalkanis, K.K.; Ioannidis, G.C.; Psomopoulos, C.S. Condition Assessment of Power Transformers
through DGA Measurements Evaluation Using Adaptive Algorithms and Deep Learning. Energies 2023, 16, 54. [CrossRef]
10. Patel, D.M.K.; Patel, D.A.M. Simulation and analysis of dga analysis for power transformer using advanced control methods.
Asian J. Converg. Technol. 2021, 7, 102–109. [CrossRef]
11. Ciulavu, C.; Helerea, E. Power Transformer Incipient Faults Monitoring. Ann. Univ. Craiova-Electr. Eng. Ser. 2008, 32, 72–77.
12. Dhini, A.; Faqih, A.; Kusumoputro, B.; Surjandari, I.; Kusiak, A. Data-driven Fault Diagnosis of Power Transformers using
Dissolved Gas Analysis (DGA). Int. J. Technol. 2020, 11, 388–399. [CrossRef]
13. Rogers, R.R. IEEE and IEC Codes to Interpret Incipient Faults in Transformers, Using Gas in Oil Analysis. IEEE Trans. Electr. Insul.
1978, 5, 349–354. [CrossRef]
14. IEEE. C57.104-1991-IEEE Guide for the Interpretation of Gases Generated in Oil-Immersed Transformers; IEEE: New York, NY, USA,
1992. [CrossRef]
15. IEEE DataPort. Available online: https://ieee-dataport.org/documents/dissolved-gas-data-transformer-oil-fault-diagnosis-
power-transformers-membership-degree#files (accessed on 17 April 2023).
16. Hussein, A.R.; Yaacob, M.; Othman, M. Ann Expert System for Diagnosing Faults and Assessing the Quality Insulation Oil of
Power Transformer Depending on the DGA Method. J. Theor. Appl. Inf. Technol. 2015, 78, 278.
17. Sarma, J.J.; Sarma, R. Fault Analysis of High Voltage Power. Int. J. Adv. Res. Electr. Electron. Instrum. Eng. 2017, 6, 2411–2419.
[CrossRef]
18. Zhang, Y.; Tang, Y.; Liu, Y.; Liang, Z. Fault Diagnosis of Transformer Using Artificial Intelligence: A Review. Front. Energy Res.
2022, 10, 1006474. [CrossRef]
19. Lopes, S.M.d.A.; Flauzino, R.A.; Altafim, R.A.C. Incipient Fault Diagnosis in Power Transformers by Data-Driven Models with
over-Sampled Dataset. Electr. Power Syst. Res. 2021, 201, 107519. [CrossRef]
20. Bishop, C.M. Pattern Recognition and Machine Learning. In Information Science and Statistics; Springer: Berlin/Heidelberg,
Germany, 2006; ISBN 978-0-387-31073-2.
21. Russell, S.; Norvig, P. Artificial Intelligence: A Modern Approach. In Always Learning; Pearson: London, UK, 2016; ISBN
978-1-292-15396-4.
22. Ling-fang, H. Artificial Intelligence. In Proceedings of the 2010 The 2nd International Conference on Computer and Automation
Engineering (ICCAE), Singapore, 26–28 February 2010; Volume 4, pp. 575–578.
23. Haykin, S.S. Neural Networks and Learning Machines. In Pearson International Edition; Pearson: London, UK, 2009; ISBN
978-0-13-129376-2.
24. Hutter, F.; Lücke, J.; Schmidt-Thieme, L. Beyond Manual Tuning of Hyperparameters. KI—Kunstl. Intell. 2015, 29, 329–337.
[CrossRef]
25. Bartz-beielstein, E.B.T.; Zaefferer, M. Tuning for Machine and Deep Learning with R; Springer: Singapore, 2023; ISBN 978-981-19516-9-5.
26. Gridin, I. Hyperparameter Optimization. In Automated Deep Learning Using Neural Network Intelligence: Develop and Design PyTorch
and TensorFlow Models Using Python; Apress: Berkeley, CA, USA, 2022; pp. 31–110. ISBN 978-1-4842-8149-9.
27. Bi, C.; Tian, Q.; Chen, H.; Meng, X.; Wang, H.; Liu, W.; Jiang, J. Optimizing a Multi-Layer Perceptron Based on an Improved Gray
Wolf Algorithm to Identify Plant Diseases. Mathematics 2023, 11, 3312. [CrossRef]
28. Xu, T.; Gao, Z.; Zhuang, Y. Fault Prediction of Control Clusters Based on an Improved Arithmetic Optimization Algorithm and
BP Neural Network. Mathematics 2023, 11, 2891. [CrossRef]
29. Asimakopoulou, G.; Kontargyri, V.; Tsekouras, G.; Asimakopoulou, F.; Gonos, I.; Stathopulos, I. Artificial Neural Network
Optimisation Methodology for the Estimation of the Critical Flashover Voltage on Insulators. IET Sci. Meas. Technol. 2009, 3,
90–104. [CrossRef]
30. Kussul0 , E.M.; Baidyk, T.; Wunsch, D.C. Neural Networks and Micromechanics; Springer: Heidelberg/Berlin, Germany; New York,
NY, USA, 2010; ISBN 9783642025341.
31. Geman, S.; Bienenstock, E.; Doursat, R. Neural Networks and the Bias/Variance Dilemma. In Neural Computation; MIT Press:
Cambridge, MA, USA, 1992; Volume 4, pp. 1–58.
32. Girolami, M. A First Course in Machine Learning; CRC Press: Boca Raton, FL, USA, 2015; ISBN 978-1-4987-5960-1.
33. Alemu, H.Z.; Wu, W.; Zhao, J. Feedforward Neural Networks with a Hidden Layer Regularization Method. Symmetry 2018, 10,
525. [CrossRef]
Mathematics 2023, 11, 4693 33 of 33
34. Sun, S.; Cao, Z.; Zhu, H.; Zhao, J. A Survey of Optimization Methods from a Machine Learning Perspective. IEEE Trans. Cybern.
2020, 50, 3668–3681. [CrossRef] [PubMed]
35. Bejani, M.M.; Ghatee, M. A Systematic Review on Overfitting Control in Shallow and Deep Neural Networks; Springer: Dordrecht, The
Netherlands, 2021; Volume 54, ISBN 0-12-345678-9.
36. Tian, Y.; Zhang, Y.; Zhang, H. Recent Advances in Stochastic Gradient Descent in Deep Learning. Mathematics 2023, 11, 682.
[CrossRef]
37. Kingma, D.P.; Ba, J.L. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on
Learning Representations, ICLR 2015—Conference Track Proceedings, San Diego, CA, USA, 7–9 May 2015; pp. 1–15.
38. Beale, M.H.; Hagan, M.T.; Demuth, H.B. Deep Learning Toolbox TM User’s Guide How to Contact MathWorks. 2020. Available
online: https://www.mathworks.com/help/deeplearning/ (accessed on 17 April 2023).
39. Mathworks. Available online: https://www.mathworks.com/content/dam/mathworks/mathworks-dot-com/campaigns/
por-tals/files/machine-learning-resource/machine-learning-with-matlab.pdf (accessed on 1 April 2022).
40. Kim, P. MATLAB Deep Learning; Apress: Berkeley, CA, USA, 2017; ISBN 978-1-4842-2844-9.
41. Biswas, S.; Nayak, P.K.; Panigrahi, B.K.; Pradhan, G. An Intelligent Fault Detection and Classification Technique Based on
Variational Mode Decomposition-CNN for Transmission Lines Installed with UPFC and Wind Farm. Electr. Power Syst. Res. 2023,
223, 109526. [CrossRef]
42. Biswas, S.; Nayak, P.K. A New Approach for Protecting TCSC Compensated Transmission Lines Connected to DFIG-Based Wind
Farm. IEEE Trans. Ind. Inf. 2021, 17, 5282–5291. [CrossRef]
43. Bouzar-Benlabiod, L.; Rubin, S.H.; Benaida, A. Optimizing Deep Neural Network Architectures: An Overview. In Proceedings of
the 2021 IEEE 22nd International Conference on Information Reuse and Integration for Data Science, Las Vegas, NV, USA, 10–12
August 2021; pp. 25–32. [CrossRef]
44. Kamalov, F.; Leung, H.H. Deep Learning Regularization in Imbalanced Data. In Proceedings of the 2020 IEEE International
Conference on Communications, Computing, Cybersecurity, and Informatics, Sharjah, United Arab Emirates, 3–5 November
2020; pp. 17–21. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.