0% found this document useful (0 votes)
19 views

Solar PV Module Fault Classification Using Artificial Intelligence and Machine Learning Techniques

Fault analysis in solar photovoltaic (PV) arrays is essential to increase reliability, and improve efficiency and safety in PV systems. Conventional fault protection methods are usually employed to overcome the challenge however conventional protection is only effective in isolating faulty circuits in time of large current flow and remains inactive in case of low fault currents and may cause problems in long run. The model of the different faults emulates the different PV fault conditions which

Uploaded by

JITAMITRA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Solar PV Module Fault Classification Using Artificial Intelligence and Machine Learning Techniques

Fault analysis in solar photovoltaic (PV) arrays is essential to increase reliability, and improve efficiency and safety in PV systems. Conventional fault protection methods are usually employed to overcome the challenge however conventional protection is only effective in isolating faulty circuits in time of large current flow and remains inactive in case of low fault currents and may cause problems in long run. The model of the different faults emulates the different PV fault conditions which

Uploaded by

JITAMITRA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

© 2023 IJNRD | Volume 8, Issue 1 January 2023 | ISSN: 2456-4184 | IJNRD.

ORG

Solar PV Module Fault Classification using


Artificial Intelligence and Machine Learning
Techniques
Jitamitra Mohanty Itun Srangi Jagadish Chandra Pati
Krupajal Engineering College Krupajal Engineering College Krupajal Engineering College

Abstract Keywords: Faults, Photovoltaics, Simulink, Simulation,


Confusion matrix, Power characteristics, Machine
Fault analysis in solar photovoltaic (PV) arrays is essential
learning, Decision tree, Classifiers.
to increase reliability, and improve efficiency and safety
in PV systems. Conventional fault protection methods are
usually employed to overcome the challenge however Introduction
conventional protection is only effective in isolating faulty Photovoltaic systems provide a promising solution to the
circuits in time of large current flow and remains inactive world’s energy problem. The solar energy industry is
in case of low fault currents and may cause problems in currently on a rise in popularity following the maturity of
long run. The model of the different faults emulates the the technologies and consequently the reduced material
different PV fault conditions which are essential for a costs due to the better technologies. However, the capital
healthy PV power system analysis. The model is a solution cost and maintenance costs for PV panels are still high
to classify the potential faults during fault conditions to cut because they are mostly installed outside where they suffer
down on the time and cost invested in fault analysis both mechanical and electrical stress resulting in
through human analysis. The model is achieved through additional power losses, hot-spots formation, and different
the use of Artificial Intelligence and Machine learning complications in PV modules like fire outbreaks, in turn,
techniques. The model performance is matched along with leading to reduced PV efficiency even complete
specified vectors to check the accuracy using the breakdown in production. The PV systems if not
confusion matrix to ensure good performance in the monitored the faults may propagate within the modules
design. The simulated results determined that the fault and cause a complete failure of the PV array. The fault
diagnosis scheme can correctly classify faults with high detection methods for the PV system are generalized in
efficiency making the power plant troubleshooting process visual analysis, thermal testing, and electrical techniques.
easier. The entire plant characteristics are got from the fed The electrical fault analysis is more effective and
data and the model is trained to capture the entire system promising for efficient monitoring and diagnostics of PV
behaviour for future instance classification. systems. Today the electricity is predicted to be largely

IJNRD2301123 International Journal of Novel Research and Development (www.ijnrd.org) b170


© 2023 IJNRD | Volume 8, Issue 1 January 2023 | ISSN: 2456-4184 | IJNRD.ORG

supplied by solar power. It is reasonable to focus on the connected in series the system voltage increases and if
design of smart systems to monitor such solar power connected in parallel the system current increases as in
systems and classify the fault type that might be present Figure 2. Every solar cell design should account for
for reliability. PV systems provide several advantages over parameters that affect the amount of generated current like
other conventional energy systems. The energy provided irradiance, temperature, and type of semiconductor
is modular in that the capacity to be generated depends on material.
the amount required it also provides easy options to
expand the power system to meet the demand. Regardless
of the massive initial cost of setting up a PV power system;
there is no cost on machineries like transformers,
generators, and transmission equipment. Overall
maintenance of a PV system is more modular and easily
accessible. The above attributes have resulted in an
expansion of photovoltaics, and India has invested enough
to improve the sector as shown in Figure 1. Figure 2: I-V characteristics of a solar cell

The PV module usually is composed of several solar cells


with identical characteristics and the modules in a series
and parallel combination form a PV array. In real working
conditions, PV modules may work at different irradiance,
ambient temperature and even under different fault
conditions, this makes the I-V curve of a PV array
completely different from the ideal case of Figure 2.

There are many models for solar cells that have been
Figure 1: Solar market in India by installed capacity
designed to meet different conditions. However, the
best model should simply be accurate enough to
There has been a progressive increase in the
account for most solar cell parameters. The numerical
installation of solar power plants. With the continued
approach to model the PV module using the equations
rate of installation, a future with more clean and
that define its basic working. Several circuit software
reliable energy will be guaranteed to improve the
can be used to design this model but this uses Simulink
energy sector. The monitoring systems that can capture
in Matlab to build a 1.3KW PV system. The approach
real-time analysis of power plants are been designed to
to building the system is shown in the sequence of
improve the reliability and stability of power systems
Figure 3. The 1.3KW plant is then introduced to
improving energy utility by the industries and avoiding
different fault configurations for data generation for
the risk of fires or any other hazards.
future fault prediction on the plant.

Modeling and Simulation of PV Modules and


Data Generation:
Solar cell exhibits a non-linear output characteristic as in
Figure 2 and the curve varies with irradiance and
Figure 3: The sequence for modeling the 1.3KW PV system
temperature levels. The solar cells are connected in a series
and parallel combination to form a module. If modules are

IJNRD2301123 International Journal of Novel Research and Development (www.ijnrd.org) b171


© 2023 IJNRD | Volume 8, Issue 1 January 2023 | ISSN: 2456-4184 | IJNRD.ORG

Solar Cell Models and depends on the semiconductor material used. The
parameters in equation (1) are shown in Table 1.
Solar cells have non-linear I-V characteristics that
vary with irradiance and so it isn't suitable to model
(𝑉+𝐼𝑅𝑠) 𝑉+𝐼𝑅𝑠
a solar cell to be a constant voltage source instead I = Iph − Io (e − 1)-( )
𝑛𝐾𝑁𝑠 𝑅𝑠ℎ
solar cells are modelled as a current source. Among
Equation (1)
the different circuit designs, the single-diode and
double-diode models are the most used to describe
the characteristics of the solar cell. The Rs describes
the ohmic losses in the contacts solar cell contacts Table 1: Solar cell parameters
metal-semiconductor interfaces. It is assumed for the Symbol Parameter
sake of simplicity that there is no recombination I Solar cell current (A)
V Solar cell voltage (V)
around the junction region for the single-diode
Iph Light-generated current (A)
model. Especially for semiconductor materials with Ish Shunt resistance current (A)
larger bandgaps, this assumption leads to deviations Io Saturation current of the diode (A)
Rsh Solar cell shunt resistance (ohms)
between actual and simulated characteristic curves of Rs Series resistance (ohms)
the solar cell but the double-diode model attempts to n diode ideal factor
k Boltzmann’s constant =1.38×10-23 J/K
incorporate the recombination in the junction. The
q Electron charge = 1.6×10-19 C
equivalent circuits for the single-diode model and the T ambient temperature (K)
double-diode model are shown in Figure 4(a) and
Figure 4(b).
The double-diode model incorporates a second diode
as in equation (2) which represents the losses due to
recombination on the surface and the junction of the
solar cell. Equation (2) provides a more accurate
description of a solar cell but requires a higher
a. b. computation power.
Figure 4: Solar cell circuit models (a) Single-diode model (b) Double-diode
model
(𝑉+𝐼𝑅𝑠) (𝑉+𝐼𝑅𝑠) 𝑉+𝐼𝑅𝑠
For the single-diode model in Fig. 4(a), the solar cell I = Iph − Io1 (e
𝑛𝐾𝑁𝑠
− 1)- Io2 (e
𝑛𝐾𝑁𝑠
− 1) −
𝑅𝑠ℎ

current equation (1) indicates that more current flows


Equation (2)
through the load if Iph is more which depends on the
irradiance levels. Iph is the generated current and
The single-diode model of a solar cell for simulation
becomes maximum during noontime. V is the working
because the model is accurate enough and converges
voltage of the cell and any system design must meet
faster than the double-diode model. The design is
this voltage level for the smooth working of the PV
achieved in Matlab by using an inbuilt solar module
system. I represent the load current indicating the
by “1-Soltech” to develop a 1.3KW PV system.
demand drawn by the load. Io is the saturation current

IJNRD2301123 International Journal of Novel Research and Development (www.ijnrd.org) b172


© 2023 IJNRD | Volume 8, Issue 1 January 2023 | ISSN: 2456-4184 | IJNRD.ORG

Simulation in MATLAB/Simulink The PV system demonstrated above is tested in


MATLAB/Simulink using the simulation model
The proposed solar model of Figure 4 is implemented in suggested. The PV modules can have different I-
Matlab using the inbuilt solar module by “1-Soltech” by V curves for different irradiance levels and for
simulation in Sims-cape toolbox which can offer an open different module conditions a feature that is
and flexible interface for modeling numerical and essentially useful for fault studies. The approach
electrical systems. Figure 5 shows the design in Matlab for here we note the healthy operating points and it’s
the 1.3KW power system. The 1.3KW power system is important to note that for the developed PV
made of two strings with three modules in series in each system and irradiance of 1000 kilowatts per
string. Each module generates an optimum current of 7.47 meter squared the system parameters are
Amps and at an optimum voltage of 29.3 V, the short Imp=14.93(A), Vmp=87.91(V), and Pmax
circuit rating of the module is 7.79 Amp and the open- =1312 (W) as in the simulation results obtained
circuit voltage is 36.6 V. The panel has an efficiency of in Figure 6.
14% and the maximum working voltage of 600 V as
a
shown in Table 2.
Figure 5: The 1.3KW power system circuit diagram

Table 2: The PV panel by 1-Soltech specifications

Parameter Value

Open Circuit Voltage Voc (V) 36.6


Short Circuit Current Isc (A) 7.79
Voltage Vmp (V) 29.3
Current Imp(A) 7.47
Panel Efficiency 14.0%
Fill Factor 0.754
Figure 6: The 1.3KW healthy I-V and P-V
System Voltage Vmax(V) 600
characteristics (a) I-V characteristic (b) P-V
characteristics

IJNRD2301123 International Journal of Novel Research and Development (www.ijnrd.org) b173


© 2023 IJNRD | Volume 8, Issue 1 January 2023 | ISSN: 2456-4184 | IJNRD.ORG

The capacities indicate the normal working conditions


of the plant and any deviations from this operating Figure 7: The measured average temperatures in Puri in
point suggest a defect in the system. Data indicating
the possible deviation margin from these conditions
under different fault conditions were generated. The
different introduced conditions show the plant
behavior during fault and this data is exported for
model design. The model designed understands the
characteristics of the plant during fault and can isolate
the type of fault in future instances using its previous
learning experience hence can predict present fault winter of 2020

conditions. The simulation process to capture the


system fault configuration was done incorporating
real-time weather. To achieve authenticity, irradiance The measured temperatures were used for
and temperature were varied according to the levels characterizing the system from healthy conditions
received in Puri time zone and weather conditions. to different fault conditions for data generation.
Data Generation Different fault configurations were introduced
To capture the real-time characteristics of the 1.3KW into the power system to capture the system
PV system, real weather conditions were introduced characteristics as much as possible. A dataset with
based on the winter of 2020-2021 in Puri where the over 1062 (6*177) data points was generated to be
plant is to be set. Average weather measurements for used in machine learning to train the model that
winter were recorded for the simulation purpose. The can fully characterize the system and classify the
first step is to identify the sunrise and sunset hours to possible faults in future instances of the plant. The
consider only generation periods of the power system temperatures over two months were recorded and
and for the winter of 2020-2021 in Puri the average the average temperature noted for simulation in
sunrise hour was 7 am and sunset at 4 pm and obtained Simulink. The temperatures recorded only
average temperatures for the winter of 2020-2021 are respond to the generation period of the plant
shown in Figure 3. Average measurements may not and these temperatures were used for the
exactly give the actual behavior of the plant but give an different fault conditions to generate data. The
overall and general behavior of the plant. The hourly temperatures were recorded and the
measurements were recorded for two months and average calculated as in Figure 8
average irradiance levels corresponding to the time and
and temperatures of the day recorded. The irradiance
was low during morning hours and maximum at
noontime. The different irradiance levels were used in
simulating all fault configurations. Temperatures were
varied in the simulation following the recorded levels.

IJNRD2301123 International Journal of Novel Research and Development (www.ijnrd.org) b174


© 2023 IJNRD | Volume 8, Issue 1 January 2023 | ISSN: 2456-4184 | IJNRD.ORG

are common PV systems


Table 3: Classification of various types of faults in a
PV system

Type of Subclass of fault Description


fault
Partial shading Caused by shielding of
blockage on the panels

The uneven irradiance Due to varying intensity levels


distribution at different times of the day

Soiling Due to dust and dirt on the


Mismatch solar panels
Figure 8- Average temperatures in winter faults
Hotspot Partial shading of one part of
the solar panel

Upper ground fault shorting of the last two


PV System Fault Classification modules of the PV string to
The energy demand is exponentially increasing the ground
Ground faults
and PV energy production is by far the fastest-
Lower ground fault shorting of second and third
growing energy technology to meet the demand. modules of the PV string

The PV industry is guaranteed by the reduced cost to the ground resulting in the
substantial back-feed current
of materials hence the production cost of energy.
Series arc fault discontinuity in any current-
In the last few years, the industry has also seen a carrying conductor
Arc faults
qualitative improvement regarding growth in grid Parallel arc fault Insulation failure in the
connectivity. The PV system suffers loss more for current-carrying conductors

a variety of faults. The different common faults The line to line Accidental short-circuiting of
two strings of solar cells
include ground fault, the line to line fault, hotspots, fault

bypass mismatch, and arc faults which all result in Bypass diode Short-circuiting due to
fault incorrect connection
high current inflows with the potential to cause a
Delamination and yellowing of
fire. Fault analysis and protection besides modules, insertion of bubbles
Degradation in the modules, cracking, and
improving the efficiency and reliability of the PV fault defects in the anti-reflection
system, if ignored, can lead to a reduction in the coating of the panels

power generated and breakdown of the power Open circuit Unplugging of connection
fault wires in the junction box
system.
Inverter faults Failure in the components
Classification of faults in PV system
Outage Blackout due to weather
conditions like lightning,
There so many types of faults either electrical or
storm, hurricanes
non-electrical that affect a power system. The idea
of broadly classifying the possible faults is close to
impossible, and throughout the years, many
different types of research have been carried to
Mismatch faults in the solar panels
isolate different fault types as much as possible.
Some of the common faults like the mismatch Mismatch faults are most common in the PV arrays
faults, ground faults, line to line faults, bypass resulting in power loss and permanent damage to
diode faults, and arcing are explained here as they the modules. Mismatch faults in PV modules occur

IJNRD2301123 International Journal of Novel Research and Development (www.ijnrd.org) b175


© 2023 IJNRD | Volume 8, Issue 1 January 2023 | ISSN: 2456-4184 | IJNRD.ORG

when electrical parameters of some panels are for ground faults are.
significantly different from the others or
i. Degradation and liquid entry leading to a short
mismatched. The possible reason for the mismatch circuit between EGC.
is the varied irradiance levels (Figure 9) on ii. Animal infestation resulting in damaged cable
different panels or different temperature levels. insulation.
Mismatch faults are further classified into two iii. Insulation damage to cables due to aging,
types: corrosion due to water, damaged panels, or
Temporal mismatch faults: Caused by shading of incorrect installation.
panels from structures, clouds, foliage, dust on
iv. Short circuits in the PV combiner box
panels, and anything else that block radiation.
Permanent mismatch faults: Mainly caused by
hotspots and aging of the modules. The shading
effect results in uneven distribution of the
irradiance on the PV array as shown in Figure 9
causing reduced power production

Figure 10: Ground fault

Ground faults are easily detected by monitoring the


direction of the inverter current continuously and once
a reverse current is detected the fault alarm becomes
Figure 9: Shading of the PV array active and no additional sensors are required.
Although different ground fault detection devices

Ground faults in Photovoltaic systems include Ground Fault

A PV panel design comprises noncurrent-carrying Detection Interruption (GFDI) a fuse, and Residual
(NCC) metals (e.g.., module frame, and the metal Current Device (RCD) is also used.
enclosures) to provide mechanical support during
normal operation

Arc faults
of the panels .The conduits can accidentally short-
circuit the current-carrying wires of the panel due to
Factors cause arcs within the module and these
various reasons. To prevent a short circuit all NCC
persistent burns for a long time interval cause
conductors are connected to an equipment grounding
massive damages in the PV system. Arcs burn at
conductor (EGC) to the ground and the conducting
very high temperatures which depends on the
conductors are well insulated. The potential reasons
IJNRD2301123 International Journal of Novel Research and Development (www.ijnrd.org) b176
© 2023 IJNRD | Volume 8, Issue 1 January 2023 | ISSN: 2456-4184 | IJNRD.ORG

available energy and thermal characteristics of the and this reduces the risk of hotspots and minimizes
panel. The current carrier generation in a module the shading effect. Any damage to the diode results
depends on the irradiance and therefore can aid in in local hotspots causing heating and damaging the
stable ignition conditions for electrical arcs. If not solar cells and the panel effectively. Bypass diodes
put off within a short period, the direct radiation on play a very essential role without which the entire
the arc may start a fire. panel breaks down over time. The diodes are
Arc faults occur in a variety of locations for connected across each module or a group of panels
example in a fuse, terminals, inverters, bypass in systems with so many panels.
diodes and also within the PV modules at joint
Results
locations. Classification of arc faults:
Successful fault classification in PV systems is
i. Parallel arc fault to the ground
essential for reliability in power production. Any
ii. Cross-string parallel arc fault
fault configuration results in a shift from the
iii. Intra-String parallel arc fault
optimal operating conditions of the power plant

The line to line faults resulting in a reduced capacity and potentially into

A line to line fault is an accidental short circuit a total power system breakdown. Fast

between two or more random points in an array identification and isolation of the fault will ensure

that are operating at different potentials. The line satisfactory customer service but the key depends

to line faults is more difficult to isolate with any on pinpointing the exact type of fault the system is

conventional fault clearing devices. The line to line suffering from at the time. Some example

faults depicts distinct behavior under low simulated conditions on the plant are:

irradiance and at night time to day time .A line to Healthy condition: The power capacity during
line fault can be represented as in Figure 11 normal operation of the “1-Soltech” solar
shorting two points There are two different panel is 1.3KW as in Figure 12 (a) the
techniques for analysis of faults in PV arrays; generation resides around this capacity during
i. steady-state analysis good sunshine hours and its desired to operate
ii. The transient fault analysis at these conditions.

Shading faults: The shading effect on the PV


capacity depends on how much shade is affecting
the panels and incase where all generating units are
blocked and no radiation reaching the panels the
production is zero. The shading effect shifts the PV
system operating point affecting the efficiency of
the system. Figure 12(b) shows the simulation
results for 30% shading on the plant and the
observed capacity reduces to 435.98W.
Figure 11: Line to Line fault
The line to Line fault: The line to line fault results
in an additional current path in the system reducing
Bypass Diode Faults
the current to the load ultimately shifting the

During shading conditions, bypass diode bypasses operating point of the PV system. The capacity

the non-generating group of panels at low voltages reduces and line fault risk heating of NOC if in

IJNRD2301123 International Journal of Novel Research and Development (www.ijnrd.org) b177


© 2023 IJNRD | Volume 8, Issue 1 January 2023 | ISSN: 2456-4184 | IJNRD.ORG

contact and may cause a fire and damage to Artificial Intelligence and Machine Learning
insulation. All line to line simulated on the 1.3KW Artificial intelligence (AI) deals with the ability of
system resulted in the reduced capacity to 914W as a computer system to perform tasks commonly
Figure 12(c). similar to human intelligence. AI involves
developing systems endowed with the intellectual
processes that mimic human characteristics like
the ability to reason, understand, generalization,
and learn from experience. Despite all continuing
advances in the technology, there are still
challenges in processing the speed and memory
capacity of the systems and there haven’t been
programs that can match human flexibility in
general to meet the human level of intelligence. In

a recent developments, some programs have attained


the performance levels of human experts and
professionals at performing certain tasks, so in that
regard, we can say artificial intelligence is limited
in the sense that it is efficient in applications of
specialization as diverse as medical applications,
search engines, and voice or handwriting
recognition which are and not necessarily in the
generalization based tasks as humans.

The classification task

Classification refers to categorizing a given set of


data into classes and can be performed on both
b
unstructured and structured data. The target is
predicting the class of given data based on familiar
knowledge or experience. The classes are often
referred to as a target or label. The classification
modeling process is the process of approximating
some function from a set of input variables unto
output variables. The goal is to identify the category
of the new data which is the best fit. The ability to
do so depends on the learning experience of the
classifier therefore the most important part of
C classification is selecting the best form of the
Figure 12: The different P-V characteristics of the 1.3KW PV learning process for the classifier.
plant
To understand the different type of learners available
(a) Healthy (b) Shaded condition (c) Line to Line fault.
and the main types are:

IJNRD2301123 International Journal of Novel Research and Development (www.ijnrd.org) b178


© 2023 IJNRD | Volume 8, Issue 1 January 2023 | ISSN: 2456-4184 | IJNRD.ORG

Lazy Learners: Lazy learners just store the training data Multi-Class Classification: An isolation process
and await the testing data to appear. The classification with more than two classes and each sample is
depends on how close the stored data is related. They assigned to one and only one label.
have more predicting time compared to eager learners.
Training: A process of feeding data to an algorithm
An example of such is the k-nearest neighbour.
(F(x,y)) with the ability to learn the data.
Eager Learners: Learners construct a general
classification approach based on a particular Prediction: The process of decoding future

training data to design models with the ability to instances based on the training obtained from

predict correctly on future instances. The model previous data.


commits to a strong hypothesis that will work for Evaluate: This means the analysis of the model by
the entire domain. More time is invested in training checking performance parameters.
and less time on the testing. Examples of such
learners are, ANNs, Decision Tree, and Naive
Bayes. Machine Learning Models (ML)

Classification terminologies Supervised Learning is a based technique that


The modeling process highly depends on the data deals with the prediction of outcomes on data
visualization process. A successful data analysis based on previous data. It involves teaching a
depends on the knowledge of data mining and model to learn patterns and functions that help map
statistics hence knowledge of terminologies and the desired outcome in a future instance. The
definitions is important for our analysis. After learned model is simply a numerical model
analysis of the data, a classifier is selected designed based on the labelled data. The algorithm
depending on the purpose of the model design. The uses this previous knowledge by the model in
most important part of the classification is predicting outcomes of future data. Supervised
selecting the best classifier. Learning utilizes regression strategies to fabricate
these models. Classification algorithms are just
Classifier: It is an algorithm that is used to map the
used on discrete or categorical targets. Model
input data to a specific category by isolating the
design for any machine learning requires a well-
clusters or patterns in the data.
labelled data and selection of the best algorithm for
Classification Model: A design that predicts or design. The selected model is then fed with data for
draws a conclusion on the inputted data from the training and exploring the patterns in the data. The
given training data and makes predictions on the performance is improved by tuning the parameters
category in which the new data best fits in. of the classifier. After the design new data is fed to

Feature: A feature is a measurable or observable the trained model to give a prediction and the

property of the phenomenon being observed. Refer accuracy is cross-validated through analysis. The

to the parts or properties that form the entire machine learning classification models in this

system. work include Random Forest, Artificial Neural

Binary Classification: An isolation process with K-Nearest Neighbor


only two outcomes in the results. K-nearest neighbor (KNN) is one of the lazy
learner-based algorithms. This means that the
model makes no assumptions on the data

IJNRD2301123 International Journal of Novel Research and Development (www.ijnrd.org) b179


© 2023 IJNRD | Volume 8, Issue 1 January 2023 | ISSN: 2456-4184 | IJNRD.ORG

distribution, the classifier design underlines the with high accuracy. This line is at an equal distance
data.Lazy learning algorithms keep all the training away from the SVMs. Support vectors refer to the
data for prediction on the outcome of future closest points to the hyperplane. SVMs explore the
instances. The premise of KNN is the fact that different possibilities of the several lines then
similar objects appear close to each other. Objects select the farthest from the support vector. In the
are classified based on a majority vote from their case of a non-linear classification data, the
neighbors and are assigned tothe category closest algorithm utilizes several kernels that map the
to their neighbors. K is a positive integer parameter points unto different planes to isolate the clusters
passed to the KNN for tuning to improve accuracy. accurately. In the work, the SVM classifier applies
the OVR strategy to build a binary classifier for
Random forest
every class. The classifier focuses on the current
Random forest algorithms use ensemble class and treats it as a positive leaving the rest
algorithms that create several decisions called trees negative. A cluster is treated as a single class and
as in Figure 14 from the training dataset to predict is fit against the rest of the clusters
outcomes of future instances. Once data is fed on
Decision Tree
to the algorithm it sets rules from it which are used
for the prediction on future outcomes on the new
data. The classifiers have an up to down approach
analysis on the nodes starting from the root node
applying a binary split first on the most predictive
features creates nodes further down through the
process. This continues until leaf nodes are
generated with no possibilities of further splitting.
This process is based on the calculation of entropy.
The model predicts a target class for each leaf node
upwards to the actual class.
Decision trees are very common algorithms and
Random forests as the name ts create several trees
their learning methods with a wide range of
called a forest of decisions based on the training
applications New data is classified by sorting from
data depicting patterns and behavior contained in
up to down of the root node following attributes
the data. The random forests called a bagging
tested in the previous nodes. Each branch below a
classifier. Decision trees pick a random set of
node indicates a possible value for an attribute. .
features from the fed data and differ from in that
The process is repeated until a leaf node hence
regard.
reaching the classification of the instance.
Support Vector Machines

Support vector machines (SVM) are good for non-


Construction of a decision tree starts with the
linearly data that has clusters. The premise of an
selection of a suitable attribute to put in the root
SVM model is locating a hyperplane in a
node and followed by the creation of branches for
multidimensional space of features that can be able
every possible value of the attribute dividing the
to separate or categorize future instances easily.
sample set into subsets. The process is recursive
For example, in the case of a 2-dimensional space,
for each branch until all instances at a node result
the hyperplane is a line that can isolate the clusters
IJNRD2301123 International Journal of Novel Research and Development (www.ijnrd.org) b180
© 2023 IJNRD | Volume 8, Issue 1 January 2023 | ISSN: 2456-4184 | IJNRD.ORG

in the same classification and tree development Design and Predictions


stops. The key to the development of a great After capturing the PV system characteristics
decision tree is the selection of an optimal method under the different fault conditions the data is
for data splitting the data, which means, selecting exported to design a smart model to predict future
an attribute that is most useful for classifying the fault conditions in the PV system. The design is
samples. Recursive visits at each decision node, dependent on previous experiences and the more
selecting the optimal split until no further splits are the data the better the learning experience for the
possible this is the basic premise in decision trees. model. The data is split into training and testing
Decision trees use the concept of entropy reduction points to ensure the model predicts correctly. The
to optimize the splitting process. problem statement is the ability of the designed
system to correctly isolate the type of fault
Entropy measures how good an attribute isolates
occurring in the system to initiate quick
the training examples according to their target or
troubleshooting to cut down on power outage. The
label based on the measure of their Information.
data analysis is achieved using anaconda’s Jupyter
In a binary classification setup, the entropy in set
notebook.
X is calculated using equation (3) which gives
information
in the message.
(𝑥) = −Pp𝑙𝑜𝑔2 (𝑃𝑝) − Pn𝑙𝑜𝑔2 (𝑃𝑛) -- Equation (3)
Model design sequential steps

Pp represents the positive examples in X and Pn


the
negatives in X. The entropy can be calculated as
the weighted sum of the entropies for the subsets
Feature Selection and Engineering
as in equation (4) and shows the weightage of
Successful feature extraction on the power system
information. Information Gain measures a
data is followed by an analysis of the features to
reduction in entropy by partitioning the training
isolate those with high weightage on the
data. The unit of information is called a bit. As a
classification process. For better classification
measure of the average uncertainty in X, the
features that tend to be noise or highly correlated are
entropy is always non-negative indicating the
removed. This is called feature selection and aids to
average number of bits required to describe the
isolate any redundancy or irrelevant variables in the
random variable. A higher entropy indicates more
data preserving information in the data. The process
information the variable contains. The key to
reduces the overfitting of data essential for higher
applying the concept of entropy to visualization
performance and accuracy. The feature selection
problems depends on the proper specification of
process involves a quick scan of the data for pattern
the random variable X and the definition of the
identification. The feature selection techniques are
probability function P(x) which depends on
classified into:
individual applications.
Hs (T) =∑ (PH (T)) ---------Equation (4) Wrapper methods: Wrapper methods technique
focuses on training features with the more
functional subset by giving priority to those only.

IJNRD2301123 International Journal of Novel Research and Development (www.ijnrd.org) b181


© 2023 IJNRD | Volume 8, Issue 1 January 2023 | ISSN: 2456-4184 | IJNRD.ORG

This process is recursive and the decision to isolate Data Analysis and Interpretation
the most significant subset to be set aside repeats To fully understand the nature of the data and its
to either add or remove a feature. The process meaning we employ Jupyter notebook using the
repeats until the desired feature subsets are Anaconda IDE. The different libraries are
attained. The feature subsets are cross-validated employed to display data and interpretation. The
for their performance using the right learning different features are checked parallel to
classifier. These methods are extensive in understand how they are related to each other and
searching and can easily find the best features for interpret their significance in the system. To get a
the training model. But the selected features using general description of the data we use the
wrappers are only great for a particular data on “describe” command and Table 4 gives the
which the model was trained and may not perform summary of the data.
perfectly on future instances risking overfitting on
Table 4: General description of the dataset
the data.
Irradiance Temperature Imp Vmp Pmax

Filter methods: Filter methods find relevant (A) (V) (W)


Count 176.0 176.0 176.0 176.0 176.0
features based on analyzing the relationship
Mean 645.5 21.5 7.0 40.2 288.3
between the target class and the features. This Std 206.7 5.4 3.1 20.4 216.5

technique gives a rank to the entire dataset on top Min 300 12 1.3 27.9 41.7
25% 500 16 4.5 30.4 152.6
of selecting the best features. Ranking helps during
50% 600 24 6.8 31.5 228.2
performance enhancement and analysis which 75% 800 26 8.9 32.3 326.4

improves performance. Filters are not good at Max 1000 28 14.9 93.6 1302.2

providing the best feature subsets instead they give


a general model design. The performance of a
model based on this technique is mostly less than
those designs implemented using wrappers with
less computational requirements and are free from
overfitting. These methods can sometimes be used
for pre-processing before application on wrapper
methods.

Embedded methods: Embedded technique is a


hybrid made from both filter and wrapper methods. The correlation is an important element employed in
Unlike the filters and wrappers, the technique feature selection to avoid overfitting in the model
features both feature selection and incorporation of design process. The highly correlated features suggest
a learning algorithm on the selected features. The dropping some of the features to improve the
method begins with selecting the most significant machine's ability to retain accuracy. The correlation in
features at every iteration based on the intrinsic the data is shown in Table 5 and we observe that the
nature of the method. The process repeats until no highest correlation is 0.844 between temperature and
further improvement in the performance of the irradiance which confirms the strong relationship
model. The technique is less prone to overfitting between the two parameters. Current is highly
than wrappers. dependent on the irradiance and this can be confirmed
by the correlation of 0.733. In situations where the

IJNRD2301123 International Journal of Novel Research and Development (www.ijnrd.org) b182


© 2023 IJNRD | Volume 8, Issue 1 January 2023 | ISSN: 2456-4184 | IJNRD.ORG

dataset is massive, the feature engineering involves (Data before splitting)


dropping off the highly correlated features as they
contain almost the same meaning on the system
characteristics. Temperature and voltage show a
negative correlation that means they are inversely
proportional, a rise in temperature reduces the cell
voltage with less effect on the generated current. Both
voltage and current show a high correlation with the
maximum power (Pmax) of 0.726 and 0.671 Data after splitting into training and testing
respectively. points
Table 5: Correlation between the different features
After splitting the data the appropriate library and
Irradiance Temperatur Imp Vmp Pmax algorithm are employed to train and test the model.
e
In this project, the algorithms used are Decision tree,
Irradiance 1 0.844 0.733 -0.032 0.408
ANN, Support vector machine, and Random forest.
Temperature 0.844 1 0.621 -0.040 0.340
To understand the working of the classifier Figure
Imp 0.733 0.621 1 0.084 0.671
shows how the linear and polynomial based
Vmp -0.032 -0.040 -0.084 1 0.726
classifiers follow the testing points. Observe that the
Pmax 0.408 0.340 0.671 0.726 1
linear classier is has a high bias but is likely to
predict correctly on the future samples avoiding
overfitting. The polynomial has high variance and
gives 100 percent accuracy on the testing data but
Model training and Testing
fails terribly on future samples hence giving a poor
After all the data cleaning and a full understanding model. In general, 80% of the data is used for
of the data, we start designing the model using the training, and 20% used for testing the trained model.
different packages and the different Algorithms The accuracy of each model depends on the data
potentially able to execute the work. The cleaned hence data visualization is a very important part of
data is encoded for all non-numerical features. The the design process to ensure the best algorithm for
classification process involves splitting the data that particular dataset is selected. The learning
frame into the target element versus the rest of the process ensures that the model explores the data and
features (mapping between the target and the capture hidden patterns and characteristics in the
selected feature). The data is then split into training data
and testing data as shown in Figure for the
classifier. Classifier performance in predicting the testing
data
Linear classifier and Polynomial classifier
Accuracy Check and Results
After designing the model we select the best fit
model by checking the accuracy of each using the
accuracy score command. This is the last step in
the design process and determines whether the

IJNRD2301123 International Journal of Novel Research and Development (www.ijnrd.org) b183


© 2023 IJNRD | Volume 8, Issue 1 January 2023 | ISSN: 2456-4184 | IJNRD.ORG

model is successful and if not the process is


i=0 i=0
repeated after further feature engineering and
Despite the confusion matrix containing all
analysis.
information about the possible outcome of a classifier,
it's not ideal for reporting on the brain-computer
Confusion Matrix ((𝒊, 𝒋)): Also referred to as the interface (BCI) field as they are not only difficult to
Error matrix and uses a table in describing the compare but to interpret too. Hence, only, some
efficiency of a model based on data with known or parameters are considered from the confusion matrix
true outcomes. It gives a clear description of the for analysis.
errors that a particular model makes representing
the classifier's efficiency in the prediction process.
For a binary classification problem the confusion
matrix is represented as:

Predicted Class 1 Predicted Class 0


Table 7.1: Typical Confusion matrix
Actual Class 1 True Positive False Negative
Symbol Formula
Actual Class 0 False Positive True Negative
X 𝑋1 + 𝑋2 + 𝑋3
W 𝑊1 + 𝑊2 + 𝑊3
Y 𝑌1 + 𝑌2 + 𝑌3
Table 6: General Confusion matrix for a binary
A 𝑋1 + 𝑌1 + 𝑊1
classifier
B 𝑋2 + 𝑌2 + 𝑊2
C 𝑋3 + 𝑌3 + 𝑊3
T 𝐴+𝐵+𝐶
Analysis: The diagonals indicate the correctly T 𝑋+𝑌+𝑊
predicted parameters and the off diagonals indicate
the misclassified classes, Table 7.1 shows a typical
confusion matrix. The elements i and j in nij of
Table 7.2: Confusion matrix different parameters
equation (5) indicate the row and the column and
Predicted Class
Class Class Class C Total
A B
Actual Class 𝑋1 𝑌1 𝑊1 A
Class
A
Class 𝑋2 𝑌2 𝑊2 B

Class C 𝑋3 𝑌3 𝑊3 C

Total X Y W T

The confusion matrix, (𝑖, 𝑗), is a square (n*n) matrix


show the cases of class i identified as class j.
with rows and columns referring to the actual and
Hence, the diagonal elements nii are the correctly
predicted class on the data respectively. It follows that
classified classes, while the off-diagonals the
the diagonals (i=j) indicate the correct classification
misclassified classes. The total cases (N) is given
decisions. For many applications, normalization of the
by equation (5) and the complete parameters of the
confusion matrix is useful for easy analysis and
matrix are shown in Table 7.2.
understanding of data. This can be achieved i by
M M
number of ways, first of which involves dividing each
𝑁 = (∑) ∑ nij Equation (5)

IJNRD2301123 International Journal of Novel Research and Development (www.ijnrd.org) b184


© 2023 IJNRD | Volume 8, Issue 1 January 2023 | ISSN: 2456-4184 | IJNRD.ORG

element of 𝐶𝑀(𝑖, 𝑗) by the total number of samples in


the dataset or the sum of the elements in the matrix as Where ∑𝑁𝑐
𝑚=1 CM(i, i) is the total number of samples
in equation (6): belonging to class i. If the confusion matrix is row-wise

CM(i,j)
normalized, then ∑Nc
m=1 CM(i, m)=1 giving Re (i) = CM
𝐶𝑀n (𝑖, 𝑗) = --- Equation
(∑𝑁𝑐
𝑖=1) ∑𝑖=1 CM(i,j)
𝑁𝑐 (i, i), implying that the diagonal elements of the matrix
are recall values.
(6):
Pr(i) is a fraction of samples that are correctly
The second type of normalization is done row-wise classified to class i taking into account all the samples
by dividing each element of the confusion matrix by that are classified to that class. Precision is a measure
the sum of elements of the respective row (the true of accuracy on a class basis and is defined according
population of the class that has been mapped on that to the equation (10):
row). After normalization has taken place we can CM(i,i)
𝑃𝑒 (𝑖) = Equation (10)
∑Nc
m=1 CM(m,i)
now discard the information that is related to the size
of each class. In this form, all classes are considered Where ∑Nc
m=1 𝐶𝑀 (𝑚, 𝑖) represent the total number of

to be of equal size and the dataset is now class- samples that were classified to class i note that, if all
balanced. This normalization as in equation (7) gives classes contain the same number of samples, i.e. if all
the elements a per unity analysis classes then all three performance measures can be
computed from any version of the confusion matrix
CM(i,j)
𝐶𝑀n (𝑖, 𝑗) =∑ 𝑁𝑐 Equation (7) either with or without normalization. Otherwise, if the
CM(i,n)
𝑛=1
classes are not balanced, the second normalization
Analysis of the Confusion matrix before method will yield different performance results from
normalization, we can extract three useful the first, standard normalization scheme. An important
performance measures, namely the overall measure that combines the values of precision and
accuracy (𝐴cc) of the classifier, which represents recall is the F measure, which is computed as the
the fraction of samples of the dataset that have harmonic mean of the precision and recall values as
been correctly classified. The overall accuracy equation (11).
(𝐴cc) in equation (8) can be computed by dividing
2Re(i)Pr(i)
the sum of the diagonal elements by the total sum 𝐹 (𝑖) =Pr(i)+Re(i) Equation (11)
of the elements of the matrix (T).
Following Table 7 for an easier understanding
equation (12) to equation (14) adopts the
( ∑𝑁𝑐
𝑚=1 𝐶𝑀(𝑚,𝑚) nomenclature and we
𝐴cc= -- ---- Equation (8)
(∑𝑁𝑐 𝑁𝑐
𝑚=1) ∑𝑚=1 𝐶𝑀(𝑚,𝑚) can see that 𝑋1 = CM(i, i) and X =
∑Nc
m=1 CM(m, i) and the performance measures are:

Two other class-specific measures that describe


CM(i,i) X1
how well the classification algorithm performs on i. 𝑃𝑒 (𝑖 = 1) = = Equation (12)
∑Nc
m=1 CM(m,i) X
each class include recall (Re (i)) and Precision (Pr
(i)). The recall is defined as the proportion of data CM(i,i) X1
ii. 𝑅𝑒 (𝑖 = 1) =
∑Nc
= X1+Y1+W1 Equation
m=1 CM(i,m)
with true class labels i that were correctly assigned (13)
to class i and is computed as in equation (9)
CM(i,i) (∑Nc
m=1 CM(m,m) 𝑋1+𝑌2+𝑊3
𝑅𝑒 (𝑖) =∑𝑁𝑐 Equation (9) iii. 𝐴𝑐𝑐 = Nc CM(m,n) = Equation
𝑚=1 CM(m,i) ( ∑Nc
m=1 ) ∑n=1 𝑇

IJNRD2301123 International Journal of Novel Research and Development (www.ijnrd.org) b185


© 2023 IJNRD | Volume 8, Issue 1 January 2023 | ISSN: 2456-4184 | IJNRD.ORG

(14) Table 9: Data obtained from the confusion matrix from the classifiers

Prediction Results:
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0
The dataset of 1062 (6*177) data points was split 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 2 0 0 0 0 0 0 0 0 0 0 0 0
with 80% of the data for training and around 20% for
0 0 0 4 0 0 0 0 0 0 0 0 0 0 0
training. 20% of the data gives about 250 data points 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0
and this confirms with confusion matrix of (16*16) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 3 0 0 0 0 0 0 0 0
obtained for the prediction on the testing data points
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
as in Table 8. To calculate the accuracy equation (14) 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0
is used on all the four algorithms then the algorithm 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
with the highest score is selected. Data from the 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0
confusion matrix of the other three algorithms are 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 2 0 0
reported in Table 9 and Table 8 shows the detailed
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
confusion matrix for decision tree since it gave the Using the equations (14) we calculate the accuracy
highest score. score of each classifier from data in Table 9 as
shown in Table 10 and Table 11 shows the
Table 8: The obtained confusion matrix using Decision tree
classifier summary of the performance by each classifier

Rando ANN used for the classification process with decision


Decision
Parameter m SVM
Tree tree giving the highest score of 86% which is good
Forest
𝐍𝐜 enough for model designing.
(∑ 𝐂𝐌(𝐦, 𝐦) 31 11 14 19
𝐦=𝟏
𝐍𝐜

∑ 𝐂𝐌(𝐦, 𝐢) 5 25 22 17
Table 10: Tabulation of the performance score of the
𝐦=𝟏 classifiers from the confusion matrix
𝐍𝐜 𝐍𝐜

( ∑ ) ∑ 𝐂𝐌(𝐦, 𝐧) = 𝑻 36 36 36 36 Element Random Forest ANN SVM


𝐦=𝟏 𝐧=𝟏
Diagonal [0,2,2,3,0,0,0,0, [0,2,2,4,0,0,0,1,0,0,0, [0,2,2,4,0,0,1,1,1
(∑𝐍𝐜
𝐦=𝟏 𝐂𝐌(𝐦, 𝐦) 0.305
0.8611 0.3888 0.5277 0,0,0,4, 0,0,0,0] 5,0, 0,0,0] ,1,1, 5,0,0,1]
( ∑𝐦=𝟏) ∑𝐍𝐜
𝐍𝐜
𝐧=𝟏 𝐂𝐌(𝐦, 𝐧) 6
Non-zero [1,2,2,3,1,1,1,1, [1,1,2,1,1,2,2,3,2,3,1, [2,3,3,1,2,2,4]
off- 1,3,1,2,3, 1,1,1] 1,1,1]
Using equation (14) for the obtained matrix, diagonals

∑Nc
m=1 CM(m, m) =31 indicating the sum of the

diagonal elements and ∑Nc Nc


m=1) ∑n=1 CM(m, n)(T) =36

which is the total elements and the obtained Table 11: Accuracy score by different classifiers used

31
Accuracy score is 𝐴𝑐𝑐 =36 = 0.8611. Similarly, Model Testing Accuracy
Decision tree 86%
Table 9 data is used to calculate the accuracy of the Random Forest 31%
ANN 38%
other classifiers
Support Vector 53%
Machine (SVM)

Conclusion and Future Work


The data was successfully generated following the
weather conditions in Puri during the winter period. The
dataset captured the system behavior under fault. The
different fault configurations were incorporated and the

IJNRD2301123 International Journal of Novel Research and Development (www.ijnrd.org) b186


© 2023 IJNRD | Volume 8, Issue 1 January 2023 | ISSN: 2456-4184 | IJNRD.ORG

data was visualized in Jupyter notebook using python to 1981; 27:622-7. 10.1109/TIT.1981.1056403
infer the meaning hidden about the 1.3KW plant. Four 10) C.C. Chang and C.J. Lin. LIBSVM: a library
for support vector machines. http://www.csie.
different algorithms were used and the most accurate
ntu.edu.tw/cjlin/libsvm, 2001.
with an efficiency of 86% was the decision tree and was
11) Rossi D, Omana M, Giaffreda D, Metra C.
used to implement the design. The biggest challenge Modeling and detection of hotspot in shaded
was exhausting all the possible fault configuration to photovoltaic cells. IEEE Trans Very Large
Scale Integr Syst 2015; 23:1031–9.
capture the entire system behavior, for example, it was
doi:10.1109/TVLSI.2014.2333064.
impossible to simulate the ground fault on the designed
12) Johnson J, Montoya M, Fresquez A, Gonzalez
DC power system. This suggests that the model may not
S, Granata J, Mccalmont S, et al.
effectively classify instances that may represent fault Differentiating Series and Parallel
conditions that were not incorporated during model Photovoltaic Arc- Faults Arc-Fault Types
2012.
training. This is the biggest throwback of using
13) Zhao Y, Yang L, Lehman B, de Palma J-F,
generated synthetic data rather than real-time data from Mosesian J, Lyons R. Decision tree-based
a plant collected over the years. Future work can include fault detection and classification in solar
photovoltaic arrays. 2012:93–9.
expanding the dataset to incorporate all year seasonal
doi:10.1109/APEC.2012.6165803.
conditions and possibly the implementation of fault
14) Yi Z, Etemadi AH. A novel detection
classification techniques on real plant data and improve algorithm for Line-to-Line faults in
the model model design through tunning of classifier Photovoltaic (PV) arrays.
parameters using new python packages.

Bibliography
1) Ministry of Renewable energy government of
India available at https//mnre.gov.in
2) Alternative Energy available at
http://www.alternative-energy.com
3) RECP, “Global Market Outlook - For Solar
Power/2017-2021,” 2017
4) S. K. Firth, K. J. Lomas, and S. J. Rees, "A
simple model of a PV system
5) Alternative energy tutorials
athttps://www.alternative-energggy-
tutorials.com
6) ”K Nearest Neighbor- sklearn”. [Online].
Available: https://scikit-
learn.org/stable/
modules/generated/sklearn.neighbors.KNeig
hborsClassifier.html
7) ”Random Forest-
sklearn”.[Online].Available: https://scikit-
learn.org/stable/
modules/generated/sklearn.ensemble.Rando
mForestClassifier.html
8) PVGIS for assessment of solar PV energy
potential of Odisha. Int J Renew Energy Res
2016; 6:61–72.
9) Short RD, Fukunaga K. The optimal distance
measure for nearest neighbor classification.
IEEE Transactions on Information Theory
IJNRD2301123 International Journal of Novel Research and Development (www.ijnrd.org) b187

You might also like