0% found this document useful (0 votes)
12 views11 pages

Optimized Dissolved Oxygen Prediction Using GA

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views11 pages

Optimized Dissolved Oxygen Prediction Using GA

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

IEEE SENSORS JOURNAL, VOL. 23, NO.

13, 1 JULY 2023 15153

Optimized Dissolved Oxygen Prediction Using


Genetic Algorithm and Bagging Ensemble
Learning for Smart Fish Farm
Prince Waqas Khan , Member, IEEE, and Yung Cheol Byun

Abstract—The field of aquaculture is one of the numerous


scientific disciplines that benefit greatly from machine learn-
ing (ML). The amount of dissolved oxygen (DO), an important
indicator of water quality in sustainable fish farming, affects
the yield of aquatic production. It is essential to make DO
projections in fishing ponds to carry out the process of
artificial aeration. We present DO forecasts utilizing time
series analysis based on data obtained from Hanwha Aqua
Planet Jeju, located in South Korea. This information could
form the basis of a data foundation for an early detection
system and improved aquaculture farm management. This
research presents a unique genetic algorithm (GA) called
GA-based XGBoost, CatBoost, and extra tree (GA-XGCBXT)
bagging ensemble model based on GAs. This model is built
on extreme gradient boosting (XGBoost), CatBoost (CB), and
extra trees (XTs). To select the most outstanding features,
various methodologies that exhibit a strong association with the primary data were applied. The performance of the
proposed model was evaluated by comparing it to actual sensor data that had been observed, both in the training and
validation sets. The precise evaluation accuracy of the anticipated results of the recommended GA-XGCBXT model was
determined using various performance indices. By utilizing the strategy we suggested, we acquired a root mean square
error of 0.310. Our objective is to enhance the ML model for aquaculture so that academics and practitioners can employ
applications for smart fish farming with complete reliability.
Index Terms— Bagging ensemble learning, dissolved oxygen (DO), electrical conductivity (EC), genetic algorithm (GA),
oxidation-reduction potential (ORP), sensor data processing, smart fish farm.

I. I NTRODUCTION [2]. The amount of DO required in aquaculture farms is


ISSOLVED oxygen (DO) is essential to aquaculture typically determined by the type of fish being raised and the
D water quality in smart fish farms. If the DO content is
correct, water plants and animals can grow and develop [1],
temperature of the water. At concentrations below three mg/L,
most species exhibit increased mortality and morbidity, which
is connected with stress in aquatic species [3]. If the decline in
Manuscript received 20 April 2023; revised 12 May 2023; DO concentration can be accurately predicted, In that situation,
accepted 18 May 2023. Date of publication 26 May 2023; date of aquaculture farmers can take urgent corrective action to avoid
current version 29 June 2023. This work was supported by the Ministry this problem by increasing DO concentration, such as by
of Small and Medium-Sized Enterprises (SMEs) and Startups (MSS),
South Korea, through the “Regional Specialized Industry Development activating an aeration system. The system can be automated
Plus Program (Research and Development),” supervised by the Korea with machine learning (ML), and the outcomes are also better
Technology and Information Promotion Agency for SMEs (TIPA), under and more valuable. Forecasting DO in fishery ponds is critical
Grant S3246057. The associate editor coordinating the review of
this article and approving it for publication was Dr. Avik Santra. for using artificial aeration [4]. Many scientific domains,
(Corresponding author: Yung Cheol Byun.) including aquaculture, rely on ML. Numerous models have
Prince Waqas Khan is with the School of Computing, Gachon been developed to forecast the DO in bodies of water.
University, Seongnam-si, Gyeonggi-do 13120, South Korea (e-mail:
[email protected]). For DO prediction in aquaculture, Li et al. [5] employed a
Yung Cheol Byun is with the Department of Computer Engineering, mixed model consisting of a long short-term memory (LSTM)
Major of Electronic Engineering, Institute of Information Science and network and a temporal convolutions network. Support vector
Technology, Jeju National University, Jeju 63243, South Korea (e-mail:
[email protected]). regression (SVR) was utilized by Wei et al. [6] to estimate
Digital Object Identifier 10.1109/JSEN.2023.3278719 DO levels in sea cucumber farming. They take into account

1558-1748 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: West Virginia Univ Institute of Technology. Downloaded on July 22,2023 at 05:42:54 UTC from IEEE Xplore. Restrictions apply.
15154 IEEE SENSORS JOURNAL, VOL. 23, NO. 13, 1 JULY 2023

is examined along with the data, algorithms, and bagging


ensemble approaches. We want to improve the ML model
for aquaculture so that researchers and professionals can
confidently use smart fish farming applications. This can
provide the data foundation for an early warning system and
better management of aquaculture farms. This study is focused
on addressing the following research question.
1) RQ1: Can a ML model accurately predict DO levels in
fish farming ponds to improve farm management and
yield of aquatic production?
The objectives of the study are to collect and analyze DO
sensor data, identify the most important factors affecting
DO levels, develop a ML model that can accurately predict
DO levels, evaluate the model’s performance, and provide
recommendations for using the model to improve aquaculture
farm management and the yield of aquatic production.

A. Contributions
We proposed a prediction model for DO using cutting-edge
ML algorithms. We used the actual dataset from the smart fish
farm in our experiments for the forecasting model. A genetic
algorithm-based XGBoost, CB, and extra tree (GA-XGCBXT)
bagging ensemble model is proposed for DO prediction. The
idea of bagging ensemble ML is put forth. A genetic algorithm
(GA) is used for optimal feature selection. The XGBoost, CB,
and additional tree models are the components of the bagging
ensemble model. In comparison to other standalone models,
it performs better. We set up a smart fish farm to collect
experimental data at Hanwha Aqua Planet Jeju in South Korea.
The significant contributions of these articles are as follows:
1) obtaining data from the smart fish farm with oxidation-
reduction potential (ORP) sensors, EC sensors, DO sen-
sors, and PH sensors along with two temperature
sensors;
2) utilizing GA and SHapley additive exPlanations (SHAP)
for optimal feature selection;
Fig. 1. Flow diagram of proposed methodology.
3) presenting a GA-XGCBXT bagging ensemble model;
4) comparing the proposed model with several state-of-the-
potential hydrogen (PH), electrical conductivity (EC), water art ML-based prediction algorithms.
temperature, and DO. Liu et al. [7] also employed the least
squares SVR variation to predict DO in river crab culture. They
B. Article Organization
did, however, improve on the particle swarm optimization
technique. Ta and Wei [8] employed a convolutional neural The rest of the article is structured as follows: The proposed
network (CNN) for DO prediction in recirculating aquaculture approach and the ML models are detailed in Section II.
systems. Eze and Ajmal [3] presented a hybrid model for The procedure for data curation and data summarization is
projecting DO in aquaculture. They employed a LSTM neural provided in Section III. In this section, we also look into
network based on ensemble empirical mode decomposition dataset smoothing and preprocessing. The performance of the
(EEMD). suggested model is discussed in detail in Section IV. We also
The proposed bagging ensemble model includes CatBoost compare our findings to those of other models that are in use.
(CB), extreme gradient boosting (XGBoost), and an additional In the final section, we wrap up this article’s work and offer
tree regressor. The performance of the proposed model was suggestions for more research.
evaluated on training and validation sets against observed
sensor data. The precise evaluation accuracy of the results II. B ACKGROUND R EVIEW
predicted by the proposed model was assessed using a Predicting DO, an important water quality parameter is
variety of performance indices, including mean absolute error essential for aquatic managers in charge of maintaining
(MAE), mean square error, root mean square error, and the health of ecosystems and running reservoirs [9]. Most
root mean square log error. Additionally, the performance DO prediction models are complicated. Additionally, reliable
of bagging ensemble approaches used in smart fish farming data to develop and calibrate new DO models need to be

Authorized licensed use limited to: West Virginia Univ Institute of Technology. Downloaded on July 22,2023 at 05:42:54 UTC from IEEE Xplore. Restrictions apply.
KHAN AND BYUN: OPTIMIZED DO PREDICTION USING GA AND BAGGING ENSEMBLE LEARNING 15155

improved. Thus, ML-based methods have improved forecasts


of complex and essential indicators of aquatic ecosystem
health like DO in recent decades. This study by Kisi et al. [10]
suggests a new technique, called Bayesian model averaging,
for figuring out how much DO is in the water every hour.
There were multiple metrics used to evaluate the methods.
When various sets of inputs were looked at, it was found
that the water temperature is the primary variable, while the
specific conductivity has almost no effect on the hourly DO.
It is suggested by Xiao et al. [11] that aquaculture DO can be
predicted using the backpropagation neural network approach Fig. 2. Experimental setup of the smart fish farm with sensor system
with a mix of activation functions. Details about the input, integration.
hidden, and output layers and the weight adjustment process
are given. The outcomes of the experiments show that the
neural network has the most accurate predictions. All the data using the Savitzky–Golay filter, and generating features
predicted values are within 5% of the error limit, which at the feature engineering step. We applied a GA to select
means they can be used in real life. The prediction model can optimal features. Then we combined XGBoost, an extra tree
help improve water quality monitoring in aquaculture. In the (XT) regressor, and CB models to generate a hybrid model.
study by Olyaie et al. [12], DO predictions in the Delaware The mean average error, the mean-squared error (MSE), and
River at Trenton, USA, were made using three different AI the root MSE (RMSE) were used for validation and then
methods. These methods use artificial neural networks, linear forecasting.
genetic programming, and support vector machines. When
the estimation accuracy of different intelligence models was A. Experimental Setup
compared, it was clear that the SVM could make the most
The data were recorded at Hanwha Aqua Planet Jeju in
accurate model for DO estimation. Also, it was found that
the month of July 2022. The largest aquarium complex in
the linear genetic programming model works better than both
all of East Asia is Aqua Planet Jeju. Aqua Planet Jeju
artificial neural network models. A new hybrid DO prediction
is a renowned family-friendly attraction that is home to
model based on an LSTM optimized by an improved sparrow
over 45 000 marine animals from both domestic and foreign
search algorithm (ISSA) is suggested by Wu et al. [13] to
regions. The centerpiece of Aqua Planet Jeju is its main tank,
enhance the DO accuracy rate and understand its shifting
which holds 27 000 fish and 6000 tons of water and displays
patterns. XGBoost selects essential components with a higher
marine life up close [14]. The experimental setup consists of a
correlation with DO as input parameters to reduce redundant
fish tank with multiple sensors. Fig. 2 shows the Experimental
information and speed up model calculation. ISSA optimizes
setup of the smart fish farm with sensors. A smart fish farm
LSTM to find the optimal initial weights and thresholds for an
water tank was installed with a depth of 100 cm and 2200 L of
XGBoost-ISSA-LSTM DO prediction model. They examine
fresh water. A LED light is placed on the top of the fish tank.
water quality prediction, which can predict short- and long-
The fish farm includes two temperature sensors: temperature
term DO. Research findings show that the proposed model
sensor A and temperature sensor B. Temperature sensor A is
outperforms other models’ adaptability and accuracy rate.
placed inside the fish tank along with other sensors and fish.
Two key flaws in conventional forecasting systems are
At the same time, Temperature Sensor B is placed behind the
their poor accuracy and generalization capabilities. This article
partition to avoid the effects of fish. Other sensors consist of
addresses these issues and presents a novel GA-based bagging
ORP sensors, EC sensors, DO sensors, and PH sensors.
ensemble model. The best features have been chosen using
As experimental organisms, 30 goldfish with a strong
various methods that correlate well with the initial data.
survival rate were stocked. In the case of biological breeding,
20 g/time of compound feed was fed three times per day at
III. M ATERIALS AND M ETHODS
11 A . M ., 2 P. M ., and 5 P. M . every day, and no water exchange
A GA-XGCBXT bagging ensemble model is proposed for was performed. After feeding the food, filtered and stable
DO prediction. For this purpose, the data were collected at circulating water was measured at the site at 10 A . M . daily
Hanwha Aqua Planet Jeju in July 2022. A fish tank with for pH, DO, and water temperature. An external filter (Eheim,
numerous sensors makes up the experimental setup. On top 1000 L), which is most often used in large aquariums or farms,
of the fish tank is an LED light. Two temperature sensors was installed in the control tank. A natural water circulation
were part of it. Sensors A and B measure temperature. The structure that is similar to a river or lake was designed using
fish tank contains temperature sensor A, additional sensors, the Cedrapump (50 W, ten revolutions/day).
and fish. To counteract the impacts of fish, temperature sensor
B is installed behind the divider. ORP sensors, EC sensors,
DO sensors, and PH sensors are among the additional sensors. B. Bagging Ensemble Learning
Fig. 1 shows the flow of the proposed methodology. We got Multiple models are trained using the same learning
the fish farm data from Hanwha Aqua Planet Jeju. The flow algorithm in an ensemble learning concept. A group of base
process consists of checking for null values, smoothing the learners, or models, known as an ensemble, work together to

Authorized licensed use limited to: West Virginia Univ Institute of Technology. Downloaded on July 22,2023 at 05:42:54 UTC from IEEE Xplore. Restrictions apply.
15156 IEEE SENSORS JOURNAL, VOL. 23, NO. 13, 1 JULY 2023

make a superior final prediction [15], [16]. Due to excessive TABLE I


variation or strong bias, a single model, also known as a H YPERPARAMETERS U SED FOR XGB R EGRESSOR , XT S
R EGRESSOR , AND CB R EGRESSOR M ODELS
base or weak learner, may need to perform more effectively.
However, weak learners can combine to become strong
learners when aggregated. This minimizes bias or variation and
improves model performance. Bagging is a type of ensemble
learning that involves creating additional data for training from
the dataset, utilizing combinations with repetitions to make
multiple sets of the original data. It is possible to reduce
the variance in the prediction [17]. A GA-XGCBXT bagging
ensemble model is proposed for DO prediction. We used a
GA for optimal feature selection and these optimal features
for bagging ensemble learning.
1) Genetic Algorithm: The GA is an optimization technique
localization [28], home network traffic classification [29], and
created using the principles of natural evolution. Here, fault detection [30].
the concepts of genetic inheritance and natural selection When it comes to structured or tabular datasets for
classification and regression predictive modeling issues,
are employed [18]. By starting with a random initial cost
XGBoost outperforms other gradient-boosting solutions in
function and only searching in the space with the lowest
terms of speed.
cost, it employs guided random search, which is different
4) XT Regressor: XTs Regressor is an ensemble ML
from previous algorithms. GA is used for large-scale feature
system that aggregates the forecasts from numerous decision
selection [19], [20], [21]
The population’s chromosomes’ genetic makeup and behav- trees [31]. The XT regressor does not do bootstrap aggregation.
ior serve as the foundation for the GA. Each chromosome Without replacement, it uses a random subset of the data. So,
represents a potential answer. As a result, the population is nodes are split at random rather than using the best splits.
made up of chromosomes. Each person in the population has Randomness in the XT regressor originates from the random
a fitness function that describes them. Consequently, getting splitting of the data, not from bootstrapping aggregation.
fitter is the answer. The best individuals are chosen from the XT regressor is used for forecasting indoor temperature in
population’s available individuals to generate the children of smart buildings [32], short-term energy forecasting [33], and
the following generation. As a result of mutation, the offspring rainfall prediction [34].
will have traits from both parents. A mutation is a minor
alteration to the structure of the gene. C. Hyperparameters
2) CatBoost: CB expands on gradient boosting and decision Hyperparameter tuning is a crucial step in ML model
tree theory. The primary concept of boosting is the sequential development that involves finding the optimal combination
combination of numerous weak models [22]. Gradient of hyperparameters to achieve the best performance. For the
boosting sequentially fits the decision trees, allowing them to proposed approach, we used a grid search method to explore
learn from earlier trees’ errors and reduce errors. Until the the hyperparameter space and find the best combination
chosen loss function is no longer minimized, new functions are of hyperparameters that yielded the highest performance.
continuously added to the already existing ones. It has different Table I shows the hyperparameters selected for XGBoost
attributes such as the number of trees, random seed, learning (XGB), XT regressor, and CB regressor models, along with
rate, depth, and loss function. CB is used in many areas their corresponding optimal values. Following the list of
for forecasting, including electricity price forecasting [23], common hyperparameters for all three models are model-
weather forecasting [24], PV power forecasting [25], and specific hyperparameters. The optimal values were obtained
coastal water quality prediction [26]. through a hyperparameter tuning process.
3) XGBoost: XGBoost is a distributed, scalable gradient-
boosted decision tree (GBDT) ML framework. Gradient IV. DATA C URATION AND A NALYSIS
boosting is a supervised learning process that combines The dataset used in this study was obtained from Hanwha
the predictions of a number of weaker, simpler models to Aqua Planet Jeju, a marine theme park located in South Korea.
attempt to predict a target variable properly [27]. XGBoost The initial dataset consisted of 25 490 rows and four columns.
is one of the top ML libraries for regression, classification, To improve the performance of the model, feature engineering
and ranking issues. It offers parallel tree boosting. It is a techniques were applied, resulting in a total of 13 features.
distributed gradient-boosting library that has been developed Data collection involved monitoring the aquatic environment
to be very effective, adaptable, and portable. Recent practical in real-time using various sensors and measuring devices.
ML competitions for structured or tabular data have been The collected data were preprocessed to remove any outliers
won by the algorithm XGBoost. When solving supervised or missing values before being used for model training
learning tasks, like predicting a target variable using training and evaluation. Table II summarizes the specification of the
data (with numerous features), researchers employ XGBoost. dataset, where the asterisk sign (*) shows the number of
XGBoost is used for predicting protein submitochondrial features without the target variable DO.

Authorized licensed use limited to: West Virginia Univ Institute of Technology. Downloaded on July 22,2023 at 05:42:54 UTC from IEEE Xplore. Restrictions apply.
KHAN AND BYUN: OPTIMIZED DO PREDICTION USING GA AND BAGGING ENSEMBLE LEARNING 15157

TABLE II
DATASET S PECIFICATIONS

TABLE III
D ESCRIPTIVE S TATISTICS OF THE DATASET

Table III presents descriptive statistics for the variables


temperature (temp), DO, pH, ORP, and EC in a dataset of
25 490 observations. The mean, standard deviation, minimum,
maximum, and quartile values are displayed for each variable.
This information provides an overview of the central tendency
and variability of the data, as well as the range of
values observed. Descriptive statistics analysis is useful in
understanding the nature of the dataset and in making
informed decisions about the appropriate statistical methods to
apply.
Fig. 3 shows the bar chart of mean features values according
to each day. Fig. 3(a) shows the daily mean bar chart of DO,
where the total mean value of DO is 4.17. Fig. 3(b) shows
the daily mean bar chart of the temperature, where the total
mean temperature value is 22.48 ◦ C. Fig. 3(c) shows the daily
mean bar chart of pH readings, where the total mean value
of pH is 7.61. Fig. 3(d) shows the daily mean bar chart of
EC readings, where the total mean value of EC is 66 636.23.
Fig. 3(e) shows the daily mean bar chart or ORP readings,
where the total mean value of ORP is 112.26.
Fig. 4 shows all readings of the features. Fig. 4(a) shows
the graph of DO, where the minimum DO value is 0.5 (mg/L)
and the maximum 6.8 (mg/L). Fig. 4(b) shows the temperature
graph, where the minimum pH value is 21.0 ◦ C, and the
maximum is 23.1 ◦ C. Fig. 4(c) shows the graph of pH
readings, where the minimum pH value is 6.95, and the
maximum is 8.15.

A. Data Smoothing
The Savitzky–Golay filter is used for smoothing. A digital
filter called the Savitzky–Golay filter employs data points to Fig. 3. Mean features value according to each day. (a) DO.
(b) Temperature. (c) PH. (d) EC. (e) ORP.
smooth the graph [35]. When using the least-squares method,
a small window is created, the data in that window is subjected
to a polynomial, and the polynomial is then used to determine
the window’s center point. Once all the neighbors have been B. Training and Testing Data
roughly adjusted with one another, the window is then shifted The dataset was collected in the months of April, May,
by one data point. Fig. 5 shows a graphical comparison of DO and June 2022; hence, it contains data from April 1, 2022,
before and after smoothing. to June 30, 2022, and the total rows of the dataset were

Authorized licensed use limited to: West Virginia Univ Institute of Technology. Downloaded on July 22,2023 at 05:42:54 UTC from IEEE Xplore. Restrictions apply.
15158 IEEE SENSORS JOURNAL, VOL. 23, NO. 13, 1 JULY 2023

Fig. 7. Graphical comparison of actual and predicted DO and their


difference.

Fig. 4. Features according to each day. (a) DO. (b) Temperature. (c) pH.

Fig. 8. Boxplot of errors between actual and predicted values by


different models.

Fig. 7 shows a graphical comparison of actual and predicted


DO and their difference. The light green line shows the actual
DO values and the orange line shows the predicted values by
Fig. 5. Data before and after smoothing.
the proposed GA-based bagging ensemble model. The purple
color between them represents the difference between actual
and predicted values.
Fig. 8 shows the boxplot of errors between actual and
predicted values by different models. The box represents the
interquartile range (IQR) of the errors, with the line inside the
box representing the median error. The whiskers extending
from the box represent the range of errors within 1.5× the
Fig. 6. Train and test data.
IQR, and any points outside of the whiskers are considered
outliers. The plot allows us to visualize the spread and
distribution of the errors, providing insights into the predictive
25 490, We divided it into 80% training data and 20% testing
model’s performance. The presence of outliers in CB and
data. Fig. 6 shows a graphical representation of this division.
XGBoost models indicates that they may have instances where
Training data starts on April 1, 2022 and ends on June 11,
their predictions are significantly off from the true values,
2022, consisting of 20 392 rows, whereas testing data starts
leading to higher errors. Whereas the box plot of the XTs
on June 12, 2022 and ends on June 30, 2022, consisting
model does not show any outliers, it means that the errors
of 5098 rows. Table IV summarizes the training and testing
for this model are relatively consistent and do not deviate
division.
significantly from the median.

V. R ESULTS AND D ISCUSSION A. Feature Selection


This section covers the test-bed environment description, The relevance of a feature is one of the numerous strategies
feature selection results, error loss, a graphical representation for improving model correctness [36]. Estimating the contribu-
of prediction, and different evaluation metrics. We have tion of each data feature to the model’s prediction using feature
compared our proposed model with several state-of-the- significance is helpful. After doing feature importance tests,
art ML algorithms. Our test bed consists of two different we determine which characteristics significantly influence
environments: IoT and ML. The IoT environment consists of the model’s decision-making. As a result, we can take
an innovative fish farm with sensors and a ML environment action by eliminating features that have little bearing on
to employ the bagging ensemble learning model. The coding the model’s predictions and concentrating on enhancing the
is done in Python language. The detailed ML implementation more essential aspects. This has a significant impact on
environment is explained in Table V. model performance. We evaluated feature importance using

Authorized licensed use limited to: West Virginia Univ Institute of Technology. Downloaded on July 22,2023 at 05:42:54 UTC from IEEE Xplore. Restrictions apply.
KHAN AND BYUN: OPTIMIZED DO PREDICTION USING GA AND BAGGING ENSEMBLE LEARNING 15159

TABLE IV
DATASET D IVISION ACCORDING TO T RAINING AND T ESTING DATA

TABLE V
ML I MPLEMENTATION E NVIRONMENT

Fig. 10. SHAP feature importance.

respectively. As can be seen, GA helped to identify the


most important features for the prediction task, resulting in
a reduced set of features while maintaining or even improving
the model’s performance. The fitness function is used to
evaluate each chromosome, which represents a candidate
solution. It is calculated using the following equation:
1
fc = (1)
1 + ec
where f c represents the fitness value of the chromosome and
ec is the error value of the chromosome.
2) Feature Selection Using SHAP: SHAP feature importance
is a substitute for permutation feature importance. Based on
the size of feature attributions, SHAP was developed [38].
By plotting the SHAP values of each feature for each sample,
we can gain an understanding of which features are most
important for the model. The distribution of each feature’s
effects on the model output can be visualized using a SHAP
summary plot [39]. Fig. 10 displays the distribution of each
feature’s effects on the proposed model output, with features
Fig. 9. Feature importance (a) before and (b) after applying GA. sorted according to the total of SHAP value magnitudes across
all samples. The color of the plot represents the feature value,
with blue indicating a low feature value and red indicating a
GA and SHAP analysis and cover, gain, weight, and total cover high feature value. The SHAP value for a specific feature i
parameters. of a given instance x can be calculated using the following
1) Feature Selection Using GA: A GA can be used for equation:
optimal feature selection [37]. GA searches for the best subset M
of features that yields the highest performance. In this study,
X
SHAPi = φi,0 + φi, j xi, j (2)
GA was employed to select the optimal subset of features for j=1
the ML models. The GA implementation used in this study
involves encoding feature subsets as binary strings and then where SHAPi is the SHAP value for feature i, φi,0 is the
applying genetic operations such as crossover and mutation base value, φi, j is the jth feature’s weight, xi, j is the jth
to generate new candidate feature subsets. Fig. 9(a) and (b) feature’s value for the ith instance, and M is the total number
show the feature importance before and after applying GA, of features.

Authorized licensed use limited to: West Virginia Univ Institute of Technology. Downloaded on July 22,2023 at 05:42:54 UTC from IEEE Xplore. Restrictions apply.
15160 IEEE SENSORS JOURNAL, VOL. 23, NO. 13, 1 JULY 2023

TABLE VI pond, while y p represents the predicted DO values obtained


C OMPARISON OF MSE, RMSE, RMSLE, R 2 , AND MAPE from the proposed GA-XGCBXT bagging ensemble model
1 X
|ya − y p | .

MAE = (3)
N
2) Mean-Squared Error: A common error metric for
regression issues is the MSE. It is also a crucial loss function
for fit or optimized algorithms when a regression problem
is framed in terms of least squares [43]. Minimizing the
MSE between the forecasts and the expected values is what
is meant by “least squares” in this context. The mean or
average of the squared discrepancies between the anticipated
and expected target values in a dataset is used to calculate
3) Features Analysis: Besides GA and SHAP, we have the MSE. We obtained an MSE score of 0.11, using (4) to
used different parameters to analyze the feature importance. calculate MSE. Where ya is the ground-truth DO values and
These parameters include cover, gain, weight, and total y p is the DO values predicted by the model
cover. Fig. 11 shows the cover, gain, and weight analysis of 1 X 2
features. MSE = ya − y p . (4)
N
The number of coverages, which compares the amount
of training data points transmitted by these sections to the 3) Root MSE: The RMSE is an addition to the MSE. The
number of times an item is used to split data across all trees, square root of the error is determined, which is significant
is greater [40]. The scale shows the proportion of observations since it means that the units of the RMSE and the goal value
connected to this function. Fig. 11(b) shows the cover score of that is being forecast are the same [44]. We obtained an RMSE
features. Where the day of the week has the lowest total cover score of 0.310, using (5) to calculate RMSE
score and Fig. 11(a) shows the total cover score of features.
r
1 X 2
Where EC has the lowest total cover score, the gain is the RMSE = ya − y p . (5)
N
typical decrease in training loss when a particular feature is
used [41]. Gain is a measurement of how much each feature of 4) Root Mean-Squared Log Error: RMSLE is the difference
each tree in the model contributed to the relative contribution between the expected and actual log-transformed data,
of the corresponding feature in the model. Fig. 11(c) shows expressed as the RMSE [16], [45]. To prevent calculating
the gain score of features. Where EC and temperature have the the natural log of potential 0 (zero) values, RMSLE adds
lowest total cover score. Frequency, often known as weight, 1 to both the actual and anticipated values before taking the
is a percentage that, on occasion, corresponds to the percentage natural logarithm. We have used (6) to calculate RMSLE.
of a particular function in the model tree. We obtained an RMSLE score of 0.058, using (6) to calculate
To compute frequency, the weight of all functions is RMSLE
r
accounted for by the weight percentage of one feature. 1 X 2
log ya + 1 − log y p + 1 . (6)

Fig. 11(d) shows the weight score of features, where EC and RMSLE =
N
temperature have the lowest total cover score.
5) R-Squared: R-Squared (R 2 ) is a measure of how well
the model fits the data and ranges from 0 to 1, with higher
B. Evaluation Metrics
values indicating a better fit.
We have used different evaluation metrics to assess the 6) Mean Absolute Percentage Error: MAPE is a measure
performance of our proposed model. We have used MAE, of the accuracy of the model’s predictions, expressed as a
MSE, RMSE, and root mean-squared log error (RMSLE) percentage of the actual value. Lower values of MAPE indicate
as evaluation metrics. Table VI shows a comparison of higher accuracy.
MSE, RMSE, RMSLE, R 2 , and MAPE. We have compared 7) Diebold-Mariano Test:
the performance of our proposed GA-XGCBXT, bagging
ensemble model, with several state-of-the-art ML models. d̄
DM = q . (7)
1) Mean Absolute Error: The average of the absolute error var(d)
T
numbers is used to generate the MAE score [42]. The
mathematical function abs() only turns a negative value The Diebold-Mariano test is a statistical test used to
positive. As a result, while calculating the MAE, the difference compare the forecast accuracy of two or more prediction
between an expected and forecast number may be positive or models. It determines whether one model significantly
negative, giving only a positive value. Using (3) to calculate outperforms another in terms of forecast errors. Equation (7) is
MAE, We obtained an MAE score of 0.21 from our GA- used to calculate DM values. Where d̄ is the mean difference
XGCBXT model. Fig. 12 shows a graphical comparison of in forecast errors between the two models, var(d) is the
MAE with different models where ya represents the actual variance of the forecast error differences, and T is the number
DO values obtained from the sensor data in the fish farming of forecast observations. The GA-XGCBXT model shows

Authorized licensed use limited to: West Virginia Univ Institute of Technology. Downloaded on July 22,2023 at 05:42:54 UTC from IEEE Xplore. Restrictions apply.
KHAN AND BYUN: OPTIMIZED DO PREDICTION USING GA AND BAGGING ENSEMBLE LEARNING 15161

Fig. 11. Cover, gain, and weight analysis of features. (a) Total cover. (b) Cover score. (c) Gain score. (d) Weight score.

Fig. 12. Comparison of MAE.

slightly better performance compared to the XT Regressor faults in the model. Additionally, a cost function is typically
and CB Regressor models, with DM values of 1.50 and used to quantify the loss and measures the mistake in several
1.00, respectively. However, the difference is not statistically ways [46]. The problem being solved and the data being used
significant based on the p-values of 0.065 and 0.380. The are typically determinants of the cost function selected. The
comparison between the GA-XGCBXT model and XGB yields training loss indicates how well a deep learning model fits
a DM value of 0.70 and a p-value of 0.242, indicating that the training data. On the other hand, a deep learning model’s
there is no significant difference in accuracy between these performance on the validation set is evaluated using a statistic
models. called validation loss. Fig. 13 shows training and testing Loss
curves where the blue line represents the training loss, and
C. Loss Curve the orange line represents the test loss. To avoid overfitting,
A high loss number typically denotes inaccurate output we have implemented an early stopping technique during
from the model, whereas a low loss value denotes fewer the training process. This technique involves monitoring the

Authorized licensed use limited to: West Virginia Univ Institute of Technology. Downloaded on July 22,2023 at 05:42:54 UTC from IEEE Xplore. Restrictions apply.
15162 IEEE SENSORS JOURNAL, VOL. 23, NO. 13, 1 JULY 2023

Fig. 13. Error loss according to epochs.

model’s performance on a validation set and stopping the are both powerful ML algorithms. XGBoost is known
training process when the performance on the validation set for its ability to handle large datasets and its efficiency,
stops improving. while the CB regressor can effectively handle categorical
data, which is common in many real-world datasets.
By combining these two algorithms, GA-XGCBXT can
VI. D ISCUSSION
leverage the strengths of each to improve performance.
The proposed method in this study, which is the GA- 3) Feature Selection: GA-XGCBXT uses a feature selec-
XGCBXT bagging ensemble model based on GAs, utilizes tion technique to select the most important features
time series analysis to make accurate forecasts of DO values in for the model, which can help reduce overfitting and
fish farming ponds. This method incorporates various feature improve generalization performance. By selecting the
selection techniques to identify the most relevant features that most informative features, GA-XGCBXT can improve
have a strong association with the primary data. The high its ability to make accurate predictions on new, unseen
accuracy of the proposed model can help address the research data.
question (RQ1) by providing precise DO projections, which 4) Ensemble Learning: GA-XGCBXT uses an ensemble
are crucial for carrying out the process of artificial aeration in learning approach, which combines multiple models
fish farming ponds. The proposed method’s ability to handle to make a final prediction. This can help reduce the
time series data and identify the most relevant features makes variance of the model and improve its robustness, as well
it a powerful tool for forecasting and managing water quality as potentially improve performance by leveraging the
in aquaculture systems. In terms of time complexity, our strengths of multiple models.
proposed approach has an O(N LlogL) complexity, where N The proposed approach has some limitations that may
is the number of samples and L is the number of features. This affect its applicability in certain contexts. One limitation is
is due to the use of the GA to perform feature selection, which that the method relies on accurate and complete data for
has an O(N LlogL) time complexity. The XGBoost model training the model, which may not always be available in
used in our approach has an O(N Ld) complexity, where d is practice. Additionally, the specific parameters chosen during
the maximum depth of the trees in the model. This means that the training process may affect the model’s performance.
the overall time complexity of our approach is dominated by Out-of-distribution (OOD) refers to the situation where the
the feature selection step rather than the model training step. model encounters inputs that are significantly different from
In comparison, other models, such as random forest and XTs, the training data, leading to a mismatch between the model’s
also have an O(N LlogL) time complexity. Therefore, our training and deployment environments. This can cause the
proposed approach has a similar time complexity to other tree- model to make erroneous predictions and negatively impact
based models while offering feature selection benefits through its performance and reliability [47]. The proposed model is
a GA. The factors that make GA-XGCBXT outperform the trained on the source and target domain data; it can accurately
other models may vary depending on the specific dataset and predict the output for inputs within the same distribution.
problem being addressed, but generally, the advantage of GA- Therefore, it is less likely to face the OOD problem. However,
XGCBXT can be attributed to the following factors. further investigation is needed to address the OOD problem.
1) GA Optimization: GA-XGCBXT is optimized using a In future work, this work can be extended using optimized
GA, which is a powerful optimization technique that models such as optimized XGBoost, and optimized Random
can effectively explore a large search space and find Forest.
an optimal set of hyperparameters. This allows GA-
XGCBXT to tune its parameters better and potentially VII. C ONCLUSION
find a better solution than other models. To solve DO prediction in smart fish farms, this article
2) XGBoost With CB Regressor: GA-XGCBXT uses a proposes a new GA-XGCBXT method. The best features have
combination of the XGBoost and CB regressors, which been chosen using various methods that correlate well with the

Authorized licensed use limited to: West Virginia Univ Institute of Technology. Downloaded on July 22,2023 at 05:42:54 UTC from IEEE Xplore. Restrictions apply.
KHAN AND BYUN: OPTIMIZED DO PREDICTION USING GA AND BAGGING ENSEMBLE LEARNING 15163

initial data. The results show that feature selection improves [15] G. I. Webb and Z. Zheng, “Multistrategy ensemble learning: Reducing
the accuracy of prediction. A GA-XGCBXT bagging ensemble error by combining ensemble learning techniques,” IEEE Trans. Knowl.
Data Eng., vol. 16, no. 8, pp. 980–991, Aug. 2004.
model is proposed for DO prediction. The performance of [16] P. W. Khan and Y.-C. Byun, “Multi-fault detection and classification of
the proposed model was evaluated on training and validation wind turbines using stacking classifier,” Sensors, vol. 22, no. 18, p. 6955,
sets against observed sensor data. The precise accuracy of the Sep. 2022.
results predicted by the proposed model was assessed using [17] Q. Sun and B. Pfahringer, “Bagging ensemble selection,” in Proc.
Australas. Joint Conf. Artif. Intell. Cham, Switzerland: Springer, 2011,
a variety of performance indices. We used the actual dataset pp. 251–260.
from the smart fish farm in our experiments for the forecasting [18] S. Mirjalili, “Genetic algorithm,” in Evolutionary Algorithms
model. The fish farm includes two temperature sensors and and Neural Networks. Cham, Switzerland: Springer, 2019,
pp. 43–55.
other ORP, EC, DO, and pH sensors. We have used different [19] W. Siedlecki and J. Sklansky, “A note on genetic algorithms
evaluation metrics to assess the performance of our proposed for large-scale feature selection,” in Handbook of Pattern Recog-
model. We obtained an MAE score of 0.21, an MSE score of nition and Computer Vision. Singapore: World Scientific, 1993,
pp. 88–107.
0.11, an RMSE score of 0.31, and an RMSLE score of 0.058.
[20] O. H. Babatunde, L. Armstrong, J. Leng, and D. Diepeveen, “A genetic
We have compared the performance of our proposed bagging algorithm-based feature selection,” Int. J. Electron. Commun. Comput.
ensemble model with several state-of-the-art ML models. The Eng., vol. 5, no. 4, pp. 899–905, 2014.
proposed work can provide the data foundation for an early [21] I.-S. Oh, J.-S. Lee, and B.-R. Moon, “Hybrid genetic algorithms for
feature selection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26,
warning system and better management of aquaculture farms. no. 11, pp. 1424–1437, Nov. 2004.
Further work will focus on enhancing model accuracy through [22] A. V. Dorogush, V. Ershov, and A. Gulin, “CatBoost: Gradient boosting
parameter optimization. with categorical features support,” 2018, arXiv:1810.11363.
[23] F. Zhang and H. Fleyeh, “Short term electricity spot price forecasting
using CatBoost and bidirectional long short term memory neural
R EFERENCES network,” in Proc. 16th Int. Conf. Eur. Energy Market (EEM), Sep. 2019,
pp. 1–6.
[1] J. Huan, H. Li, M. Li, and B. Chen, “Prediction of dissolved [24] D. Niu, L. Diao, Z. Zang, H. Che, T. Zhang, and X. Chen, “A
oxygen in aquaculture based on gradient boosting decision tree and machine-learning approach combining wavelet packet denoising with
long short-term memory network: A study of Chang Zhou fishery catboost for weather forecasting,” Atmosphere, vol. 12, no. 12, p. 1618,
demonstration base, China,” Comput. Electron. Agricult., vol. 175, Dec. 2021.
Aug. 2020, Art. no. 105530. [25] M. Massaoudi, S. S. Refaat, H. Abu-Rub, I. Chihi, and F. S. Wesleti,
[2] M. Luo and Q. Wang, “A reflective optical fiber SPR sensor with surface “A hybrid Bayesian ridge Regression-CWT-Catboost model for PV
modified hemoglobin for dissolved oxygen detection,” Alexandria Eng. power forecasting,” in Proc. IEEE Kansas Power Energy Conf. (KPEC),
J., vol. 60, no. 4, pp. 4115–4120, Aug. 2021. Jul. 2020, pp. 1–5.
[3] E. Eze and T. Ajmal, “Dissolved oxygen forecasting in aquaculture: [26] L. Grbčić et al., “Coastal water quality prediction based on machine
A hybrid model approach,” Appl. Sci., vol. 10, no. 20, p. 7079, learning with feature interpretation and spatio-temporal analysis,” 2021,
Oct. 2020. arXiv:2107.03230.
[4] W. Li, H. Wu, N. Zhu, Y. Jiang, J. Tan, and Y. Guo, “Prediction of [27] T. Chen, T. He, M. Benesty, V. Khotilovich, Y. Tang, and H. Cho,
dissolved oxygen in a fishery pond based on gated recurrent unit (GRU),” “Xgboost: Extreme gradient boosting,” R Package Version, vol. 1, no. 4,
Inf. Process. Agricult., vol. 8, no. 1, pp. 185–193, Mar. 2021. pp. 1–4, Aug. 2015.
[5] W. Li, Y. Wei, D. An, Y. Jiao, and Q. Wei, “LSTM-TCN: Dissolved [28] B. Yu et al., “SubMito-XGBoost: Predicting protein submitochondrial
oxygen prediction in aquaculture, based on combined model of long localization by fusing multiple feature information and eXtreme
short-term memory network and temporal convolutional network,” gradient boosting,” Bioinformatics, vol. 36, no. 4, pp. 1074–1081,
Environ. Sci. Pollut. Res., vol. 29, no. 26, pp. 39545–39556, Jun. 2022. Feb. 2020.
[6] Y. Wei, D. Li, H. Tai, J. Wang, and Q. Ding, “Prediction of dissolved [29] I. L. Cherif and A. Kortebi, “On using eXtreme gradient boost-
oxygen content in aquaculture of sea cucumber using support vector ing (XGBoost) machine learning algorithm for home network
regression,” Sensor Lett., vol. 9, no. 3, pp. 1075–1082, Jun. 2011. traffic classification,” in Proc. Wireless Days (WD), Apr. 2019,
[7] S. Liu et al., “Prediction of dissolved oxygen content in river crab pp. 1–6.
culture based on least squares support vector regression optimized by [30] P. Trizoglou, X. Liu, and Z. Lin, “Fault detection by an ensemble
improved particle swarm optimization,” Comput. Electron. Agricult., framework of extreme gradient boosting (XGBoost) in the operation
vol. 95, pp. 82–91, Jul. 2013. of offshore wind turbines,” Renew. Energy, vol. 179, pp. 945–962,
[8] X. Ta and Y. Wei, “Research on a dissolved oxygen prediction method Dec. 2021.
for recirculating aquaculture systems based on a convolution neural [31] V. John, N. M. Karunakaran, C. Guo, K. Kidono, and S. Mita, “Free
network,” Comput. Electron. Agricult., vol. 145, pp. 302–310, Feb. 2018. space, visible and missing lane marker estimation using the PsiNet and
[9] B. F. Z. Sami et al., “Machine learning algorithm as a sustainable tool for extra trees regression,” in Proc. 24th Int. Conf. Pattern Recognit. (ICPR),
dissolved oxygen prediction: A case study of Feitsui Reservoir, Taiwan,” Aug. 2018, pp. 189–194.
Sci. Rep., vol. 12, no. 1, pp. 1–12, Mar. 2022. [32] S. Alawadi, D. Mera, M. Fernández-Delgado, F. Alkhabbas,
[10] O. Kisi, M. Alizamir, and A. D. Gorgij, “Dissolved oxygen prediction C. M. Olsson, and P. Davidsson, “A comparison of machine learning
using a new ensemble method,” Environ. Sci. Pollut. Res., vol. 27, no. 9, algorithms for forecasting indoor temperature in smart buildings,”
pp. 9589–9603, Mar. 2020. Energy Syst., vol. 13, pp. 689–705, Jan. 2020.
[11] Z. Xiao, L. Peng, Y. Chen, H. Liu, J. Wang, and Y. Nie, “The dissolved [33] P.-P. Phyo, Y.-C. Byun, and N. Park, “Short-term energy forecasting
oxygen prediction method based on neural network,” Complexity, using machine-learning-based ensemble voting regression,” Symmetry,
vol. 2017, pp. 1–6, Oct. 2017. vol. 14, no. 1, p. 160, Jan. 2022.
[12] E. Olyaie, H. Z. Abyaneh, and A. D. Mehr, “A comparative analysis [34] A. Y. Barrera-Animas, L. O. Oyedele, M. Bilal, T. D. Akinosho,
among computational intelligence techniques for dissolved oxygen J. M. D. Delgado, and L. A. Akanbi, “Rainfall prediction: A com-
prediction in Delaware river,” Geosci. Frontiers, vol. 8, no. 3, parative analysis of modern machine learning algorithms for time-
pp. 517–527, May 2017. series forecasting,” Mach. Learn. with Appl., vol. 7, Mar. 2022,
[13] Y. Wu, L. Sun, X. Sun, and B. Wang, “A hybrid XGBoost-ISSA-LSTM Art. no. 100204.
model for accurate short-term and long-term dissolved oxygen prediction [35] R. Schafer, “What is a Savitzky–Golay filter? [Lecture notes],” IEEE
in ponds,” Environ. Sci. Pollut. Res., vol. 29, no. 12, pp. 18142–18159, Signal Process. Mag., vol. 28, no. 4, pp. 111–117, Jul. 2011.
Mar. 2022. [36] S.-C. Lo, “The effects of feature selection and model selection on the
[14] (Jun. 2022). Hanwha Aqua Planet Jeju. [Online]. Available: hhttps:// correctness of classification,” in Proc. IEEE Int. Conf. Ind. Eng. Eng.
english.visitkorea.or.kr/enu/ATR/SI_EN_3_1_1_1.jsp?cid=2350810 Manage., Dec. 2010, pp. 989–993.

Authorized licensed use limited to: West Virginia Univ Institute of Technology. Downloaded on July 22,2023 at 05:42:54 UTC from IEEE Xplore. Restrictions apply.

You might also like