Optimized Dissolved Oxygen Prediction Using GA
Optimized Dissolved Oxygen Prediction Using GA
1558-1748 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: West Virginia Univ Institute of Technology. Downloaded on July 22,2023 at 05:42:54 UTC from IEEE Xplore. Restrictions apply.
15154 IEEE SENSORS JOURNAL, VOL. 23, NO. 13, 1 JULY 2023
A. Contributions
We proposed a prediction model for DO using cutting-edge
ML algorithms. We used the actual dataset from the smart fish
farm in our experiments for the forecasting model. A genetic
algorithm-based XGBoost, CB, and extra tree (GA-XGCBXT)
bagging ensemble model is proposed for DO prediction. The
idea of bagging ensemble ML is put forth. A genetic algorithm
(GA) is used for optimal feature selection. The XGBoost, CB,
and additional tree models are the components of the bagging
ensemble model. In comparison to other standalone models,
it performs better. We set up a smart fish farm to collect
experimental data at Hanwha Aqua Planet Jeju in South Korea.
The significant contributions of these articles are as follows:
1) obtaining data from the smart fish farm with oxidation-
reduction potential (ORP) sensors, EC sensors, DO sen-
sors, and PH sensors along with two temperature
sensors;
2) utilizing GA and SHapley additive exPlanations (SHAP)
for optimal feature selection;
Fig. 1. Flow diagram of proposed methodology.
3) presenting a GA-XGCBXT bagging ensemble model;
4) comparing the proposed model with several state-of-the-
potential hydrogen (PH), electrical conductivity (EC), water art ML-based prediction algorithms.
temperature, and DO. Liu et al. [7] also employed the least
squares SVR variation to predict DO in river crab culture. They
B. Article Organization
did, however, improve on the particle swarm optimization
technique. Ta and Wei [8] employed a convolutional neural The rest of the article is structured as follows: The proposed
network (CNN) for DO prediction in recirculating aquaculture approach and the ML models are detailed in Section II.
systems. Eze and Ajmal [3] presented a hybrid model for The procedure for data curation and data summarization is
projecting DO in aquaculture. They employed a LSTM neural provided in Section III. In this section, we also look into
network based on ensemble empirical mode decomposition dataset smoothing and preprocessing. The performance of the
(EEMD). suggested model is discussed in detail in Section IV. We also
The proposed bagging ensemble model includes CatBoost compare our findings to those of other models that are in use.
(CB), extreme gradient boosting (XGBoost), and an additional In the final section, we wrap up this article’s work and offer
tree regressor. The performance of the proposed model was suggestions for more research.
evaluated on training and validation sets against observed
sensor data. The precise evaluation accuracy of the results II. B ACKGROUND R EVIEW
predicted by the proposed model was assessed using a Predicting DO, an important water quality parameter is
variety of performance indices, including mean absolute error essential for aquatic managers in charge of maintaining
(MAE), mean square error, root mean square error, and the health of ecosystems and running reservoirs [9]. Most
root mean square log error. Additionally, the performance DO prediction models are complicated. Additionally, reliable
of bagging ensemble approaches used in smart fish farming data to develop and calibrate new DO models need to be
Authorized licensed use limited to: West Virginia Univ Institute of Technology. Downloaded on July 22,2023 at 05:42:54 UTC from IEEE Xplore. Restrictions apply.
KHAN AND BYUN: OPTIMIZED DO PREDICTION USING GA AND BAGGING ENSEMBLE LEARNING 15155
Authorized licensed use limited to: West Virginia Univ Institute of Technology. Downloaded on July 22,2023 at 05:42:54 UTC from IEEE Xplore. Restrictions apply.
15156 IEEE SENSORS JOURNAL, VOL. 23, NO. 13, 1 JULY 2023
Authorized licensed use limited to: West Virginia Univ Institute of Technology. Downloaded on July 22,2023 at 05:42:54 UTC from IEEE Xplore. Restrictions apply.
KHAN AND BYUN: OPTIMIZED DO PREDICTION USING GA AND BAGGING ENSEMBLE LEARNING 15157
TABLE II
DATASET S PECIFICATIONS
TABLE III
D ESCRIPTIVE S TATISTICS OF THE DATASET
A. Data Smoothing
The Savitzky–Golay filter is used for smoothing. A digital
filter called the Savitzky–Golay filter employs data points to Fig. 3. Mean features value according to each day. (a) DO.
(b) Temperature. (c) PH. (d) EC. (e) ORP.
smooth the graph [35]. When using the least-squares method,
a small window is created, the data in that window is subjected
to a polynomial, and the polynomial is then used to determine
the window’s center point. Once all the neighbors have been B. Training and Testing Data
roughly adjusted with one another, the window is then shifted The dataset was collected in the months of April, May,
by one data point. Fig. 5 shows a graphical comparison of DO and June 2022; hence, it contains data from April 1, 2022,
before and after smoothing. to June 30, 2022, and the total rows of the dataset were
Authorized licensed use limited to: West Virginia Univ Institute of Technology. Downloaded on July 22,2023 at 05:42:54 UTC from IEEE Xplore. Restrictions apply.
15158 IEEE SENSORS JOURNAL, VOL. 23, NO. 13, 1 JULY 2023
Fig. 4. Features according to each day. (a) DO. (b) Temperature. (c) pH.
Authorized licensed use limited to: West Virginia Univ Institute of Technology. Downloaded on July 22,2023 at 05:42:54 UTC from IEEE Xplore. Restrictions apply.
KHAN AND BYUN: OPTIMIZED DO PREDICTION USING GA AND BAGGING ENSEMBLE LEARNING 15159
TABLE IV
DATASET D IVISION ACCORDING TO T RAINING AND T ESTING DATA
TABLE V
ML I MPLEMENTATION E NVIRONMENT
Authorized licensed use limited to: West Virginia Univ Institute of Technology. Downloaded on July 22,2023 at 05:42:54 UTC from IEEE Xplore. Restrictions apply.
15160 IEEE SENSORS JOURNAL, VOL. 23, NO. 13, 1 JULY 2023
Authorized licensed use limited to: West Virginia Univ Institute of Technology. Downloaded on July 22,2023 at 05:42:54 UTC from IEEE Xplore. Restrictions apply.
KHAN AND BYUN: OPTIMIZED DO PREDICTION USING GA AND BAGGING ENSEMBLE LEARNING 15161
Fig. 11. Cover, gain, and weight analysis of features. (a) Total cover. (b) Cover score. (c) Gain score. (d) Weight score.
slightly better performance compared to the XT Regressor faults in the model. Additionally, a cost function is typically
and CB Regressor models, with DM values of 1.50 and used to quantify the loss and measures the mistake in several
1.00, respectively. However, the difference is not statistically ways [46]. The problem being solved and the data being used
significant based on the p-values of 0.065 and 0.380. The are typically determinants of the cost function selected. The
comparison between the GA-XGCBXT model and XGB yields training loss indicates how well a deep learning model fits
a DM value of 0.70 and a p-value of 0.242, indicating that the training data. On the other hand, a deep learning model’s
there is no significant difference in accuracy between these performance on the validation set is evaluated using a statistic
models. called validation loss. Fig. 13 shows training and testing Loss
curves where the blue line represents the training loss, and
C. Loss Curve the orange line represents the test loss. To avoid overfitting,
A high loss number typically denotes inaccurate output we have implemented an early stopping technique during
from the model, whereas a low loss value denotes fewer the training process. This technique involves monitoring the
Authorized licensed use limited to: West Virginia Univ Institute of Technology. Downloaded on July 22,2023 at 05:42:54 UTC from IEEE Xplore. Restrictions apply.
15162 IEEE SENSORS JOURNAL, VOL. 23, NO. 13, 1 JULY 2023
model’s performance on a validation set and stopping the are both powerful ML algorithms. XGBoost is known
training process when the performance on the validation set for its ability to handle large datasets and its efficiency,
stops improving. while the CB regressor can effectively handle categorical
data, which is common in many real-world datasets.
By combining these two algorithms, GA-XGCBXT can
VI. D ISCUSSION
leverage the strengths of each to improve performance.
The proposed method in this study, which is the GA- 3) Feature Selection: GA-XGCBXT uses a feature selec-
XGCBXT bagging ensemble model based on GAs, utilizes tion technique to select the most important features
time series analysis to make accurate forecasts of DO values in for the model, which can help reduce overfitting and
fish farming ponds. This method incorporates various feature improve generalization performance. By selecting the
selection techniques to identify the most relevant features that most informative features, GA-XGCBXT can improve
have a strong association with the primary data. The high its ability to make accurate predictions on new, unseen
accuracy of the proposed model can help address the research data.
question (RQ1) by providing precise DO projections, which 4) Ensemble Learning: GA-XGCBXT uses an ensemble
are crucial for carrying out the process of artificial aeration in learning approach, which combines multiple models
fish farming ponds. The proposed method’s ability to handle to make a final prediction. This can help reduce the
time series data and identify the most relevant features makes variance of the model and improve its robustness, as well
it a powerful tool for forecasting and managing water quality as potentially improve performance by leveraging the
in aquaculture systems. In terms of time complexity, our strengths of multiple models.
proposed approach has an O(N LlogL) complexity, where N The proposed approach has some limitations that may
is the number of samples and L is the number of features. This affect its applicability in certain contexts. One limitation is
is due to the use of the GA to perform feature selection, which that the method relies on accurate and complete data for
has an O(N LlogL) time complexity. The XGBoost model training the model, which may not always be available in
used in our approach has an O(N Ld) complexity, where d is practice. Additionally, the specific parameters chosen during
the maximum depth of the trees in the model. This means that the training process may affect the model’s performance.
the overall time complexity of our approach is dominated by Out-of-distribution (OOD) refers to the situation where the
the feature selection step rather than the model training step. model encounters inputs that are significantly different from
In comparison, other models, such as random forest and XTs, the training data, leading to a mismatch between the model’s
also have an O(N LlogL) time complexity. Therefore, our training and deployment environments. This can cause the
proposed approach has a similar time complexity to other tree- model to make erroneous predictions and negatively impact
based models while offering feature selection benefits through its performance and reliability [47]. The proposed model is
a GA. The factors that make GA-XGCBXT outperform the trained on the source and target domain data; it can accurately
other models may vary depending on the specific dataset and predict the output for inputs within the same distribution.
problem being addressed, but generally, the advantage of GA- Therefore, it is less likely to face the OOD problem. However,
XGCBXT can be attributed to the following factors. further investigation is needed to address the OOD problem.
1) GA Optimization: GA-XGCBXT is optimized using a In future work, this work can be extended using optimized
GA, which is a powerful optimization technique that models such as optimized XGBoost, and optimized Random
can effectively explore a large search space and find Forest.
an optimal set of hyperparameters. This allows GA-
XGCBXT to tune its parameters better and potentially VII. C ONCLUSION
find a better solution than other models. To solve DO prediction in smart fish farms, this article
2) XGBoost With CB Regressor: GA-XGCBXT uses a proposes a new GA-XGCBXT method. The best features have
combination of the XGBoost and CB regressors, which been chosen using various methods that correlate well with the
Authorized licensed use limited to: West Virginia Univ Institute of Technology. Downloaded on July 22,2023 at 05:42:54 UTC from IEEE Xplore. Restrictions apply.
KHAN AND BYUN: OPTIMIZED DO PREDICTION USING GA AND BAGGING ENSEMBLE LEARNING 15163
initial data. The results show that feature selection improves [15] G. I. Webb and Z. Zheng, “Multistrategy ensemble learning: Reducing
the accuracy of prediction. A GA-XGCBXT bagging ensemble error by combining ensemble learning techniques,” IEEE Trans. Knowl.
Data Eng., vol. 16, no. 8, pp. 980–991, Aug. 2004.
model is proposed for DO prediction. The performance of [16] P. W. Khan and Y.-C. Byun, “Multi-fault detection and classification of
the proposed model was evaluated on training and validation wind turbines using stacking classifier,” Sensors, vol. 22, no. 18, p. 6955,
sets against observed sensor data. The precise accuracy of the Sep. 2022.
results predicted by the proposed model was assessed using [17] Q. Sun and B. Pfahringer, “Bagging ensemble selection,” in Proc.
Australas. Joint Conf. Artif. Intell. Cham, Switzerland: Springer, 2011,
a variety of performance indices. We used the actual dataset pp. 251–260.
from the smart fish farm in our experiments for the forecasting [18] S. Mirjalili, “Genetic algorithm,” in Evolutionary Algorithms
model. The fish farm includes two temperature sensors and and Neural Networks. Cham, Switzerland: Springer, 2019,
pp. 43–55.
other ORP, EC, DO, and pH sensors. We have used different [19] W. Siedlecki and J. Sklansky, “A note on genetic algorithms
evaluation metrics to assess the performance of our proposed for large-scale feature selection,” in Handbook of Pattern Recog-
model. We obtained an MAE score of 0.21, an MSE score of nition and Computer Vision. Singapore: World Scientific, 1993,
pp. 88–107.
0.11, an RMSE score of 0.31, and an RMSLE score of 0.058.
[20] O. H. Babatunde, L. Armstrong, J. Leng, and D. Diepeveen, “A genetic
We have compared the performance of our proposed bagging algorithm-based feature selection,” Int. J. Electron. Commun. Comput.
ensemble model with several state-of-the-art ML models. The Eng., vol. 5, no. 4, pp. 899–905, 2014.
proposed work can provide the data foundation for an early [21] I.-S. Oh, J.-S. Lee, and B.-R. Moon, “Hybrid genetic algorithms for
feature selection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26,
warning system and better management of aquaculture farms. no. 11, pp. 1424–1437, Nov. 2004.
Further work will focus on enhancing model accuracy through [22] A. V. Dorogush, V. Ershov, and A. Gulin, “CatBoost: Gradient boosting
parameter optimization. with categorical features support,” 2018, arXiv:1810.11363.
[23] F. Zhang and H. Fleyeh, “Short term electricity spot price forecasting
using CatBoost and bidirectional long short term memory neural
R EFERENCES network,” in Proc. 16th Int. Conf. Eur. Energy Market (EEM), Sep. 2019,
pp. 1–6.
[1] J. Huan, H. Li, M. Li, and B. Chen, “Prediction of dissolved [24] D. Niu, L. Diao, Z. Zang, H. Che, T. Zhang, and X. Chen, “A
oxygen in aquaculture based on gradient boosting decision tree and machine-learning approach combining wavelet packet denoising with
long short-term memory network: A study of Chang Zhou fishery catboost for weather forecasting,” Atmosphere, vol. 12, no. 12, p. 1618,
demonstration base, China,” Comput. Electron. Agricult., vol. 175, Dec. 2021.
Aug. 2020, Art. no. 105530. [25] M. Massaoudi, S. S. Refaat, H. Abu-Rub, I. Chihi, and F. S. Wesleti,
[2] M. Luo and Q. Wang, “A reflective optical fiber SPR sensor with surface “A hybrid Bayesian ridge Regression-CWT-Catboost model for PV
modified hemoglobin for dissolved oxygen detection,” Alexandria Eng. power forecasting,” in Proc. IEEE Kansas Power Energy Conf. (KPEC),
J., vol. 60, no. 4, pp. 4115–4120, Aug. 2021. Jul. 2020, pp. 1–5.
[3] E. Eze and T. Ajmal, “Dissolved oxygen forecasting in aquaculture: [26] L. Grbčić et al., “Coastal water quality prediction based on machine
A hybrid model approach,” Appl. Sci., vol. 10, no. 20, p. 7079, learning with feature interpretation and spatio-temporal analysis,” 2021,
Oct. 2020. arXiv:2107.03230.
[4] W. Li, H. Wu, N. Zhu, Y. Jiang, J. Tan, and Y. Guo, “Prediction of [27] T. Chen, T. He, M. Benesty, V. Khotilovich, Y. Tang, and H. Cho,
dissolved oxygen in a fishery pond based on gated recurrent unit (GRU),” “Xgboost: Extreme gradient boosting,” R Package Version, vol. 1, no. 4,
Inf. Process. Agricult., vol. 8, no. 1, pp. 185–193, Mar. 2021. pp. 1–4, Aug. 2015.
[5] W. Li, Y. Wei, D. An, Y. Jiao, and Q. Wei, “LSTM-TCN: Dissolved [28] B. Yu et al., “SubMito-XGBoost: Predicting protein submitochondrial
oxygen prediction in aquaculture, based on combined model of long localization by fusing multiple feature information and eXtreme
short-term memory network and temporal convolutional network,” gradient boosting,” Bioinformatics, vol. 36, no. 4, pp. 1074–1081,
Environ. Sci. Pollut. Res., vol. 29, no. 26, pp. 39545–39556, Jun. 2022. Feb. 2020.
[6] Y. Wei, D. Li, H. Tai, J. Wang, and Q. Ding, “Prediction of dissolved [29] I. L. Cherif and A. Kortebi, “On using eXtreme gradient boost-
oxygen content in aquaculture of sea cucumber using support vector ing (XGBoost) machine learning algorithm for home network
regression,” Sensor Lett., vol. 9, no. 3, pp. 1075–1082, Jun. 2011. traffic classification,” in Proc. Wireless Days (WD), Apr. 2019,
[7] S. Liu et al., “Prediction of dissolved oxygen content in river crab pp. 1–6.
culture based on least squares support vector regression optimized by [30] P. Trizoglou, X. Liu, and Z. Lin, “Fault detection by an ensemble
improved particle swarm optimization,” Comput. Electron. Agricult., framework of extreme gradient boosting (XGBoost) in the operation
vol. 95, pp. 82–91, Jul. 2013. of offshore wind turbines,” Renew. Energy, vol. 179, pp. 945–962,
[8] X. Ta and Y. Wei, “Research on a dissolved oxygen prediction method Dec. 2021.
for recirculating aquaculture systems based on a convolution neural [31] V. John, N. M. Karunakaran, C. Guo, K. Kidono, and S. Mita, “Free
network,” Comput. Electron. Agricult., vol. 145, pp. 302–310, Feb. 2018. space, visible and missing lane marker estimation using the PsiNet and
[9] B. F. Z. Sami et al., “Machine learning algorithm as a sustainable tool for extra trees regression,” in Proc. 24th Int. Conf. Pattern Recognit. (ICPR),
dissolved oxygen prediction: A case study of Feitsui Reservoir, Taiwan,” Aug. 2018, pp. 189–194.
Sci. Rep., vol. 12, no. 1, pp. 1–12, Mar. 2022. [32] S. Alawadi, D. Mera, M. Fernández-Delgado, F. Alkhabbas,
[10] O. Kisi, M. Alizamir, and A. D. Gorgij, “Dissolved oxygen prediction C. M. Olsson, and P. Davidsson, “A comparison of machine learning
using a new ensemble method,” Environ. Sci. Pollut. Res., vol. 27, no. 9, algorithms for forecasting indoor temperature in smart buildings,”
pp. 9589–9603, Mar. 2020. Energy Syst., vol. 13, pp. 689–705, Jan. 2020.
[11] Z. Xiao, L. Peng, Y. Chen, H. Liu, J. Wang, and Y. Nie, “The dissolved [33] P.-P. Phyo, Y.-C. Byun, and N. Park, “Short-term energy forecasting
oxygen prediction method based on neural network,” Complexity, using machine-learning-based ensemble voting regression,” Symmetry,
vol. 2017, pp. 1–6, Oct. 2017. vol. 14, no. 1, p. 160, Jan. 2022.
[12] E. Olyaie, H. Z. Abyaneh, and A. D. Mehr, “A comparative analysis [34] A. Y. Barrera-Animas, L. O. Oyedele, M. Bilal, T. D. Akinosho,
among computational intelligence techniques for dissolved oxygen J. M. D. Delgado, and L. A. Akanbi, “Rainfall prediction: A com-
prediction in Delaware river,” Geosci. Frontiers, vol. 8, no. 3, parative analysis of modern machine learning algorithms for time-
pp. 517–527, May 2017. series forecasting,” Mach. Learn. with Appl., vol. 7, Mar. 2022,
[13] Y. Wu, L. Sun, X. Sun, and B. Wang, “A hybrid XGBoost-ISSA-LSTM Art. no. 100204.
model for accurate short-term and long-term dissolved oxygen prediction [35] R. Schafer, “What is a Savitzky–Golay filter? [Lecture notes],” IEEE
in ponds,” Environ. Sci. Pollut. Res., vol. 29, no. 12, pp. 18142–18159, Signal Process. Mag., vol. 28, no. 4, pp. 111–117, Jul. 2011.
Mar. 2022. [36] S.-C. Lo, “The effects of feature selection and model selection on the
[14] (Jun. 2022). Hanwha Aqua Planet Jeju. [Online]. Available: hhttps:// correctness of classification,” in Proc. IEEE Int. Conf. Ind. Eng. Eng.
english.visitkorea.or.kr/enu/ATR/SI_EN_3_1_1_1.jsp?cid=2350810 Manage., Dec. 2010, pp. 989–993.
Authorized licensed use limited to: West Virginia Univ Institute of Technology. Downloaded on July 22,2023 at 05:42:54 UTC from IEEE Xplore. Restrictions apply.