Standardization of Rainfall Prediction in Bangladesh Using Machine Learning Approach
Standardization of Rainfall Prediction in Bangladesh Using Machine Learning Approach
Abstract— Rainfall has a significant impact on human life in Rainfall forecasting accuracy is being enhanced by
many areas, including natural disasters such as agriculture, scientists and researchers all over the world. For forecasting
droughts, floods, and landslides. Artificial intelligence rainfall, several statistical and numerical models have been
technologies have a high chance of succeeding in these fields. developed. In the realm of forecasting, however, the
The goal of this research is to develop the rain prediction model importance of numerical models is fading due to ongoing
using Machine Learning. This model is built based on a 2,391 developments in soft computing methodologies. Artificial
records dataset that data has collected from the website named neural networks have emerged as an inductive strategy for
Bangladesh Jatiyo Tottho Batayon. We used a total of five hydrological and clamatorial forecasting in recent decades,
models for the study. Each model has trained with eight input
owing to its ability to model both linear and non-linear
features then it has validated the rainfall predictions. All of the
model performance indicated well. But Random Forest predicts
systems without any prior knowledge about catchment
the dataset most accurately so that it is the best for this study. behavior and flow processes. For early warning of landslide
Random Forest classifier delivered higher Accuracy, Recall, occurrence, a daily rainfall forecasting system based on the
Precision, and F1 Score. BPNN model can be used. The research provides four
backpropagation neural networks with various topologies
Keywords— Machine learning, Rainfall, Climate, Random capable of predicting the next day's rainfall intensity [2].
Forest, Confusion matrix. Rainfall thresholds and antecedent rainfall records were
among the input features. Support vector machine (SVM),
I. INTRODUCTION another innovative machine learning strategy, has proven to
Rainfall is the most important component of the be an effective strategy for handling non-linear and
hydrological cycle, and it is the principal source of rainfall complicated machine learning issues. Non-linear regression
distribution that may have been used and regulated directly to problems are solved using support vector regression (SVR), a
meet consumption requirements. Water is the most valuable modified form of SVM. The comparison of several modern
natural resource that plays a significant role in social and methodologies and algorithms aids in enhancing rainfall
economic development. Water resources are generally under forecasting accuracy. The proposed study evaluates the
stress or facing a mismatch between supply and demand in accuracy of monthly rainfall forecasts generated using time-
numerous regions around the globe. This is due to a rise in series analysis approaches with the single input parameter
demand, a shortage of accessible water, or poor management being antecedent rainfall intensity [3]. Linear regression,
of available water resources. In fact, proper water planning backpropagation neural network (BPNN), support vector
and management are the main options for reducing water regression (SVR), and the long short term memory (LSTM)
stress or closing the gap between supply and demand. recurrent neural network are the machine learning approaches
Numerous steps have been taken around the world to not only used to make predictions. To determine the likelihood of a
predict rainfall but also to establish the interaction between landslide, rainfall estimates for a certain site can be compared
rainfall and runoff in order to alter water management to the proper landslide triggering rainfall thresholds.
procedures to make water accessible whenever it is needed In this research study, we have made rainfall predictions
[1]. based on daily climate data. Five machine learning algorithms
Bangladesh's rainfall varies based on the season and have been trained and tested using the data. Those algorithms
region. Winter (November to February) is extremely dry, are Decision Tree, K Neighbors, Logistic Regression,
accounting for only around 4% of the yearly rainfall. The Multinomial NB, and Random Forest. The remainder of the
westerly disturbances that approach the nation from the article continues as follows: The next part of the related work
northwestern states of Bangladesh cause rainfall to vary from depicts a comprehensive summary of the relevant articles.
20 mm in the west and south to 40 mm in the northeast during Then methodology describes the dataset and algorithm that is
this season. Bangladesh's average annual rainfall ranges from used in our paper. Then there's the result and discussion
1500 millimeters in the west-central regions to over 3000 section, which goes over all of the models' findings and
millimeters in the northeast and southeast. The rainfall in outcomes. The final section summarizes our work.
Surma Valley and the surrounding hills is extremely high.
II. RELATED WORK
Rainfall averages 4180 mm in Sylhet, 5330 mm on the foot of
the steep Meghalaya Plateau in Sunamganj, and 6400 mm in This section discusses some of the significant studies
Lalakhal, the highest in Bangladesh. related to our research topic that has been completed by other
authors. Rani et al. [4] presented a flood detection system and basis from 3 district stations: Gazipur, Rangpur, and Barishal
an alerting system using machine learning techniques. For from the year 2016 to 2019.
rainfall prediction, they used Linear Regression, Support
Vector Machines, and Artificial Neural Networks in their
research. Kader et al. [5] applied Particle Swarm Optimization
(PSO) and Multi Layer Perceptron (MLP) techniques for
rainfall forecasting. They experimented with four factors: high
temperature, low temperature, humidity, and wind speed.
Misra et al. [6] used the machine learning approach to
anticipate average rainy season rainfall in the Indian state of
Odisha. In their study maximum of 91% accuracy was
achieved by Random forest regression. Zhang et al. [7] used
Support Vector Regression and Multilayer perception
developed for maximum rainfall prediction in both monsoon
and non-monsoon seasons. The parameters used in their
research are average temperature in a month, wind velocity,
humidity, and cloud cover. Adnan et al. [8] find the
applicability of four machine learning methods, ANFIS-PSO,
ANFIS-FCM, MARS, and the results were compared to those
of EBA4SUB, a physically event-based technique. Anwar et
al. [9] used historical meteorological data and builds a rain
prediction model using a rule-based Machine Learning
technique. In their research, the decision tree model had the
highest accuracy. Anwar et al. [10] used the Extreme Gradient
Boosting model to predict rainfall. In their dataset, they used Fig. 1. Dataset creation procedure.
weather data obtained by the weather station over the previous The process of creating a dataset after collecting the raw
seven years. The 8 attributes of their dataset are Minimum data files from the website is shown in Figure 1. Out of the
temperature, Maximum temperature, Average temperature, total 9 attributes, 8 are input attributes and their details are
Average Humidity, Sun exposure time, Maximum wind shown in Table I. There are 2,391 records in the dataset, with
speed, Average wind speed, and Rainfall. 1660 records indicating no rain and 731 records indicating
Singh et al. [11] proposed a method and builds an rain. Of these, 80% of data have been used for model training
application by using Raspberry Pi 3 B. This program uses real- and 20% for model testing. Since the average amount of
time data from humidity, temperature, and pressure sensors to rainfall in a year is so limited that’s why our dataset is
forecast whether it will rain today. Park et al. [12] used imbalanced as seen in Figure 2. On our dataset, we trained and
different machine learning algorithms for estimating the tested five of the most popular machine learning models.
probability of flooding in South Korea's coastal districts.
Dikshit et al. [13] predicting temporal drought trends in New TABLE I. DATASET ATTRIBUTES IN DETAILS
South Wales, Australia's south-eastern state by using machine Attribute Description Type
learning approaches. Again machine learning techniques are
used to predict the Standardized Streamflow Index for
hydrological dryness in paper [14]. Klompenburg et al. [15] Max Temp The maximum temperature Numeric
predict Crop yield using machine learning techniques. Their of a day is shown in Celsius
research differs from others in terms of scale, geological Min Temp The minimum temperature Numeric
location, and crop. To forecast drought in Pakistan, of a day is shown in Celsius
researchers [16] employed Support Vector Machine (SVM), Actual The actual evaporation of a Numeric
Artificial Neural Network (ANN), and k-Nearest Neighbour Evaporation day is shown in millimeters
(KNN). In these studies the aithors worked on a specific (mm)
region and used a little parameters. Here, in our paper we used Relative Relative humidity recorded Numeric
the of three different regions. Moreover, we used eight Humidity at 9.00 am in one day is
parameters for our dataset. (9.00 am) shown in percent (%)
III. RESEARCH METHODOLOGY Relative Relative humidity recorded Numeric
Humidity at 2.00 pm in one day is
This section will go through the data collection and (2.00 pm) shown in percent (%)
preparation process, including the procedure and approaches
used in this study. Sunshine The total number of hours of Numeric
sunshine in a day is shown in
A. Dataset (hrs/day)
The dataset is the most significant and crucial aspect of our Cloudy The total number of hours Numeric
research study. We have collected the data we have used from cloudy in a day is shown in
Bangladesh's government official website named Bangladesh (hrs/day)
Jatiyo Tottho Batayon [17]. Mainly Climate database [18] Solar Total Solar Radiation in one Numeric
section of the Bangladesh Rice Research Institute section of Radiation day is shown in (cal / cm2)
this site. Our dataset contains Climate data recorded on a daily Rainfall 0= No Rain, 1= Rain Nominal
Rain
31%
No
Rain
69%
Solar Radiation
Max Temp
Min Temp
Sunshine
Rainfall
Cloudy
B. Classification algorithms
Decision Tree: The most powerful and widely used
tool for categorization and prediction is the decision
tree. A decision tree is a flowchart-like tree structure in
10.5 3.2 1 92 60 5.3 5.4 243.44 0
which each internal node represents an attribute test,
each branch reflects the test's conclusion, and each leaf
16.4 11.6 1.4 93 70 6.1 4.7 272.67 1
node (terminal node) stores a class label.
37.4 25.8 5 76 42 6.6 5.9 319.06 0
36 27 4.2 84 74 5.7 7.2 307.18 1 K Neighbors: A non-parametric classification method
is the k-nearest neighbor's algorithm (k-NN). It is
35.3 21.8 4 86 40 7.7 4.1 407.63 0
employed in the categorization and regression of data.
The input in both situations is the k closest training
examples in the data set. Whether k-NN is used for
classification or regression determines the outcome.
Logistic Regression: Logistic regression is a statistical
model that uses a logistic function to represent a binary
dependent variable in its most basic form, though there
are many more advanced variants. Logistic regression
(or logit regression) is a method of estimating the
parameters of a logistic model in regression analysis.
Multinomial NB: The multinomial Naive Bayes
classifier is effective for discrete feature classification
(e.g., word counts for text classification). Normally,
integer feature counts are required for the multinomial
distribution. Fractional counts, such as TF-IDF, may
also function in practice. alpha float is a parameter with
a default value of 1.0. The probability formula of
Naive Bayes is: