Rainfall Prediction System: (Peer-Reviewed, Open Access, Fully Refereed International Journal)
Rainfall Prediction System: (Peer-Reviewed, Open Access, Fully Refereed International Journal)
ABSTRACT
In modern times, global warming is affecting the entire world, having a major impact on humankind and accelerating
climate change. As a result, the atmosphere and ocean are warming, sea levels are rising, and floods and droughts are
occurring. One of the major consequences of this is uneven rainfall / precipitation. Precipitation forecasting now-a-days
is a tedious task that is being considered by most of the major world authorities. Precipitation is a climatic factor that
affects the various human activities on which they depend. Like Agricultural production, construction, energy
production and tourism. This makes rainfall a serious problem and requires better rainfall forecasts. For these various
reasons, accurate forecasting of rainfall is paramount. There are many ways to predict it, but the one that is chosen for
this project is to observe and collect previous year's rainfall data which is collected over 10 years of rainfall
measurements. Next, we forecast rainfall for the next day. Therefore, this project seeks to optimize results and find
suitable machine learning models for predicting rainfall. We will also compare the machine learning algorithms and
methods used in machine learning. Methodologies include logistic regression, support vector classification, random
forest, catboost, and xgboost. Compare each of these algorithms individually, taking into account the parameters that
apply to machine learning, such as accuracy.
Keywords: accuracy, forecasting, machine learning algorithms, rainfall.
I. INTRODUCTION
Rainfall forecasting is one of the challenging and uncertain tasks that has a major impact on human society. Timely and
accurate forecasts help you proactively reduce human and financial losses. Heavy rain forecasts are a big problem for
the Meteorological Bureau because they are closely related to the economy and people's lives. This is the cause of natural
disasters such as floods and droughts that people all around the world face each year. As global warming progresses,
rainfall detection and prediction has become a major issue in countries where appropriate technology is not available,
and if done correctly, it can serve a variety of purposes such as agriculture, health and drinking. The accuracy of
precipitation forecasts is very important where the economy is heavily dependent on agriculture. Due to the dynamic
nature of the atmosphere, statistical methods do not provide excellent accuracy in predicting precipitation. The non-
linearity of precipitation data makes machine learning and AI better techniques. Predictions help people take
precautions, and predictions must be accurate. The purpose of this project is to provide non-experts with easy access to
the techniques and approaches used in the field of precipitation prediction, to provide comparative studies between
various machine learning techniques, and to provide early precipitation prediction and to identify the best machine
learning algorithm for you. The objective of this system is to understand & analyze the data collected for rainfall for all
the states and derive conclusive information & statistics regarding the rainfall pattern across the country as well as to
design an interactive and user-friendly web interface for the system for user’s ease of understanding the results obtained
from the input data provided.
II. LITERATURE SURVEY
A. Kala and Dr. S. Ganesh Vaidyanathan in [1] have used Artificial Neural Network for implementing their idea. Their
procedure for rainfall prediction includes gathering the weather data, then preprocessing it, building the Feed Forward
Neural Network (FFNN) model with training data and then validating it with testing data and in the end evaluating the
model by comparing desired and actual output. There was 0.935 accuracy using Artificial Neural Network. R. Kingsy
Grace and B. Suganya in their paper has said that they have implemented their model using machine learning [2]. They
have compared various models such as Deep Convolutional Neural Network, Genetic Programming, ANN, Linear
www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science
[1]
e-ISSN: 2582-5208
International Research Journal of Modernization in Engineering Technology and
Science
( Peer-Reviewed, Open Access, Fully Refereed International Journal )
Volume:04/Issue:03/March-2022 Impact Factor- 6.752 www.irjmets.com
Regression, Hybrid Neural Network, Likelihood, LSTM and ConvNet. There was 99% accuracy of the Multiple Linear
Regression. Hiyam Abobaker Yousif Ahmed and Sondos W. A. Mohamed in [3] have implemented their model using
linear regression. Their whole idea is done in 2 major parts i.e., first is data collection and selection and second is data
cleaning and transformation. There was 85% accuracy given by the Multiple Linear Regression model. CMAK Zeelan
Basha, Nagulla Bhavana, Ponduru Bhavya, Sowmya V. has implemented a model on the same topic using machine
learning and deep learning techniques [4]. They have taken the help of Auto-Regressive Integrated Moving Average
(AREMA Model), Artificial Neural Network, Support Vector Machine and Self Organizing Map. The outcomes intend
that in terms of MSE and RMSE, their architecture outruns other approaches. B. Vasantha, R. Tamilkodi, L.
Venkateswara Kiran in [5] have forecasted the rainfall applying actual time global climate parameters. They have mainly
used convolution neural systems to anticipate weather parameters which in result will give meaningful designs to
understand the forecasting. They believe that the results precision is expected to be extended by 70%. Anjali Samad,
Bhagyanidhi, Vaibhav Gautam, Piyush Jain, Sangeeta, Kanishka Sarkar has done precipitation forecasting applying
Long Short Term Memory Neural Network [6]. This model is based on Australian dataset which covers seasonal
decomposition methods. They claim that their model has been forecasting accurately for check cases but there was some
increase in error rate while handling the outliers. Rose Ellen N. Macabiog, Jennifer C. Dela Cruz have done the model
building on rainfall predictive approach applying classification [7]. They have done data collection, data preprocessing,
building the predictive model, model evaluation and selection and finally testing and evaluation. They observed that we
can get most accurate results if we use all 5 attributes for the model. Course KNN, Fine Gaussian SVM, Neural Network
given the best 0.811 accuracy Arief Bramanto Wicaksono Putra, Rheo Malani, Bedi Suprapty, Achamad Fanany Onnilita
Gaffar have implemented Deep Auto Encoder using Semi CNN [8]. Their method of working includes Auto-Encoder
Neural Network (AENN), Convolution Neural Network (CNN), proposed method, model data time series autoregressive
(AR model). This approach is one of the recent ones for rainfall prediction with 99% performance. Yuana Ratna Sari,
Esmeralda Contessa Djamal, Fikri Nugraha in [9] have done the precipitation forecasting using 1-D CNN. They have
done preprocessing and convolutional neural network (CNN). They observed that there was 0.9463 accuracy of training
data and 0.8146 accuracy of testing data. Also, they configured that more the layers you use in the model, more is the
accuracy. Eslam Hussein, Mehrdad Ghaziasgar, Christopher Thron in [10] have applied Support Vector Machine
Classification. The methodology used consists data gathering, preprocessing, classification, comparison between
different SVM inputs, comparison between regional predictions. Nikhil Oswal in [11] has used machine learning
techniques to implement his model. His methodology includes data exploration and analysis, data preprocessing,
modelling and evaluating. At the end he has concluded that Australian rainfall is not certain there is not a particular
relation between time and shower. Still, he is able to find certain pattern and develop high performance model.
III. PROPOSED SYSTEM
The proposed solution to this is to predict precipitation in short-term forecasts. Designing and developing a precipitation
forecasting system with a web interface (GUI). To predict precipitation a few days ago at a particular location. Accurate
and accurate forecasts help develop better strategies for agriculture and water storage, and are also informed about floods
to implement precautionary measures. Prediction systems are implemented comparing various machine learning
algorithms such as SVR, ANN, and multiple regression. The data is analyzed and visualized using histograms, graphs,
etc. to derive meaningful information from the patterns of precipitation data obtained. The results of the various ML
algorithms implemented are compared with accuracy. The aim of the proposed study is to be effective and efficient in
predicting rainfall with maximum accuracy and precision.
IV. METHODOLOGY
The first and most important step in the process is data collection. We have collected our data from the official website
of the Indian Government from the year 1951 to 2015 from all the districts and for all months. Next step is Converting
the data into the correct format to conduct experiments i.e. doing preprocessing tasks on our dataset like replacing
missing and null values with the mean of the column and detecting and removing outliers and visualizing using boxplots.
Next step involves analysis of the data and observing variations in the patterns of rainfall to derive conclusions and also
to determine the correlation between the different parameters. After that, we visualize our analyzed data using bar
graphs, heat maps, histograms, etc. for a better and easy understanding of the information obtained. After that, we try to
predict the average rainfall by separating data into training and testing datasets. We apply various statistical and machine
Figure 1: Methodolgy
Figure 2 explains the basic data flow of the entire methodology and explains each of the individual components of our
approach. The different components are data collection, data cleaning and analysis, data visualization, splitting into
training and testing set, implementing all the mentioned ML Models, input state, month and other values through a web
interface to predict the rainfall.
Data pre-processing:
After successfully loading the dataset, we checked if this dataset has null values or not. It was the same case which
happens with most of the dataset meaning there were null values present. 21 out of 23 attributes has some or more null
values. After checking the percentage null values, there were 4 such attributes of which more than 35% of rows were
not having any data. Replacing that amount of data with only one value, that could be mean or median, affects the
variance because of its central tendency, also it alters visualization considerably. So, for those 4 attributes, we have
defined a function which will randomly fit a value at the place of null or NaN. Then for the remaining columns which
have continuous numerical values we used the method of replacing null values with median value of that column. After
that, attributes having categorical value with null values are also replaced with distinct but dummy values. At this point
there were zero value in the entire dataset.
Our next aim was to convert categorical values into numerical values which will later will be used for prediction.
Categorical values such as yes and no are replaced with one, zero respectively; location, wind direction these attributes
are assigned distinct representative number. After making the dataset completely numerical we checked for outliers in
the attributes using boxplot. Outliers provide useful insights into the data you are investigating and can affect statistical
results. This can help you find inconsistencies in the statistical process and find errors. For outlier removal we first
define a range using inter quartile range in which if a value resides it isn’t considered as an outlier. But if not, then it
needs to get changed so that it won’t bother the prediction. So, for that we looked for lowest and highest value of that
range. Attribute value lower than that of lowest range value is replaced with the lowest value and higher attribute value
than highest range value is replaced by highest value. After the process of pre-processing, we also drew some graphs
and plots along with boxplot such as histogram, distplot, countplot, subplot for the visualization.
Model validation:
A validation process involves comparing a trained model to a test set of similar data. The testing set is different from
the training set from which the model was trained. Model validation follows model training in order to find a perfect
model with the most ideal performance. The testing data set is utilized principally to check the speculation capacity of
a prepared model. Model validation aims to find an ideal model that has the best performance. For this first thing that
we have done is train test split. Because a model can't be evaluated by using the same data it was trained on. It needs to
get evaluated it with fresh data. So, we did 80-20 percent split of our total dataset. 80 percent is used for training and
remaining for testing.
Another problem we are considering here is handling imbalanced data. An imbalanced dataset is one where one class
has a very high number of observations while the other has many fewer, i.e., one class has very high observations while
the other has very few observations. A standard approach to deal with an imbalanced dataset is to resample the data.
There are two kinds of resampling: oversampling and undersampling. Oversampling is more commonly used than
undersampling. The reason is that when we undersample, we tend to remove instances from the data that may contain
valuable information. A technique called Synthetic Minority Oversampling Technique (SMOTE) generates synthetic
samples for the minority class, thereby reducing the overfitting problem posed by random oversampling. In this method,
interpolation between positive instances that lie together is used to create new instances based on the feature space.
That’s why we did perform this technique and gathered resampled data. Then we applied this data to seven different
Feature selection:
As the crucial features are determined during this step of Data cleaning, feature selection is vital to the process. By
selecting the relevant features, we not only get rid of the unimportant ones but also increase the performance of our
model. We are using mutual_info_classif method in our project. This is an interesting method of selecting features based
on mutual information(entropy) gain that is applicable to classification problems. Using this method, the model is more
accurate due to its univariate filtering. In univariate methods, because features are calculated separately instead of in
groups, the top ten performing variables seem to perform poorly when they are grouped. This results in the selection of
suboptimal features. Nevertheless, univariate filtering methods are relatively quick and can be used for screening, which
leads to better performance and less training. That’s why we figured out mutual information values of every attribute,
arranged it decreasing order (Higher the value, more the dependency) and selected top 10 attributes which were well
suited for prediction. Taking these 10 attributes and performing resampling we used CatBoost Classifier for the
prediction model. Figure 3 illustrates the value on which we got the best 10 attributes.
Interface building:
A web interface has been built to help users understand the importance of building prediction models. In order to create
the web interface, we used flask. It is a web framework which is a Python module that lets you create web applications
easily. One of the important parts of this interface is analysis of the dataset that we have done using PowerBI. Power BI
www.irjmets.com @International Research Journal of Modernization in Engineering, Technology and Science
[5]
e-ISSN: 2582-5208
International Research Journal of Modernization in Engineering Technology and
Science
( Peer-Reviewed, Open Access, Fully Refereed International Journal )
Volume:04/Issue:03/March-2022 Impact Factor- 6.752 www.irjmets.com
is a business intelligence tool that allows users to visualize raw data and analyze it as well as present it in the form of
actionable insights. We have made use of line chart, heatmap, bar graph, decomposition tree, area chart, radial chart to
analyze our data. We have also used slicer for periodic or time-oriented analysis.
Figure 4 and 5 holds the view of our powerbi analysis dashboard on which you can see all the charts and plots. There
are two dedicated predictor pages designed for making prediction using user’s data. On one page we are predicting using
limited attributes that we have got using feature extraction. And on the other we can enter data of all 22 entities for
prediction.
REFERENCES
[1] Kala, A., & Vaidyanathan, S. G. (2018, July). Prediction of rainfall using artificial neural network. 2018
International Conference on Inventive Research in Computing Applications (ICIRCA).
[2] Grace, R. K., & Suganya, B. (2020, March). Machine Learning based Rainfall Prediction. 2020 6th International
Conference on Advanced Computing and Communication Systems (ICACCS).
[3] Ahmed, H. A. Y., & Mohamed, S. W. A. (2021, February 26). Rainfall Prediction using Multiple Linear Regressions
Model. 2020 International Conference on Computer, Control, Electrical, and Electronics Engineering (ICCCEEE).
[4] Basha, C. Z., Bhavana, N., Bhavya, P., & V, S. (2020, July). Rainfall Prediction using Machine Learning & Deep
Learning Techniques. 2020 International Conference on Electronics and Sustainable Communication Systems
(ICESC).
[5] Vasantha, B., Tamilkodi, R., & kiran, L. V. (2019, March). Rainfall pattern prediction using real time global climate
parameters through machine learning. 2019 International Conference on Vision Towards Emerging Trends in
Communication and Networking (ViTECoN).
[6] Samad, A., Bhagyanidhi, Gautam, V., Jain, P., Sangeeta, & Sarkar, K. (2020, October 30). An approach for rainfall
prediction using long short term memory neural network. 2020 IEEE 5th International Conference on Computing
Communication and Automation (ICCCA).
[7] Macabiog, R. E. N., & Dela Cruz, J. C. (2019, November). Rainfall Predictive Approach for La Trinidad, Benguet
using Machine Learning Classification. 2019 IEEE 11th International Conference on Humanoid, Nanotechnology,
Information Technology, Communication and Control, Environment, and Management ( HNICEM ).
[8] Wicaksono Putra, A. B., Malani, R., Suprapty, B., & Onnilita Gaffar, A. F. (2020, July). A deep auto encoder semi
convolution neural network for yearly rainfall prediction. 2020 International Seminar on Intelligent Technology
and Its Applications (ISITIA).
[9] Sari, Y. R., Djamal, E. C., & Nugraha, F. (2020, September 15). Daily rainfall prediction using one dimensional
convolutional neural networks. 2020 3rd International Conference on Computer and Informatics Engineering
(IC2IE).
[10] Hussein, E., Ghaziasgar, M., & Thron, C. (2020, July). Regional rainfall prediction using support vector machine
classification of large-scale precipitation maps. 2020 IEEE 23rd International Conference on Information Fusion
(FUSION).
[11] Oswal, N. (2021). Predicting Rainfall using Machine Learning Techniques. Institute of Electrical and Electronics
Engineers (IEEE).