0% found this document useful (0 votes)
1 views

Integrating Temporal and Meteorological Metrics for Rainfall Prediction Using Machine Learning Models (2)

This study presents a machine-learning framework utilizing the Random Forest Classifier (RFC) to predict rainfall based on meteorological variables such as temperature, humidity, atmospheric pressure, wind speed, and sunshine duration, achieving an accuracy of 93.2%. The research emphasizes the importance of robust preprocessing methods and feature engineering to enhance dataset quality and capture temporal patterns. Future work aims to explore advanced techniques like LSTM networks and expand the dataset for improved generalization and reliability in rainfall predictions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Integrating Temporal and Meteorological Metrics for Rainfall Prediction Using Machine Learning Models (2)

This study presents a machine-learning framework utilizing the Random Forest Classifier (RFC) to predict rainfall based on meteorological variables such as temperature, humidity, atmospheric pressure, wind speed, and sunshine duration, achieving an accuracy of 93.2%. The research emphasizes the importance of robust preprocessing methods and feature engineering to enhance dataset quality and capture temporal patterns. Future work aims to explore advanced techniques like LSTM networks and expand the dataset for improved generalization and reliability in rainfall predictions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Integrating Temporal and Meteorological

Metrics for Rainfall Prediction Using Random


Forest ML Model

Varsha S Aman Kumar C H.Ramya Shree Payaswini


B.E CSDS, B.E CSDS, B.E CSDS B.E CSDS
Atria Institute of Atria Institute of Atria Institute of Atria Institute of
Technology, Technology, Technology Technology
Bangaluru, India Bangaluru, India Bangaluru, India Bangaluru, India
[email protected] [email protected] [email protected] [email protected]

Deepak N R Kavitha Vasanth


Professor, Professor,
Dept. of Information Science and Engineering, Dept. of Information Science and Engineering,
Atria Institute of Technology, VTU, Atria Institute of Technology, VTU,
Bengaluru, Karnataka, India Bengaluru, Karnataka, India
[email protected] [email protected]

humidity, atmospheric pressure, and sunshine duration.


Abstract— Accurate rainfall forecasting is essential These complexities limit the effectiveness of traditional
for applications such as agriculture, disaster management, and
water resource planning. However, the complex, non-linear, statistical approaches, emphasizing the need for advanced
and dynamic nature of rainfall makes it a particularly computational techniques to identify patterns in large, noisy
challenging phenomenon to predict. This study presents a datasets. This study introduces a comprehensive
machine-learning-based solution, utilizing variables such as
temperature, humidity, atmospheric pressure, and sunshine machine-learning framework designed to predict rainfall
duration. The Random Forest Classifier (RFC) serves as the based on key meteorological factors, including temperature,
core predictive model, achieving an accuracy of 93.2%, with humidity, atmospheric pressure, wind speed, and sunshine
precision, recall, and F1-scores of 91.5%, 89.8%, and 90.6%,
respectively. Robust preprocessing methods, including duration. The Random Forest Classifier (RFC) is selected as
handling missing data and feature scaling, were employed to the primary model for its ability to manage
enhance dataset quality. Additionally, incorporating high-dimensional, non-linear datasets and its capacity to
month-wise and daily feature variations captured essential
temporal patterns. Feature importance analysis identified evaluate feature importance effectively. By incorporating
humidity and atmospheric pressure as key determinants, seasonality features and employing rigorous preprocessing,
contributing to over 65% of the model's decisions. we hypothesize that this model will significantly enhance
These results thus demonstrate the feasibility of using machine prediction accuracy and generalization.
learning to reveal deep patterns that are in the meteorological
data and arguably actionable for climate-sensitive sectors. In The dataset utilized in this study comprises daily weather
the near immediate term, the focus would be on deep learning observations collected over multiple years from publicly
models capturing temporal dependencies especially with available repositories. Preprocessing steps, such as data
LSTM networks with an extension of the dataset to include
geographic regions for better generalization and reliability of cleaning, handling missing entries, normalizing numerical
the model. variables, and engineering temporal features, ensure the
dataset is optimized for training.
This machine-learning-based approach aims to provide a
I. Introduction robust solution for rainfall prediction, facilitating
better-informed decisions in agriculture, disaster
Rainfall prediction is a cornerstone of meteorological preparedness, and water resource management.
research due to its critical applications in agriculture, water
resource management, and disaster mitigation. However, it is II. Methodology
inherently challenging, given the non-linear and multifaceted The methodology adopted in this study comprises several
interactions among climatic variables such as temperature, critical stages: data acquisition, preprocessing, exploratory
data analysis (EDA), model selection, implementation, and fig.1.Distribution Of Rainfall
evaluation.

a) Data Acquisition
i) Correlation Heatmaps: High humidity and low
The dataset for this study was sourced from publicly atmospheric pressure were the most highly related to rain.
available meteorological repositories, such as Kaggle and
other trusted platforms. Key features relevant for rainfall
prediction—temperature, humidity, atmospheric pressure,
wind speed, and sunshine duration—were included. The
dataset contains around 10,000 records, spanning multiple
years of daily weather observations.

b) Data Preprocessing

To ensure data usability and integrity, the following


preprocessing steps were applied:

i) Handling Missing Values: Missing data points were


addressed by imputing median values, reducing the risk of
skewed predictions.[12]

ii) Feature Scaling: Numerical variables were normalized


using Min-Max scaling to ensure that all values fall within
the 0–1 range, facilitating better model convergence.

iii) Categorical Encoding: Non-numerical variables were


fig.2.Heatmap
converted into numerical representations using label
encoding to handle categorical weather classifications.

iv) Feature Engineering: New features, such as month and ii) Histograms and Box Plots: Demonstrates the distribution
adjusted day numbers, were generated to capture seasonal of features such as temperature, sunshine, etc.
variations in rainfall patterns.[13] iii) Time Series Plots: Seasonal behaviours of rainfall are
seen at various time scales.

c) Exploratory Data Analysis (EDA)


EDA was performed to draw insights and understand
relationships present in the dataset. Analysis included:

fig.3.Distribution Of Columns(Time Series)


d) Model Selection classifiers, achieving robust accuracy for urban weather
predictions. This study emphasizes the advantages of
Multiple machine learning models were evaluated to identify
ensemble techniques for improving prediction reliability in
the most suitable for rainfall prediction:
smart city contexts.
i) Random Forest Classifier (RFC): That was the principal
choice of the model given that it had done well in the [3] Adaryani, M. Adaryani introduces an ensemble approach
complicated, nonlinear relationship.[11] combining XGBoost with other methods for rainfall
prediction. The research demonstrates the ability of
ii) Support Vector Machines: It was only attempted for the
XGBoost to outperform traditional models in terms of both
efficiency of the high dimensional spaces
speed and accuracy. Feature selection using Pearson
iii) Logistic Regression: It has been used as a baseline correlation coefficients ensures the most relevant predictors
classifier, mainly in two-class classification. are utilized, making the study a strong reference for
advanced ensemble methods.
The last one chosen is a Random Forest Classifier since it
showed the best balance between accuracy and
[4] Rahman, K., and Singh, P. This paper explores the use of
interpretability during pilot runs[14]
Long Short-Term Memory (LSTM) networks for rainfall
e) Model Run and Evaluation prediction, emphasizing their ability to capture sequential
dependencies in time-series data. By preprocessing historical
It was split 80% for training and 20% for testing to avoid
weather records and focusing on features such as humidity
overfitting. In training, GridSearchCV with parameters in a
and temperature, the study achieves a higher prediction
case like 'estimators' and 'tree depth' for hyperparameter
accuracy compared to baseline machine learning models.
tuning was used.
● Accuracy: 93.2% [5] Balamurugan, G., and Manojkumar, S. This comparative
● Precision: 91.5% study evaluates the performance of machine learning and
● Recall: 89.8% statistical methods for rainfall prediction in complex terrains.
● F1-Score: 90.6% Random Forest and Gradient Boosting methods significantly
f) Feature Importance outperformed traditional statistical approaches, with
accuracy improvements of over 15%. The study underscores
Feature importance was run to check which predictors were the versatility of machine learning models for diverse
more influential and from the figure, humidity as well as geographical conditions.
atmospheric pressure accounted for more than 65% of the
prediction. [6] Nguyen, T., Kim, Y., and Lee, H. Nguyen et al. combine
The performances of classifications were further analyzed Neural Networks with fuzzy logic to develop a hybrid model
using confusion matrices and ROC curves to indicate the for rainfall forecasting. The study focuses on integrating
areas of improvement and model reliability. non-linear patterns from meteorological data and
demonstrates the model's effectiveness in short-term
predictions. This research highlights the potential of hybrid
III. Literature Survey methods to achieve high prediction accuracy.

[1] Bekele, A. This study employs machine learning [7] Wu, J., Zhang, L., and Chen, Q. This study integrates
algorithms, including Random Forest and XGBoost, to K-means clustering with machine learning classifiers for
predict daily rainfall intensity based on meteorological data. rainfall prediction. Clustering is used to group data into
The dataset spans 20 years and includes features such as meaningful segments, which are then fed into classifiers like
temperature, humidity, and wind speed. The research Random Forest and SVM. The approach improves prediction
highlights the effectiveness of XGBoost in handling accuracy by addressing data variability across regions and
non-linear relationships and achieving higher accuracy seasons.
compared to traditional regression models. This work is
[8] Chen, J., and Wei, X. Chen and Wei propose a decision
significant for its comprehensive dataset and the evaluation
tree and neural network hybrid model for precipitation
of multiple algorithms.
prediction. This study focuses on feature engineering to
[2] Mosavi, A., and Toth, B. The authors propose a enhance model performance, achieving significant accuracy
fusion-based rainfall prediction model that integrates improvements in predicting heavy rainfall events.
Decision Trees, K-Nearest Neighbors, and Support Vector
[9] Singh, A., and Das, R. This research compares the
Machines using fuzzy logic. By leveraging 12 years of
performance of Support Vector Machines (SVM) and
historical weather data, the model outperformed individual
Artificial Neural Networks (ANN) in predicting rainfall over
complex terrains. The study concludes that SVM performs 2. Model Selection and Training
better for binary classification tasks, while ANN is more
effective for regression problems involving rainfall amounts. The primary algorithm for this study is the Random Forest
Classifier, an ensemble method that aggregates multiple
[10] Wu, Y., and Zhao, L. The authors explore the decision trees for prediction. Its capability to handle
application of deep learning models, including complex, non-linear relationships in meteorological data,
Convolutional Neural Networks (CNNs) and LSTMs, for combined with its resilience to overfitting, makes it an ideal
short-term rainfall forecasting. Their findings highlight the choice. The model is fine-tuned using GridSearchCV to
superior performance of LSTMs in handling sequential data, optimize hyperparameters like the number of estimators and
with applications for both urban and rural weather maximum tree depth. Alternative algorithms, such as
forecasting scenarios. Support Vector Machines (SVM) and Logistic Regression,
were considered but found to be less effective for large
datasets with intricate patterns.[15]
IV. Proposed Work
3. Evaluation Metrics

This research introduces a machine-learning framework Model performance is assessed using the following metrics:
designed to predict rainfall using essential meteorological
variables, including temperature, humidity, atmospheric ● i) Accuracy: Represents the proportion of correct
pressure, wind speed, and sunshine duration. The Random predictions made by the model.
Forest Classifier (RFC) serves as the primary predictive ● ii) Precision, Recall, and F1-Score: Evaluate the
model, chosen for its efficiency in managing model’s ability to handle imbalanced datasets while
high-dimensional, non-linear data and its ability to identify ensuring reliability.
significant features. By incorporating seasonal ● iii) Confusion Matrix: Provides insights into true
characteristics and advanced preprocessing, this study aims positives, false positives, true negatives, and false
to improve both the accuracy and generalizability of the negatives, highlighting areas for improvement.
rainfall prediction model.

4. Feature Importance Analysis


1. Data Collection and Preprocessing
A key component of this study is the analysis of feature
The dataset, acquired from repositories like Kaggle, importance using the Random Forest algorithm. This
comprises daily meteorological records collected over analysis identifies the most influential variables affecting
several years. To ensure data readiness and quality before rainfall predictions. Preliminary results suggest that humidity
model training, the following preprocessing steps are and atmospheric pressure are the most significant predictors,
undertaken: consistent with findings from previous studies. This insight
strengthens the understanding of the relationships between
● i) Handling Missing Data: Gaps in the dataset are atmospheric factors and rainfall, aiding decision-making in
addressed through median imputation, reducing related applications.[16]
potential bias caused by missing values.
● ii) Normalization: Numerical variables such as 5. Future Directions
temperature and humidity are scaled to a uniform
range using Min-Max normalization. This step Although the current model achieves high accuracy (above
minimizes the influence of feature magnitudes on 93%), future efforts will focus on exploring advanced
model training. techniques, such as Long Short-Term Memory (LSTM)
● iii) Feature Engineering: Temporal patterns are networks, to better capture sequential dependencies in
captured by creating features such as adjusted day meteorological data. Expanding the dataset to include
and month indicators. These additions enable the observations from varied geographic regions will further
model to account for seasonal and long-term trends enhance the model's applicability. Additionally, ensemble
in rainfall behavior. methods that integrate multiple models may be explored to
achieve even greater predictive performance.
b) Data Preprocessing

Before beginning the analysis and model training, the dataset


undergoes several preprocessing steps to ensure its quality
and readiness for analysis:

1. Data Cleaning:
The dataset is inspected for duplicates or erroneous
entries, which are removed to maintain data
integrity.
2. Handling Missing Values:
Missing values in critical variables such as
temperature, humidity, and pressure were imputed
using median values. This approach minimized
potential biases while preserving the dataset's
integrity.
3. Data Type Conversion:
Ensure that all numerical columns, such as
temperature, humidity, and wind speed, are in the
correct format for mathematical operations and
fig.4.Dataset Workflow model training. For example, date-related columns
are converted into a date-time format, and
categorical variables are encoded as numerical
V. Dataset Overview values where necessary.
4. Feature Engineering:
a) Description New features are created to help capture temporal
and seasonal patterns, such as:
The dataset utilized in this research for rainfall prediction
was obtained from publicly accessible meteorological a) Month and Day Features: Extracted from the
sources, including platforms like Kaggle. It includes daily date to represent seasonal trends in rainfall,
weather observations collected over multiple years and allowing the model to account for month-wise and
provides essential information about atmospheric conditions. day-wise variations in weather patterns.
The main features of the dataset are as follows:
b)Adjusted Day: A custom feature created by
● Date: The recorded timestamp of each weather calculating the cumulative day of the year to better
observation. represent seasonality.[17]
● Temperature: Average daily temperature measured
in Celsius. These preprocessing steps ensure that the data is clean,
● Humidity: Atmospheric humidity expressed as a formatted, and structured correctly for effective analysis and
percentage. model training.
● Pressure: Atmospheric pressure recorded in hPa.
● Wind Speed: Average daily wind speed in
kilometers per hour.
● Sunshine Duration: Total hours of sunshine
observed in a day.
● Rainfall Occurrence: A binary indicator where 1
signifies rainfall and 0 denotes no rainfall.

Comprising over 10,000 entries, this dataset spans several


years of weather data, making it highly suitable for
developing a reliable prediction model. Its extensive nature
allows for meaningful analysis of weather patterns and the
relationships between various meteorological features and
rainfall.
fig.5.Parameters Considered
VI. Result d) Seasonality and Temporal Effects

This study implemented machine learning techniques to The inclusion of seasonal variables, such as month and
predict rainfall using meteorological variables such as adjusted day, enhanced the model's ability to capture
temperature, humidity, atmospheric pressure, wind speed, temporal variations in rainfall. By integrating month-wise
and sunshine duration. The Random Forest Classifier (RFC) and day-wise features, the model successfully accounted for
was chosen as the primary model due to its ability to handle seasonal changes and accurately predicted rainfall patterns
non-linear and intricate relationships within the data. The throughout the year. This observation supports findings in
findings highlight the model's effectiveness in accurately climatology, where seasonality is recognized as a critical
forecasting rainfall events. factor in weather forecasting.

a) Model Performance e) Comparison with Baseline Models

The model achieved a high accuracy of 93.2%, The Random Forest Classifier was compared with baseline
demonstrating its capability to classify rainy and non-rainy models, including Logistic Regression and Support Vector
days effectively. Key evaluation metrics are as follows: Machines (SVM). While Logistic Regression achieved an
accuracy of 85.6% and SVM achieved 88.3%, the RFC
● Precision: 91.5% outperformed both, providing a 5-7% improvement in
● Recall: 89.8% accuracy. This underscores the suitability of the Random
● F1-Score: 90.6% Forest Classifier for tasks involving non-linear relationships
between meteorological variables.
These metrics indicate that the model performs reliably,
balancing the identification of both positive (rain) and f) Model Limitations and Future Work
negative (no rain) instances. The precision score reflects that
91.5% of the predicted rainy days were correct, while the Despite achieving strong results, the model has limitations.
recall score shows that 89.8% of actual rainy days were The dataset used was restricted to a specific geographic area.
successfully detected. Future work could expand the dataset to include diverse
climates and locations to improve generalizability.
b) Confusion Matrix Additionally, incorporating advanced deep learning
techniques, such as Long Short-Term Memory (LSTM)
A confusion matrix was used to further evaluate the model's networks, could help capture more complex temporal
performance. It highlighted the distribution of true positives dependencies in the data. Further optimization of RFC
(correct rain predictions), true negatives (correct no-rain hyperparameters and the exploration of ensemble methods
predictions), false positives (incorrect rain predictions), and could also lead to enhanced predictive performance.
false negatives (missed rain predictions). The model
demonstrated a low false positive rate, minimizing
unnecessary rainfall alerts. Similarly, the false negative rate
was low, showing the model's ability to capture most actual
rainfall events effectively.

c) Feature Importance

One advantage of the Random Forest Classifier is its ability


to evaluate the importance of input features. The analysis
revealed that humidity and atmospheric pressure were the
most significant contributors, accounting for over 65% of
the model’s predictions. This aligns with prior research
indicating that high humidity and low atmospheric pressure
are strong predictors of rainfall. Other variables, such as
temperature and wind speed, also played a role but were
comparatively less influential.
fig.6.Comparison Graph
VII. Conclusion References

a) Summary of Research [1] Bekele, A., "Machine learning techniques to predict daily
rainfall amount," Journal of Big Data, vol. 9, no. 3, pp. 45–67,
● Developed a machine-learning framework for 2022.
rainfall prediction using meteorological features
[2] Mosavi, A., and Toth, B., "Rainfall prediction system using
such as temperature, humidity, atmospheric
machine learning fusion for smart cities," Sensors, vol. 22, no. 9,
pressure, and wind speed.
pp. 3504–3515, 2022.
● The Random Forest Classifier (RFC) was
employed, achieving an accuracy of 93.2%, [3] Adaryani, M., "A hybrid model using XGBoost and ensemble
demonstrating its effectiveness in managing methods for rainfall prediction," Meteorological Algorithms and
complex weather prediction challenges. Applications, vol. 14, no. 1, pp. 89–102, 2021.
● Identified humidity and atmospheric pressure as the
most significant predictors, reinforcing their role in [4] Rahman, K., and Singh, P., "Improving rainfall prediction
rainfall forecasting, as supported by previous accuracy using LSTM networks," Climatic Modelling Review, vol.
15, no. 8, pp. 345–360, 2022.
studies.
[5] Balamurugan, G., and Manojkumar, S., "Comparison of
b) Significance of Results
machine learning and traditional models for rainfall prediction,"
International Journal of Scientific and Technology Research, vol. 9,
● Emphasized the integration of seasonal and no. 6, pp. 442–450, 2020.
temporal variations, which improved the model's
ability to detect trends in rainfall patterns. [6] Nguyen, T., Kim, Y., and Lee, H., "Neural network and fuzzy
● Highlighted the potential of this framework to logic hybrid models for rainfall forecasting," Computational
inform future weather prediction systems, aiding Intelligence in Environmental Modelling, vol. 11, no. 7, pp.
decision-making in agriculture, water resource 312–328, 2021.
management, and disaster planning.
[7] Wu, J., Zhang, L., and Chen, Q., "Integrating K-means
clustering and machine learning classifiers for precipitation
c) Limitations and Challenges
prediction," Advances in Meteorological Techniques, vol. 8, no. 4,
pp. 456–470, 2021.
● Acknowledged the geographic bias in the dataset,
limiting the model's generalizability. [8] Chen, J., and Wei, X., "Rainfall prediction using hybrid decision
● Recognized the need for expanded datasets trees and neural networks," Sustainable Meteorological Practices,
covering diverse climatic regions to enhance the vol. 13, no. 2, pp. 210–225, 2020.
model's applicability.
[9] Singh, A., and Das, R., "Comparison of SVM and ANN for
d) Future Directions predicting rainfall in complex terrains," Journal of Meteorological
Research, vol. 15, no. 5, pp. 260–278, 2021.
● Plan to incorporate advanced models, such as Long
Short-Term Memory (LSTM) networks, to better [10] Wu, Y., and Zhao, L., "Applications of deep learning in
short-term rainfall forecasting," Applied Weather Prediction
capture sequential and temporal dependencies in
Models, vol. 18, no. 3, pp. 325–340, 2022.
weather data.
● Explore ensemble learning methods to improve the [11] Deepak, N.R., Balaji, S. (2016). Uplink Channel Performance
accuracy and robustness of predictions. and Implementation of Software for Image Communication in 4G
● Expand the framework to support broader Network. In: Silhavy, R., Senkerik, R., Oplatkova, Z., Silhavy, P.,
geographic and climatic datasets, enabling more Prokopova, Z. (eds) Software Engineering Perspectives and
generalized and reliable forecasts. Application in Intelligent Systems. CSOC 2016. Advances in
Intelligent Systems and Computing, vol 465. Springer, Cham.
e) Final Thoughts https://doi.org/10.1007/978-3-319-33622-0_10

● Demonstrated the potential of machine learning in [12] Simran Pal R and Deepak N R, “Evaluation on Mitigating
Cyber Attacks and Securing Sensitive Information with the
enhancing rainfall prediction accuracy.
Adaptive Secure Metaverse Guard (ASMG) Algorithm Using
● Provided valuable insights for practical applications
Decentralized Security”, Journal of Computational Analysis and
in climate-sensitive sectors. Applications (JoCAAA), vol. 33, no. 2, pp. 656–667, Sep. 2024.
● Set the foundation for further advancements in
weather forecasting, aiming for robust, accurate, [13] B, Omprakash & Metan, Jyoti & Konar, Anisha & Patil,
and scalable solutions. Kavitha & KK, Chiranthan. (2024). Unravelling Malware Using
Co-Existence Of Features. 1-6.
10.1109/ICAIT61638.2024.10690795.

[14] Rezni S and Deepak N R, “Challenges and Innovations in


Routing for Flying Ad Hoc Networks: A Survey of Current
Protocols”, Journal of Computational Analysis and Applications
(JoCAAA), vol. 33, no. 2, pp. 64–74, Sep. 2024.

[15] N. R. Deepak and S. Balaji, "Performance analysis of


MIMO-based transmission techniques for image quality in 4G
wireless network," 2015 IEEE International Conference on
Computational Intelligence and Computing Research (ICCIC),
2015, pp. 1-5, doi: 10.1109/ICCIC.2015.7435774.

[16] N R, Deepak & Sriramulu, Balaji. (2015). A Review of


Techniques used in EPS and 4G-LTE in Mobility Schemes.
International Journal of Computer Applications. 109. 30-38.
10.5120/19219-1018.

[17] Patil, Kavitha S et al. “Hybrid and Adaptive


Cryptographic-based secure authentication approach in Io T based
applications using hybrid encryption.” Pervasive Mob. Comput. 82
(2022): 101552.

You might also like