0% found this document useful (0 votes)
51 views39 pages

ML - Report

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views39 pages

ML - Report

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Development of Explainable AI (XAI) based model for

prediction of high impact rain events using satellite data


A PROJECT REPORT

21CSC305P –MACHINE LEARNING


(2021 Regulation)
III Year/ V Semester
Academic Year: 2024 -2025

Submitted by
Saket Gudimella [RA2211026010111]
Vasan Lennin [RA2211026010106]
R Siddharth [RA2211026010079]
S Veena Maheswari [RA2211026010123]
Under the Guidance of
Dr. M.S.Abirami
Associate Professor
Department of Computational Intelligence

in partial fulfillment of the requirements for the degree of

BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE ENGINEERING
with specialization in
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING

SCHOOL OF COMPUTING
COLLEGE OF ENGINEERING AND TECHNOLOGY
SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
KATTANKULATHUR- 603 203

NOVEMBER 2024
SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
KATTANKULATHUR – 603 203

BONAFIDE CERTIFICATE

Certified that 21CSC305P - MACHINE LEARNING project


report titled “Development of Explainable AI (XAI) based model for
prediction of high impact rain events using satellite data” is the
bonafide work of Saket Gudimella [RA2211026010111], Vasan
Lennin [RA2211026010106], R Siddharth[RA2211026010079], S
Veena Maheswari[RA2211026010123] who carried out the task of
completing the project within the allotted time.

SIGNATURE SIGNATURE
Dr. M.S.Abirami Dr. R.Annie Uthra
Course Faculty Head of the Department
Associate Professor Professor
Department of Computational Intelligence Department of Computational Intelligence
SRM Institute of Science and Technology SRM Institute of Science and Technology
Kattankulathur Kattankulathur
ABSTRACT

With the increasing global frequency and severity of extreme weather events, accurate and timely
predictions of heavy rainfall are essential for effective disaster management and risk mitigation.
This project, titled Development of an Explainable AI (XAI) based Model for Prediction of
Heavy/High Impact Rain Events Using Satellite Data, addresses the challenge of forecasting
high-impact rainfall events by developing a machine learning model that not only achieves high
predictive accuracy but also emphasizes transparency through explainable AI (XAI) techniques.

Utilizing satellite data, this model integrates a range of meteorological parameters, including
atmospheric pressure, temperature, humidity, wind speed, and cloud cover, to create a
comprehensive dataset for predicting rainfall events. The project investigates various machine
learning algorithms, such as Support Vector Machines (SVM) and Random Forest to identify the
most suitable approach for predicting intense rain events. Each algorithm is rigorously evaluated for
its predictive performance, enabling a comparative analysis to determine the most accurate model.

A significant aspect of this project is its focus on explainability, achieved through SHapley Additive
exPlanations (SHAP) analysis. SHAP values are calculated for each model, providing insights into
the contribution of individual features (such as temperature fluctuations, cloud density, or humidity
levels) to the prediction outcome. This level of interpretability is critical in supporting
meteorologists and disaster management teams, as it allows them to understand the reasoning
behind each prediction, ultimately fostering trust and transparency in AI-driven forecasts.

The project demonstrates the efficacy of combining machine learning with XAI techniques in
improving early warning systems for high-impact rain events. The insights derived from the SHAP
analysis aid in understanding complex weather patterns, thereby enabling proactive and data-driven
decision-making. By prioritizing both predictive accuracy and interpretability, this project offers a
promising approach to enhancing preparedness, minimizing risks, and reducing the socioeconomic
impacts of heavy rainfall events through explainable, reliable AI-driven insights.
TABLE OF CONTENTS

S NO CHAPTER PAGE NO

1 INTRODUCTION
1.1 Background and Motivation
1.2 Objectives of the Study
1.3 Challenges in Heavy Rain Prediction
1.4 Importance of Explainable AI in Meteorology
1.5 Software Requirements Specification

2 LITERATURE SURVEY
2.1 Overview of Rainfall Prediction Models
2.2 Explainable AI in Meteorological Applications
2.3 Comparative Analysis of Machine Learning Techniques
2.4 Review of Satellite Data Sources for Meteorology
2.5 Summary and Key Takeaways

3 METHODOLOGY
3.1 Data Collection and Preprocessing
3.1.1 Satellite Data Acquisition and Processing
3.1.2 Feature Engineering and Selection
3.1.3 Data Splitting and Normalization
3.2 Model Architecture and Design
3.2.1 Support Vector Machine (SVM) Implementation
3.2.2 Random Forest Implementation
3.2.3 SHAP Analysis for Model Explainability

4 RESULTS AND DISCUSSIONS

5 CONCLUSION AND FUTURE ENHANCEMENT

6 REFERENCES
LIST OF FIGURES

Figure No Title of the Figure Page No

1 Architecture diagram 23
2 Annual Rainfall Data 29
3 Annual Rainfall Range Plot 29
4 Algorithm Accuracy Bar Plot 29
5 Prediction vs Actual Data ROC 29
6 Random Forest Result Comparision 20
7 Dataset Parameters 30
8 SVM Prediction Accuracy Parameters 30
9 Month-wise Rainfall Data 30
10 Average impact SHAP value 30
11 Random Forest Classifier Error 30
12 ROC curves for RFC 30
13 RFC Prediction Accuracy Parameters 30
14 RFC Confusion Matrix 31
15 SVM Classification Graph 31
LIST OF TABLES

Table No Title of the Table Page No

1 Literature Survey 15

ABBREVIATIONS

SN. NO ACRONYM ABBREVIATION

1. SVM Support Vector Machine


2. RFC Random Forest Classifier
3. SHAP SHapley Additive exPlanations
4. AUC Area Under the Curve
5. XAI Explainable Artificial Intelligence
6. INSAT Indian National Satellite System
7. NWP Numerical Weather Prediction
8. WRF Weather Research and Forecasting
9. ECMWF European Centre for Medium-Range
Weather Forecasts
10. RFE Recursive Feature Elimination
CHAPTER 1

INTRODUCTION

1.1 BACKGROUND AND MOTIVATION

Extreme rainfall events pose a serious challenge for meteorology and disaster response, often
leading to severe floods, landslides, and widespread infrastructure damage. The increasing
frequency and intensity of such events, largely influenced by climate change, underscores the need
for accurate and rapid forecasting. Traditional forecasting models, while valuable, sometimes lack
transparency in identifying the factors that drive predictions, which can make interpretation
challenging. However, advancements in space technology, including high-resolution satellite data
from platforms like INSAT-3D/3DR, provide a valuable resource for enhancing the accuracy and
timeliness of rainfall forecasts. Effectively translating this satellite data into actionable insights
requires advanced machine learning models that balance both accuracy and interpretability.

The motivation behind developing an Explainable AI (XAI) model for rainfall nowcasting stems
from the dual need for precision and transparency in high-stakes weather forecasting. Conventional
"black box" AI models, despite their predictive power, often lack interpretability, leaving users and
stakeholders uncertain about the underlying rationale—especially in situations where critical
decisions need to be made. This project aims to address this by integrating explainable features into
the model, offering clarity on key predictors and flagging potential conditions where model accuracy
might falter. This approach will help build trust in the model's predictions and empower
meteorologists, disaster response teams, and policymakers to make informed decisions. Ultimately,
the project will deliver a user-friendly web application that not only provides rainfall forecasts but
also reveals the AI model’s reasoning, making its outputs easier to understand and more reliable for
end-users.
1.2 OBJECTIVES OF THE STUDY

● Develop a Predictive Model for Heavy Rainfall: Create an AI model that accurately forecasts
heavy rain events using satellite data, enhancing early warning systems.

● Integrate Explainability: Ensure the model is explainable, providing insights into how it arrives
at its predictions to support transparency and trust.

● Identify Key Influencing Factors: Highlight the main factors contributing to heavy rainfall
predictions, allowing meteorologists to understand the drivers behind severe rain forecasts.

● Support Decision-Making for Disaster Preparedness: Provide weather experts with clear,
actionable insights to improve disaster readiness and response strategies.

● Enhance Early Warning Systems: Contribute to more reliable and understandable early
warnings for severe rainfall, aiding proactive disaster management.
1.3 CHALLENGES IN HEAVY RAIN PREDICTION

● Data Quality and Preprocessing: Satellite data often contains missing values, noise, and
inconsistencies, which require extensive cleaning and preprocessing. Ensuring the data is
properly normalized and formatted for use in the model is a complex and
resource-intensive task.

● Complexity of Weather Systems: The factors influencing heavy rainfall are numerous and
highly dynamic, making it difficult to capture all the relevant temporal and spatial patterns
accurately. Modeling these complex interactions to improve prediction accuracy is a major
challenge.

● Balancing Accuracy with Interpretability: While highly accurate models can be complex and
opaque, the goal is to create a model that is both precise and interpretable. Striking this balance
between performance and explainability presents a technical challenge, as more complex models
often sacrifice transparency.

● Real-Time Forecasting: To be useful for early warning systems, the model must provide
predictions in real-time. This requires optimizing the model for speed and low-latency
processing, ensuring that predictions are available when needed without sacrificing accuracy.

● Handling Model Limitations: Identifying conditions where the model may struggle or fail to
make accurate predictions is critical for improving its reliability. Developing strategies to
manage these limitations and minimize errors is an ongoing challenge.

● Designing an Accessible Interface: Creating a user-friendly web application that presents both
predictions and interpretability insights in a clear, concise manner is essential. Ensuring that the
interface is intuitive for meteorologists and non-experts alike requires thoughtful design and
testing.
1.4 IMPORTANCE OF EXPLAINABLE AI IN METEOROLOGY

Explainable AI (XAI) is crucial in meteorology because it enhances the transparency and


trustworthiness of AI-driven weather predictions. Traditional AI models are often viewed as “black
boxes,” making it difficult for meteorologists to understand how predictions are made. XAI
addresses this by providing clear explanations of the factors influencing predictions, which helps
meteorologists make more informed decisions and improve the accuracy of forecasts.

Furthermore, XAI supports model validation by identifying potential errors or biases, allowing for
timely corrections and refinements. This is particularly important in high-stakes scenarios, such as
predicting heavy rainfall or storms, where understanding the model’s reasoning is essential for
effective disaster preparedness and response.

Additionally, XAI promotes fairness by detecting biases related to geographical regions or


environmental factors, ensuring that predictions are reliable across diverse conditions. It also aids in
regulatory compliance by demonstrating that the model’s predictions are based on scientifically
sound principles. Finally, XAI enhances communication between meteorologists, AI experts, and
other stakeholders, ensuring that weather forecasts are understood and trusted by all parties involved.
This transparency ultimately fosters greater adoption of AI technologies in meteorology, improving
overall forecasting capabilities.
1.5 SOFTWARE REQUIREMENT SPECIFICATION
Core Libraries

● Data Manipulation: numpy, pandas


● Data Visualization: matplotlib, seaborn

● Warnings Suppression: warnings

Data Preprocessing & Analysis

● Data Preprocessing: sklearn.utils.resample, sklearn.model_selection.train_test_split,


sklearn.preprocessing.MinMaxScaler

● Model Evaluation Metrics: sklearn.metrics (e.g., accuracy_score, roc_auc_score,


classification_report, confusion_matrix, cohen_kappa_score, plot_confusion_matrix, roc_curve)

Machine Learning Models

● Classifiers:
○ sklearn.linear_model.LogisticRegression
○ sklearn.neighbors.KNeighborsClassifier
○ sklearn.svm.SVC
○ sklearn.tree.DecisionTreeClassifier
○ sklearn.ensemble.RandomForestClassifier
○ sklearn.neural_network.MLPClassifier
○ lightgbm
○ catboost

○ xgboost

Data Analysis Functions

● Custom Data Exploration: data_explore() function for dataset overview, info, and column
analysis.

● Outlier Detection: IQR() function for outlier thresholding based on the interquartile range
(IQR).

Data Cleaning & Feature Engineering

● Missing Value Imputation: Manual imputation of missing values in categorical and numerical
columns
● Outlier Treatment: max_value() function for capping outliers

● Categorical Encoding: pd.get_dummies() for encoding categorical variables

Model Training & Evaluation

● ROC Curve Plotting: plot_roc_curve() function

● Generalized Model Training Function: model_run() function to train models, predict, and
output performance metrics.
CHAPTER 2

LITERATURE SURVEY

2.1 OVERVIEW OF RAINFALL PREDICTION MODELS

Rainfall prediction models are essential in meteorology and climate science, offering insights for
agriculture, water resource management, and disaster prevention. These models can be broadly
categorized into statistical, dynamical, and hybrid models. Statistical models leverage historical rainfall
data and other meteorological variables to identify patterns and trends. Techniques such as regression
analysis, autoregressive integrated moving average (ARIMA), and machine learning algorithms like
support vector machines (SVM) and artificial neural networks (ANN) are commonly used. These
models are computationally efficient and perform well with localized, short-term predictions but may
struggle with long-term forecasts or in regions lacking extensive historical data.

Dynamical models, in contrast, use physical laws and equations that govern atmospheric behavior to
simulate climate conditions. Numerical Weather Prediction (NWP) models are a widely used example,
incorporating data from satellites and radar to model large-scale weather systems. NWP models such as
the Weather Research and Forecasting (WRF) model and the European Centre for Medium-Range
Weather Forecasts (ECMWF) model are prominent examples. While these models provide a more
comprehensive understanding of rainfall distribution and atmospheric dynamics, they are
computationally intensive. Recent advancements have seen the development of hybrid models,
combining statistical methods with dynamical models to achieve better accuracy. The integration of
machine learning with NWP models is also promising, as it can enhance real-time adaptability and
improve prediction accuracy across diverse geographical areas.
2.2 EXPLAINABLE AI IN METEOROLOGICAL APPLICATIONS

Explainable AI (XAI) is increasingly vital in meteorological applications, where understanding the


reasoning behind predictions is crucial for decision-making, especially when it comes to weather
forecasting, disaster preparedness, and climate change predictions. Traditional AI models, such as deep
learning or machine learning, often act as “black boxes,” providing predictions without transparent
reasoning. XAI helps make these models interpretable, ensuring that meteorologists and disaster
response teams can trust and understand the factors behind weather forecasts.

In weather forecasting, XAI models provide clarity on the key atmospheric factors driving predictions,
such as temperature, humidity, and wind patterns. For example, in heavy rainfall prediction, XAI
techniques like SHAP or LIME can explain which variables (e.g., cloud cover, moisture levels)
contributed most to a rainfall forecast, improving prediction transparency.

XAI is also valuable in severe weather event prediction, such as hurricanes or tornadoes. It helps
meteorologists understand why certain conditions are likely to lead to extreme events, making forecasts
more actionable. In the case of climate modeling, XAI techniques reveal which factors (e.g., CO2
emissions or solar radiation) are influencing long-term climate changes, aiding policy decisions.

Moreover, in disaster management, XAI ensures that weather predictions for events like floods are
understandable, allowing timely and informed decisions. By explaining why a model has issued a
warning, XAI increases trust in early warning systems, helping authorities respond effectively.

Overall, XAI enhances the accuracy, transparency, and reliability of meteorological predictions, making
them more actionable and trustworthy for decision-makers, ultimately improving public safety and
climate resilience.
2.3 COMPARATIVE ANALYSIS OF MACHINE LEARNING
TECHNIQUES
Machine learning techniques like Support Vector Machines (SVM) and Random Forest have shown
significant improvements over traditional rainfall prediction models, particularly in handling complex,
nonlinear patterns in meteorological data. Traditional statistical models, such as multiple linear
regression and autoregressive models, rely on predefined relationships and often struggle with the
intricate and variable dynamics of rainfall. SVM and Random Forest, however, are capable of
identifying complex data patterns without extensive parameter assumptions. SVM is particularly
effective for classification problems, separating data into distinct categories using hyperplanes, while
Random Forest leverages an ensemble of decision trees to improve prediction accuracy and reduce
overfitting, making it robust for rainfall prediction tasks.

Compared to traditional models, SVM and Random Forest are better equipped to handle
high-dimensional data, often found in meteorological datasets due to the range of influencing factors
such as temperature, humidity, and atmospheric pressure. Traditional models may underperform when
numerous interdependent variables are present, as they struggle to model the nonlinear relationships
effectively. Machine learning models, by contrast, can adapt to complex dependencies and identify key
variables automatically, thus enhancing prediction precision. Random Forest, with its ensemble
approach, captures a broader set of features and reduces variance, which can lead to more accurate
predictions even with noisy data. This flexibility has made machine learning models more effective and
widely adopted in recent rainfall prediction studies.

In addition, the integration of SHAP (SHapley Additive exPlanations) for explainable AI (XAI) offers
new insights into machine learning predictions, addressing a key limitation of traditional models and
even early machine learning models, which often act as black boxes. SHAP values allow for an
interpretable analysis of feature importance, offering transparency into how each factor influences
model predictions. This capability is crucial for meteorology, where understanding the underlying
factors behind rainfall predictions is important for decision-making in policy, agriculture, and disaster
management. By using SHAP in conjunction with SVM and Random Forest, modern rainfall prediction
models not only achieve higher accuracy but also provide actionable insights into the contributing
factors, making them superior to traditional statistical models in both predictive power and
interpretability.
Paper Ref No. Journal Name & Year Paper Title Techniques Used / Inferences
Methodology

1 IEEE Transactions on Satellite Data-based Random Forest, SVM, Machine


Geoscience and Prediction of Heavy Gradient Boosting, learning
Remote Sensing, Rainfall Using Satellite Data techniques
2022 Machine Learning Preprocessing provide
Techniques accurate
rainfall
predictions.
However, lack
of
transparency
in
decision-maki
ng is a
limitation.

2 Journal of Hydrology, Deep Learning Models CNN, LSTM, CNN and


2021 for Extreme Rainfall Multimodal Data LSTM models
Prediction from Fusion (combining achieve high
Remote Sensing Data satellite imagery, accuracy in
precipitation data) predicting
heavy rain,
but
interpretation
of the results
is difficult due
to the
complexity of
deep learning
models.

3 Nature Scientific Explainable AI for SHAP, LIME, Random SHAP and


Reports, 2024 Weather Forecasting: Forest, SVM, XGBoost LIME are
Applications of SHAP effective in
and LIME making
complex ML
models
interpretable,
allowing users
to understand
feature
contributions.

4 Remote Sensing, Application of SHAP, LIME, CNN, SHAP and


2022 Explainable AI in Random Forest, Data LIME help in
Meteorological Fusion from multiple understandin
Predictions Using satellites g model
Satellite Data predictions
and improving
user trust in
AI-driven
weather
forecasting
models.
5 Atmospheric Hybrid Machine Hybrid Machine A hybrid
Research, 2024 Learning and XAI Learning (Random approach
Framework for Forest + LSTM), combining
Improved Rainfall SHAP, Meteorological traditional ML
Forecasting Satellite Data with XAI
improves
prediction
accuracy
while
providing
explainability.
Key
meteorologica
l variables
influencing
predictions
are
highlighted.

6 IEEE Access, 2024 Integrating XAI with CNN, SHAP, LIME, Integrating
Convolutional Neural Radar and Satellite XAI
Networks for Data Integration techniques
Predicting Severe with CNN
Weather Events models
makes the
decision-maki
ng process
more
interpretable,
enhancing
user
confidence in
predictions.

7 Climate Dynamics, Machine Learning Random Forest, Machine


2023 Approaches for Decision Trees, learning
Real-Time Prediction Time-Series Data from models based
of Heavy Rainfall Meteorological on time-series
Using Satellite Data Satellites data from
satellite
observations
show great
potential for
real-time
rainfall
prediction.
Decision
trees provide
partial
interpretability
, but need
enhancement
.
8 Journal of Applied The Role of XAI methods (SHAP, XAI bridges
Meteorology and Explainable AI in LIME), Deep Learning, the gap
Climatology, 2022 Meteorological Meteorological Data between
Predictions Analysis model
accuracy and
interpretability,
improving
stakeholder
understanding
in critical
applications
like rain
prediction.

9 Earth and Space Improving XGBoost, SHAP, Explainable AI


Science, 2022 Predictability of Satellite Data, can help
Extreme Rainfall Real-Time Prediction clarify the
Events through factors
Machine Learning and contributing to
Explainable AI extreme
rainfall
predictions,
making
models more
transparent
and
actionable in
real-time
contexts.

10 Journal of Climate, Interpretable Machine SHAP, Decision Trees, Interpretable


2022 Learning for Climate Random Forest, SVM machine
Modeling: Applications learning
to Rainfall Prediction techniques
allow for more
meaningful
predictions in
climate
modeling,
highlighting
the most
impactful
variables in
rainfall
prediction.

Table 1: Literature Survey


2.4 REVIEW OF SATELLITE DATA SOURCES FOR
METEOROLOGY

INSAT (Indian National Satellite System)

The INSAT system, developed and operated by the Indian Space Research Organisation (ISRO),
consists of a series of multipurpose geostationary satellites. It plays a vital role in providing real-time
meteorological data for weather forecasting, disaster management, and communication, particularly for
India and its surrounding regions.

INSAT-3D/3DR: These advanced satellites are equipped with state-of-the-art imaging systems that
capture data related to cloud cover, sea surface temperatures, and various atmospheric parameters. The
high-resolution data from INSAT-3D/3DR is particularly beneficial for short-term weather forecasting,
such as predicting heavy rainfall, thunderstorms, and cyclones. Additionally, these satellites generate
imagery of the Earth's surface, which helps in analyzing weather systems such as monsoons and tropical
storms.

Key Parameters: Cloud imagery, sea surface temperature, atmospheric water vapor levels, wind patterns,
and rainfall estimation.

Strengths:
- Continuous, real-time data from a geostationary orbit allows for frequent and consistent monitoring of
weather conditions.
- High temporal resolution ensures timely observations, which are crucial for real-time weather
forecasting and early warning systems.
- Ideal for monitoring regional and tropical weather phenomena, making it highly effective for
forecasting monsoons, cyclones, and heavy rainfall.

Limitations:
- The spatial resolution is lower compared to polar-orbiting satellites, limiting the ability to capture fine
details of local weather events.
- As a geostationary satellite system, it has limited global coverage, mainly focusing on the equatorial
regions. This restricts its capability to capture weather data from polar and remote areas.
CHAPTER 3

METHODOLOGY

3.1 DATA COLLECTION AND PROCESSING

Data Collection and Processing

1. Data Preparation
○ Data Gathering: Collect satellite data from reliable sources, such as Sentinel or Landsat
datasets, using remote sensing platforms like Google Earth Engine or directly accessing
public repositories.
○ Data Preprocessing: Process the satellite imagery to handle any missing or erroneous
values. This may include filling missing data through interpolation or removal if data is
unreliable.
○ Standardization and Normalization: Apply Min-Max scaling or Z-score normalization
to standardize features across temporal and spatial datasets, ensuring that variations in
data scale do not bias the model.
2. Feature Engineering
○ Spatial Features: Extract relevant spatial features such as pixel intensity, NDVI
(Normalized Difference Vegetation Index), and land cover classifications to capture
geographic patterns.
○ Temporal Features: Create temporal features by aggregating data over time periods
(e.g., monthly or seasonal averages), capturing temporal variations.
○ Meteorological Indices: Integrate external meteorological data, like temperature,
precipitation, and humidity, to enrich the feature set and improve predictive capabilities
for modeling natural processes.
3. Data Cleaning and Outlier Treatment
○ Outlier Detection: Use interquartile range (IQR) and other statistical methods to identify
and handle outliers, ensuring they do not skew model training.
○ Missing Value Imputation: Where applicable, apply interpolation or domain-specific
methods for imputing missing values, particularly for continuous data.
4. Data Splitting
○ Train-Test Split: Divide the processed data into training and test sets, ensuring that
temporal or spatial integrity is maintained (e.g., by year or geographic region) to avoid
data leakage and enhance model generalization.
3.1.1 SATELLITE DATA ACQUISITION AND PROCESSING

Satellite data for meteorology is collected from geostationary (e.g., INSAT-3D/3DR) and polar-orbiting
satellites. These satellites provide vital atmospheric data like cloud cover, temperature, humidity, and sea
surface temperatures. The data undergoes several preprocessing steps such as calibration,
georeferencing, data fusion, and cloud masking to ensure accuracy and consistency.

Feature engineering extracts relevant temporal and spatial patterns from the data, like cloud density
changes or temperature gradients, to predict rainfall. Missing values are handled, and data is
standardized to improve model performance. Machine learning models, such as decision trees or random
forests, are then trained on this processed data to predict rainfall events. Techniques like SHAP values
and feature importance scores enhance the interpretability of the model, allowing meteorologists to
understand the reasons behind predictions.

The model is validated using cross-validation and deployed for real-time forecasting, displayed through
a user-friendly interface for operational use in weather prediction and disaster preparedness.
3.1.2 FEATURE ENGINEERING AND SELECTION

Feature engineering transforms raw satellite data into useful features for rainfall prediction. This
involves creating temporal features (e.g., time-based patterns), spatial features (e.g., geographical
gradients), and meteorological indices (e.g., cloud thickness, sea surface temperature, humidity) that
capture key weather patterns.

Feature selection identifies the most relevant features by removing redundant or less impactful ones.
Techniques like correlation analysis, Recursive Feature Elimination (RFE), and decision tree-based
methods help identify important features. Feature scaling (normalization or standardization) ensures all
features contribute equally to the model. Dimensionality reduction methods like PCA can simplify the
dataset while retaining essential information.

The goal is to improve model performance, reduce complexity, and ensure the selected features capture
critical weather conditions for accurate rainfall forecasting.
3.1.3 DATA SPLITTING AND NORMALIZATION

Data Splitting: The dataset is divided into three sets:

● Training Set (70%): Used to train the model and learn from the data.

● Validation Set (15%): Helps tune hyperparameters, prevent overfitting, and optimize the model's
performance during training.

● Test Set (15%): Evaluates the model’s final performance on new, unseen data to assess its
generalization ability.

These splits ensure that the model is trained on one portion of the data, validated and optimized on
another, and then tested on a separate set to provide an unbiased performance evaluation.

Normalization: Features in satellite data (e.g., cloud cover, sea surface temperature, wind speed) often
have different ranges and units. Normalization ensures that all features are on the same scale, preventing
any single feature from disproportionately influencing the model. Common techniques include:

● Min-Max Scaling: Scales features to a fixed range (e.g., 0 to 1).

● Standardization: Centers features around a mean of 0 and scales to a unit variance, which is
helpful when features have varying distributions.

Normalization is essential for models like decision trees, random forests, or neural networks, which are
sensitive to the scale of the input features. By ensuring that all features contribute equally, the model can
learn efficiently and make more accurate predictions.

Together, data splitting and normalization ensure that the model trains effectively, generalizes well to
new data, and provides accurate, explainable rainfall forecasts.
3.2 MODEL ARCHITECTURE AND DESIGN

Figure 1: Architecture diagram


3.2.1 SUPPORT VECTOR MACHINE IMPLEMENTATION

● Data Preprocessing:
○ The raw satellite data (e.g., cloud top temperature, humidity, water vapor levels from
INSAT-3DR) and IMD rainfall data would be preprocessed. This involves normalizing
the features, handling missing data, and ensuring the target labels are binary (e.g., heavy
rain vs. no heavy rain).

● Feature Extraction:
○ Key predictors are extracted from the satellite data, such as cloud motion vectors,
temperature gradients, and atmospheric pressure. These features are used as input for the
SVM model.

● Training the SVM Model:


○ SVM works by finding an optimal hyperplane that separates the data points into two
classes (heavy rainfall vs. no rainfall). The algorithm maximizes the margin between the
classes, ensuring that the data points closest to the hyperplane (support vectors) are
correctly classified.
○ The kernel function plays a crucial role here. Depending on the complexity of the data, a
linear kernel (for linearly separable data) or a non-linear kernel like Radial Basis
Function (RBF) can be used to capture more complex relationships.

● Model Training:
○ The SVM is trained using historical satellite data and corresponding rainfall records. The
model learns the relationship between atmospheric features and rainfall occurrences.

● Prediction:
○ Once trained, the SVM can predict whether a new set of satellite data will result in heavy
rainfall or not. The output is a binary classification (rain or no rain) based on the feature
input.

● Model Evaluation:
○ The performance of the SVM is evaluated using standard metrics like accuracy,
precision, recall, and F1-score. Cross-validation is used to ensure the model generalizes
well on unseen data.
3.2.2 RANDOM FOREST IMPLEMENTATION

● Data Preprocessing:
○ As with the SVM, raw satellite data and IMD rainfall data need to be preprocessed. This
includes cleaning the data, filling missing values, and normalizing or standardizing the
features. The target variable is typically binary (e.g., heavy rain vs. no rain).

● Feature Extraction:
○ From the satellite data, relevant meteorological features (such as cloud cover,
temperature, water vapor content, and wind speed) are extracted. These features will
serve as the input variables for the Random Forest model.

● Random Forest Model Architecture:


○ Random Forest is an ensemble learning method that builds multiple decision trees during
training. Each tree is trained on a random subset of the training data (both features and
samples), making the model robust to overfitting.At each node in a decision tree, the
algorithm chooses the best feature split based on criteria like Gini impurity or entropy.

● Training the Random Forest:


○ The model creates many decision trees, each trained independently on different random
subsets of the data. Each tree produces a classification (rain/no rain), and the Random
Forest combines their predictions through majority voting to make the final decision.

● Model Training:
○ Historical satellite data and corresponding rainfall measurements are used to train the
forest. The trees learn different aspects of the atmospheric conditions that could predict
heavy rainfall by exploring various splits and combinations of features.

● Prediction
○ Once trained, the Random Forest uses the collective predictions of its decision trees to
classify new satellite data as either heavy rain or no rain. This is done by averaging the
predictions from all the trees and outputting the most common classification (majority
vote).

● Model Evaluation
○ The Random Forest model is evaluated using metrics like accuracy, precision, recall, and
F1-score to assess how well it predicts rainfall events. Cross-validation is often used to
ensure the model’s performance is consistent across different subsets of the data.
3.2.3 SHAP ANALYSIS FOR MODEL EXPLAINABILITY

1. SHAP Analysis for SVM Predictions

● Model Agnostic Explanation:

○ Since SVM is a black-box model, SHAP is used to make the model’s decision process
more interpretable. SHAP assigns each feature an importance score, showing its
contribution to the prediction (e.g., whether a particular cloud feature contributed
positively or negatively to forecasting heavy rainfall).

● Post-Training SHAP Application:

○ After training the SVM model on the satellite data, SHAP values are computed for each
feature. SHAP uses Shapley values from cooperative game theory, where each feature is
considered a "player" contributing to the final prediction.

○ The algorithm computes the average contribution of each feature across all possible
permutations of features. This helps in understanding how the presence or absence of a
feature affects the model’s decision.

● Interpretation of SHAP Values:

○ A positive SHAP value indicates that the feature increased the likelihood of a certain
class (e.g., heavy rain), while a negative SHAP value suggests it decreased the likelihood
of that class.

○ For instance, a high positive SHAP value for cloud top temperature may indicate that an
increase in cloud height is strongly associated with the occurrence of heavy rainfall.

● Global and Local Interpretability:

○ Global interpretability: By aggregating SHAP values across all instances, you can
identify which features are generally most influential in predicting heavy rainfall. For
example, cloud cover or atmospheric moisture might consistently have high SHAP
values.

○ Local interpretability: SHAP can also provide explanations for individual predictions,
helping to understand why a specific instance was classified as "rain" or "no rain." For
instance, if an instance is predicted as heavy rain, SHAP can show that the cloud motion
feature was particularly influential for that prediction.
2. SHAP Analysis for Random Forest Predictions

● Model Agnostic Explanation:

○ Similar to SVM, Random Forest is an ensemble model that can be difficult to interpret
directly. SHAP can be used to break down the contribution of each feature across all the
trees in the forest and provide a more transparent view of the decision process.

● Post-Training SHAP Application:

○ After the Random Forest model is trained, SHAP values are computed for the predictions
made by the ensemble of decision trees. Each decision tree in the Random Forest
contributes to the final classification, and SHAP can break down how each tree's decision
influences the overall prediction.

○ SHAP works by computing the contribution of each feature across all trees in the forest,
treating the feature as a "player" in the ensemble, and evaluating its importance relative
to the other features.

● Interpretation of SHAP Values:

○ Similar to SVM, SHAP values for Random Forest provide insights into feature
importance. Features with high SHAP values are seen as more influential in the model’s
decision to classify a data point (e.g., whether it will rain or not). For instance, cloud
cover or humidity levels might have a high SHAP value in predicting rainfall.

○ A positive SHAP value for a feature indicates its contribution toward predicting a
positive outcome (heavy rain), while a negative SHAP value suggests it worked against
predicting that outcome.

● Global and Local Interpretability:

○ Global interpretability: SHAP can aggregate feature importances across all trees in the
forest to give a clear picture of which features are the most important in predicting
rainfall. Features that consistently contribute to accurate predictions across multiple trees
(e.g., cloud top temperature, atmospheric pressure) will show high SHAP values.

○ Local interpretability: For any given prediction, SHAP can show the exact contribution
of each feature for that specific instance. For example, for a data point where the model
predicts "rain," SHAP will show how much each feature (such as water vapor, cloud
cover, or wind speed) contributed to the final prediction.
3. Model-Specific Considerations:

● SVM with SHAP:

○ While SVM models typically work well for linear separability, SHAP can still provide
insights into the decision boundary and how each feature contributes to the margin
separating the classes.

○ The kernel trick used in SVM (e.g., RBF kernel) complicates the interpretation, but
SHAP can effectively map out the feature contributions in a non-linear decision space.

● Random Forest with SHAP:

○ Random Forests are more naturally interpretable since they consist of multiple decision
trees. SHAP values in this context allow for deeper understanding by quantifying how
the individual trees’ splits and feature choices contribute to final predictions.

○ The ensemble nature of Random Forest means that SHAP values give a comprehensive
view of feature importance across a large number of decision paths, making it especially
useful for understanding model behavior on complex datasets like satellite imagery.
CHAPTER 4

RESULT AND DISCUSSION

Figure 2: Annual Rainfall Data Figure 3: Annual Rainfall Range Plot

Figure 4: Algorithm Accuracy Bar Plot Figure 5: Prediction vs Actual data ROC

Figure 6: Random Forest Result Comparision


Figure 7: Dataset Parameters Figure 8: SVM Prediction Accuracy Parameters

Figure 9: Month-wise Rainfall Data

Figure 10: Average impact SHAP value Figure 11: Random Forest Classifier Error

Figure 12: ROC curves for RFC Figure 13: RFC Prediction Accuracy Parameters
Figure 14: RFC Confusion Matrix Figure 15: SVM Classification Graph
CHAPTER 5

CONCLUSION AND FUTURE ENHANCEMENT

Developing an Explainable AI (XAI) model for predicting high-impact rainfall events using satellite
data offers a vital tool for enhancing early warning systems and improving decision-making in
meteorology and disaster preparedness. By leveraging satellite data from systems like INSAT-3DR and
applying explainability techniques such as SHAP and LIME, this project provides not only accurate
rainfall predictions but also insights into key predictive factors. Additionally, we have designed a model
that interprets and analyzes its results, offering a clear understanding of the variables driving each
prediction. This transparency builds trust in model outputs and empowers stakeholders to take timely,
informed actions in response to potential extreme weather.

FUTURE ENHANCEMENT:

● Incorporating Additional Data: Adding data from sources like radar and historical weather
patterns for enhanced robustness.

● Exploring Advanced Architectures: Testing models such as CNN-LSTM hybrids to better


capture temporal and spatial patterns in rainfall.

● Implementing Real-Time Updating: Continuously retraining the model with new data to
maintain accuracy.

● Expanding Interpretability Tools: Using tools like Counterfactual Explanations to deepen


insight into predictions.

● Enhancing User Interface: Adding features for real-time alerts and interactive visualizations
for better accessibility to non-technical users.
CHAPTER 6

REFERENCES

● E. Collini, L. A. I. Palesi, P. Nesi, G. Pantaleo, N. Nocentini and A. Rosi, "Predicting and


Understanding Landslide Events With Explainable AI," in IEEE Access, vol. 10, pp.
31175-31189, 2022, doi: 10.1109/ACCESS.2022.3158328. keywords: {Terrain
factors;Indexes;Rivers;Radio frequency;Geology;Predictive models;Machine learning;Landslide
prediction;machine-learning;explainable artificial intelligence;snap4city},

● A.S. Albahri, Yahya Layth Khaleel, Mustafa Abdulfattah Habeeb, Reem D. Ismael, Qabas A.
Hameed, Muhammet Deveci, Raad Z. Homod, O.S. Albahri, A.H. Alamoodi, Laith Alzubaidi,A
systematic review of trustworthy artificial intelligence applications in natural
disasters,Computers and Electrical Engineering,

Volume 118, Part B, 2024.

● Başağaoğlu, Hakan, Debaditya Chakraborty, Cesar Do Lago, Lilianna Gutierrez, Mehmet Arif
Şahinli, Marcio Giacomoni, Chad Furl, Ali Mirchi, Daniel Moriasi, and Sema Sevinç Şengör.
2022. "A Review on Interpretable and Explainable Artificial Intelligence in Hydroclimatic
Applications" Water 14, no. 8: 1230. https://doi.org/10.3390/w14081230

● Sarmad Dashti Latif, Nur Alyaa Binti Hazrin, Chai Hoon Koo, Jing Lin Ng, Barkha Chaplot,
Yuk Feng Huang, Ahmed El-Shafie, Ali Najah Ahmed, Assessing rainfall prediction models:
Exploring the advantages of machine learning and remote sensing approaches, Alexandria
Engineering Journal, Volume 82, 2023.

● Ali Ulvi Galip Senocak, M. Tugrul Yilmaz, Sinan Kalkan, Ismail Yucel, Muhammad Amjad, An
explainable two-stage machine learning approach for precipitation forecast, Journal of
Hydrology,

Volume 627, Part A, 2023, 130375, ISSN 0022-1694


CHAPTER 7

APPENDIX
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from IPython.display import display
import time

import warnings
warnings.filterwarnings('ignore')

# for data preprocessing


from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score, roc_auc_score,
classification_report, confusion_matrix,\
cohen_kappa_score, plot_confusion_matrix, roc_curve

# import different classifiers


from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import lightgbm
import catboost
import xgboost
from sklearn.neural_network import MLPClassifier

df =
pd.read_csv("/kaggle/input/weather-dataset-rattle-package/weatherAUS.csv")

def data_explore(df):
display(df.head())
print("*" * 30)
print(f"shape of dataset {df.shape}")
print("*" * 30)
display("Info {}".format(df.info()))
print("*" * 30)
print("Dtypes: \n{}".format(df.dtypes.value_counts()))
print("*" * 30)
print(df.columns)
print("*" * 30)
print("Number of columns having null values: ",
df.isnull().any().sum())
data_explore(df)
# describe for all numeric variables
df.describe().T
# describe for all categorical variables
df.describe(include=['object']).T

# data type plots


fig, ax = plt.subplots(1,2,figsize = (12,6))

df.dtypes.value_counts().plot.pie(explode = [0.05,0.05], autopct =


"%1.0f%%",
shadow = True, ax = ax[1])
ax[1].set_title("datatype")

df.dtypes.value_counts().plot(kind = 'bar', ax = ax[0])


ax[0].set_title("datatype")

display(df['RainTomorrow'].value_counts())
display(df['RainTomorrow'].value_counts() * 100 / len(df))

df['RainTomorrow'].value_counts().plot(kind = 'bar', color = ['skyblue',


'navy'], rot = 0)

# conversion of target variable from categorical to numeric


df['RainTomorrow'] = df['RainTomorrow'].map({'No': 0, 'Yes': 1})

# frequency plot of each categorical variable


plt.figure(figsize = (20,8))
for i, col in enumerate(cat_cols[1:]):
plt.subplot(2, 3, i+1)
sns.countplot(df[col])
plt.xticks(rotation = 90)
plt.title(f"{col} has {df[col].nunique()} unique values")

df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df[['Date', 'Year', 'Month', 'Day']].head()

# box plot of numerical variables


plt.figure(figsize = (15,12))
for i, col in enumerate(num_cols):
plt.subplot(4, 4, i+1)
sns.boxplot(data = df, y = col, whis = 3)
plt.title(col)

outlier_cols = ['Rainfall', 'Evaporation', 'WindSpeed9am', 'WindSpeed3pm']


# histogram plot to check distribution
plt.figure(figsize = (12,10))
for i, col in enumerate(outlier_cols):
plt.subplot(2, 2, i+1)
sns.histplot(data = df,x = col, bins = 20)
plt.title(col)

def IQR(df, out_cols):


for col in out_cols:
iqr = df[col].quantile(0.75) - df[col].quantile(0.25)
lower = df[col].quantile(0.25) - (iqr * 3)
upper = df[col].quantile(0.75) + (iqr * 3)
outlier_percent = round((df[df[col] > upper].shape[0] *
100)/len(df), 2)
print( col , '\t', lower.round(2), '\t', upper.round(2),
'\t', df[col].min(), '\t', df[col].max(), '\t',
outlier_percent)
print('column \t\t lower \t high \t min \t max \t outlier_percent')
IQR(df, outlier_cols)

# Heatmap
sns.set_context('notebook', font_scale=1.0, rc = {'lines.linewidth': 2.5})
plt.figure(figsize = (15,12))

# mask the duplicate correlation values


mask = np.zeros_like(df.corr())
mask[np.triu_indices_from(mask, 1)] = True

a = sns.heatmap(df.corr(), mask = mask, annot=True, fmt = '.2f', cmap =


'viridis')

rotx = a.set_xticklabels(a.get_xticklabels(), rotation = 90)


roty = a.set_yticklabels(a.get_yticklabels(), rotation = 30)

# Pair Plot for higly correlated variables


sns.pairplot(data = df, vars = ['MinTemp', 'MaxTemp', 'Temp9am',
'Temp3pm', 'WindGustSpeed', 'WindSpeed3pm', 'Pressure9am', 'Pressure3pm'],
kind = 'scatter',
diag_kind= 'hist',
hue = 'RainTomorrow')

no = df[df['RainTomorrow'] == 0]
yes = df[df['RainTomorrow'] == 1]

yes_os = resample(yes, replace = True, n_samples=len(no), random_state=21)

df_os = pd.concat([no, yes_os])


print(df_os.shape)

fig = plt.figure(figsize = (8,5))


df_os['RainTomorrow'].value_counts(normalize = True).plot(kind = 'bar',
color =
['skyblue', 'navy'],
alpha = 0.9,
rot = 0)
plt.title('balanced dataset')

# categorical and numeric missing values in train and test datasets

cat_miss = pd.concat([pd.DataFrame(X_train[cat_cols].isnull().sum()),
pd.DataFrame(X_test[cat_cols].isnull().sum())],
axis = 1)
num_miss = pd.concat([pd.DataFrame(X_train[num_cols].isnull().sum()),
pd.DataFrame(X_test[num_cols].isnull().sum())],
axis = 1)
cat_miss.columns = ['train', 'test']
num_miss.columns = ['train', 'test']
display(cat_miss)
display(num_miss)

# General method for model training


def model_run(clf, X_train, y_train, X_test, y_test, verbose = 1):
tic = time.time()

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
y_pred_train = clf.predict(X_train)

accuracy_train = accuracy_score(y_train, y_pred_train)


accuracy_test = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)
coh_kap = cohen_kappa_score(y_test, y_pred)

toc = time.time()
time_taken = toc-tic

print("Training Accuracy = {}".format(accuracy_train.round(2) * 100))


print("Test Accuracy = {}".format(accuracy_test.round(2) * 100))
print("ROC Area under Curve = {}".format(roc_auc.round(2)))
print("Cohen's Kappa = {}".format(coh_kap.round(2)))
print("Time taken = {}".format(time_taken))
print(classification_report(y_test,y_pred,digits=5))

probs = clf.predict_proba(X_test)[:,1]

fpr, tpr, threshold = roc_curve(y_test,probs)


plot_roc_curve(fpr, tpr)

plot_confusion_matrix(clf, X_test, y_test, cmap=plt.cm.Blues,


normalize='all')

return clf, accuracy_train,accuracy_test, roc_auc, coh_kap, time_taken


# SVM
param_dt = {'max_depth': 16, 'max_features': 'sqrt'}

clf_dt = DecisionTreeClassifier(**param_dt)
clf_dt, acc_tr_dt, acc_dt, roc_dt, coh_kap_dt, time_dt = model_run(clf_dt,
X_train, y_train, X_test, y_test)

params_rf = {'max_depth': 16,


'min_samples_leaf': 1,
'min_samples_split': 2,
'n_estimators': 200,
'random_state': 21}
clf_rf = RandomForestClassifier(**params_rf)
clf_rf, acc_tr_rf, acc_rf, roc_rf, coh_kap_rf, time_rf = model_run(clf_rf,
X_train, y_train, X_test, y_test)

# calculating the most important features


importance = clf_xgb.feature_importances_

feat_imp_df = pd.DataFrame({'Features': cols, 'Importance': importance})

f_imp = feat_imp_df[feat_imp_df.sort_values('Importance',
ascending=False)['Importance'] >= 0.01].reset_index(drop= True)
f_imp.shape
# model performance on important features
imp_feat = f_imp['Features']

clf_xgb_imp = xgboost.XGBClassifier(**params_xgb)
clf_xgb_imp, acc_tr_xgb_imp,acc_xgb_imp, roc_xgb_imp, coh_kap_xgb_imp,
time_xgb_imp = model_run(clf_xgb_imp, X_train[imp_feat], y_train,
X_test[imp_feat], y_test)

import shap
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
model = RandomForestRegressor()
model.fit(X_train, y_train)
X_test_sample = X_test.sample(100, random_state=42) # Adjust sample size
as needed

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test_sample)

shap.summary_plot(shap_values, X_test_sample, plot_type="bar",


max_display=10)

shap.force_plot(explainer.expected_value[1], shap_values[1][0, :],


X_test_sample.iloc[0, :])

You might also like