ML - Report
ML - Report
Submitted by
Saket Gudimella [RA2211026010111]
Vasan Lennin [RA2211026010106]
R Siddharth [RA2211026010079]
S Veena Maheswari [RA2211026010123]
Under the Guidance of
Dr. M.S.Abirami
Associate Professor
Department of Computational Intelligence
BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE ENGINEERING
with specialization in
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING
SCHOOL OF COMPUTING
COLLEGE OF ENGINEERING AND TECHNOLOGY
SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
KATTANKULATHUR- 603 203
NOVEMBER 2024
SRM INSTITUTE OF SCIENCE AND TECHNOLOGY
KATTANKULATHUR – 603 203
BONAFIDE CERTIFICATE
SIGNATURE SIGNATURE
Dr. M.S.Abirami Dr. R.Annie Uthra
Course Faculty Head of the Department
Associate Professor Professor
Department of Computational Intelligence Department of Computational Intelligence
SRM Institute of Science and Technology SRM Institute of Science and Technology
Kattankulathur Kattankulathur
ABSTRACT
With the increasing global frequency and severity of extreme weather events, accurate and timely
predictions of heavy rainfall are essential for effective disaster management and risk mitigation.
This project, titled Development of an Explainable AI (XAI) based Model for Prediction of
Heavy/High Impact Rain Events Using Satellite Data, addresses the challenge of forecasting
high-impact rainfall events by developing a machine learning model that not only achieves high
predictive accuracy but also emphasizes transparency through explainable AI (XAI) techniques.
Utilizing satellite data, this model integrates a range of meteorological parameters, including
atmospheric pressure, temperature, humidity, wind speed, and cloud cover, to create a
comprehensive dataset for predicting rainfall events. The project investigates various machine
learning algorithms, such as Support Vector Machines (SVM) and Random Forest to identify the
most suitable approach for predicting intense rain events. Each algorithm is rigorously evaluated for
its predictive performance, enabling a comparative analysis to determine the most accurate model.
A significant aspect of this project is its focus on explainability, achieved through SHapley Additive
exPlanations (SHAP) analysis. SHAP values are calculated for each model, providing insights into
the contribution of individual features (such as temperature fluctuations, cloud density, or humidity
levels) to the prediction outcome. This level of interpretability is critical in supporting
meteorologists and disaster management teams, as it allows them to understand the reasoning
behind each prediction, ultimately fostering trust and transparency in AI-driven forecasts.
The project demonstrates the efficacy of combining machine learning with XAI techniques in
improving early warning systems for high-impact rain events. The insights derived from the SHAP
analysis aid in understanding complex weather patterns, thereby enabling proactive and data-driven
decision-making. By prioritizing both predictive accuracy and interpretability, this project offers a
promising approach to enhancing preparedness, minimizing risks, and reducing the socioeconomic
impacts of heavy rainfall events through explainable, reliable AI-driven insights.
TABLE OF CONTENTS
S NO CHAPTER PAGE NO
1 INTRODUCTION
1.1 Background and Motivation
1.2 Objectives of the Study
1.3 Challenges in Heavy Rain Prediction
1.4 Importance of Explainable AI in Meteorology
1.5 Software Requirements Specification
2 LITERATURE SURVEY
2.1 Overview of Rainfall Prediction Models
2.2 Explainable AI in Meteorological Applications
2.3 Comparative Analysis of Machine Learning Techniques
2.4 Review of Satellite Data Sources for Meteorology
2.5 Summary and Key Takeaways
3 METHODOLOGY
3.1 Data Collection and Preprocessing
3.1.1 Satellite Data Acquisition and Processing
3.1.2 Feature Engineering and Selection
3.1.3 Data Splitting and Normalization
3.2 Model Architecture and Design
3.2.1 Support Vector Machine (SVM) Implementation
3.2.2 Random Forest Implementation
3.2.3 SHAP Analysis for Model Explainability
6 REFERENCES
LIST OF FIGURES
1 Architecture diagram 23
2 Annual Rainfall Data 29
3 Annual Rainfall Range Plot 29
4 Algorithm Accuracy Bar Plot 29
5 Prediction vs Actual Data ROC 29
6 Random Forest Result Comparision 20
7 Dataset Parameters 30
8 SVM Prediction Accuracy Parameters 30
9 Month-wise Rainfall Data 30
10 Average impact SHAP value 30
11 Random Forest Classifier Error 30
12 ROC curves for RFC 30
13 RFC Prediction Accuracy Parameters 30
14 RFC Confusion Matrix 31
15 SVM Classification Graph 31
LIST OF TABLES
1 Literature Survey 15
ABBREVIATIONS
INTRODUCTION
Extreme rainfall events pose a serious challenge for meteorology and disaster response, often
leading to severe floods, landslides, and widespread infrastructure damage. The increasing
frequency and intensity of such events, largely influenced by climate change, underscores the need
for accurate and rapid forecasting. Traditional forecasting models, while valuable, sometimes lack
transparency in identifying the factors that drive predictions, which can make interpretation
challenging. However, advancements in space technology, including high-resolution satellite data
from platforms like INSAT-3D/3DR, provide a valuable resource for enhancing the accuracy and
timeliness of rainfall forecasts. Effectively translating this satellite data into actionable insights
requires advanced machine learning models that balance both accuracy and interpretability.
The motivation behind developing an Explainable AI (XAI) model for rainfall nowcasting stems
from the dual need for precision and transparency in high-stakes weather forecasting. Conventional
"black box" AI models, despite their predictive power, often lack interpretability, leaving users and
stakeholders uncertain about the underlying rationale—especially in situations where critical
decisions need to be made. This project aims to address this by integrating explainable features into
the model, offering clarity on key predictors and flagging potential conditions where model accuracy
might falter. This approach will help build trust in the model's predictions and empower
meteorologists, disaster response teams, and policymakers to make informed decisions. Ultimately,
the project will deliver a user-friendly web application that not only provides rainfall forecasts but
also reveals the AI model’s reasoning, making its outputs easier to understand and more reliable for
end-users.
1.2 OBJECTIVES OF THE STUDY
● Develop a Predictive Model for Heavy Rainfall: Create an AI model that accurately forecasts
heavy rain events using satellite data, enhancing early warning systems.
● Integrate Explainability: Ensure the model is explainable, providing insights into how it arrives
at its predictions to support transparency and trust.
● Identify Key Influencing Factors: Highlight the main factors contributing to heavy rainfall
predictions, allowing meteorologists to understand the drivers behind severe rain forecasts.
● Support Decision-Making for Disaster Preparedness: Provide weather experts with clear,
actionable insights to improve disaster readiness and response strategies.
● Enhance Early Warning Systems: Contribute to more reliable and understandable early
warnings for severe rainfall, aiding proactive disaster management.
1.3 CHALLENGES IN HEAVY RAIN PREDICTION
● Data Quality and Preprocessing: Satellite data often contains missing values, noise, and
inconsistencies, which require extensive cleaning and preprocessing. Ensuring the data is
properly normalized and formatted for use in the model is a complex and
resource-intensive task.
● Complexity of Weather Systems: The factors influencing heavy rainfall are numerous and
highly dynamic, making it difficult to capture all the relevant temporal and spatial patterns
accurately. Modeling these complex interactions to improve prediction accuracy is a major
challenge.
● Balancing Accuracy with Interpretability: While highly accurate models can be complex and
opaque, the goal is to create a model that is both precise and interpretable. Striking this balance
between performance and explainability presents a technical challenge, as more complex models
often sacrifice transparency.
● Real-Time Forecasting: To be useful for early warning systems, the model must provide
predictions in real-time. This requires optimizing the model for speed and low-latency
processing, ensuring that predictions are available when needed without sacrificing accuracy.
● Handling Model Limitations: Identifying conditions where the model may struggle or fail to
make accurate predictions is critical for improving its reliability. Developing strategies to
manage these limitations and minimize errors is an ongoing challenge.
● Designing an Accessible Interface: Creating a user-friendly web application that presents both
predictions and interpretability insights in a clear, concise manner is essential. Ensuring that the
interface is intuitive for meteorologists and non-experts alike requires thoughtful design and
testing.
1.4 IMPORTANCE OF EXPLAINABLE AI IN METEOROLOGY
Furthermore, XAI supports model validation by identifying potential errors or biases, allowing for
timely corrections and refinements. This is particularly important in high-stakes scenarios, such as
predicting heavy rainfall or storms, where understanding the model’s reasoning is essential for
effective disaster preparedness and response.
● Classifiers:
○ sklearn.linear_model.LogisticRegression
○ sklearn.neighbors.KNeighborsClassifier
○ sklearn.svm.SVC
○ sklearn.tree.DecisionTreeClassifier
○ sklearn.ensemble.RandomForestClassifier
○ sklearn.neural_network.MLPClassifier
○ lightgbm
○ catboost
○ xgboost
● Custom Data Exploration: data_explore() function for dataset overview, info, and column
analysis.
● Outlier Detection: IQR() function for outlier thresholding based on the interquartile range
(IQR).
● Missing Value Imputation: Manual imputation of missing values in categorical and numerical
columns
● Outlier Treatment: max_value() function for capping outliers
● Generalized Model Training Function: model_run() function to train models, predict, and
output performance metrics.
CHAPTER 2
LITERATURE SURVEY
Rainfall prediction models are essential in meteorology and climate science, offering insights for
agriculture, water resource management, and disaster prevention. These models can be broadly
categorized into statistical, dynamical, and hybrid models. Statistical models leverage historical rainfall
data and other meteorological variables to identify patterns and trends. Techniques such as regression
analysis, autoregressive integrated moving average (ARIMA), and machine learning algorithms like
support vector machines (SVM) and artificial neural networks (ANN) are commonly used. These
models are computationally efficient and perform well with localized, short-term predictions but may
struggle with long-term forecasts or in regions lacking extensive historical data.
Dynamical models, in contrast, use physical laws and equations that govern atmospheric behavior to
simulate climate conditions. Numerical Weather Prediction (NWP) models are a widely used example,
incorporating data from satellites and radar to model large-scale weather systems. NWP models such as
the Weather Research and Forecasting (WRF) model and the European Centre for Medium-Range
Weather Forecasts (ECMWF) model are prominent examples. While these models provide a more
comprehensive understanding of rainfall distribution and atmospheric dynamics, they are
computationally intensive. Recent advancements have seen the development of hybrid models,
combining statistical methods with dynamical models to achieve better accuracy. The integration of
machine learning with NWP models is also promising, as it can enhance real-time adaptability and
improve prediction accuracy across diverse geographical areas.
2.2 EXPLAINABLE AI IN METEOROLOGICAL APPLICATIONS
In weather forecasting, XAI models provide clarity on the key atmospheric factors driving predictions,
such as temperature, humidity, and wind patterns. For example, in heavy rainfall prediction, XAI
techniques like SHAP or LIME can explain which variables (e.g., cloud cover, moisture levels)
contributed most to a rainfall forecast, improving prediction transparency.
XAI is also valuable in severe weather event prediction, such as hurricanes or tornadoes. It helps
meteorologists understand why certain conditions are likely to lead to extreme events, making forecasts
more actionable. In the case of climate modeling, XAI techniques reveal which factors (e.g., CO2
emissions or solar radiation) are influencing long-term climate changes, aiding policy decisions.
Moreover, in disaster management, XAI ensures that weather predictions for events like floods are
understandable, allowing timely and informed decisions. By explaining why a model has issued a
warning, XAI increases trust in early warning systems, helping authorities respond effectively.
Overall, XAI enhances the accuracy, transparency, and reliability of meteorological predictions, making
them more actionable and trustworthy for decision-makers, ultimately improving public safety and
climate resilience.
2.3 COMPARATIVE ANALYSIS OF MACHINE LEARNING
TECHNIQUES
Machine learning techniques like Support Vector Machines (SVM) and Random Forest have shown
significant improvements over traditional rainfall prediction models, particularly in handling complex,
nonlinear patterns in meteorological data. Traditional statistical models, such as multiple linear
regression and autoregressive models, rely on predefined relationships and often struggle with the
intricate and variable dynamics of rainfall. SVM and Random Forest, however, are capable of
identifying complex data patterns without extensive parameter assumptions. SVM is particularly
effective for classification problems, separating data into distinct categories using hyperplanes, while
Random Forest leverages an ensemble of decision trees to improve prediction accuracy and reduce
overfitting, making it robust for rainfall prediction tasks.
Compared to traditional models, SVM and Random Forest are better equipped to handle
high-dimensional data, often found in meteorological datasets due to the range of influencing factors
such as temperature, humidity, and atmospheric pressure. Traditional models may underperform when
numerous interdependent variables are present, as they struggle to model the nonlinear relationships
effectively. Machine learning models, by contrast, can adapt to complex dependencies and identify key
variables automatically, thus enhancing prediction precision. Random Forest, with its ensemble
approach, captures a broader set of features and reduces variance, which can lead to more accurate
predictions even with noisy data. This flexibility has made machine learning models more effective and
widely adopted in recent rainfall prediction studies.
In addition, the integration of SHAP (SHapley Additive exPlanations) for explainable AI (XAI) offers
new insights into machine learning predictions, addressing a key limitation of traditional models and
even early machine learning models, which often act as black boxes. SHAP values allow for an
interpretable analysis of feature importance, offering transparency into how each factor influences
model predictions. This capability is crucial for meteorology, where understanding the underlying
factors behind rainfall predictions is important for decision-making in policy, agriculture, and disaster
management. By using SHAP in conjunction with SVM and Random Forest, modern rainfall prediction
models not only achieve higher accuracy but also provide actionable insights into the contributing
factors, making them superior to traditional statistical models in both predictive power and
interpretability.
Paper Ref No. Journal Name & Year Paper Title Techniques Used / Inferences
Methodology
6 IEEE Access, 2024 Integrating XAI with CNN, SHAP, LIME, Integrating
Convolutional Neural Radar and Satellite XAI
Networks for Data Integration techniques
Predicting Severe with CNN
Weather Events models
makes the
decision-maki
ng process
more
interpretable,
enhancing
user
confidence in
predictions.
The INSAT system, developed and operated by the Indian Space Research Organisation (ISRO),
consists of a series of multipurpose geostationary satellites. It plays a vital role in providing real-time
meteorological data for weather forecasting, disaster management, and communication, particularly for
India and its surrounding regions.
INSAT-3D/3DR: These advanced satellites are equipped with state-of-the-art imaging systems that
capture data related to cloud cover, sea surface temperatures, and various atmospheric parameters. The
high-resolution data from INSAT-3D/3DR is particularly beneficial for short-term weather forecasting,
such as predicting heavy rainfall, thunderstorms, and cyclones. Additionally, these satellites generate
imagery of the Earth's surface, which helps in analyzing weather systems such as monsoons and tropical
storms.
Key Parameters: Cloud imagery, sea surface temperature, atmospheric water vapor levels, wind patterns,
and rainfall estimation.
Strengths:
- Continuous, real-time data from a geostationary orbit allows for frequent and consistent monitoring of
weather conditions.
- High temporal resolution ensures timely observations, which are crucial for real-time weather
forecasting and early warning systems.
- Ideal for monitoring regional and tropical weather phenomena, making it highly effective for
forecasting monsoons, cyclones, and heavy rainfall.
Limitations:
- The spatial resolution is lower compared to polar-orbiting satellites, limiting the ability to capture fine
details of local weather events.
- As a geostationary satellite system, it has limited global coverage, mainly focusing on the equatorial
regions. This restricts its capability to capture weather data from polar and remote areas.
CHAPTER 3
METHODOLOGY
1. Data Preparation
○ Data Gathering: Collect satellite data from reliable sources, such as Sentinel or Landsat
datasets, using remote sensing platforms like Google Earth Engine or directly accessing
public repositories.
○ Data Preprocessing: Process the satellite imagery to handle any missing or erroneous
values. This may include filling missing data through interpolation or removal if data is
unreliable.
○ Standardization and Normalization: Apply Min-Max scaling or Z-score normalization
to standardize features across temporal and spatial datasets, ensuring that variations in
data scale do not bias the model.
2. Feature Engineering
○ Spatial Features: Extract relevant spatial features such as pixel intensity, NDVI
(Normalized Difference Vegetation Index), and land cover classifications to capture
geographic patterns.
○ Temporal Features: Create temporal features by aggregating data over time periods
(e.g., monthly or seasonal averages), capturing temporal variations.
○ Meteorological Indices: Integrate external meteorological data, like temperature,
precipitation, and humidity, to enrich the feature set and improve predictive capabilities
for modeling natural processes.
3. Data Cleaning and Outlier Treatment
○ Outlier Detection: Use interquartile range (IQR) and other statistical methods to identify
and handle outliers, ensuring they do not skew model training.
○ Missing Value Imputation: Where applicable, apply interpolation or domain-specific
methods for imputing missing values, particularly for continuous data.
4. Data Splitting
○ Train-Test Split: Divide the processed data into training and test sets, ensuring that
temporal or spatial integrity is maintained (e.g., by year or geographic region) to avoid
data leakage and enhance model generalization.
3.1.1 SATELLITE DATA ACQUISITION AND PROCESSING
Satellite data for meteorology is collected from geostationary (e.g., INSAT-3D/3DR) and polar-orbiting
satellites. These satellites provide vital atmospheric data like cloud cover, temperature, humidity, and sea
surface temperatures. The data undergoes several preprocessing steps such as calibration,
georeferencing, data fusion, and cloud masking to ensure accuracy and consistency.
Feature engineering extracts relevant temporal and spatial patterns from the data, like cloud density
changes or temperature gradients, to predict rainfall. Missing values are handled, and data is
standardized to improve model performance. Machine learning models, such as decision trees or random
forests, are then trained on this processed data to predict rainfall events. Techniques like SHAP values
and feature importance scores enhance the interpretability of the model, allowing meteorologists to
understand the reasons behind predictions.
The model is validated using cross-validation and deployed for real-time forecasting, displayed through
a user-friendly interface for operational use in weather prediction and disaster preparedness.
3.1.2 FEATURE ENGINEERING AND SELECTION
Feature engineering transforms raw satellite data into useful features for rainfall prediction. This
involves creating temporal features (e.g., time-based patterns), spatial features (e.g., geographical
gradients), and meteorological indices (e.g., cloud thickness, sea surface temperature, humidity) that
capture key weather patterns.
Feature selection identifies the most relevant features by removing redundant or less impactful ones.
Techniques like correlation analysis, Recursive Feature Elimination (RFE), and decision tree-based
methods help identify important features. Feature scaling (normalization or standardization) ensures all
features contribute equally to the model. Dimensionality reduction methods like PCA can simplify the
dataset while retaining essential information.
The goal is to improve model performance, reduce complexity, and ensure the selected features capture
critical weather conditions for accurate rainfall forecasting.
3.1.3 DATA SPLITTING AND NORMALIZATION
● Training Set (70%): Used to train the model and learn from the data.
● Validation Set (15%): Helps tune hyperparameters, prevent overfitting, and optimize the model's
performance during training.
● Test Set (15%): Evaluates the model’s final performance on new, unseen data to assess its
generalization ability.
These splits ensure that the model is trained on one portion of the data, validated and optimized on
another, and then tested on a separate set to provide an unbiased performance evaluation.
Normalization: Features in satellite data (e.g., cloud cover, sea surface temperature, wind speed) often
have different ranges and units. Normalization ensures that all features are on the same scale, preventing
any single feature from disproportionately influencing the model. Common techniques include:
● Standardization: Centers features around a mean of 0 and scales to a unit variance, which is
helpful when features have varying distributions.
Normalization is essential for models like decision trees, random forests, or neural networks, which are
sensitive to the scale of the input features. By ensuring that all features contribute equally, the model can
learn efficiently and make more accurate predictions.
Together, data splitting and normalization ensure that the model trains effectively, generalizes well to
new data, and provides accurate, explainable rainfall forecasts.
3.2 MODEL ARCHITECTURE AND DESIGN
● Data Preprocessing:
○ The raw satellite data (e.g., cloud top temperature, humidity, water vapor levels from
INSAT-3DR) and IMD rainfall data would be preprocessed. This involves normalizing
the features, handling missing data, and ensuring the target labels are binary (e.g., heavy
rain vs. no heavy rain).
● Feature Extraction:
○ Key predictors are extracted from the satellite data, such as cloud motion vectors,
temperature gradients, and atmospheric pressure. These features are used as input for the
SVM model.
● Model Training:
○ The SVM is trained using historical satellite data and corresponding rainfall records. The
model learns the relationship between atmospheric features and rainfall occurrences.
● Prediction:
○ Once trained, the SVM can predict whether a new set of satellite data will result in heavy
rainfall or not. The output is a binary classification (rain or no rain) based on the feature
input.
● Model Evaluation:
○ The performance of the SVM is evaluated using standard metrics like accuracy,
precision, recall, and F1-score. Cross-validation is used to ensure the model generalizes
well on unseen data.
3.2.2 RANDOM FOREST IMPLEMENTATION
● Data Preprocessing:
○ As with the SVM, raw satellite data and IMD rainfall data need to be preprocessed. This
includes cleaning the data, filling missing values, and normalizing or standardizing the
features. The target variable is typically binary (e.g., heavy rain vs. no rain).
● Feature Extraction:
○ From the satellite data, relevant meteorological features (such as cloud cover,
temperature, water vapor content, and wind speed) are extracted. These features will
serve as the input variables for the Random Forest model.
● Model Training:
○ Historical satellite data and corresponding rainfall measurements are used to train the
forest. The trees learn different aspects of the atmospheric conditions that could predict
heavy rainfall by exploring various splits and combinations of features.
● Prediction
○ Once trained, the Random Forest uses the collective predictions of its decision trees to
classify new satellite data as either heavy rain or no rain. This is done by averaging the
predictions from all the trees and outputting the most common classification (majority
vote).
● Model Evaluation
○ The Random Forest model is evaluated using metrics like accuracy, precision, recall, and
F1-score to assess how well it predicts rainfall events. Cross-validation is often used to
ensure the model’s performance is consistent across different subsets of the data.
3.2.3 SHAP ANALYSIS FOR MODEL EXPLAINABILITY
○ Since SVM is a black-box model, SHAP is used to make the model’s decision process
more interpretable. SHAP assigns each feature an importance score, showing its
contribution to the prediction (e.g., whether a particular cloud feature contributed
positively or negatively to forecasting heavy rainfall).
○ After training the SVM model on the satellite data, SHAP values are computed for each
feature. SHAP uses Shapley values from cooperative game theory, where each feature is
considered a "player" contributing to the final prediction.
○ The algorithm computes the average contribution of each feature across all possible
permutations of features. This helps in understanding how the presence or absence of a
feature affects the model’s decision.
○ A positive SHAP value indicates that the feature increased the likelihood of a certain
class (e.g., heavy rain), while a negative SHAP value suggests it decreased the likelihood
of that class.
○ For instance, a high positive SHAP value for cloud top temperature may indicate that an
increase in cloud height is strongly associated with the occurrence of heavy rainfall.
○ Global interpretability: By aggregating SHAP values across all instances, you can
identify which features are generally most influential in predicting heavy rainfall. For
example, cloud cover or atmospheric moisture might consistently have high SHAP
values.
○ Local interpretability: SHAP can also provide explanations for individual predictions,
helping to understand why a specific instance was classified as "rain" or "no rain." For
instance, if an instance is predicted as heavy rain, SHAP can show that the cloud motion
feature was particularly influential for that prediction.
2. SHAP Analysis for Random Forest Predictions
○ Similar to SVM, Random Forest is an ensemble model that can be difficult to interpret
directly. SHAP can be used to break down the contribution of each feature across all the
trees in the forest and provide a more transparent view of the decision process.
○ After the Random Forest model is trained, SHAP values are computed for the predictions
made by the ensemble of decision trees. Each decision tree in the Random Forest
contributes to the final classification, and SHAP can break down how each tree's decision
influences the overall prediction.
○ SHAP works by computing the contribution of each feature across all trees in the forest,
treating the feature as a "player" in the ensemble, and evaluating its importance relative
to the other features.
○ Similar to SVM, SHAP values for Random Forest provide insights into feature
importance. Features with high SHAP values are seen as more influential in the model’s
decision to classify a data point (e.g., whether it will rain or not). For instance, cloud
cover or humidity levels might have a high SHAP value in predicting rainfall.
○ A positive SHAP value for a feature indicates its contribution toward predicting a
positive outcome (heavy rain), while a negative SHAP value suggests it worked against
predicting that outcome.
○ Global interpretability: SHAP can aggregate feature importances across all trees in the
forest to give a clear picture of which features are the most important in predicting
rainfall. Features that consistently contribute to accurate predictions across multiple trees
(e.g., cloud top temperature, atmospheric pressure) will show high SHAP values.
○ Local interpretability: For any given prediction, SHAP can show the exact contribution
of each feature for that specific instance. For example, for a data point where the model
predicts "rain," SHAP will show how much each feature (such as water vapor, cloud
cover, or wind speed) contributed to the final prediction.
3. Model-Specific Considerations:
○ While SVM models typically work well for linear separability, SHAP can still provide
insights into the decision boundary and how each feature contributes to the margin
separating the classes.
○ The kernel trick used in SVM (e.g., RBF kernel) complicates the interpretation, but
SHAP can effectively map out the feature contributions in a non-linear decision space.
○ Random Forests are more naturally interpretable since they consist of multiple decision
trees. SHAP values in this context allow for deeper understanding by quantifying how
the individual trees’ splits and feature choices contribute to final predictions.
○ The ensemble nature of Random Forest means that SHAP values give a comprehensive
view of feature importance across a large number of decision paths, making it especially
useful for understanding model behavior on complex datasets like satellite imagery.
CHAPTER 4
Figure 4: Algorithm Accuracy Bar Plot Figure 5: Prediction vs Actual data ROC
Figure 10: Average impact SHAP value Figure 11: Random Forest Classifier Error
Figure 12: ROC curves for RFC Figure 13: RFC Prediction Accuracy Parameters
Figure 14: RFC Confusion Matrix Figure 15: SVM Classification Graph
CHAPTER 5
Developing an Explainable AI (XAI) model for predicting high-impact rainfall events using satellite
data offers a vital tool for enhancing early warning systems and improving decision-making in
meteorology and disaster preparedness. By leveraging satellite data from systems like INSAT-3DR and
applying explainability techniques such as SHAP and LIME, this project provides not only accurate
rainfall predictions but also insights into key predictive factors. Additionally, we have designed a model
that interprets and analyzes its results, offering a clear understanding of the variables driving each
prediction. This transparency builds trust in model outputs and empowers stakeholders to take timely,
informed actions in response to potential extreme weather.
FUTURE ENHANCEMENT:
● Incorporating Additional Data: Adding data from sources like radar and historical weather
patterns for enhanced robustness.
● Implementing Real-Time Updating: Continuously retraining the model with new data to
maintain accuracy.
● Enhancing User Interface: Adding features for real-time alerts and interactive visualizations
for better accessibility to non-technical users.
CHAPTER 6
REFERENCES
● A.S. Albahri, Yahya Layth Khaleel, Mustafa Abdulfattah Habeeb, Reem D. Ismael, Qabas A.
Hameed, Muhammet Deveci, Raad Z. Homod, O.S. Albahri, A.H. Alamoodi, Laith Alzubaidi,A
systematic review of trustworthy artificial intelligence applications in natural
disasters,Computers and Electrical Engineering,
● Başağaoğlu, Hakan, Debaditya Chakraborty, Cesar Do Lago, Lilianna Gutierrez, Mehmet Arif
Şahinli, Marcio Giacomoni, Chad Furl, Ali Mirchi, Daniel Moriasi, and Sema Sevinç Şengör.
2022. "A Review on Interpretable and Explainable Artificial Intelligence in Hydroclimatic
Applications" Water 14, no. 8: 1230. https://doi.org/10.3390/w14081230
● Sarmad Dashti Latif, Nur Alyaa Binti Hazrin, Chai Hoon Koo, Jing Lin Ng, Barkha Chaplot,
Yuk Feng Huang, Ahmed El-Shafie, Ali Najah Ahmed, Assessing rainfall prediction models:
Exploring the advantages of machine learning and remote sensing approaches, Alexandria
Engineering Journal, Volume 82, 2023.
● Ali Ulvi Galip Senocak, M. Tugrul Yilmaz, Sinan Kalkan, Ismail Yucel, Muhammad Amjad, An
explainable two-stage machine learning approach for precipitation forecast, Journal of
Hydrology,
APPENDIX
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
from IPython.display import display
import time
import warnings
warnings.filterwarnings('ignore')
df =
pd.read_csv("/kaggle/input/weather-dataset-rattle-package/weatherAUS.csv")
def data_explore(df):
display(df.head())
print("*" * 30)
print(f"shape of dataset {df.shape}")
print("*" * 30)
display("Info {}".format(df.info()))
print("*" * 30)
print("Dtypes: \n{}".format(df.dtypes.value_counts()))
print("*" * 30)
print(df.columns)
print("*" * 30)
print("Number of columns having null values: ",
df.isnull().any().sum())
data_explore(df)
# describe for all numeric variables
df.describe().T
# describe for all categorical variables
df.describe(include=['object']).T
display(df['RainTomorrow'].value_counts())
display(df['RainTomorrow'].value_counts() * 100 / len(df))
df['Date'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df[['Date', 'Year', 'Month', 'Day']].head()
# Heatmap
sns.set_context('notebook', font_scale=1.0, rc = {'lines.linewidth': 2.5})
plt.figure(figsize = (15,12))
no = df[df['RainTomorrow'] == 0]
yes = df[df['RainTomorrow'] == 1]
cat_miss = pd.concat([pd.DataFrame(X_train[cat_cols].isnull().sum()),
pd.DataFrame(X_test[cat_cols].isnull().sum())],
axis = 1)
num_miss = pd.concat([pd.DataFrame(X_train[num_cols].isnull().sum()),
pd.DataFrame(X_test[num_cols].isnull().sum())],
axis = 1)
cat_miss.columns = ['train', 'test']
num_miss.columns = ['train', 'test']
display(cat_miss)
display(num_miss)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
y_pred_train = clf.predict(X_train)
toc = time.time()
time_taken = toc-tic
probs = clf.predict_proba(X_test)[:,1]
clf_dt = DecisionTreeClassifier(**param_dt)
clf_dt, acc_tr_dt, acc_dt, roc_dt, coh_kap_dt, time_dt = model_run(clf_dt,
X_train, y_train, X_test, y_test)
f_imp = feat_imp_df[feat_imp_df.sort_values('Importance',
ascending=False)['Importance'] >= 0.01].reset_index(drop= True)
f_imp.shape
# model performance on important features
imp_feat = f_imp['Features']
clf_xgb_imp = xgboost.XGBClassifier(**params_xgb)
clf_xgb_imp, acc_tr_xgb_imp,acc_xgb_imp, roc_xgb_imp, coh_kap_xgb_imp,
time_xgb_imp = model_run(clf_xgb_imp, X_train[imp_feat], y_train,
X_test[imp_feat], y_test)
import shap
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
model = RandomForestRegressor()
model.fit(X_train, y_train)
X_test_sample = X_test.sample(100, random_state=42) # Adjust sample size
as needed
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test_sample)