Rainfall Prediction using Machine Learning
Rainfall Prediction using Machine Learning
Today there are no certain methods by using which we can predict whether there will be rainfall today or not. Even
the meteorological department’s prediction fails sometimes. In this article, we will learn how to build a machine-
learning model which can predict whether there will be rainfall today or not based on some atmospheric factors. This
problem is related to Rainfall Prediction using Machine Learning because machine learning models tend to perform
better on the previously known task which needed highly skilled individuals to do so.
Pandas – This library helps to load the data frame in a 2D array format and has multiple functions to perform
analysis tasks in one go.
Numpy – Numpy arrays are very fast and can perform large computations in a very short time.
Sklearn – This module contains multiple libraries are having pre-implemented functions to perform tasks
from data preprocessing to model development and evaluation.
XGBoost – This contains the eXtreme Gradient Boosting machine learning algorithm which is one of the
algorithms which helps us to achieve high accuracy on predictions.
Imblearn – This module contains a function that can be used for handling problems related to data
imbalance.
Code:-
import numpy as np
import pandas as pd
import seaborn as sb
import warnings
warnings.filterwarnings('ignore')
Now let’s load the dataset into the panda’s data frame and print its first five rows.
Python
Code:-
df = pd.read_csv('Rainfall.csv')
df.head()
Output :-
Python Code :-
df.shape
Let’s check which column of the dataset contains which type of data.
Python Code:-
df.info()
Output:-
As per the above information regarding the data in each column, we can observe that there are no null values.
Python Code:-
df.describe().T
Data Cleaning
The data which is obtained from the primary sources is termed the raw data and required a lot of preprocessing
before we can derive any conclusions from it or do some modeling on it. Those preprocessing steps are known
as data cleaning and it includes, outliers removal, null value imputation, and removing discrepancies of any sort in
the data inputs.
Output:-
So there is one null value in the ‘winddirection’ as well as the ‘windspeed’ column. But what’s up with the column
name wind direction?
Output:-
Index(['day', 'pressure ', 'maxtemp', 'temperature', 'mintemp', 'dewpoint', 'humidity ', 'cloud ',
'rainfall', 'sunshine', ' winddirection', 'windspeed'], dtype='object')
Here we can observe that there are unnecessary spaces in the names of the columns let’s remove that.
Python Code:-
df.rename(str.strip,
axis='columns',
inplace=True)
df.columns
Python Code:-
if df[col].isnull().sum() > 0:
val = df[col].mean()
df[col] = df[col].fillna(val)
df.isnull().sum().sum()
Output: 0
Exploratory Data Analysis
EDA is an approach to analyzing the data using visual techniques. It is used to discover trends, and patterns, or to
check assumptions with the help of statistical summaries and graphical representations. Here we will see how to
check the data imbalance and skewness of the data.
Python Code:-
plt.pie(df['rainfall'].value_counts().values,
labels = df['rainfall'].value_counts().index,
autopct='%1.1f%%')
plt.show()
Output:--
Python Code:-
df.groupby('rainfall').mean()
The observations we have drawn from the above dataset are very much similar to what is observed in real life as
well.
features.remove('day')
print(features)
Let’s check the distribution of the continuous features given in the dataset.
Python Code:-
plt.subplots(figsize=(15,8))
plt.subplot(3,4, i + 1)
sb.distplot(df[col])
plt.tight_layout()
plt.show()
Output:-
Let’s draw boxplots for the continuous variable to detect the outliers present in the data.
Python Code:-
plt.subplots(figsize=(15,8))
plt.subplot(3,4, i + 1)
sb.boxplot(df[col])
plt.tight_layout()
plt.show()
There are outliers in the data but sadly we do not have much data so, we cannot remove this.
Python Code:-
Sometimes there are highly correlated features that just increase the dimensionality of the feature space and do not
good for the model’s performance. So we must check whether there are highly correlated features in this dataset or
not.
Python Code:-
plt.figure(figsize=(10,10))
annot=True,
cbar=False)
plt.show()
Output:-
Now we will remove the highly correlated features ‘maxtemp’ and ‘mintemp’. But why not temp or dewpoint? This is
because temp and dewpoint provide distinct information regarding the weather and atmospheric conditions.
Python Code:-
Model Training
Now we will separate the features and target variables and split them into training and testing data by using which
we will select the model which is performing best on the validation data.
Python Code:-
X_train, X_val, \
target,
test_size=0.2,
stratify=target,
random_state=2)
ros = RandomOverSampler(sampling_strategy='minority',
random_state=22)
X, Y = ros.fit_resample(X_train, Y_train)
The features of the dataset were at different scales so, normalizing it before training will help us to obtain optimum
results faster along with stable training.
Python
scaler = StandardScaler()
X = scaler.fit_transform(X)
X_val = scaler.transform(X_val)
Now let’s train some state-of-the-art models for classification and train them on our training data.
LogisticRegression
XGBClassifier
SV
for i in range(3):
models[i].fit(X, Y)
print(f'{models[i]} : ')
train_preds = models[i].predict_proba(X)
val_preds = models[i].predict_proba(X_val)
print()
LogisticRegression() :
XGBClassifier() :
SVC(probability=True) :
Model Evaluation
From the above accuracies, we can say that Logistic Regression and support vector classifier are satisfactory as the
gap between the training and the validation accuracy is low. Let’s plot the confusion matrix as well for the validation
data using the SVC model.
Python
plt.show()
Let’s plot the classification report as well for the validation data using the SVC model.
Python Code:-
print(metrics.classification_report(Y_val,
models[2].predict(X_val)))
precision recall f1-score support
accuracy 0.85 74