CSI5155 ML Project Report
CSI5155 ML Project Report
Techniques
Nikhil Oswal
1 Introduction
Rainfall prediction remains a serious concern and has attracted the attention
of governments, industries, risk management entities, as well as the scientific
community. Rainfall is a climatic factor that affects many human activities like
agricultural production, construction, power generation, forestry and tourism,
among others [1]. To this extent, rainfall prediction is essential since this vari-
able is the one with the highest correlation with adverse natural events such
as landslides, flooding, mass movements and avalanches. These incidents have
affected society for years [2]. Therefore, having an appropriate approach for rain-
fall prediction makes it possible to take preventive and mitigation measures for
these natural phenomena [3].
To solve this uncertainty, we used various machine learning techniques and
models to make accurate and timely predictions. These paper aims to provide end
to end machine learning life cycle right from Data preprocessing to implementing
models to evaluating them. Data Preprocessing steps include imputing missing
values, feature transformation, encoding categorical features, feature scaling and
feature selection. We implemented models such as Logistic Regression, Decision
Tree, K Nearest Neighbour, Rule-based and Ensembles. For evaluation purpose,
2 Nikhil Oswal
we used Accuracy, Precision, Recall, F-Score and Area Under Curve as evaluation
metrics. For our experiments, we train our classifiers using Australian weather
data gathered from various weather stations in Australia.
The paper is organized as follows. First, we describe the data set under
consideration in Section 2. The adopted methods and techniques are presented in
Section 3, while the experiments and results are shown and discussed in Section
4. Finally, closing conclusions are drawn (Section 5).
2 Case Study
In this paper, the data set under consideration contains daily weather observa-
tions from numerous Australian weather stations. The target variable is Rain-
Tomorrow which means: Did it rain the next day? Yes or No. The dataset is avail-
able at https://www.kaggle.com/jsphyg/weather-dataset-rattle-package
and definitions are adapted from http://www.bom.gov.au/climate/dwo/IDCJDW0000.
shtml.
The data set consists of 23 features and 142k instances. Below are the fea-
tures.
Feature Description
Date The date of observation
Location The common name of the location of the weather station
MinTemp The minimum temperature in degrees celsius
MaxTemp The maximum temperature in degrees celsius
Rainfall The amount of rainfall recorded for the day in mm
Evaporation The so-called Class A pan evaporation (mm) in the 24 hours to 9am
Sunshine The number of hours of bright sunshine in the day.
WindGustDir The direction of the strongest wind gust in the 24 hours to midnight
WindGustSpeed The speed (km/h) of the strongest wind gust in the 24 hours to midnight
WindDir9am Direction of the wind at 9am
WindDir3pm Direction of the wind at 3pm
WindSpeed9am Wind speed (km/hr) averaged over 10 minutes prior to 9am
WindSpeed3pm Wind speed (km/hr) averaged over 10 minutes prior to 3pm
Humidity9am Humidity (percent) at 9am
Humidity3pm Humidity (percent) at 3pm
Pressure9am Atmospheric pressure (hpa) reduced to mean sea level at 9am
Pressure3pm Atmospheric pressure (hpa) reduced to mean sea level at 3pm
Cloud9am Fraction of sky obscured by cloud at 9am.
Cloud3pm Fraction of sky obscured by cloud at 3pm.
Temp9am Temperature (degrees C) at 9am
Temp3pm Temperature (degrees C) at 3pm
RainToday 1 if precipitation exceeds 1mm, otherwise 0
RISK MM The amount of next day rain in mm.
RainTomorrow The target variable. Did it rain tomorrow?
Predicting Rainfall using Machine Learning Techniques 3
3 Methodology
In this paper, the overall architecture include four major components: Data Ex-
ploration and Analysis, Data Pre-processing, Model Implementation, and Model
Evaluation, as shown in Fig. 1.
We have other features with null values too which we will be imputing in
our preprocessing steps. If we look the distribution of our target variable, it is
clear that we have a class imbalance problem with number of positive instances
- 110316 and number of negative instances - 31877.
Predicting Rainfall using Machine Learning Techniques 5
The correlation matrix depicts that the features - MaxTemp, Pressure9am, Pres-
sure3pm, Temp3pm and Temp9am are negatively correlated with target variable.
Hence, we can drop this features in our feature selection step later.
Missing Values: As per our EDA step, we learned that we have few instances
with null values. Hence, this becomes one of the important step. To impute the
missing values, we will group our instances based on the location and date and
thereby replace the null values by there respective mean values.
Feature Expansion: Date feature can be expanded to Day, Month and Year
and then these newly created features can be further used for other preprocessing
steps.
Categorical Values: Categorical feature is one that has two or more categories,
but there is no intrinsic ordering to the categories. We have a few categorical
features - WindGustDir, WindDir9am, WindDir3pm with 16 unique values. Now
it gets complicated for machines to understand texts and process them, rather
than numbers, since the models are based on mathematical equations and cal-
culations. Therefore, we have to encode the categorical data. We here tried two
different techniques.
Feature Scaling: Our data set contains features with highly varying magni-
tudes and range. But since, most of the machine learning algorithms use Eu-
clidean distance between two data points in their computations, this is a prob-
lem. The features with high magnitudes will weigh in a lot more in the distance
calculations than features with low magnitudes. To suppress this effect, we need
to bring all features to the same level of magnitudes. This can be achieved by
scaling. We did this using scikit learn’s min-max scalar and brought all the
features in the range of 0 to 1 [7].
Handling Class Imbalance We learned in our EDA step that our data set
is highly imbalanced. Imbalanced data results in biased results as our model
doesn’t learn much about the minority class. We performed two experiments
one with oversampled data and another with undersampled data.
– Undersampling: We used Imblearn’s random under sampler library to
eliminate instances of majority class [10]. This elimination is based on dis-
tance so that there is minimum loss of information (figure 10)
– Oversampling: We used Imblearn’s SMOTE technique to generate syn-
thetic instances for minority class [10]. A subset of data is taken from the
minority class as an example and then new synthetic similar instances are
created. (figure 11)
3.3 Models
We chose different classifiers each belonging to different model family (such as
Linear classifier, Tree-based, Distance-based, Rule-based and Ensemble). All the
Predicting Rainfall using Machine Learning Techniques 9
classifiers were implemented using scikit-learn except for Decision table which
was implemented using weka.
The following classification algorithms have been used to build prediction
models to perform the experiments:
Decision Tree have a natural if then else construction that makes it fit
easily into a programmatic structure. They also are well suited to categorization
problems where attributes or features are systematically checked to determine
a final category. It works for both categorical and continuous input and output
variables. In this technique, we split the population or sample into two or more
homogeneous sets (or sub-populations) based on most significant splitter / dif-
ferentiator in input variables. This characteristics of Decision Tree makes it a
good fit for our problem as our target variable is binary categorical variable.
Decision table provides a handy and compact way to represent complex busi-
ness logic. In a decision table, business logic is well divided into conditions,
actions (decisions) and rules for representing the various components that form
the business logic. [11] This was implemented using Weka.
Gradient Boosting Here, many models are trained sequentially. Each new
model gradually minimizes the loss function (y = ax + b + e, where e is the
error term) of the whole system using Gradient Descent method. The learning
method consecutively fits new models to give a more accurate estimate of the
response variable. The main idea behind this algorithm is to construct new base
learners which can be optimally correlated with negative gradient of the loss
function, relevant to the whole ensemble.
Our Model’s configuration: number of weak learners = 100, learning rate =
[0.05, 0.1, 0.25], maximum features = 2, maximum depth = 2
3.4 Evaluation
Recall is the number of correct positive results divided by the number of all
relevant samples (all samples that should have been identified as positive).
Predicting Rainfall using Machine Learning Techniques 13
F1 Score is the Harmonic Mean between precision and recall. The range for
F1 Score is [0, 1]. It tells how precise our classifier is (how many instances it
classifies correctly), as well as how robust it is (it does not miss a significant
number of instances). High precision but lower recall, gives you an extremely
accurate, but it then misses a large number of instances that are difficult to
classify. The greater the F1 Score, the better is the performance of our model.
For all the experiments and development of classifiers, we used Python 3 and
Google colab’s Jupyter Notebook. We used libraries such as Sckit Learn, Mat-
plotlib, Seaborn, Pandas, Numpy and Imblearn. We used weka for implementing
Decision Table.
We carried experiments with different input data; one with the original
dataset, then with the undersampled dataset and last one with the oversam-
pled dataset. We splitted out dataset in ratio of 75:25 for training and testing
purpose.
Experiment 1 - Original Dataset: Post all the preprocessing steps (as men-
tioned above in the Methodology section), we ran all the implemented classifiers
each one with the same input data (Shape: 92037 x 4). Figure 12 depicts two
considered metrics (10-skfold Accuracy and Area Under Curve) for all the clas-
sifiers.
Accuracy wise Gradient Boosting with a learning rate of 0.25 performed best,
coverage wise Random Forest and Decision Tree performed worsts.
14 Nikhil Oswal
Accuracy and coverage wise Logistic Regression performed best and Decision
Tree performed worsts.
Predicting Rainfall using Machine Learning Techniques 15
Accuracy and coverage wise Decision Tree performed best and Logistic Re-
gression performed worsts.
We have varying range of results with respect to different input data and
different classifiers. Other metrics are followed in appendix.
5 Discussion
With the issues with our original dataset, we learned many things considering
all the preprocessing steps that we carried to rectify them. The first important
thing we learned is the importance of knowing your data. While imputing the
missing value, we grouped two other features and calculated the mean instead of
directly calculating the mean for all the instances. This way our imputed values
were closer to the correct information. Another thing we learned is about the
leaky features. While exploring our data, we came to that one of our feature
(RiskMM) was used for generating the target variable and hence it made no
sense to use this feature for predictions.
We learned about the curse of dimensionality while dealing with categorical
variables which we solved using feature hashing. We also learned two techniques
for performing feature selection - univariate selection and correlation heat map.
We also explore undersampling and oversampling techniques while handling the
class imbalance problem.
16 Nikhil Oswal
With the experiments that we carried using different data, we also came to
know that in a few cases we have achieved higher accuracy (Decision Tree) clearly
implying the classic case of overfitting. We also observed that the performance
of classifiers varied with different input data. To count a few, Logistic Regression
performed best with undersampled data whereas it performed worst with over-
sampled data; same goes with KNN, it performed best with oversampled data
and worst with undersampled data. Hence we can say that the input data has a
very important role here. Ensembles to be precise Gradient Boosting performed
pretty consistently in all the experiments.
In this paper, we explored and applied several preprocessing steps and learned
there impact on the overall performance of our classifiers. We also carried a
comparative study of all the classifiers with different input data and observed
how the input data can affect the model predictions.
We can conclude that Australian weather is uncertain and there is no such
correlation among rainfall and the respective region and time. We figured certain
patterns and relationships among data which helped in determining important
features. Refer to the appendix section.
As we have a huge amount of data, we can apply Deep Learning models such
as Multilayer Perceptron, Convolutional Neural Network, and others. It would
be great to perform a comparative study between the Machine learning classifiers
and Deep learning models.
References
1. World Health Organization: Climate Change and Human Health: Risks and Re-
sponses. World Health Organization, January 2003
2. Alcntara-Ayala, I.: Geomorphology, natural hazards, vulnerability and prevention
of natural disasters in developing countries. Geomorphology 47(24), 107124 (2002)
3. Nicholls, N.: Atmospheric and climatic hazards: Improved monitoring and prediction
for disaster mitigation. Natural Hazards 23(23), 137155 (2001)
4. [Online] InDataLabs, Exploratory Data Analysis: the Best way to Start
a Data Science Project. Available: https://medium.com/@InDataLabs/
why-start-a-data-science-project-with-exploratory-data-analysis-f90c0efcbe49
5. [Online] Pandas Documentation. Available: https://pandas.pydata.org/
pandas-docs/stable/reference/api/pandas.get\_dummies.html
6. [Online] Sckit-Learn Documentation Available: https://scikit-learn.org/
stable/modules/generated/sklearn.feature\_extraction.FeatureHasher.
html
7. [Online] Sckit-Learn Documentation Available: https://scikit-learn.org/
stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
8. [Online] Sckit Learn Documentation Available: https://scikit-learn.org/
stable/modules/generated/sklearn.feature_selection.SelectKBest.html
Predicting Rainfall using Machine Learning Techniques 17
Appendix
is a good feature for our predictions. Refer figure 15, 16, 17, 18 and 19 depicting
this patterns.
Below are the evaluation metrics for all the experiments carried.