0% found this document useful (0 votes)
131 views4 pages

# For Linear Algebra Import Numpy As NP # For Data Processing Import Pandas As PD

This document outlines steps to build a machine learning model to predict rainfall using a weather dataset. It includes: 1. Importing libraries and loading/preprocessing the weather data, which involves removing unnecessary variables, null values, and outliers. 2. Exploratory data analysis using SelectKBest to identify the top three predictor variables of rainfall as rainfall, humidity, and whether it rained the previous day. 3. Building classification models using logistic regression, random forest, decision tree, and support vector machine to predict rainfall, and evaluating model accuracy on test data. Logistic regression results in 83% accuracy with a runtime of 0.17 seconds.

Uploaded by

Dilip Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
131 views4 pages

# For Linear Algebra Import Numpy As NP # For Data Processing Import Pandas As PD

This document outlines steps to build a machine learning model to predict rainfall using a weather dataset. It includes: 1. Importing libraries and loading/preprocessing the weather data, which involves removing unnecessary variables, null values, and outliers. 2. Exploratory data analysis using SelectKBest to identify the top three predictor variables of rainfall as rainfall, humidity, and whether it rained the previous day. 3. Building classification models using logistic regression, random forest, decision tree, and support vector machine to predict rainfall, and evaluating model accuracy on test data. Logistic regression results in 83% accuracy with a runtime of 0.17 seconds.

Uploaded by

Dilip Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Step 1: Import the required libraries

# For linear algebra


import numpy as np
# For data processing
import pandas as pd

Step 2: Load the data set


 
#Load the data set
df = pd.read_csv('. . . Desktop/weatherAUS.csv')
#Display the shape of the data set
print('Size of weather data frame is :',df.shape)
#Display data
print(df[0:5])

Step 3: Data Preprocessing

# Checking for null values


print(df.count().sort_values())
 
[5 rows x 24 columns]
Sunshine 75625
Evaporation 82670
Cloud3pm 86102
Cloud 9am 89572
Pressure 9am 130395
Pressure 3pm 130432
WindDir 9am 134894
WindGustDir 135134
WindGustSpeed 135197
Humidity 3pm 140953
WindDir 3pm 141232
Temp 3pm 141851
RISK_MM 142193
RainTomorrow 142193
RainToday 142199
Rainfall 142199
WindSpeed 3pm 142398
Humidity 9am 142806
Temp 9am 143693
WindSpeed 9am 143693
MinTemp 143975
MaxTemp 144199
Location 145460
Date 145460
dtype: int64
During data preprocessing it is always necessary to remove the variables that are not significant.
Unnecessary data will just increase our computations.

df =
df.drop(columns=['Sunshine','Evaporation','Cloud3pm','Cloud9am','Location','RI
SK_MM','Date'],axis=1)
print(df.shape)
 
(145460, 17)

Next, we will remove all the null values in our data frame.

#Removing null values


df = df.dropna(how='any')
print(df.shape)
 
(112925, 17)

After removing null values, we must also check our data set for any outliers. An outlier is a data
point that significantly differs from other observations. Outliers usually occur due to
miscalculations while collecting the data.

z = np.abs(stats.zscore(df._get_numeric_data()))
print(z)
df= df[(z < 3).all(axis=1)]
print(df.shape)
 
[[0.11756741 0.10822071 0.20666127 ... 1.14245477 0.08843526 0.04787026]
[0.84180219 0.20684494 0.27640495 ... 1.04184813 0.04122846 0.31776848]
[0.03761995 0.29277194 0.27640495 ... 0.91249673 0.55672435 0.15688743]
...
[1.44940294 0.23548728 0.27640495 ... 0.58223051 1.03257127 0.34701958]
[1.16159206 0.46462594 0.27640495 ... 0.25166583 0.78080166 0.58102838]
[0.77784422 0.4789471 0.27640495 ... 0.2085487 0.37167606 0.56640283]]
(107868, 17)

Next, we’ll be assigning ‘0s’ and ‘1s’ in the place of ‘YES’ and ‘NO’.

#Change yes and no to 1 and 0 respectvely for RainToday and RainTomorrow


variable
df['RainToday'].replace({'No': 0, 'Yes': 1},inplace = True)
df['RainTomorrow'].replace({'No': 0, 'Yes': 1},inplace = True)
Normalise The Data 

Step 4: Exploratory Data Analysis (EDA)

Now that we’re done pre-processing the data set, it’s time to check perform analysis and identify
the significant variables that will help us predict the outcome. To do this we will make use of the
SelectKBest function

#Using SelectKBest to get the top features!


from sklearn.feature_selection import SelectKBest, chi2
X = df.loc[:,df.columns!='RainTomorrow']
y = df[['RainTomorrow']]
selector = SelectKBest(chi2, k=3)
selector.fit(X, y)
X_new = selector.transform(X)
print(X.columns[selector.get_support(indices=True)])
 
Index(['Rainfall', 'Humidity3pm', 'RainToday'], dtype='object')

The output gives us the three most significant predictor variables:

1. Rainfall
2. Humidity3pm
3. RainToday

The main aim of this demo is to make you understand how Machine Learning works, therefore,
to simplify the computations we will assign only one of these significant variables as the input.

#The important features are put in a data frame


df = df[['Humidity3pm','Rainfall','RainToday','RainTomorrow']]
 
#To simplify computations we will use only one feature (Humidity3pm) to build
the model
 
X = df[['Humidity3pm']] \input
y = df[['RainTomorrow']] \output
Step 5: Building a Machine Learning Model

At this step, we will build the Machine Learning model by using the training data set and
evaluate the efficiency of the model by using the testing data set.

We’ll be building classification models, by using the following algorithms:

1. Logistic Regression
2. Random Forest
3. Decision Tree
4. Support Vector Machine

Logistic Regression

#Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import time
 
#Calculating the accuracy and the time taken by the classifier
t0=time.time()
#Data Splicing
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25)
clf_logreg = LogisticRegression(random_state=0)
#Building the model using the training data set
clf_logreg.fit(X_train,y_train)
 
#Evaluating the model using testing data set
y_pred = clf_logreg.predict(X_test)
score = accuracy_score(y_test,y_pred)
 
#Printing the accuracy and the time taken by the classifier
print('Accuracy using Logistic Regression:',score)
print('Time taken using Logistic Regression:' , time.time()-t0)
 
Accuracy using Logistic Regression: 0.8330181332740015
Time taken using Logistic Regression: 0.1741015911102295

You might also like