# For Linear Algebra Import Numpy As NP # For Data Processing Import Pandas As PD
# For Linear Algebra Import Numpy As NP # For Data Processing Import Pandas As PD
df =
df.drop(columns=['Sunshine','Evaporation','Cloud3pm','Cloud9am','Location','RI
SK_MM','Date'],axis=1)
print(df.shape)
(145460, 17)
Next, we will remove all the null values in our data frame.
After removing null values, we must also check our data set for any outliers. An outlier is a data
point that significantly differs from other observations. Outliers usually occur due to
miscalculations while collecting the data.
z = np.abs(stats.zscore(df._get_numeric_data()))
print(z)
df= df[(z < 3).all(axis=1)]
print(df.shape)
[[0.11756741 0.10822071 0.20666127 ... 1.14245477 0.08843526 0.04787026]
[0.84180219 0.20684494 0.27640495 ... 1.04184813 0.04122846 0.31776848]
[0.03761995 0.29277194 0.27640495 ... 0.91249673 0.55672435 0.15688743]
...
[1.44940294 0.23548728 0.27640495 ... 0.58223051 1.03257127 0.34701958]
[1.16159206 0.46462594 0.27640495 ... 0.25166583 0.78080166 0.58102838]
[0.77784422 0.4789471 0.27640495 ... 0.2085487 0.37167606 0.56640283]]
(107868, 17)
Next, we’ll be assigning ‘0s’ and ‘1s’ in the place of ‘YES’ and ‘NO’.
Now that we’re done pre-processing the data set, it’s time to check perform analysis and identify
the significant variables that will help us predict the outcome. To do this we will make use of the
SelectKBest function
1. Rainfall
2. Humidity3pm
3. RainToday
The main aim of this demo is to make you understand how Machine Learning works, therefore,
to simplify the computations we will assign only one of these significant variables as the input.
At this step, we will build the Machine Learning model by using the training data set and
evaluate the efficiency of the model by using the testing data set.
1. Logistic Regression
2. Random Forest
3. Decision Tree
4. Support Vector Machine
Logistic Regression
#Logistic Regression
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import time
#Calculating the accuracy and the time taken by the classifier
t0=time.time()
#Data Splicing
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25)
clf_logreg = LogisticRegression(random_state=0)
#Building the model using the training data set
clf_logreg.fit(X_train,y_train)
#Evaluating the model using testing data set
y_pred = clf_logreg.predict(X_test)
score = accuracy_score(y_test,y_pred)
#Printing the accuracy and the time taken by the classifier
print('Accuracy using Logistic Regression:',score)
print('Time taken using Logistic Regression:' , time.time()-t0)
Accuracy using Logistic Regression: 0.8330181332740015
Time taken using Logistic Regression: 0.1741015911102295