ML P L Lohitha 22-01-23 Business Report
ML P L Lohitha 22-01-23 Business Report
ML P L Lohitha 22-01-23 Business Report
Machine Learning
1 Problem-1 4
2 Sample of the dataset 4
3 Outlier Treatment 6
4 Univariate analysis 6
4 Bivariate analysis 8
5 Multivariate analysis 10
6 Data encoding, Scaling and Splitting 12
7 Logistic Regression 13
8 Linear Discriminant Analysis 15
9 K- Nearest Neighbor 18
10 Naïve Baye’s Model 20
11 Random Forest 22
12 Bagging 25
13 Boosting 27
14 Model Comparison 29
15 Business Insights 31
16 Problem - 2 32
17 Solution 32
LIST OF FIGURES
Figures PAGE NO
Figure 1 6
Figure 2 6
Figure 3 7
Figure 4 7
Figure 5 7
Figure 6 8
Figure 7 8
Figure 8 8
Figure 9 9
Figure 10 10
Figure 11 11
Figure 12 11
Figure 13 12
Figure 14 12
Figure 15 14
Figure 16 14
Figure 17 15
Figure 18 15
Figure 19 16
Figure 20 17
Figure 21 17
2
Figure 22 18
Figure 23 19
Figure 24 19
Figure 25 20
Figure 26 20
Figure 27 21
Figure 28 21
Figure 29 22
Figure 30 22
Figure 31 23
Figure 32 24
Figure 33 24
Figure 34 25
Figure 35 26
Figure 36 26
Figure 37 27
Figure 38 27
Figure 39 28
Figure 40 28
Figure 41 29
Figure 42 29
Figure 43 30
Figure 44 31
Figure 45 31
Figure 46 31
Figure 47 33
Figure 48 33
Figure 49 33
Figure 50 34
Figure 51 34
Figure 52 34
Figure 53 34
LIST OF TABLES
TABLE PAGE NO
3
Problem 1
Problem Statement:
You are hired by one of the leading news channels CNBE who wants to analyze recent elections.
This survey was conducted on 1525 voters with 9 variables. You have to build a model, to predict
which party a voter will vote for on the basis of the given information, to create an exit poll that will
help in predicting overall win and seats covered by a particular party.
Data Ingestion:
1.1 Read the dataset. Do the descriptive statistics and do the null value condition check. Write an
inference on it.
1.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers .
Data Pre333333.
paration:
1.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data
Split: Split the data into train and test (70:30).
Modelling:
1.4 Apply Logistic Regression and LDA (linear discriminant analysis
1.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.
1.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting.
1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model.
Final Model:
Table 1
4
The dataset has 8 integer variables and 2 object variables.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1525 entries, 0 to 1524
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1525 non-null int64
1 vote 1525 non-null object
2 age 1525 non-null int64
3 economic.cond.national 1525 non-null int64
4 economic.cond.household 1525 non-null int64
5 Blair 1525 non-null int64
6 Hague 1525 non-null int64
7 Europe 1525 non-null int64
8 political.knowledge 1525 non-null int64
9 gender 1525 non-null object
dtypes: int64(8), object(2)
memory usage: 119.3+ KB
The five point summary of the dataset helped us understand the mean, median and standard
deviation of the data and it also helps us to identify anomalies in the data. There is a column in
the dataset ‘Unnamed: 0’ which is just numbering of each column so we can drop that column as
it doesn’t help us in model building. The minimum age of the customers is 24 and the
maximum age is 93. The average age of the voters is 53.
Table 2
5
we dropped the column ‘Unnamed: 0’ and checked for duplicate values. There are no duplicate
values in the data.
0
Outlier Treatment:
There are no outliers in the data except for one outlier in the columns ‘ economic.cond.national’,
‘economic.cond.household’. we choose to not to treat that outlier as it represents a scale of
economic conditions of the voters and treating the outlier might alter the predictions of the
model.
Figure 1
Univariate analysis:
The age of the voters is evenly distributed and it is neither left skewed nor right skewed.
Figure 2
6
The column economic.cond.national is right skewed and there is a outlier in the data.
Figure 3
The column economic.cond.household is right skewed and there is a outlier in the data.
Figure 4
7
Figure 5
figure 6
The majority of the voters choice of party is labour and only 30.30% prefer conservative.
Figure 7
The gender ratio of the voters is almost equally distributed between male and female.
8
Figure 8
Bivariate analysis:
We can infer from the below graph that there is very low or no correlation between the
variables.
Since the correlation is low the data is very ideal for model building and we can predict from
the data properly.
There are no patterns in the data to infer from the graph.
9
Figure 9
From the heatmap we can infer that all the variables have very low correlation.
The variables with the highest positive correlation of 35% is economic.cond.national and
economic.cond.household.
The variables with the highest negative correlation of -30% is Blair and Europe.
Figure 10
Multivariate analysis:
The national economic conditions of voters whose party of choice is Labour is higher when
compared to those who’s party of choice is conservative.
10
Figure 11
Majority of the male voters have higher political knowledge when compared to female voters
and irrespective of gender, voters whose party of choice is conservative have higher political
knowledge when compared to the others.
Figure 12
The voters with lower age have greater national economic conditions and the age of the
voters whose party of choice is conservative is higher when compared to others.
11
figure 13
The voters with high political knowledge have high Eurosceptic sentiment.
Figure 14
12
Table 3
We will then chec the standard deviation and variance of the variables to
decide whether to scale the data or not.
vote 0.459534
age 15.706057
economic.cond.national 0.880680
economic.cond.household 0.929646 Standard
Blair 1.174439
Hague 1.230300
Deviation
Europe 3.296457
political.knowledge 1.082960
gender 0.498945
dtype: float64
vote 0.211172
age 246.680211
economic.cond.national 0.775598
economic.cond.household 0.864243
Blair 1.379307 Variance
Hague 1.513638
Europe 10.866629
political.knowledge 1.172801
gender 0.248946
dtype: float64
Since the standard deviation and variance of few variables are very high we
decided to scale the data. The magnitude of the age and other factors are
also varying, so scaling would help us to standardise the which will help us to
yield better predictions.
We will then split the data into train and test with the ratio [70 : 30 ]. We split
the data to avoid overfitting and don’t want the data to memorise it instead we
want it to learn a pattern from the data so we will split the data.
Logistic Regression:
We have build a logistic regression model after splitting the data.
0.8397375820056232
precision recall f1-score support
We can understand from the above classification report that the accuracy of
the model on train data set is 84% and recall is 91 and 69 for the classes.
Since both the classes are equally important we will check the f1 score as it
combines precision and recall scores of a model which is 89 and 73.
0.8231441048034934
precision recall f1-score support
13
0 0.87 0.89 0.88 328
1 0.70 0.65 0.68 130
Figure 15
The confusion matrix is plotted above and the number of true positives and
true negatives are 292 and 85 in the test set and the are under the ROC curve
is 0.882387.
Figure 16
The above plot depicts the ROC curve and the model with the highest area
under the ROC curve has performed better than all the other models.
To optimise the performance of the model we will perform Hyper parameter
tuning using grid search.
We used multiple parameters and identified best parameter combination to
improve model performance through grid search.
0.8209606986899564
14
precision recall f1-score support
Figure 17
The confusion matrix is plotted above and the number of true positives and
true negatives are 292 and 84 in the test set and the area under the ROC
curve is 0.882387. There is no major difference in the confusion matrix and
the area under the ROC curve.
Figure 18
15
0.8369259606373008
We can understand from the above classification report that the accuracy of
the model on train data set is 84% and recall is 90 and 70 for the classes.
Since both the classes are equally important we will check the f1 score as it
combines precision and recall scores of a model which is 88 and 73.
0.8187772925764192
We have verified the performance of the model on test data and the accuracy
is 82% which haven’t dropped much from train set so there is no case of over
fitting of model here.
Figure 19
The confusion matrix is plotted above and the number of true positives and
true negatives are 289 and 86 in the test set and the are under the ROC curve
is 0.883771.
16
figure 20
The above plot depicts the ROC curve and the model with the highest area
under the ROC curve has performed better than all the other models.
To optimise the performance of the model we will perform Hyper parameter
tuning using grid search.
We used multiple parameters and identified best parameter combination to
improve model performance through grid search.
0.8231441048034934
Figure 21
The confusion matrix is plotted above and the number of true positives and
true negatives are 288 and 89 in the test set and the area under the ROC
17
curve is 0.885272. There is slight difference in the confusion matrix and the
area under the ROC curve where the TP and TN changed from 289 , 86 to
KNN Model:
We have build a KNN model and the classification report is given below.
0.8631677600749765
We can understand from the above classification report that the accuracy of
the model on train data set is 86% and recall is 92 and 75 for the classes.
Since both the classes are equally important we will check the f1 score as it
combines precision and recall scores of a model which is 90 and 77.
0.8253275109170306
We have verified the performance of the model on test data and the accuracy
is 83% which haven’t dropped much from train set so there is no case of over
fitting of model here.
18
figure 23
The confusion matrix is plotted above and the number of true positives and
true negatives are 286 and 92 in the test set and the are under the ROC curve
is 0.870556.
figure 24
The above plot depicts the ROC curve and the model with the highest area
under the ROC curve has performed better than all the other models.
To optimise the performance of the model we will perform Hyper parameter
tuning using grid search.
We used multiple parameters and identified best parameter combination to
improve model performance through grid search.
0.8253275109170306
19
Using those parameters we have checked the performance of the model on
test data and there is no difference in the accuracy and f1 score.
figure 25
The confusion matrix is plotted above and the number of true positives and
true negatives are 286 and 92 in the test set and the area under the ROC
curve is 0.870556. There is no difference in the confusion matrix and the area
under the ROC curve even after hyper parameter tuning.
Figure 26
We can understand from the above classification report that the accuracy of
the model on train data set is 83% and recall is 88 and 72 for the classes.
20
Since both the classes are equally important we will check the f1 score as it
combines precision and recall scores of a model which is 88 and 73.
0.8253275109170306
We have verified the performance of the model on test data and the accuracy
is 83% which haven’t dropped much from train set so there is no case of over
fitting of model here.
Figure 27
The confusion matrix is plotted above and the number of true positives and
true negatives are 284 and 94 in the test set and the are under the ROC curve
is 0.884545.
figure 28
The above plot depicts the ROC curve and the model with the highest area
under the ROC curve has performed better than all the other models.
21
To optimise the performance of the model we used the SMOTE technique to
solve the class imbalance.
We increased the sample size of class 1 to match the class 0 so that the
algorithm is not biased and built the model to verify the performance on test
data.
0.7903930131004366
Figure 29
The confusion matrix is plotted above and the number of true positives and
true negatives are 258 and 104 in the test set and the area under the ROC
curve is 0.884545. There is a drop in the TP’s and TN’s in the confusion
matrix after applying SMOTE so its not ideal for this model and data. The
ROC curve remained the same.
Figure 30
22
Model Tuning , Bagging and Boosting:
Random Forest:
We have build a Random forest model to perform bagging and the
classification report is given below.
0.9990627928772259
precision recall f1-score support
We can understand from the above classification report that the accuracy of
the model on train data set is 100% and recall is 100 and 100 for the classes.
Since both the classes are equally important we will check the f1 score as it
combines precision and recall scores of a model which is 100 and 100.
0.8209606986899564
We have verified the performance of the model on test data and the accuracy
is 82%. There is a major drop in the accuracy from 100 to 82 which is a clear
case of over fitting of the data.
Figure 31
The confusion matrix is plotted above and the number of true positives and
true negatives are 285 and 91 in the test set and the are under the ROC curve
is 0.889962.
23
figure 32
The above plot depicts the ROC curve and the model with the highest area
under the ROC curve has performed better than all the other models.
To optimise the performance of the model and to solve the problem of over
fitting we will perform Hyper parameter tuning using grid search.
We used multiple parameters and identified best parameter combination to
improve model performance through grid search.
0.8209606986899564
Figure 33
The confusion matrix is plotted above and the number of true positives and
true negatives are 290 and 86 in the test set and the area under the ROC
curve is 0.889962. There is a slight difference in the confusion matrix but no
24
difference in the area under the ROC curve even after hyper parameter
tuning.
figure 34
Bagging :
We have built a bagging model using random forest classifier and the
classification report is given below.
0.9662605435801312
We can understand from the above classification report that the accuracy of
the model on train data set is 97% and recall is 99 and 92 for the classes.
Since both the classes are equally important we will check the f1 score as it
combines precision and recall scores of a model which is 98 and 94.
08362445414847162
We have verified the performance of the model on test data and the accuracy
is 84%. There is a major drop in the accuracy from 97 to 84 which is a clear
case of over fitting of the data.
25
Figure 35
The confusion matrix is plotted above and the number of true positives and
true negatives are 291 and 92 in the test set and the are under the ROC curve
is 0.897291.
Figure 36
The above plot depicts the ROC curve and the model with the highest area
under the ROC curve has performed better than all the other models.
To optimise the performance of the model and to solve the problem of over
fitting we will perform Hyper parameter tuning using grid search.
We used multiple parameters and identified best parameter combination to
improve model performance through grid search.
0.8318777292576419
26
figure 37
The confusion matrix is plotted above and the number of true positives and
true negatives are 291 and 90 in the test set and the area under the ROC
curve is 0.897291. There is a slight difference in the confusion matrix but no
difference in the area under the ROC curve even after hyper parameter
tuning.
Figure 38
Boosting:
We have build a Boosting model and the classification report is given below.
0.8865979381443299
We can understand from the above classification report that the accuracy of
the model on train data set is 89% and recall is 93 and 79 for the classes.
Since both the classes are equally important we will check the f1 score as it
combines precision and recall scores of a model which is 92 and 81.
27
0.8318777292576419
We have verified the performance of the model on test data and the accuracy
is 83%. There is a drop in the accuracy from 89 to 83 , so there is no problem
of overfitting in the model.
Figure 39
The confusion matrix is plotted above and the number of true positives and
true negatives are 285 and 96 in the test set and the are under the ROC curve
is 0.897291.
Figure 40
The above plot depicts the ROC curve and the model with the highest area
under the ROC curve has performed better than all the other models.
To optimise the performance of the model we will perform Hyper parameter
tuning using grid search.
28
We used multiple parameters and identified best parameter combination to
improve model performance through grid search.
0.8318777292576419
Figure 41
The confusion matrix is plotted above and the number of true positives and
true negatives are 287 and 94 in the test set and the area under the ROC
curve is 0.904245. There is a slight difference in the confusion matrix but no
difference in the area under the ROC curve even after hyper parameter
tuning.
Figure 42
29
Model comparison:
The below visualisations depicts the performance of models on multiple
parameters.
Figure 43
From the above graph we can infer that the models Bagging and Boosting
have covered larger area when compared to other models.
30
Figure 44 figure 45
Figure 46
The above graphs displays the ‘testing accuracy’ ‘f1 score’ and ‘precision’ of
all the models and clearly bagging is performing well, when compared with
other models.
So considering all the parameters like Accuracy, Confusion Matrix, Plot ROC
curve, f1_score, precision and ROC_AUC score the best performing model is
Bagging which is a best suited model for the given dataset.
Business Insights:
The models prediction towards voting for a party highly depends on the voters
attitude on European integration, age of the voter, assessment of the
conservative leader and labour party leader.
A voter is likely to vote for conservative party if he/she has high political
knowledge and low Eurosceptic sentiment.
31
A voter is likely to vote for Labour party if he/she has low political knowledge
and high Eurosceptic sentiment.
Older age people with lower Eurosceptic sentiment tend to voter for Labour
party and young people with lesser Eurosceptic sentiment are voting for
conservative party.
Problem 2 :
In this particular project, we are going to work on the inaugural corpora from the nltk
in Python. We will be looking at the following speeches of the Presidents of the
United States of America:
2.1 Find the number of characters, words, and sentences for the mentioned
documents.
Solution :
We imported the necessary libraries to perform the analysis and loaded the
speeches of the three presidents.
The number of characters for the mentioned documents are 25180.
The number of words for the mentioned documents are 5110.
The number of sentences for the mentioned documents are 189.
The top three words after removing stop words for the speech of President
Franklin D. Roosevelt are ‘Know’, ‘Spirit’, ‘Life’.
32
Figure 47
The top three words after removing stop words for the speech of President
John F. Kennedy are ‘US’, ‘World’, ‘Let’.
Figure 48
The top three words after removing stop words for the speech of Let’, ‘US’,
‘America’, ‘Peace’.
Figure 49
33
The top three words for all the three speeches combined are ‘US’, ‘America’,
‘World’.
Figure 50
The word cloud for all the three speeches are
Figure 51 Figure 52
Figure 53
34