Machine Learning-2 Business Report
Machine Learning-2 Business Report
- Problem definition - Check shape, Data types, and statistical summary - Univariate analysis
- Multivariate analysis - Use appropriate visualizations to identify the patterns and insights -
Key meaningful observations on individual variables and the relationship between variables
Observations.
Data set contains 1525 rows with 9 columns (Shape).
Observations:
Data set contains no missing values. .
Data set contains object named 'vote' & 'gender'.
Data set contains 7 integer values, and 2 object.
Observations.
The minimum and maximum age is 24 to 93.
In every age group the labour party has got more votes than the conservative
party.
Female votes are considerably higher than the male votes in both parties.
In both genders, the labour party has got more votes than the conservative
party.
Observations
Labour party has higher votes overall.
Out of 82 people who gave score of 5, 73 people have voted for the labour
party.
Out of 542 people who gave score of 4, 450 people have voted for the labour
party.This is the highest set of people in the labour party
Out of 607 people who gave score of 3, 407 people have voted for the labour
party.This is the second highest set of people in the labour party. The
remaining 200 people who have voted for conservative party is the highest set
of people in that party.
Out of 257 people who gave score of 2, 117 people have voted for the labour
party. 140 people have voted for the conservative party. This is the instance
where the conservative party has got more votes than the labour party.
Out of 37 people who gave score of 1, 16 people have voted for the labour
party. 21 people have voted for the conservative party.
The score of 3,4 and 5 have more votes in the labour party.
Out of 92 people who gave a score of 5, 69 people have voted for the labour
party.
Out of 440 people who gave score of 4, 353 people have voted for the labour
party.This is second the highest set of people in the labour party.
Out of 648 people who gave score of 3, 450 people have voted for the labour
party.This is the highest set of people in the labour party. The remaining 198
people who have voted for conservative party is the highest set of people in
that party.
Out of 280 people who gave score of 2, 154 people have voted for the labour
party. 126 people have voted for the conservative party.
Out of 65 people who gave score of 1, 37 people have voted for the labour
party. 28 people have voted for the conservative party.
The score of 3,4 and 5 have more votes in the labour party.
In all the instances, the labour party have more votes than the conservative
party.
Observations
Labour party has higher votes overall.
Out of 153 people who gave a score of 5, 150 people have voted for the
labour party.The remaining 3 people despite giving a score of 5 to the
labour leader have chosen to vote foe the conservative party.
Out of 836 people who gave score of 4, 679 people have voted for the
labour party.The remaining 157 people despite giving a score of 4 to the
labour leader have chosen to vote for the conservative party.
Only 1 person has given a score of 3 and that person has voted for the
conservative party.
Out of 438 people who gave score of 2, 242 people have voted for the
conservative party. The remaining 196 people, despite giving an
unsatisfactory score of 2 to the labour leader, have chosen to vote for
the labour party.
Out of 97 people who gave score of 1, 59 people have voted for the
conservative party. The remaining 38 people despite giving the lowest
score of 1 to the labour leader, have chosen to vote the labour party.
The score of 4 and 5 have more votes in the labour party.
The score of 1,2 and 3 have more votes in the conservative party.
Observations
Out of 338 people who gave ascore of 11, 166 people have voted for the
labour party and 172 people have voted for the conservative party.
People who gave score of 7 to 10 have voted for labour and
conservative almost equally.Conservative party seem to be slightly
higher in these instances.
Out of 209 people who gave a score of 6, 173 people have voted for the
labour party and 36 people have voted for the conservative party.
People who gave a score of 1 to 6 have predominantly voted for the
labour party. As we can see there are total of 770 people who have
given scores from 1 to 6. Out of 770 people 672 people have voted for
the labour party. So, 87.28% of the people have chosen labour party.
So, we can infer that lower the 'Eurosceptic' sentiment, higher the
votes for labour party.
Observations
Out of 250 people who gave a score of 3, 178 people have voted for the
labour party and 72 people have voted for the conservative party.
Out of 782 people who gave a score of 2, 498 people have voted for the
labour party and 284 people have voted for the conservative party.
Out of 38 people who gave a score of 1, 27 people have voted for the
labour party and 11 people have voted for the conservative party.
Out of 455 people who gave a score of 0, 360 people have voted for the
labour party and 95 people have voted for the conservative party.
We can see that, in all instances, labour party gets the higher number
of votes.
Out of 1525 people 455 people gave a score of 0. So, this means that
29.93% of the people are casting their votes without any political
knowledge.
Observations
Pairplot tells us about the interaction of each variable with every other
variable present. As such there is no strong relationship present
between the variables. There is a mixture of positive and negative
relationships though which is expected.
Overall its a rough estimate of the interactions, clearer picture can be
obtained by heatmap values and also different kinds of plots.
Pairplot is acombination of histograms and scatterplots.
From the histogram we can see that the 'Blair','Europe' and
'political.knowledge' variables are slightly left skewed.
All other variables seem to be normally distributed.
From the scatterplot, we can see that there is mostly no correlation
between the variables.
We can use the correlation heatmap to view them more clearly.
Observations
Data Pre-processing
Prepare the data for modelling: - Outlier Detection(treat, if needed)) - Encode the data -
Data split - Scale the data (and state your reasons for scaling the features).
Observations
As we can see after treating the outliers with cap and floor technique
all the outliers have been adjusted.
Observations.
From the above results we can see that both variables contain only two
classifications of data in them.
We can use a simple categorical conversion (pd.categorical() or dummy
encoding with drop_first = True, both of them will work here)
This will convert the values into 0 and 1. As there is no level or order in
the subcategory any encoding will give the same result.
Observations.
The info of the dataset doest not contain any object datatype after
encoding the data.
The 'vote' and 'gender' variable is converted to 0 and 1 after encoding.
Train data:
• Accuracy: 84% • Precision: 86% • Recall: 91% • F1-Score: 89% • AUC: 90.4%
Test data:
• Accuracy: 83% • Precision: 86% • Recall: 90% • F1-Score: 88% • AUC: 90.4%
• The model is not over-fitted. • As we can see, the train data has a 84%
accuracy and test data has 83% accuracy. The difference is very less. So, we
can infer that the KNN model has performed well.
Train data:
• Accuracy: 83% • Precision: 88% • Recall: 88% • F1-Score: 88% • AUC: 88.7%
Test data:
• Accuracy: 82% • Precision: 89% • Recall: 86% • F1-Score: 87% • AUC: 88.7%
• The model is not over-fitted or under-fitted. • The error in the test data is
slightly higher than the train data, which is absolutely fine because the error
margin is low and the error in both train and test data is not too high. Thus,
the model is not over-fitted or under-fitted.
Bagging Model - Observation
Train data:
Test data:
• Accuracy: 80% • Precision: 86% • Recall: 86% • F1-Score: 86% • AUC: 100%
• The model is over-fitted. • As we can see, the train data has a 100%
accuracy and test data has 80% accuracy. The difference is more in this
model. So, we can infer that the Bagging model has not performed well.
Train data:
• Accuracy: 89% • Precision: 91% • Recall: 93% • F1-Score: 92% • AUC: 95%
Test data:
• Accuracy: 83% • Precision: 89% • Recall: 87% • F1-Score: 88% • AUC: 95%
• The model is not over-fitted. • As we can see, the train data has a 89%
accuracy and test data has 83% accuracy. The difference is very less. So, we
can infer that the Boosting model has performed well.
Bagging
Before Tuning : The model permormed well on the training data with an
accuracy of 1.00, indicating potential overfitting. On the test data, it
had an acuuracy of 0.80, with balanced precision and recall for both
classes.
After Tuning : The model's test accuracy improved slightly to 0.83 after
tuning. Precision and recall improved for class 0, indicating better
performance in predicting the minority class. The over all F1 score and
accuracy improved, suggesting a better balance between precision and
recall.
Boosting
Before Tuning : The model permormed well on the training data with an
accuracy of 0.89, indicating potential overfitting. On the test data, it
had an acuuracy of 0.83, with balanced precision and recall for both
classes.
After Tuning : The model's test accuracy improved slightly to 0.82 after
tuning. Precision and recall improved for class 0, indicating better
performance in predicting the minority class. The over all F1 score and
accuracy improved, suggesting a better balance between precision and
recall.
Compare all the model built so far - Select the final model with the
proper justification - Check the most important features in the final
model and draw inferences.
To compare all the models and select the final one, lets analyze the
performance based on various metrics accuracy,F1-score,recall,precision and
AUC-ROC score. Below there is the summary of each model:
KNN:
Accuracy- 0.83
Precision- 0.86
Recall-0.90
F1-score-0.88
AUC-Roc-0.90
Naive Baye's:
Accuracy- 0.82
Precision- 0.89
Recall-0.86
F1-score-0.87
AUC-Roc-0.887
Bagging(After Tuning):
Accuracy- 0.80
Precision- 0.80
Recall-0.80
F1-score-0.90
AUC-Roc-0.82
Boosting(After Tuning):
Accuracy- 0.83
Precision- 0.80
Recall-0.80
F1-score-0.86
AUC-Roc-0.81
Conclusion:
• All the tuned models have high values and every model is good. But as we
can see, the most consistent tuned model in both train and test data is the
Boosting model.
• The tuned gradient boost model performs the best with 79% accuracy score
in train and 83% accuracy score in test. Also it has the best AUC score of 81%
inboth train and test data which is the highest of all the models.
• It also has a precision score of 80% and recall of 80% which is also the
highest of all the models. So, we conclude that Gradient Boost Tuned model is
the best/optimized model.
Actionable Insights & Recommendations¶
Compare all four models - Conclude with the key takeaways for the
business.
KNN:
Accuracy- 0.83
Precision- 0.86
Recall-0.90
F1-score-0.88
AUC-Roc-0.90
Naive Baye's:
Accuracy- 0.82
Precision- 0.89
Recall-0.86
F1-score-0.87
AUC-Roc-0.887
Bagging(After Tuning):
Accuracy- 0.80
Precision- 0.80
Recall-0.80
F1-score-0.90
AUC-Roc-0.82
Boosting(After Tuning):
Accuracy- 0.83
Precision- 0.80
Recall-0.80
F1-score-0.86
AUC-Roc-0.81
Insights:
• Labour party has more than double the votes of conservative party.
• Most number of people have given a score of 3 and 4 for the national
economic condition and the average score is 3.245221
• Most number of people have given a score of 3 and 4 for the household
economic condition and the average score is 3.137772
• Blair has higher number of votes than Hague and the scores are much better
for Blair than for Hague.
• The average score of Blair is 3.335531 and the average score of Hague is
2.749506. So, here we can see that,Blair has a better score.
• People who gave a low score of 1 to a certain party, still decided to vote for
the same party instead of voting for the other party. This can be because of
lack of political knowledge among the people.
• People who have higher Eurosceptic sentiment, has voted for the
conservative party and lower the Eurosceptic sentiment, higher the votes for
Labour party.
• Out of 454 people who gave a score of 0 for political knowledge, 360 people
have voted for the labour party and 94 people have voted for the conservative
party.
• All models performed well on training data set as well as test dat set. The
tuned models have performed better than the regular models.
Business recommendations:
• Gathering more data will also help in training the models and thus improving
the predictive powers.
• We can also create a function in which all the models predict the outcome in
sequence. This will helps in better understanding and the probability of what
the outcome will be.
• Using Gradient Boosting model without scaling for predicting the outcome
as it has the best optimized performance.
As we can see '--' 63, 'us' 44, 'new' 26 these are most frequent word and
character.
Observations:
• us - 44
• new - 26
• let - 25
• america - 15
• shall - 13
Here, 'every','peace','people' all are on 7th place because of the same number
of occurences. Most occuring word: '--', 'us'and 'new'.
Problem 2 - Plot Word cloud of all three speeches
Show the most common words used in all three speeches in the form of
word clouds.
We can see some highlighted words like
'let','us','new','nation','world','america','people','peace',etc. This
shows bigger the size more the frequency.
Insights:
Our objective was to look at all the speeches and analyse them. To find
the strength and sentiment of the speeches.
Based on the outputs we can see that there are some similar words
that are present in all the speeches.
These words may prove the point which inspired many people and also
get them the seat of the president of United States of America.
Among all the speeches "nation" is the word that is significantly
highlighted in all three.