0% found this document useful (0 votes)
20 views78 pages

Machine Learning-2 Business Report

Uploaded by

arpitasaha.1994
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views78 pages

Machine Learning-2 Business Report

Uploaded by

arpitasaha.1994
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 78

Machine Learning-2

Define the problem and perform Exploratory Data Analysis

- Problem definition - Check shape, Data types, and statistical summary - Univariate analysis
- Multivariate analysis - Use appropriate visualizations to identify the patterns and insights -
Key meaningful observations on individual variables and the relationship between variables

Observations.
 Data set contains 1525 rows with 9 columns (Shape).
Observations:
 Data set contains no missing values. .
 Data set contains object named 'vote' & 'gender'.
 Data set contains 7 integer values, and 2 object.
Observations.
 The minimum and maximum age is 24 to 93.

 The minimum and maximum Assessment of current national economic


conditions is 1 to 5.

 The minimum and maximum Assessment of current household economic


conditions is 1 to 5.

 The mean of Assessment of current national economic conditions is 3.245221.

 The mean of Assessment of current household economic conditions is


3.137772.

 The mean of Assessment of the Labour leader is 3.335531.

 The mean of Assessment of the Conservative leader is 2.749506.

 The mean of an 11-point scale that measures respondents' attitudes toward


European integration is 6.740277. High scores represent ‘Eurosceptic’
sentiment.

 The mean of Knowledge of parties' positions on European integration is


1.540541.

 The Frequency of Labour vote is more by 1057.

 The Frequency of female gender is more by 808


Observations.
Distplot and Boxplot of age:
 The data is normally distributed.
 Maximum number of people are aged between 40 and 70.
 Outliers are not present.
 The minimum value is 24 and the maximum value is 93.
 The mean value is 54.241.
Observations.
Distplot and Boxplot of "economic.cond.national, economic.cond.household, Blair, Hague,
Europe, political.knowledge"
 We can see that all the numerical Variables are normally distributed (not perfectly normal
though and are multi modal in some instances as well.
 There are outliers present in "economic.cond.national" and "economic.cond.household"
variables that can be seen from the boxplots on the right too.
 Also from the boxplots the min and max values of the variables are not very clear, we can
separately obtain them while checking for outliers.
Observations
 We can clearly see that, the labour party has got more votes than the
conservative party.

 In every age group the labour party has got more votes than the conservative
party.
 Female votes are considerably higher than the male votes in both parties.

 In both genders, the labour party has got more votes than the conservative
party.
Observations
 Labour party has higher votes overall.

 Out of 82 people who gave score of 5, 73 people have voted for the labour
party.

 Out of 542 people who gave score of 4, 450 people have voted for the labour
party.This is the highest set of people in the labour party

 Out of 607 people who gave score of 3, 407 people have voted for the labour
party.This is the second highest set of people in the labour party. The
remaining 200 people who have voted for conservative party is the highest set
of people in that party.

 Out of 257 people who gave score of 2, 117 people have voted for the labour
party. 140 people have voted for the conservative party. This is the instance
where the conservative party has got more votes than the labour party.

 Out of 37 people who gave score of 1, 16 people have voted for the labour
party. 21 people have voted for the conservative party.

 The score of 3,4 and 5 have more votes in the labour party.

 The score of 1 and 2 have more votes in the conservative party.


Observations
 Labour party has higher votes overall.

 Out of 92 people who gave a score of 5, 69 people have voted for the labour
party.

 Out of 440 people who gave score of 4, 353 people have voted for the labour
party.This is second the highest set of people in the labour party.

 Out of 648 people who gave score of 3, 450 people have voted for the labour
party.This is the highest set of people in the labour party. The remaining 198
people who have voted for conservative party is the highest set of people in
that party.

 Out of 280 people who gave score of 2, 154 people have voted for the labour
party. 126 people have voted for the conservative party.

 Out of 65 people who gave score of 1, 37 people have voted for the labour
party. 28 people have voted for the conservative party.

 The score of 3,4 and 5 have more votes in the labour party.

 In all the instances, the labour party have more votes than the conservative
party.
Observations
 Labour party has higher votes overall.
 Out of 153 people who gave a score of 5, 150 people have voted for the
labour party.The remaining 3 people despite giving a score of 5 to the
labour leader have chosen to vote foe the conservative party.
 Out of 836 people who gave score of 4, 679 people have voted for the
labour party.The remaining 157 people despite giving a score of 4 to the
labour leader have chosen to vote for the conservative party.
 Only 1 person has given a score of 3 and that person has voted for the
conservative party.
 Out of 438 people who gave score of 2, 242 people have voted for the
conservative party. The remaining 196 people, despite giving an
unsatisfactory score of 2 to the labour leader, have chosen to vote for
the labour party.
 Out of 97 people who gave score of 1, 59 people have voted for the
conservative party. The remaining 38 people despite giving the lowest
score of 1 to the labour leader, have chosen to vote the labour party.
 The score of 4 and 5 have more votes in the labour party.
 The score of 1,2 and 3 have more votes in the conservative party.
Observations

 Labour party has higher votes overall.


 Out of 73 people who gave a score of 5, 59 people have voted for the
conservative party.The remaining 14 people despite giving a score of 5
to the conservative leader have chosen to vote foe the labour party.
 Out of 558 people who gave score of 4, 287 people have voted for the
conservative party.The remaining 271 people despite giving a score of 4
to the conservative leader have chosen to vote for the labour party.
 Out of 37 people who gave a score of 3, 28 have voted for the labour
party. The remaining 9 despite giving an average score of 3 to the
conservative party have chosen to vote for the conservative party.
 Out of 624 people who gave score of 2, 528 people have voted for the
labour party. The remaining 96 people, despite giving an unsatisfactory
score of 2 to the conservative leader, have chosen to vote for the
conservative party.
 Out of 233 people who gave score of 1, 222 people have voted for the
labour party. The remaining 11 people despite giving the lowest score
of 1 to the conservative leader, have chosen to vote the conservative
party.
 The score of 4 and 5 have more votes in the conservative party,
although in 4 the votes are almost equal in both the
parties.Conservative party gets slightly higher.
 The score of 1,2 and 3 have more votes in the labour party. Still a
significant percentage of people who have a bad score to the
conservative leader still chose to vote for 'Hague'.
Observations

 Out of 338 people who gave ascore of 11, 166 people have voted for the
labour party and 172 people have voted for the conservative party.
 People who gave score of 7 to 10 have voted for labour and
conservative almost equally.Conservative party seem to be slightly
higher in these instances.
 Out of 209 people who gave a score of 6, 173 people have voted for the
labour party and 36 people have voted for the conservative party.
 People who gave a score of 1 to 6 have predominantly voted for the
labour party. As we can see there are total of 770 people who have
given scores from 1 to 6. Out of 770 people 672 people have voted for
the labour party. So, 87.28% of the people have chosen labour party.
 So, we can infer that lower the 'Eurosceptic' sentiment, higher the
votes for labour party.
Observations

 Out of 250 people who gave a score of 3, 178 people have voted for the
labour party and 72 people have voted for the conservative party.
 Out of 782 people who gave a score of 2, 498 people have voted for the
labour party and 284 people have voted for the conservative party.
 Out of 38 people who gave a score of 1, 27 people have voted for the
labour party and 11 people have voted for the conservative party.
 Out of 455 people who gave a score of 0, 360 people have voted for the
labour party and 95 people have voted for the conservative party.
 We can see that, in all instances, labour party gets the higher number
of votes.
 Out of 1525 people 455 people gave a score of 0. So, this means that
29.93% of the people are casting their votes without any political
knowledge.
Observations

 Pairplot tells us about the interaction of each variable with every other
variable present. As such there is no strong relationship present
between the variables. There is a mixture of positive and negative
relationships though which is expected.
 Overall its a rough estimate of the interactions, clearer picture can be
obtained by heatmap values and also different kinds of plots.
 Pairplot is acombination of histograms and scatterplots.
 From the histogram we can see that the 'Blair','Europe' and
'political.knowledge' variables are slightly left skewed.
 All other variables seem to be normally distributed.
 From the scatterplot, we can see that there is mostly no correlation
between the variables.
 We can use the correlation heatmap to view them more clearly.
Observations

 We can see that, mostly there is no correlation in the dataset through


this. There are some variables that are moderately positively correlated
and some that are slightly negatively correlated.
 'economic.cond.national' with 'economic.cond.household' have
moderate positive correlation.
 'Blair' with 'economic.cond.national' and 'economic.cond.household'
have moderate positive correlation.
 'Europe' with 'Hague' have moderate positive correlation.
 'Hague' with 'economic.cond.national' and 'Blair' have moderate
negative correlation.
 'Europe' with 'economic.cond.national' and 'Blair' have moderate
negative correlation.

Data Pre-processing
Prepare the data for modelling: - Outlier Detection(treat, if needed)) - Encode the data -
Data split - Scale the data (and state your reasons for scaling the features).
Observations

 There are nearly no outliers in most of the numerical columns.


 Only outliers are present in 'economic.cond.national' and
'economic.cond.household' variables that can be seen from the
boxplots.
 In Gaussian Naive Bayes, outliers will affect the shape of the Gaussian
distribution and have the usual effects on the mean etc. So depending
on our use case, it makes sense to remove outlier .
Observations.

 As we can see after treating the outliers with cap and floor technique
all the outliers have been adjusted.

Observations.

 From the above results we can see that both variables contain only two
classifications of data in them.
 We can use a simple categorical conversion (pd.categorical() or dummy
encoding with drop_first = True, both of them will work here)
 This will convert the values into 0 and 1. As there is no level or order in
the subcategory any encoding will give the same result.
Observations.

 The info of the dataset doest not contain any object datatype after
encoding the data.
 The 'vote' and 'gender' variable is converted to 0 and 1 after encoding.

Reasons for Scaling the feature:

 The dataset contains features highly varying in magnitudes, units and


range between the 'age' column and other columns.
 But since, most of the machine learning algorithms use Eucledian
distance between two data points in their computations, this is a
problem.
 If left alone, these algorithms only take in the magnitude of features
neglecting the units.
 The results would vary greatly between different units, 1km and 1000
metres.
 The features with high magnitudes will weigh in a lot more in the
distance calculations than features with low magnitudes.
 To supress this effect, we need to bring all features to the same level of
magnitudes. This can be acheived by scaling.
 in this case, we have a lot of encoded, ordinal, categorical and
continuous variables. So, we use the min max scaler technique to scale
the data.
Model Performance evaluation
- Check the confusion matrix and classification metrics for all the models (for both train and
test dataset) - ROC-AUC score and plot the curve - Comment on all the model performance.
K-Nearest Neighbor Model - Observation

Train data:

• Accuracy: 84% • Precision: 86% • Recall: 91% • F1-Score: 89% • AUC: 90.4%

Test data:

• Accuracy: 83% • Precision: 86% • Recall: 90% • F1-Score: 88% • AUC: 90.4%

Validness of the model:

• The model is not over-fitted. • As we can see, the train data has a 84%
accuracy and test data has 83% accuracy. The difference is very less. So, we
can infer that the KNN model has performed well.

Naïve Bayes Model - Observation

Train data:

• Accuracy: 83% • Precision: 88% • Recall: 88% • F1-Score: 88% • AUC: 88.7%

Test data:

• Accuracy: 82% • Precision: 89% • Recall: 86% • F1-Score: 87% • AUC: 88.7%

Validness of the model:

• The model is not over-fitted or under-fitted. • The error in the test data is
slightly higher than the train data, which is absolutely fine because the error
margin is low and the error in both train and test data is not too high. Thus,
the model is not over-fitted or under-fitted.
Bagging Model - Observation

Train data:

• Accuracy: 100% • Precision: 100% • Recall: 100% • F1-Score: 100% • AUC:


100%

Test data:

• Accuracy: 80% • Precision: 86% • Recall: 86% • F1-Score: 86% • AUC: 100%

Validness of the model:

• The model is over-fitted. • As we can see, the train data has a 100%
accuracy and test data has 80% accuracy. The difference is more in this
model. So, we can infer that the Bagging model has not performed well.

Boosting Model - Observation

Train data:

• Accuracy: 89% • Precision: 91% • Recall: 93% • F1-Score: 92% • AUC: 95%

Test data:

• Accuracy: 83% • Precision: 89% • Recall: 87% • F1-Score: 88% • AUC: 95%

Validness of the model:

• The model is not over-fitted. • As we can see, the train data has a 89%
accuracy and test data has 83% accuracy. The difference is very less. So, we
can infer that the Boosting model has performed well.

Model Performance improvement


- Improve the model performance of bagging and boosting models by tuning the model -
Comment on the model performance improvement on training and test data.
After 10 fold cross validation, scores both on train and test data set
respectively for all 10 folds are almost same.

Hence our model is valid.

Bagging

 Before Tuning : The model permormed well on the training data with an
accuracy of 1.00, indicating potential overfitting. On the test data, it
had an acuuracy of 0.80, with balanced precision and recall for both
classes.
 After Tuning : The model's test accuracy improved slightly to 0.83 after
tuning. Precision and recall improved for class 0, indicating better
performance in predicting the minority class. The over all F1 score and
accuracy improved, suggesting a better balance between precision and
recall.

Boosting

 Before Tuning : The model permormed well on the training data with an
accuracy of 0.89, indicating potential overfitting. On the test data, it
had an acuuracy of 0.83, with balanced precision and recall for both
classes.
 After Tuning : The model's test accuracy improved slightly to 0.82 after
tuning. Precision and recall improved for class 0, indicating better
performance in predicting the minority class. The over all F1 score and
accuracy improved, suggesting a better balance between precision and
recall.

Final Model Selection

 Compare all the model built so far - Select the final model with the
proper justification - Check the most important features in the final
model and draw inferences.

To compare all the models and select the final one, lets analyze the
performance based on various metrics accuracy,F1-score,recall,precision and
AUC-ROC score. Below there is the summary of each model:

KNN:

(Test Data Set)

 Accuracy- 0.83
 Precision- 0.86
 Recall-0.90
 F1-score-0.88
 AUC-Roc-0.90

Naive Baye's:

(Test Data Set)

 Accuracy- 0.82
 Precision- 0.89
 Recall-0.86
 F1-score-0.87
 AUC-Roc-0.887

Bagging(After Tuning):

(Test Data Set)

 Accuracy- 0.80
 Precision- 0.80
 Recall-0.80
 F1-score-0.90
 AUC-Roc-0.82

Boosting(After Tuning):

(Test Data Set)

 Accuracy- 0.83
 Precision- 0.80
 Recall-0.80
 F1-score-0.86
 AUC-Roc-0.81

Conclusion:

• There is no under-fitting or over-fitting in any of the tuned models.

• All the tuned models have high values and every model is good. But as we
can see, the most consistent tuned model in both train and test data is the
Boosting model.

• The tuned gradient boost model performs the best with 79% accuracy score
in train and 83% accuracy score in test. Also it has the best AUC score of 81%
inboth train and test data which is the highest of all the models.

• It also has a precision score of 80% and recall of 80% which is also the
highest of all the models. So, we conclude that Gradient Boost Tuned model is
the best/optimized model.
Actionable Insights & Recommendations¶

 Compare all four models - Conclude with the key takeaways for the
business.

KNN:

(Test Data Set)

 Accuracy- 0.83
 Precision- 0.86
 Recall-0.90
 F1-score-0.88
 AUC-Roc-0.90

Naive Baye's:

(Test Data Set)

 Accuracy- 0.82
 Precision- 0.89
 Recall-0.86
 F1-score-0.87
 AUC-Roc-0.887

Bagging(After Tuning):

(Test Data Set)

 Accuracy- 0.80
 Precision- 0.80
 Recall-0.80
 F1-score-0.90
 AUC-Roc-0.82
Boosting(After Tuning):

(Test Data Set)

 Accuracy- 0.83
 Precision- 0.80
 Recall-0.80
 F1-score-0.86
 AUC-Roc-0.81

Insights:

• Labour party has more than double the votes of conservative party.

• Most number of people have given a score of 3 and 4 for the national
economic condition and the average score is 3.245221

• Most number of people have given a score of 3 and 4 for the household
economic condition and the average score is 3.137772

• Blair has higher number of votes than Hague and the scores are much better
for Blair than for Hague.

• The average score of Blair is 3.335531 and the average score of Hague is
2.749506. So, here we can see that,Blair has a better score.

• On a scale of 0 to 3, about 30% of the total population has zero knowledge


about politics/parties.

• People who gave a low score of 1 to a certain party, still decided to vote for
the same party instead of voting for the other party. This can be because of
lack of political knowledge among the people.
• People who have higher Eurosceptic sentiment, has voted for the
conservative party and lower the Eurosceptic sentiment, higher the votes for
Labour party.

• Out of 454 people who gave a score of 0 for political knowledge, 360 people
have voted for the labour party and 94 people have voted for the conservative
party.

• All models performed well on training data set as well as test dat set. The
tuned models have performed better than the regular models.

• There is no over-fitting in any model except Random Forest and Bagging


regular models.

• Gradient Boosting model tuned is the best/optimized model.

Business recommendations:

• Hyper-parameters tuning is an import aspect of modelbuilding. There are


limitations to this as to process these combinations, huge amount of
processing power is required. But if tuning can be done with many sets of
parameters, we might get even better results.

• Gathering more data will also help in training the models and thus improving
the predictive powers.

• We can also create a function in which all the models predict the outcome in
sequence. This will helps in better understanding and the probability of what
the outcome will be.

• Using Gradient Boosting model without scaling for predicting the outcome
as it has the best optimized performance.

Problem 2 - Define the problem and Perform Exploratory Data Analysis

 Problem Definition - Find the number of Character, words & sentences


in all three speeches.
• President Franklin D. Roosevelt's speech have 1323 of total words.

• President John F. Kennedy's speech have 1364 of total words.

• President Richard Nixon's speech have 1769 of total words.

• President Franklin D. Roosevelt's speech have 7651 characters (including


spaces).

• President John F. Kennedy's speech have 7673 characters (including


spaces).

• President Richard Nixon's speech have 10106 characters (including spaces)


• There are 4.78 avg_word in President Franklin D.Roosevelt's speech.

• There are 4.62 avg_word in President John F. Kennedy'sspeech.

• There are 4.71 avg_word in President Richard Nixon's speech.

• There are 632 stopwords in President Franklin D.Roosevelt's speech.

• There are 618 stopwords in President John F. Kennedy'sspeech.

• There are 899 stopwords in President Richard Nixon's speech.

• There are 14 numerics in President Franklin D. Roosevelt's speech.


• There are 7 numerics in President John F. Kennedy'sspeech.

• There are 10 numerics in President Richard Nixon's speech.

• There are 1 UpperCase words in President Franklin D. Roosevelt's


speech.

• There are 5 UpperCase words in President John F. Kennedy'sspeech.

• There are 13 UpperCase words in President Richard Nixon's speech.

• There are 119 UpperCase letters in President Franklin D. Roosevelt's


speech.

• There are 94 UpperCase letters in President John F. Kennedy'sspeech.

• There are 132 UpperCase letters in President Richard Nixon's speech.

Problem 2 - Text cleaning

 Stopword removal - Stemming - find the 3 most common words used in


all three speeches.
After removal of stopwords:

• President Franklin D. Roosevelt's speech have 5144 characters (including


spaces).

• President John F. Kennedy's speech have 5205 characters (including


spaces).

• President Richard Nixon's speech have 6557 characters (including spaces).


After removal of stopwords:

• President Franklin D. Roosevelt's speech have 662 of total words.

• President John F. Kennedy's speech have 723 of total words.

• President Richard Nixon's speech have 843 of total words.

 As we can see '--' 63, 'us' 44, 'new' 26 these are most frequent word and
character.
Observations:

The most frequent words used in all three speeches are:

• us - 44

• new - 26

• let - 25

• america - 15

• shall - 13

Here, 'every','peace','people' all are on 7th place because of the same number
of occurences. Most occuring word: '--', 'us'and 'new'.
Problem 2 - Plot Word cloud of all three speeches

 Show the most common words used in all three speeches in the form of
word clouds.
We can see some highlighted words like
'let','us','new','nation','world','america','people','peace',etc. This
shows bigger the size more the frequency.

Insights:

 Our objective was to look at all the speeches and analyse them. To find
the strength and sentiment of the speeches.
 Based on the outputs we can see that there are some similar words
that are present in all the speeches.
 These words may prove the point which inspired many people and also
get them the seat of the president of United States of America.
 Among all the speeches "nation" is the word that is significantly
highlighted in all three.

You might also like