0% found this document useful (0 votes)
6 views54 pages

ML-2 Project Final

The document outlines a business report focused on leveraging machine learning to build predictive models for voter preferences based on a survey of 1525 voters. It details the steps involved in exploratory data analysis, data preprocessing, model building, and performance evaluation, along with actionable insights for a news channel's election coverage. Additionally, it includes a data dictionary and various analyses related to demographic and socio-economic factors influencing voter behavior.

Uploaded by

shalinigauri28
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views54 pages

ML-2 Project Final

The document outlines a business report focused on leveraging machine learning to build predictive models for voter preferences based on a survey of 1525 voters. It details the steps involved in exploratory data analysis, data preprocessing, model building, and performance evaluation, along with actionable insights for a news channel's election coverage. Additionally, it includes a data dictionary and various analyses related to demographic and socio-economic factors influencing voter behavior.

Uploaded by

shalinigauri28
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

MACHINE LEARNING-2

BUSINESS REPORT
CODED PROJECT
Table of Contents
PROBLEM 1 ....................................................................................................................................... 5
Data Overview ............................................................................................................................... 5
Business Context:....................................................................................................................... 5
Objective: .................................................................................................................................. 5
Data Dictionary: ......................................................................................................................... 6
Problem 1.1 - Define the problem and perform Exploratory Data Analysis ..................................... 6
Problem definition - Check shape, Data types, statistical summary ............................................. 6
Univariate Analysis:.................................................................................................................. 13
Bivariate Analysis: .................................................................................................................... 15
Problem 1.2 - Data Preprocessing ................................................................................................ 19
Outlier Treatment: ................................................................................................................... 19
Encode the Data: ..................................................................................................................... 21
Data split: ................................................................................................................................ 21
Problem 1.3 – Model Building and Model evaluation ................................................................... 22
Metrics of Choice (Justify the evaluation metrics): ................................................................... 22
Model Building (KNN, Naive bayes, Bagging, Boosting) - Metrics of Choice (Justify the evaluation
metrics) - Model Building (KNN, Naive bayes, Bagging, Boosting): ............................................ 24
Problem 1 .4 – Model Performance Improvement:....................................................................... 29
Improve the model performance of bagging and boosting models by tuning the model -
Comment on the model performance improvement on training and test data: ........................ 29
Problem 1.5 - Final Model Selection............................................................................................. 35
Compare all the model built so far - Select the final model with the proper justification - Check
the most important features in the final model and draw inferences: ....................................... 35
Problem 1.6 Actionable Insights & Recommendations: ................................................................ 42
Compare all four models - Conclude with the key takeaways for the business .......................... 43
PROBLEM 2 ..................................................................................................................................... 44
Data Overview ............................................................................................................................. 44
Business Context:..................................................................................................................... 44
Objective: ................................................................................................................................ 44
Data Dictionary: ....................................................................................................................... 45
Problem 2.1 - Define the problem and perform Exploratory Data Analysis ................................... 45
Problem definition - Problem Definition - Find the number of Character, words & sentences in all
three speeches. ....................................................................................................................... 45
Number of Characters in all the three speeches. ...................................................................... 46
Number of Words in all the three speeches.............................................................................. 46
Number of Sentences in all the three speeches. ....................................................................... 47

1
Problem 2.2 – Text Cleaning ......................................................................................................... 49
Stopword removal - Stemming - find the 3 most common words used in all three speeches: .... 49
Removal of Stopword............................................................................................................... 50
Stemming ................................................................................................................................ 50
find the 3 most common words used in all three speeches....................................................... 50
Inferences: ............................................................................................................................... 52
Problem 2 .3- Plot Word cloud of all three speeches .................................................................... 52
Show the most common words used in all three speeches in the form of word clouds: ............ 52
Inferences: ............................................................................................................................... 52

2
List Of Figures
FIG: 1.1.1 FIRST FIVE ROWS OF THE DATASET ............................................................................................................... 7
FIG: 1.1.2 LAST FEW ROWS OF THE DATASET ................................................................................................................ 7
FIG: 1.1.3 SHAPE OF THE DATASET ............................................................................................................................ 7
FIG:1.1.4 DATA TYPES OF THE DATASET ....................................................................................................................... 8
FIG: 1.1.5 STATISTICAL SUMMARY OF THE DATASET ....................................................................................................... 9
FIG: 1.1.6 DESCRIPTIVE ANALYSIS OF NUMERICAL AND CATEGORICAL VARIABLE .................................................................. 10
FIG: 1.1.7 DUPLICATE CHECK................................................................................................................................. 11
FIG: 1.1.8 DESCRIPTION AND DATA TYPES ................................................................................................................. 11
FIG: 1.1.9 UNIQUE VALUES FOR CATEGORICAL VARIABLE .............................................................................................. 12
FIG: 1.1.10 HISTOGRAM AND BOXPLOT DEPICTING EACH NUMERICAL ATTRIBUTES .............................................................. 13
FIG: 1.1.11 AGE DISTRIBUTION PLOT ....................................................................................................................... 14
FIG: 1.1.12 BARPLOT DISTRIBUTION OF VOTE............................................................................................................. 15
FIG: 1.1.13 VOTE VS AGE BOXPLOT DISTRIBUTION ...................................................................................................... 15
FIG:1.1.14 RELATIONSHIP BETWEEN EACH ATTRIBUTE WITH RESPECT TO GENDER ................................................................ 16
FIG: 1.1.15 PAIR PLOT OF THE VARIABLES ................................................................................................................. 17
FIG:1.1.16 HEATMAP OF ATTRIBUTES...................................................................................................................... 18
FIG: 1.2.1 BOXPLOT TO CHECK THE PRESENCE OF OUTLIERS IN THE VARIABLE ...................................................................... 20
FIG: 1.2.2 BOXPLOT AFTER OUTLIER TREATMENT ......................................................................................................... 20
FIG: 1.2.3 DATA SET AFTER ENCODING ..................................................................................................................... 21
FIG: 1.2.4 DATA SPLIT RATIO ................................................................................................................................. 21
FIG: 1.3.1 ACCURACY, CONFUSION MATRIX AND CLASSIFICATION REPORT OF THE TRAIN DATA ................................................ 22
FIG: 1.3.2 ACCURACY, CONFUSION MATRIX AND CLASSIFICATION REPORT OF THE TEST DATA .................................................. 23
FIG: 1.3.3 DATASET USED IN KNN AFTER SCALING THROUGH ZSCORE ............................................................................... 24
FIG: 1.3.4 DATA SET SHAPE USED IN THE MODEL BUILDING ............................................................................................ 24
FIG: 1.3.5 ACCURACY, CONFUSION MATRIX AND CLASSIFICATION REPORT OF THE TRAIN DATA ................................................ 25
FIG: 1.3.6 ACCURACY, CONFUSION MATRIX AND CLASSIFICATION REPORT OF THE TEST DATA .................................................. 26
FIG: 1.3.7 DIFFERENCE BETWEEN TRAINING AND TESTING ACCURACY OF THE MODEL ............................................................ 26
FIG: 1.3.8 MISCLASSIFICATION PLOT FOR DIFFERENT K VALUES ........................................................................................ 27
FIG: 1.4.1 ACCURACY, CONFUSION MATRIX AND CLASSIFICATION REPORT OF THE TRAIN DATA ................................................ 29
FIG: 1.4.2 ACCURACY, CONFUSION MATRIX AND CLASSIFICATION REPORT OF THE TEST DATA .................................................. 30
FIG: 1.4.3 ACCURACY, CONFUSION MATRIX AND CLASSIFICATION REPORT OF THE TRAIN DATA ................................................ 30
FIG: 1.4.4 ACCURACY, CONFUSION MATRIX AND CLASSIFICATION REPORT OF THE TEST DATA .................................................. 31
FIG: 1.4.5 ACCURACY, CONFUSION MATRIX AND CLASSIFICATION REPORT OF THE TRAIN DATA ................................................ 32
FIG: 1.4.6 ACCURACY, CONFUSION MATRIX AND CLASSIFICATION REPORT OF THE TEST DATA .................................................. 32
FIG:1.4.7 DATA SET SHAPE WITH THE ACCURACY SCORE ................................................................................................ 34
FIG:2.1.1 LOADED TEXT SPEECHES INTO THE DATASET FORMAT ....................................................................................... 46
FIG:2.1.2 INFORMATION OF THE SPEECH TEXT IMPORTED .............................................................................................. 46
FIG:2.1.3 NUMBER OF CHARACTERS IN ALL THE THREE SPEECHES. ................................................................................... 46
FIG:2.1.4 NUMBER OF WORDS IN ALL THE THREE SPEECHES. ......................................................................................... 46
FIG:2.1.5 NUMBER OF SENTENCES IN ALL THE THREE SPEECHES. ..................................................................................... 47
FIG:2.1.6 COUNT OF STOP WORDS ......................................................................................................................... 48
FIG:2.1.7 COUNT OF SPECIAL CHARACTERS ............................................................................................................... 48
FIG: 2.1.8 COUNT OF NUMBERS............................................................................................................................. 48
FIG: 2.1.9 COUNT OF UPPERCASE WORDS ................................................................................................................ 48
FIG: 2.1.10 COUNT OF UPPERCASE LETTERS.............................................................................................................. 49
FIG: 2.1.11 NUMBER OF LISTINGS ON THE COMPLETE SPEECH SET ................................................................................... 49
FIG: 2.2.1 LOWER CASE CONVERSION OF THE SPEECH .................................................................................................. 49
FIG: 2.2.2 REMOVAL OF PUNCTUATION .................................................................................................................... 50
FIG:2.2.3 REMOVAL OF STOP WORD ....................................................................................................................... 50

3
FIG:2.2.4 FREQDIST COUNT OF WORDS AFTER STEMMING ............................................................................................ 50
FIG:2.2.5 THREE MOST COMMON WORDS USED IN ALL THE THREE SPEECHES ..................................................................... 50
FIG:2.2.6 TOP 3 COMMON WORDS USED IN ALL THE THREE SPEECHES AFTER REMOVING EXTENDED STOP WORD ......................... 51
FIG: 2.3.1 WORD CLOUD FOR INAUGURAL SPEECH (AFTER CLEANING)!! ........................................................................... 53

4
PROBLEM 1

Data Overview
Business Context:
CNBE, a prominent news channel, is gearing up to provide insightful coverage of recent
elections, recognizing the importance of data-driven analysis. A comprehensive survey has
been conducted, capturing the perspectives of 1525 voters across various demographic and
socio-economic factors. This dataset encompasses 9 variables, offering a rich source of
information regarding voters' characteristics and preferences.

The primary objective is to leverage machine learning to build a predictive model capable of
forecasting which political party a voter is likely to support. This predictive model, developed
based on the provided information, will serve as the foundation for creating an exit poll. The
exit poll aims to contribute to the accurate prediction of the overall election outcomes,
including determining which party is likely to secure the majority of seats.
Objective:
The primary objective is to leverage machine learning to build a predictive model capable of
forecasting which political party a voter is likely to support.

1. Define the problem and perform exploratory Data - Problem definition - Check shape, Data
types, statistical summary - Univariate analysis - Multivariate analysis - Use appropriate
visualizations to identify the patterns and insights - Key meaningful observations on individual
variables and the relationship between variables.
2. Data Pre-processing - Prepare the data for modelling: - Missing value Treatment (if needed)
- Outlier Detection (treat, if needed) - Encode the data -Data split - Scale the data (and state
your reasons for scaling the features).
3. Model Performance evaluation - Check the confusion matrix and classification metrics for
all the models (for both train and test dataset) - ROC-AUC score and plot the curve - Comment
on all the model performance.
4. Model Performance improvement - Improve the model performance of bagging and
boosting models by tuning the model - Comment on the model performance improvement on
training and test data.
5. Final Model Selection - Compare all the model built so far - Select the final model with the
proper justification - Check the most important features in the final model and draw
inferences.
4. Actionable Insights & Recommendations - Compare all four models - Conclude with the
key takeaways for the business.

5
Data Dictionary:

1. vote: Party choice: Conservative or Labour


2. age: in years
3. economic.cond.national: Assessment of current national economic conditions, 1 to 5.
4. economic.cond.household: Assessment of current household economic conditions, 1 to 5.
5. Blair: Assessment of the Labour leader, 1 to 5.
6. Hague: Assessment of the Conservative leader, 1 to 5.
7. Europe: an 11-point scale that measures respondents' attitudes toward European
integration. High scores represent ‘Eurosceptic’ sentiment.
8. political.knowledge: Knowledge of parties' positions on European integration, 0 to 3.
9. gender: female or male.

Problem 1.1 - Define the problem and perform Exploratory Data Analysis

Problem definition - Check shape, Data types, statistical summary

The problem defined in this context is to build a predictive model leveraging machine learning
techniques. The goal is to forecast which political party a voter is likely to support based on a
comprehensive survey capturing the perspectives of 1525 voters across various demographic
and socio-economic factors. The predictive model serves a dual purpose: it is intended to
contribute to insightful coverage by a news channel (CNBE) of recent elections and to form
the foundation for an exit poll. The exit poll aims to accurately predict overall election
outcomes, specifically determining which political party is likely to secure the majority of
seats. Therefore, the objective is to develop a robust and accurate machine learning model
that can effectively predict voter preferences and, by extension, forecast election results.

The initial steps to get an overview of any dataset is to:

 observe the first few rows of the dataset, to check whether the dataset has been
loaded properly or not.
 get information about the number of rows and columns in the dataset.
 find out the data types of the columns to ensure that data is stored in the preferred
format and the value of each property is as expected.
 check the statistical summary of the dataset to get an overview of the numerical
columns of the data.

 Displaying the first few rows of the dataset.

6
Fig: 1.1.1 First five rows of the dataset

 Displaying the last few rows of the dataset.

Fig: 1.1.2 Last few rows of the dataset

 Checking the shape of the dataset.

Fig: 1.1.3 Shape of the dataset

 Checking the data types of the columns for the dataset.

7
Fig:1.1.4 Data types of the dataset

 There are 1525 rows and 10 columns, Unnamed column has been dropped from the
data frame.
 There is an unnamed column that needs to be dropped it contains the serial number
information which is not required for further calculation.
 From the Information it can be inferred that there are 2 categorical variables and 8
integer variables. And there are no missing values all attributes have 1525 data.

 Getting the statistical summary for the numerical variables.

8
Fig: 1.1.5 Statistical Summary of the dataset

 Upon dropping the duplicate rows and comparing the descriptive statistics with the
previous set, it appears that the attributes' central tendencies and dispersions remain
quite consistent.
 Therefore, the key patterns and trends in the dataset, such as the preference for the
Labour party in voting and the gender distribution, remain consistent with the
previous analysis.
 Any differences are likely within the realm of expected variability, and the overall
interpretation and inferences drawn from the data remain largely unchanged.
 The dataset reveals some interesting patterns across its attributes. In terms of age, the
relatively high mean (54.18 years) suggests a generally older population.
 The quartiles and median distribution indicate a fairly even spread, with the 25th
percentile at 41 years, the median at 53 years, and the 75th percentile at 67 years.
 The economic conditions, both national and household, exhibit moderate means and
standard deviations, with quartiles indicating a reasonable spread of responses.
 Blair's rating, with a mean of 3.33, appears relatively balanced, but the larger standard
deviation implies some variability in opinions. Similarly, Hague's rating has a lower
mean of 2.75 and a notable standard deviation, indicating a more diverse range of
opinions.
 Europe's attribute demonstrates a higher mean (6.73) and a wider standard deviation,
suggesting varied sentiments towards European matters.
 Political knowledge scores, with a mean of 1.54, show a low average understanding,
and the quartiles reveal a concentration of scores at the lower end, potentially
indicating a lack of widespread political knowledge among respondents.
 Overall, while some attributes exhibit relatively stable and centred distributions,
others, such as political knowledge, display potential outliers or skewness that may
warrant further investigation.

 Descriptive Analysis of Numerical and Categorical variable.

9
Fig: 1.1.6 Descriptive analysis of Numerical and Categorical Variable

 The dataset reveals some interesting patterns across its attributes. In terms of age, the
relatively high mean (54.18 years) suggests a generally older population.
 The quartiles and median distribution indicate a fairly even spread, with the 25th
percentile at 41 years, the median at 53 years, and the 75th percentile at 67 years.
 The economic conditions, both national and household, exhibit moderate means and
standard deviations, with quartiles indicating a reasonable spread of responses.
 Blair's rating, with a mean of 3.33, appears relatively balanced, but the larger standard
deviation implies some variability in opinions. Similarly, Hague's rating has a lower
mean of 2.75 and a notable standard deviation, indicating a more diverse range of
opinions.
 Europe's attribute demonstrates a higher mean (6.73) and a wider standard deviation,
suggesting varied sentiments towards European matters.
 Political knowledge scores, with a mean of 1.54, show a low average understanding,
and the quartiles reveal a concentration of scores at the lower end, potentially
indicating a lack of widespread political knowledge among respondents.
 Overall, while some attributes exhibit relatively stable and centred distributions,
others, such as political knowledge, display potential outliers or skewness that may
warrant further investigation.
 The dataset provides information on voting patterns and gender distribution among
1525 respondents. In terms of voting, there are two distinct categories, with Labour
being the most frequently chosen option, garnering 1063 votes.
 This dominance suggests a significant preference for the Labour Party among the
respondents.

10
 The gender distribution is notably skewed, with a total of 812 females compared to
713 males, indicating a higher representation of females in the dataset.
 This imbalance may be relevant in the context of the study's objectives, potentially
influencing the overall perspectives and responses captured.
 Understanding the dynamics between voting patterns and gender in this dataset could
provide valuable insights into the political landscape and preferences of the surveyed
population.
 Further analyses focusing on the interplay between gender and voting choices could
unveil interesting nuances and contribute to a more comprehensive understanding of
the dataset.

 Duplicate Row check and Removal

Fig: 1.1.7 Duplicate Check

 8 rows had duplicate records which were dropped. After dropping the duplicate rows,
the shape of the data frame is 1517 rows and 9 columns with 7 integers and 2 object
variables.
 Upon dropping the duplicate rows and comparing the descriptive statistics with the
previous set, it appears that the attributes' central tendencies and dispersions remain
quite consistent.

 Columns description and data type.

Fig: 1.1.8 Description and Data types


 There are 9 columns with 7 integer and 2 object variables with the 1517 observations after
removing duplicate entry and un wanted column.

11
 Unique values for Categorical variable.

Fig: 1.1.9 Unique Values for Categorical Variable

 Further probing into the Categorical variables: The dataset provides insights into two
categorical variables: "VOTE" and "GENDER." For the "VOTE" variable, there are two
unique values - Conservative and Labour.
 Among the 1525 respondents, 1057 individuals voted for the Labour party, while 460
opted for the Conservative party. This distribution suggests a notable preference for
the Labour party among the surveyed population, with a substantial majority choosing
this political option.
 Regarding the "GENDER" variable, there are also two unique values - male and female.
Among the respondents, there were 808 females and 709 males, indicating a slight
imbalance in gender representation, with a higher number of females in the dataset.
 This gender distribution may be relevant for analysing how political preferences and
opinions vary across gender lines.
 In summary, the data suggests a predominant inclination towards the Labour party in
voting patterns, and the gender distribution shows a slight majority of female
respondents.
 These insights into the categorical variables contribute to a better understanding of
the political and demographic composition of the surveyed population. Further
analyses could explore potential relationships between voting preferences and gender
to uncover nuanced patterns within the dataset.

12
Univariate Analysis:

Fig: 1.1.10 Histogram and boxplot depicting each numerical Attributes

13
 Univariate analysis involves the examination of individual variables in isolation to
understand their distribution, characteristics, and behaviour.
 In the context of the Voter’s characteristics and preferences with respect to the party’s
choice measures dataset, univariate analysis entails scrutinizing each attribute
independently.
 This process includes assessing summary statistics such as mean, median, standard
deviation, and quartiles to gauge the central tendency, spread, and shape of the
variable's distribution.
 Visualization tools like histograms, kernel density plots, or box plots can be employed
to provide a visual representation of the data's distribution.
 Additionally, identifying potential outliers and understanding the presence of missing
values are crucial aspects of univariate analysis.
 This analysis aids in uncovering patterns, trends, and anomalies within each variable,
laying the groundwork for more in-depth exploration and guiding subsequent data
preprocessing and modeling decisions.
 Positive skewness (right-skewed) is observed in most of the variables. It provides
insights into the asymmetry of the distribution.
 The above plots represent the behavioural characteristics of each attribute towards
contributing to the election outcomes with the choice of party the voters want to elect
and win them.
 From the above plots, it can be seen that most of the variables are skewed to the left
or right which indicates that the values require pre-processing before proceeding to
the analysis.

Fig: 1.1.11 Age distribution plot

 Age distribution count in terms of the vote with respect to the gender variable.

14
Fig: 1.1.12 barplot distribution of Vote

 The above bar plot represents the percentage of the party’s choice as per the voter’s
preferences.
 Labour has the 69.7% of chances on acquiring the votes in the upcoming elections
where as Conservative party is likely to get the votes 30.3% out of 1517 observations
done on the survey.

Bivariate Analysis:

Fig: 1.1.13 Vote Vs Age boxplot distribution

15
Fig:1.1.14 Relationship between each attribute with respect to gender

16
Fig: 1.1.15 Pair plot of the variables

17
Fig:1.1.16 Heatmap of Attributes

 The following is the interpretation of the correlation matrix based on the above
values:
 Age has a very weak positive correlation with economic conditions (national and
household), Blair's rating, and Europe's rating.
 It has a weak negative correlation with political knowledge.

18
 Economic conditions (national and household) are positively correlated, indicating
that individuals who perceive the national economic condition positively are likely to
perceive their household economic condition positively as well.
 Both economic conditions have positive correlations with Blair's rating, suggesting
that individuals with a positive economic outlook may be more inclined to support
Blair.
 Blair and Hague have a negative correlation, implying that respondents who rate
Blair higher tend to rate Hague lower and vice versa.
 Europe has a negative correlation with economic conditions and Blair, suggesting that
individuals with a positive economic outlook or favourable views of Blair may have
fewer positive views towards Europe.
 Political knowledge has weak negative correlations with age, economic conditions,
and ratings for Blair, Hague, and Europe. This suggests that individuals with higher
political knowledge tend to be slightly younger and may have different perspectives
on economic conditions and political figures.
 These correlations provide insights into potential relationships between variables,
which can be valuable for feature selection in building predictive models. However,
keep in mind that correlation does not imply causation and further analysis or
experimentation may be needed to establish causal relationships.

Problem 1.2 - Data Preprocessing

Outlier Treatment:

 To check for the presence of an outlier’s box plot was plotted.


 From the below it can be inferred that there are here are nearly no outliers in most
of the numerical columns, only outlier is in economic.cond.household and
economic.cond.national variable.
 In Gaussian Naive Bayes, outliers will affect the shape of the Gaussian distribution and
have the usual effects on the mean etc.
 So, depending on our use case, it makes sense to remove outlier.
 The Outliers was treated using the IQR. And it can be seen from the following graph
that outliers are removed.

19
Fig: 1.2.1 Boxplot to check the presence of outliers in the variable

Fig: 1.2.2 Boxplot after outlier treatment

20
 Outliers in both economic.cond.household and economic.cond.national have been
removed using the IQR treatment.

Encode the Data:

 One-hot encoding was done for the 2 object variables. Drop First is used to ensure that
multiple columns created based on the levels of categorical variables are not included
else it will result in multicollinearity. This is done to ensure that we do not land in to
dummy trap.

Fig: 1.2.3 Data set after encoding


Data split:

 Train-Test Split Split X and y into training and test set in 70:30 ratio with random_state=1.

Fig: 1.2.4 Data Split Ratio

 The dataset has been effectively split into training and testing sets, with the training
set comprising 1061 samples and the testing set consisting of 456 samples.
 Each sample in both sets includes 8 features, reflecting various demographic and socio-
economic attributes.
 The corresponding target variable, representing the political party vote, accompanies
each sample.
 This division is essential for training a machine learning model on a subset of the data
and evaluating its performance on unseen data, ensuring that the model can
generalize well beyond the training set.
 The provided shapes confirm the appropriate partitioning, setting the stage for model
training and assessment, ultimately contributing to the development of a robust
predictive model for forecasting political party preferences.

21
Problem 1.3 – Model Building and Model evaluation

Metrics of Choice (Justify the evaluation metrics):

 Gaussian Naïve Bayes

 Now GaussianNB classifier is built. The classifier is trained using training data. We can use fit
() method for training it. After building a classifier, our model is ready to make predictions.
We can use predict () method with test set features as its parameters.

Fig: 1.3.1 Accuracy, Confusion matrix and Classification Report of the train data

 Here 0 stands for Conservative and 1 stands for Labour.


 The model achieved an accuracy of approximately 83.41% on the training set. This
indicates the proportion of correctly classified instances out of the total. Predicting the
Conservative class needs attention.
 A balanced evaluation of precision, recall, and F1-score for both classes becomes
crucial. In this scenario, where misclassifying instances from the Conservative class
may have significant implications, it is essential to consider the model's performance
on each class individually.
 True Positive (TP): 675 (Correctly predicted as Labour), True Negative (TN): 211
(Correctly predicted as Conservative).
 False Positive (FP): 96 (Incorrectly predicted as Labour), False Negative (FN): 79
(Incorrectly predicted as Conservative).
 Precision (for Labour): 0.88, of all instances predicted as Labour, 88% were Labour. This
indicates a relatively low false positive rate.
 Recall (for Labour): 0.90 of all actual Labour instances, 90% were correctly predicted
as Labour. This indicates a relatively low false negative rate.
 The overall accuracy of the model on the training set is 83.5%. This measures the
proportion of correctly classified instances among all instances.

22
 The F1-score, which is the harmonic mean of precision and recall, is commonly used
when there is an imbalance between classes. The weighted average F1-score is around
0.89.
 In summary, while the model may exhibit high accuracy, a more nuanced assessment
involving class-specific metrics should be conducted to capture the model's
effectiveness for both Labour and Conservative predictions.

Fig: 1.3.2 Accuracy, Confusion matrix and Classification Report of the test data

 The model achieved an accuracy of approximately 82.2% on the training set. This
indicates the proportion of correctly classified instances out of the total. Predicting the
Conservative class needs attention.
 A balanced evaluation of precision, recall, and F1-score for both classes becomes
crucial. In this scenario, where misclassifying instances from the Conservative class
may have significant implications, it is essential to consider the model's performance
on each class individually.
 True Positive (TP): 263 (Correctly predicted as Labour), True Negative (TN): 112
(Correctly predicted as Conservative).
 False Positive (FP): 41 (Incorrectly predicted as Labour), False Negative (FN): 40
(Incorrectly predicted as Conservative).
 Precision (for Labour): 0.87, of all instances predicted as Labour, 87% were Labour. This
indicates a relatively low false positive rate.
 Recall (for Labour): 0.87 of all actual Labour instances, 87% were correctly predicted
as Labour. This indicates a relatively low false negative rate.
 The overall accuracy of the model on the testing set is 82.2%. This measures the
proportion of correctly classified instances among all instances.
 The F1-score, which is the harmonic mean of precision and recall, is commonly used
when there is an imbalance between classes. The weighted average F1-score is around
0.87.
 Inferences:
 While the model demonstrates reasonably high accuracy on both the training and
testing sets, the persistent occurrence of false positives and false negatives suggests

23
that further model improvement is necessary. The model seems to generalize
reasonably well to the testing set, as indicated by the comparable accuracy scores.
However, addressing the imbalances in precision, recall, and F1-score, particularly for
class 0, should be a focus for further refinement.
 For a Naive Bayes algorithm while calculating likelihoods of numerical features it
assumes the feature to be normally distributed and then we calculate probability using
the mean and variance of that feature only and also it assumes that all the predictors
are independent of each other. The scale doesn’t matter. Performing feature scaling in
this algorithm may not have much effect.

Model Building (KNN, Naive bayes, Bagging, Boosting) - Metrics of Choice (Justify the
evaluation metrics) - Model Building (KNN, Naive bayes, Bagging, Boosting):

 KNN MODEL

 Generally, good KNN performance usually requires preprocessing of data to make all
variables similarly scaled and centred Now let’s apply zscore on continues columns
and see the performance for KNN.

Fig: 1.3.3 Dataset used in KNN after scaling through zscore

Fig: 1.3.4 Data set shape used in the model building

 In summary, both the training and testing sets have a consistent structure with 8
features per sample, suggesting that the datasets are well-prepared for training and
evaluating a machine learning model. The sizes of the sets (1061 samples in training
and 456 samples in testing) are reasonable for model development and validation, and

24
the dimensions align appropriately for compatibility with the chosen machine learning
algorithm.

Fig: 1.3.5 Accuracy, Confusion matrix and Classification Report of the train data

 The provided output from the K-Nearest Neighbours (KNN) model on the training set
indicates a high level of accuracy and overall good performance.
 The accuracy score of approximately 87.08% suggests that the model correctly
classified a substantial majority of instances in the training dataset.
 The confusion matrix reveals a balanced distribution of true positives and true
negatives, with 240 instances correctly predicted as class 0 (Conservative) and 684
instances correctly predicted as class 1 (Labour).
 However, there are notable false positives (81 instances predicted as Labour while
being Conservative) and false negatives (56 instances predicted as Conservative while
being Labour), indicating some misclassifications.
 The precision, recall, and F1-score metrics further characterize the model's
performance, with higher values for class 1, indicating that the model is more effective
at identifying Labour voters.
 The weighted average values of precision, recall, and F1-score, all around 87%, affirm
the model's balanced performance across both classes.
 Overall, the KNN model demonstrates robust predictive capabilities on the training set,
and further evaluation on the testing set would be essential to assess its generalization
to unseen data.

25
Fig: 1.3.6 Accuracy, Confusion matrix and Classification Report of the test data

 The output from the K-Nearest neighbours (KNN) model on the testing set indicates a
solid level of performance, although there are some areas for consideration.
 The accuracy score of approximately 82.23% suggests that the model correctly
classified a significant portion of instances in the unseen dataset.
 The confusion matrix reveals a distribution of true positives and true negatives, with
88 instances correctly predicted as class 0 (Conservative) and 287 instances correctly
predicted as class 1 (Labour).
 However, there are noticeable false positives (51 instances predicted as Labour while
being Conservative) and false negatives (30 instances predicted as Conservative while
being Labour), indicating some misclassifications on the testing set.
 The precision, recall, and F1-score metrics provide a more nuanced understanding of
the model's performance, with a higher precision, recall, and F1-score for class 1,
indicating that the model is more proficient at identifying Labour voters.
 The weighted average values of precision, recall, and F1-score, all around 82%, affirm
the model's overall effectiveness on the testing set.
 While the model maintains a solid level of performance, further fine-tuning or
alternative approaches may be explored to address misclassifications and enhance
predictive accuracy.

Fig: 1.3.7 Difference between training and testing accuracy of the model

 In summary, while a slight difference between training and testing accuracy is


common, a negative difference suggests a potential issue with overfitting or model
generalization. Further investigation into model complexity, hyperparameters, and
data characteristics may help identify opportunities for improvement.

26
 Plot misclassification error vs k (with k value on X-axis) using matplotlib

Fig: 1.3.8 misclassification plot for different k values

 When comparing the differences between test and train scores for different values of
k, the goal is to select a model with a smaller difference, indicating better
generalization and reduced risk of overfitting. Let's analyse the provided differences:

For K=5, the difference is -0.0485.


For K=7, the difference is -0.0478.
For K=11, the difference is -0.0359.
For K=12, the difference is -0.0284.
For K=13, the difference is -0.0368.

27
 Among these options, the model with K=12 has the smallest difference between test
and train scores, suggesting better generalization. A smaller difference implies that the
model performs consistently on both the training and testing sets, reducing the
likelihood of overfitting.
 Therefore, based on the provided differences, the model with K=12 appears to be the
better choice for this particular scenario. However, it's essential to consider other
performance metrics (such as overall accuracy, precision, recall, and F1-score) and
possibly perform cross-validation to ensure the robustness and generalizability of the
chosen model.

Accuracy Score for K=7 is 0.8135964912280702


Accuracy Score for K=11 is 0.8245614035087719
Accuracy Score for K=12 is 0.831140350877193

 The accuracy scores for different values of k in your K-Nearest Neighbours (KNN) model
provide insights into how well the model is performing on your dataset. Let's analyse
the results:
 Accuracy Score for K=7: 0.8135 Accuracy Score for K=11: 0.8245 Accuracy Score for
K=12: 0.8311
 Optimal K: The increasing trend in accuracy scores suggests that higher values of k are
contributing to better performance. Among the mentioned values, K=12 has the
highest accuracy.
 Model Complexity: The choice of k influences the complexity of the KNN model.
Smaller k values result in more complex models, while larger k values lead to smoother
decision boundaries and potentially better generalization.
 Consistency: The consistency in the accuracy scores across K=11 and K=12 indicates
that the model's performance is stable. The small increase in accuracy from K=11 to
K=12 suggests that the model may not be overly sensitive to changes in k beyond a
certain point.
 Generalization: The increasing accuracy scores imply that the model is generalizing
well to the testing data. However, as with any model evaluation, it's crucial to consider
potential overfitting and validate the model's performance on a separate validation set
or through cross-validation.
 Fine-Tuning: Further exploration of different k values, especially in the vicinity of the
optimal value (around K=12), may provide additional insights into the model's
performance. Fine-tuning can help strike the right balance between bias and variance.
 In summary, the results indicate that the model performs well with higher values of k,
and K=12 provides the highest accuracy among the mentioned values. The choice of
the optimal k depends on the specific characteristics of your dataset, and additional
experimentation or validation techniques may be valuable.

28
Problem 1 .4 – Model Performance Improvement:

Improve the model performance of bagging and boosting models by tuning the model -
Comment on the model performance improvement on training and test data:

 Ada Boost

Fig: 1.4.1 Accuracy, Confusion matrix and Classification Report of the train data

 Accuracy: The AdaBoost model achieved an accuracy of approximately 84.26% on the


training set. This indicates that the model correctly predicted the political party
preference for 84.26% of the samples in the training data.
 Confusion Matrix: The confusion matrix shows that there were 219 true negatives, 675
true positives, 102 false positives, and 65 false negatives in the training set.
 Precision (the ratio of correctly predicted positive observations to the total predicted
positives) for class 0 is 0.77, and for class 1 is 0.87.
 Recall (the ratio of correctly predicted positive observations to the all observations in
actual class) for class 0 is 0.68, and for class 1 is 0.91.
 F1-score (the weighted average of Precision and Recall) is also reported.

29
Fig: 1.4.2 Accuracy, Confusion matrix and Classification Report of the test data

 Accuracy: The AdaBoost model achieved an accuracy of approximately 81.57% on the


test set. This indicates that the model correctly predicted the political party preference
for 81.57% of the samples in the test data.
 Confusion Matrix: The confusion matrix shows that there were 95 true negatives, 277
true positives, 44 false positives, and 40 false negatives in the test set.
 Precision, Recall, and F1-score: Similar to the training set, precision, recall, and F1-
score values are reported for both classes.
 Overall Inference: The model seems to perform well on both the training and test
datasets, with high accuracy and balanced precision and recall values. Precision and
recall for both classes (Conservative and Labour) are reasonably balanced, indicating
that the model is not biased towards one class. The F1-score considers both precision
and recall, providing a balance between false positives and false negatives.

 Gradient Boosting

Fig: 1.4.3 Accuracy, Confusion matrix and Classification Report of the train data

 Accuracy: The Gradient Boosting model achieved an accuracy of approximately 90%


on the training set. This indicates that the model correctly predicted the political party
preference for 90% of the samples in the training data.

30
 Confusion Matrix: The confusion matrix shows that there were 254 true negatives, 696
true positives, 67 false positives, and 44 false negatives in the training set.
 Precision, Recall, and F1-score: Precision for class 0 is 0.85, and for class 1 is 0.91. Recall
for class 0 is 0.79, and for class 1 is 0.94. F1-score is also reported.

Fig: 1.4.4 Accuracy, Confusion matrix and Classification Report of the test data

 Accuracy: The Gradient Boosting model achieved an accuracy of approximately 83%


on the test set. This indicates that the model correctly predicted the political party
preference for 83% of the samples in the test data.
 Confusion Matrix: The confusion matrix shows that there were 93 true negatives, 285
true positives, 46 false positives, and 32 false negatives in the test set.
 Precision, Recall, and F1-score: Precision for class 0 is 0.74, and for class 1 is 0.86. Recall
for class 0 is 0.67, and for class 1 is 0.90. F1-score is also reported.
 Overall Inference: The Gradient Boosting model demonstrates strong performance on
both the training and test datasets, with high accuracy and balanced precision and
recall values. Precision and recall for both classes (Conservative and Labour) are
reasonably balanced, indicating that the model is not biased towards one class.
 The F1-score considers both precision and recall, providing a balance between false
positives and false negatives. The model generalizes well to the test set, suggesting
that it is robust and not overfitting to the training data.
 This performance suggests that the Gradient Boosting Classifier is a promising model
for predicting political party preferences based on the provided features. Further
analysis, including hyperparameter tuning, could potentially enhance the model's
performance.

 BAGGING

 The performance metrics on the train and test data sets indicate that the Bagging
Classifier is performing exceptionally well on the training data, achieving near-perfect
accuracy.

31
 However, there is a noticeable drop in performance on the test data, with an accuracy
of 81.91%.

Fig: 1.4.5 Accuracy, Confusion matrix and Classification Report of the train data

 Accuracy: 99.92%: The model achieves extremely high accuracy on the training data,
which might suggest overfitting.
 Confusion Matrix: The confusion matrix on the training set shows that the model
predicts all instances of both classes correctly, with no false positives or false
negatives.

Fig: 1.4.6 Accuracy, Confusion matrix and Classification Report of the test data

 Accuracy: 81.91%: The model's accuracy drops on the test data, indicating that the
model might not generalize as well to unseen data.
 Confusion Matrix: The confusion matrix on the test set reveals that the model makes
some errors in predicting both classes. There are false positives and false negatives,
suggesting some misclassifications.

32
 Precision, Recall, F1-Score: The precision, recall, and F1-score for both classes (0 and
1) are provided in the classification report.
 Precision measures the accuracy of the positive predictions, recall measures the
coverage of actual positive instances, and the F1-score is the harmonic mean of
precision and recall. These metrics provide a more detailed view of the model's
performance.
 Inference: The model seems to have overfitted to the training data, achieving near-
perfect accuracy on it.
 The drop in accuracy on the test set suggests that the model might not generalize well
to new, unseen data. Further analysis of precision, recall, and F1-score for each class
can help understand the model's performance on specific classes and identify areas
for improvement.
 Consider regularization techniques or adjusting hyperparameters to improve
generalization performance. Additionally, further evaluation using cross-validation can
provide a more robust assessment of the model's capabilities.

 Cross Validation on Naïve Bayes Model

Cross Validation With CV=5:

Cross Validation training score

Mean training score

Cross Validation test score

Mean testing score

Cross Validation With CV=10:

Cross Validation training score

33
Mean training score

Cross Validation test score

Mean testing score

 Consistency: The model demonstrates a consistent performance across different folds


in both the training and testing sets, as indicated by the small standard deviations in
the cross-validated scores.
 Generalization: The model generalizes well to unseen data, with mean testing scores
comparable to mean training scores. This suggests that the NB model is not
significantly overfitting to the training data.
 Stability: The performance stability is evident as the mean training and testing scores
remain close across both cross-validation scenarios (cv=5 and cv=10).
 Slight Overfitting: While the model generalizes well, there is a small gap between mean
training and testing scores, indicating a slight overfitting tendency.
 Model Adequacy: The NB model seems adequate for the given task, achieving
consistent and reasonable accuracy on both training and testing sets. However, further
exploration or consideration of alternative models may be valuable.
 In summary, the Naive Bayes model demonstrates stable and consistent performance
across cross-validation folds, suggesting its adequacy for the task. The slight overfitting
observed could potentially be addressed through model tuning or regularization
techniques. Overall, the model appears to generalize well to new data.

 SMOTE

Fig:1.4.7 data set shape with the accuracy score

 An accuracy score of 0.7741 means that your K-Nearest Neighbours (KNN) classifier,
trained on SMOTE-resampled data with K=12, achieved an accuracy of approximately
77.41% on the test set.

34
 Here are some inferences and considerations based on this result: The model exhibits
decent performance with an accuracy of over 77%. However, the interpretation of
accuracy should be cautious, especially in imbalanced datasets. It's crucial to consider
other evaluation metrics, especially when dealing with imbalanced classes.
 To determine which model is the best, you should consider multiple factors, including
accuracy scores, the specific characteristics of your data, and the problem you are
trying to solve. Here are some observations based on the provided information.
 Accuracy Scores: KNN with K=12 and SMOTE: 0.7741 KNN with K=12 (without SMOTE):
0.8311 Naive Bayes (NB): 0.8223.
 Consideration of Accuracy Alone: The KNN model without SMOTE, specifically with
K=12, has the highest accuracy (0.8311) among the models mentioned.

Problem 1.5 - Final Model Selection

Compare all the model built so far - Select the final model with the proper justification - Check
the most important features in the final model and draw inferences:

 Conclusion: We have conducted various analyses, including descriptive statistics, data


exploration, and model training and evaluation. Here are some observations and
recommendations for developing an exit poll predictive model:
 Descriptive Statistics: The dataset contains information on voters' characteristics,
economic conditions, political knowledge, and party preferences.
 Descriptive statistics provide insights into the distribution of variables, measures of
central tendency, and dispersion.
 Data Exploration: The dataset includes both numerical and categorical variables.
 Analysis of attributes such as age, economic conditions, and political knowledge has
been performed to understand their distributions and relationships.
 Model Training: Machine learning models, including K-Nearest Neighbours (KNN) with
and without SMOTE and Naive Bayes (NB), Bagging and Boosting have been trained.
 Model Comparison: Based on the provided information, it seems that the Gradient
Boosting model has demonstrated the best overall performance among the models
you've built. Here's a summary of the key findings:

 Final Model Selection


 Let's look at the performance of all the models on the Train Data set
 Recall refers to the percentage of total relevant results correctly classified by the
algorithm and hence we will compare Recall of class "1" for all models.

35
1. Gaussian Naive Bayes Model: Recall for class 1 is 0.90 and the model accuracy is 83.50.

0.8350612629594723

[[211 96]

[ 79 675]]

precision recall f1-score support

0 0.73 0.69 0.71 307

1 0.88 0.90 0.89 754

accuracy 0.84 1061

macro avg 0.80 0.79 0.80 1061

weighted avg 0.83 0.84 0.83 1061

2. KNN Model: Recall for class 1 is 0.92 and the model accuracy is 87.08.

0.8708765315739868

[[240 81]

[ 56 684]]

precision recall f1-score support

0 0.81 0.75 0.78 321

1 0.89 0.92 0.91 740

accuracy 0.87 1061

macro avg 0.85 0.84 0.84 1061

weighted avg 0.87 0.87 0.87 1061

36
3. KNN Model (For k=12): Recall for class 1 is 0.91 and the model accuracy is 85.95.

0.8595664467483506

[[242 79]

[ 70 670]]

precision recall f1-score support

0 0.78 0.75 0.76 321

1 0.89 0.91 0.90 740

accuracy 0.86 1061

macro avg 0.84 0.83 0.83 1061

weighted avg 0.86 0.86 0.86 1061

4. Ada Boost: Recall for class 1 is 0.91 and the model accuracy is 84.26.

0.8426013195098964

[[219 102]

[ 65 675]]

precision recall f1-score support

0 0.77 0.68 0.72 321

1 0.87 0.91 0.89 740

accuracy 0.84 1061

macro avg 0.82 0.80 0.81 1061

weighted avg 0.84 0.84 0.84 1061

37
5. Gradient Boosting: Recall for class 1 is 0.94 and the model accuracy is 89.53.

0.8953817153628653

[[254 67]

[ 44 696]]

precision recall f1-score support

0 0.85 0.79 0.82 321

1 0.91 0.94 0.93 740

accuracy 0.90 1061

macro avg 0.88 0.87 0.87 1061

weighted avg 0.89 0.90 0.89 1061

6. BAGGING: Recall for class 1 is 1.00 and the model accuracy is 99.91.

0.9991755976916735

[[377 0]

[ 1 835]]

precision recall f1-score support

0 1.00 1.00 1.00 377

1 1.00 1.00 1.00 836

accuracy 1.00 1213

38
macro avg 1.00 1.00 1.00 1213

weighted avg 1.00 1.00 1.00 1213

7. KNN model(k=12) with smote, model accuracy is 77.41

So as per the train data,

 Worst performing models are - Gaussian Naive Bayes Model and Ada Boost.
 Best Performing models are - Gradient Boost, KNN Model and Bagging.
 Gradient Boosting appears to be the most effective model among those considered,
as it provides the highest accuracy on the test set. Gradient Boosting often performs
well due to its ensemble nature, which combines multiple weak learners to create a
strong learner.
 However, are these best performing models overfitted.
 Let's look at the performance on the test data set.
 Recall on the Test Data Set
 Recall refers to the percentage of total relevant results correctly classified by the
algorithm and hence we will compare Recall of class "1" for all models.

1. Gaussian Naive Bayes Model: Recall for class 1 is 0.87 and the model accuracy is 82.23.

0.8223684210526315

[[112 41]

[ 40 263]]

precision recall f1-score support

0 0.74 0.73 0.73 153

1 0.87 0.87 0.87 303

accuracy 0.82 456

macro avg 0.80 0.80 0.80 456

weighted avg 0.82 0.82 0.82 456

39
2. KNN Model: Recall for class 1 is 0.91 and the model accuracy is 82.23.

0.8223684210526315

[[ 88 51]

[ 30 287]]

precision recall f1-score support

0 0.75 0.63 0.68 139

1 0.85 0.91 0.88 317

accuracy 0.82 456

macro avg 0.80 0.77 0.78 456

weighted avg 0.82 0.82 0.82 456

3. KNN Model (For k=12): Recall for class 1 is 0.89 and the model accuracy is 83.11.

0.831140350877193

[[ 97 42]

[ 35 282]]

precision recall f1-score support

0 0.73 0.70 0.72 139

1 0.87 0.89 0.88 317

accuracy 0.83 456

40
macro avg 0.80 0.79 0.80 456

weighted avg 0.83 0.83 0.83 456

4. Ada Boost: Recall for class 1 is 0.87 and the model accuracy is 81.57.

0.8157894736842105

[[ 95 44]

[ 40 277]]

precision recall f1-score support

0 0.70 0.68 0.69 139

1 0.86 0.87 0.87 317

accuracy 0.82 456

macro avg 0.78 0.78 0.78 456

weighted avg 0.81 0.82 0.82 456

5. Gradient Boosting: Recall for class 1 is 0.90 and the model accuracy is 82.89.

0.8289473684210527

[[ 93 46]

[ 32 285]]

precision recall f1-score support

0 0.74 0.67 0.70 139

41
1 0.86 0.90 0.88 317

accuracy 0.83 456

macro avg 0.80 0.78 0.79 456

weighted avg 0.83 0.83 0.83 456

6. BAGGING: Recall for class 1 is 0.90 and the model accuracy is 81.90.

0.819078947368421

[[ 51 32]

[ 23 198]]

precision recall f1-score support

0 0.69 0.61 0.65 83

1 0.86 0.90 0.88 221

accuracy 0.82 304

macro avg 0.78 0.76 0.76 304

weighted avg 0.81 0.82 0.82 304

 Inferences:
 Model which has not performed well on the train data set, also have not performed
well on the test data set However Gradient Boost, KNN Model and Bagging which had
a 100% score on the train data set have shown a poor result on the test data set. Hence
a clear case of overfitting.
 While KNN Model also performs well, Gradient Boosting surpasses it in terms of
accuracy.

Problem 1.6 Actionable Insights & Recommendations:

42
Compare all four models - Conclude with the key takeaways for the business

 Model Performance on Test Data:


 KNN (K-Nearest Neighbours):
 Accuracy: Varies with different values of K, ranging from 0.822 to 0.831.
 Performance: Achieves the highest accuracy on the test set. But not highest accuracy
performed on the train set.
 Naive Bayes:
 Accuracy: 0.822.
 Performance: Good accuracy, but not the highest among the models. Performs the
least accuracy value on the train set.
 Bagging:
 Accuracy: 0.819.
 Performance: Achieves high accuracy on the training set but has a slight drop on the
test set.
 Gradient Boosting:
 Accuracy: 0.828.
 Performance: Good accuracy, but not the highest among the models. Performs the
highest accuracy on the train set.

 Conclusion:
 Model which has not performed well on the train data set, also have not performed
well on the test data set However Gradient Boost, KNN Model and Bagging which
had a 100% score on the train data set have shown a poor result on the test data set.
Hence a clear case of overfitting.
 Gradient Boosting performs the highest accuracy on the train set while KNN model
on the train set.

 Key takeaways for the business:

 An exit poll aims to predict election outcomes, including determining the party likely
to secure the majority of seats.
 Model performance should be evaluated not only based on accuracy but also
considering precision, recall, and other relevant metrics.
 Consider the potential impact of class imbalance, as predicting the winning party
might be challenging if there's a significant imbalance.

43
PROBLEM 2
In this particular project, we are going to work on the inaugural corpora from the nltk in Python. We
will be looking at the following speeches of the Presidents of the United States of America:

President Franklin D. Roosevelt in 1941

President John F. Kennedy in 1961

President Richard Nixon in 1973

Code Snippet to extract the three speeches:

"

import nltk

nltk.download('inaugural')

from nltk.corpus import inaugural

inaugural.fileids()

inaugural.raw('1941-Roosevelt.txt')

inaugural.raw('1961-Kennedy.txt')

inaugural.raw('1973-Nixon.txt')

"

Data Overview

Business Context:

The goal is to perform an exploratory data analysis for the given text of speeches which has
happened during the president inaugural speech at America at different years, we need to
analyse this text content shared through text cleaning and plotting the word cloud to find the
most common words which was used in all the three speeches.

Objective:

The goal is to perform an exploratory data analysis for the given text of speeches which has
happened during the president inaugural speech at America at different years, we need to

44
analyse this text content shared through text cleaning and plotting the word cloud to find the
most common words which was used in all the three speeches.

1. Define the problem and perform exploratory Data - Problem Definition - Find the number
of Character, words & sentences in all three speeches.
2. Text cleaning - Stopword removal - Stemming - find the 3 most common words used in all
three speeches.
3. Plot Word cloud of all three speeches - Show the most common words used in all three
speeches in the form of word clouds

Data Dictionary:

1. inaugural.raw('1941-Roosevelt.txt')
2. inaugural.raw('1961-Kennedy.txt')
3. inaugural.raw('1973-Nixon.txt')

Problem 2.1 - Define the problem and perform Exploratory Data Analysis

Problem definition - Problem Definition - Find the number of Character, words & sentences in
all three speeches.

 The problem is to find and analyse the following metrics in each of the three
presidential speeches: the number of characters, words, and sentences. This analysis
will provide insights into the length and structure of the speeches delivered by
Presidents Franklin D. Roosevelt in 1941, John F. Kennedy in 1961, and Richard Nixon
in 1973.
 Performing EDA involves examining and summarizing key characteristics of the data.
In this case, we want to explore the length of the speeches in terms of characters,
words, and sentences.

 Displaying the loaded dataset.

45
Fig:2.1.1 Loaded text speeches into the dataset format

 Displaying the information of the imported speech.

Fig:2.1.2 Information of the speech text imported

Number of Characters in all the three speeches.

Fig:2.1.3 Number of Characters in all the three speeches.


 President Franklin D. Roosevelt in 1941 speech has the 7651 number of characters in it.
 President John F. Kennedy in 1961 speech has the 7673 number of characters in it.
 President Richard Nixon in 1973 speech has the 10106 number of characters in it.

Number of Words in all the three speeches.

Fig:2.1.4 Number of Words in all the three speeches.


 President Franklin D. Roosevelt in 1941 speech has the 1323 number of words in it.
 President John F. Kennedy in 1961 speech has the 1364 number of words in it.

46
 President Richard Nixon in 1973 speech has the 10106 number of words in it.

Number of Sentences in all the three speeches.

Fig:2.1.5 Number of Sentences in all the three speeches.

 President Franklin D. Roosevelt in 1941 speech has the 69 number of sentences in it.
 President John F. Kennedy in 1961 speech has the 56 number of sentences in it.
 President Richard Nixon in 1973 speech has the 70 number of sentences in it.
 The analysis of the three presidential speeches reveals interesting insights into their
lengths and structures. President Franklin D. Roosevelt's speech from 1941 is
characterized by 7,651 characters, 1,323 words, and 69 sentences.
 President John F. Kennedy's speech from 1961 consists of 7,673 characters, 1,364
words, and 56 sentences.
 Lastly, President Richard Nixon's speech from 1973 is the longest, containing 10106
characters, 1769 words, and 70 sentences.
 From this information, it can be inferred that while Roosevelt's speech is slightly
shorter than Kennedy's in terms of characters and words, it has fewer sentences,
suggesting a more concise and focused narrative. On the other hand, Nixon's speech
stands out as the most extensive in terms of characters and words, indicating a more
detailed and comprehensive address.
 The varying lengths and sentence structures of these speeches reflect the unique
communication styles and emphases of each president during their respective
inaugural moments.

 Count of Stop Words:

47
Fig:2.1.6 Count of Stop words

 Count of special characters:

Fig:2.1.7 Count of Special Characters

 Count of Numbers:

Fig: 2.1.8 Count of Numbers

 Count of Uppercase Words:

Fig: 2.1.9 Count of Uppercase Words

48
 Count of Uppercase Letters:

Fig: 2.1.10 Count of Uppercase Letters

 Number of listings on the complete speech set:

Fig: 2.1.11 Number of listings on the complete speech set

Problem 2.2 – Text Cleaning

Stopword removal - Stemming - find the 3 most common words used in all three speeches:

 Lower Case Conversion:

Fig: 2.2.1 Lower Case Conversion of the speech

 Removal of Punctuation:

49
Fig: 2.2.2 Removal of Punctuation

Removal of Stopword

Fig:2.2.3 Removal of Stop word

Stemming

Fig:2.2.4 FreqDist count of words after stemming

 There seems to be outliers only in very few attributes. IQR was used to treat the
variables to remove the outliers.

find the 3 most common words used in all three speeches

Fig:2.2.5 Three most common words used in all the three speeches

50
 The cleaned and processed words from each speech (roosevelt_cleaned,
kennedy_cleaned, and nixon_cleaned) are combined into a single list
(all_cleaned_words).
 The text from the three presidential speeches by removing stop words and applying
stemming. Was cleaned and processed. It then calculates the three most common
words across all three speeches.
 The clean text function is defined to tokenize the text, remove stop words, and apply
stemming using the Porter Stemmer. Stop words, common words that do not
contribute much to the meaning of the text, are excluded from the analysis.
 The top 3 most common used words in all the three speeches were “us”, “new” and
“let”.

Fig:2.2.6 Top 3 common words used in all the three speeches after removing extended stop
word

 The top 3 most common used words in all the three speeches were “us”, “new” and
“let”.
 US refers to the United States since these speeches were conducted in USA and new
and let seems to be used as an extended stop words to see the actual most common
words.
 From the above graph it is clear that “us”,” nation” and “America” are the most
commonly used words which has been found out by the text analysing method.
 The FreqDist class from NLTK is used to calculate the frequency distribution of words
in the combined list. The most common (3) method is applied to retrieve the three
most common words along with their frequencies.

51
Inferences:

 The three most common words obtained from the code are [('us', 45), ('new', 26),
('let', 25)].
 The three most common words obtained from the code after using extended stop
words are: new” and “let” are [('us', 45), ('nation', 37), ('america', 29)].
 The result suggests that the most frequently occurring words across all three speeches
are 'us,' 'nation,' and 'america.'
 The word 'us' may indicate a focus on collective identity or inclusiveness in the
speeches.
 The frequent occurrence of 'nation' reflects an emphasis on the country as a whole
and its role in the speeches.
 The word 'let' may be indicative of a call to action or encouragement in the context of
the speeches.
 The word 'new' may indicate a promise made by the president during the event of
speeches.
 The word 'america' has been used since the speeches refers and represents the USA.

Problem 2 .3- Plot Word cloud of all three speeches

Show the most common words used in all three speeches in the form of word clouds:

Inferences:

 The analysis of the three presidential speeches (Roosevelt 1941, Kennedy 1961, Nixon
1973) revealed that the most frequently used words across all speeches are "us,"
"nation," and "america."
 This suggests a common theme of unity, patriotism, and a call to collective action.
 The repetition of these words indicates a shared emphasis on the nation's strength,
responsibility, and the importance of working together towards common goals.
 The presidents seem to highlight the idea of a united and empowered nation capable
of facing challenges and embracing opportunities.
 Understanding these recurring themes in the speeches can provide insights into the
leaders' priorities and the overall tone of their addresses.

52
Fig: 2.3.1 Word Cloud for inaugural speech (after cleaning)!!

53

You might also like