2021
Machine learning – Naïve Bayes, KNN, Bagging and
Boosting on Voter Mindset prediction on Election
Anil Ulchala
12/4/2021
1
1 Contents
1 Contents .......................................................................................................................................... 1
1. Action Required: ............................................................................................................................. 4
2. Problem Statement: ........................................................................................................................ 4
2.1 Read the dataset. Do the descriptive statistics and do the null value condition check? Write
an inference on it. ............................................................................................................................. 47
2.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers.
7
2.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data
Split: Split the data into train and test (70:30) ................................................................................. 11
2.4 Apply Logistic Regression and LDA (linear discriminant analysis). ....................................... 12
2.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. ....................................... 14
2.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting...... 16
2.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final Model:
Compare the models and write inference which model is best/optimized. .................................... 17
2.8 Based on these predictions, what are the insights? (5 marks). ............................................ 28
3. Problem Statement: ...................................................................................................................... 28
3.1 Find the number of characters, words, and sentences for the mentioned documents. ...... 28
3.2 Remove all the stopwords from all three speeches. ............................................................ 30
3.3 Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words. (after removing the stopwords) ...................................................... 31
3.4 Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words. (after removing the stopwords) ...................................................... 34
List of Tables:
Table 1 – Comparison between various models .................................................................................. 28
This Business Report is generated based on the Data set extracted from reliable sources
2
Figure 1 - Loading Dataset into Jupyter notebook.................................................................................. 4
Figure 2 – Dropping unwanted columns/variables ................................................................................. 4
Figure 3 – Shape and Data type information of the data set ................................................................. 5
Figure 4 – Checking for Null Values ........................................................................................................ 5
Figure 5 – Proportions checking for Categorical Variables ..................................................................... 6
Figure 6 – Descriptive Information ......................................................................................................... 6
Figure 7 – Distribution of the variables................................................................................................... 7
Figure 8 – Count plot for Vote Variable .................................................................................................. 7
Figure 9 – Strip plot Age vs Vote ............................................................................................................. 8
Figure 10 – Strip plot Vote vs Economic.cond.national .......................................................................... 8
Figure 11 – Economic conditional household vs Age .............................................................................. 9
Figure 12 – Strip plot against Vote feature ............................................................................................. 9
Figure 13 – Heat map of the Variables ................................................................................................. 10
Figure 14 – Pair Plot for the variables ................................................................................................... 10
Figure 15 – Outlier Treatment .............................................................................................................. 11
Figure 16 –Box plot after outlier Treatment ......................................................................................... 11
Figure 17 – Converting Target Variable to Integer type ....................................................................... 12
Figure 18 – Dataset info after encoding the variables .......................................................................... 12
Figure 19 – Test_Train_Split of the data ............................................................................................... 12
Figure 20 – Logistic Regression Model building .................................................................................... 13
Figure 21 – Model for LDA .................................................................................................................... 13
Figure 22 – Model for Gaussian NB ...................................................................................................... 14
Figure 23 – Scaling the Data Set for model building ............................................................................. 14
Figure 24 – Model building ................................................................................................................... 14
Figure 25 – Model building ................................................................................................................... 15
Figure 26 – Calculating Misclassification error for various values of K................................................. 15
Figure 27 – Plotting Misclassification error for various values of K ...................................................... 15
Figure 28 – Model building for K =19 .................................................................................................... 16
Figure 29 – Model using Random Forest .............................................................................................. 16
Figure 30 – Model using Random Forest and applying Bagging ........................................................... 16
Figure 31 – Model using Random Forest and applying Boosting ......................................................... 17
Figure 32 – Confusion Matrix of Train Data .......................................................................................... 18
Figure 33 – Confusion Matrix of Train Data .......................................................................................... 18
Figure 34 – AUC and RoC of Train Data ................................................................................................ 19
Figure 35 – AUC and RoC of Test Data .................................................................................................. 19
Figure 36 – Confusion Matrix of Train and Test Data ........................................................................... 20
Figure 37 – Classification Report of Train and Test Data ...................................................................... 20
Figure 38 – AUC and RoC of Train and Test Data .................................................................................. 21
Figure 39 – Confusion Matrix of Train Data .......................................................................................... 21
Figure 40 – Confusion Matrix of Train Data .......................................................................................... 22
Figure 41 – AUC and RoC of Train Data ................................................................................................ 22
Figure 42 – AUC and RoC of Test Data .................................................................................................. 23
Figure 43 – Confusion Matrix of Train Data .......................................................................................... 23
Figure 44 – Confusion Matrix of Train Data .......................................................................................... 24
Figure 45 – AUC and RoC of Train Data ................................................................................................ 24
Figure 46 – Confusion Matrix of Train Data .......................................................................................... 25
Figure 47 – Confusion Matrix of Train Data .......................................................................................... 25
Figure 48 – AUC and RoC of Train Data ................................................................................................ 26
Figure 49 – Confusion Matrix of Train Data .......................................................................................... 26
Figure 50 – Confusion Matrix of Train Data .......................................................................................... 27
Figure 51 – AUC and RoC of Train Data ................................................................................................ 27
This Business Report is generated based on the Data set extracted from reliable sources
3
Figure 52 - Loading Dataset into Jupyter notebook .............................................................................. 29
Figure 53 – Length of data .................................................................................................................... 29
Figure 54 – No of words in each speech ............................................................................................... 29
Figure 55 – User defined function to count no of sentences in a Text file .......................................... 30
Figure 56 – No of sentences in each presidential speech..................................................................... 30
Figure 57 – Importing stopwords to python ......................................................................................... 30
Figure 58 – Code to remove punctuation and unnecessary words ...................................................... 31
Figure 59 – Removing stopwords from the speeches........................................................................... 31
Figure 60 – Converting tokens into Lower case .................................................................................... 32
Figure 61 – Lemmatization with POS tag .............................................................................................. 32
Figure 62 – Word Frequency calculation for Roosevelt Speech ........................................................... 33
Figure 63 – Word Frequency calculation for Keenedy Speech ............................................................. 33
Figure 64 – Word Frequency calculation for Nixon Speech .................................................................. 33
Figure 65 – Word cloud for Roosevelt speech ...................................................................................... 34
Figure 66 – Word cloud for Kennedy speech ........................................................................................ 34
Figure 67 – Word cloud for Nixon speech ............................................................................................ 35
This Business Report is generated based on the Data set extracted from reliable sources
4
Business Case
1. Action Required:
The purpose of this whole exercise is to explore the dataset. Do the exploratory data analysis.
Perform Machine Learning using NaiveBayes, KNN techniques on the data set and compare the
models. Also, need to provide the insights and recommendations based on the output. Also
need perform NLP techniques, on another data set containing speeches of former vice-
presidents and create a word cloud.
2. Problem Statement:
You are hired by one of the leading news channels CNBE who wants to analyze recent elections.
This survey was conducted on 1525 voters with 9 variables. You have to build a model, to predict
which party a voter will vote for on the basis of the given information, to create an exit poll that
will help in predicting overall win and seats covered by a particular party.
2.1 Read the dataset. Do the descriptive statistics and do the null value
condition check? Write an inference on it.
Ans: To start with the data analysis, we need to first load the data. Figure below shows the snap for
loading data
Figure 1 - Loading Dataset into Jupyter notebook
Successfully loaded the data into Python.
2.1.1. Dropping the unnecessary columns from the dataset
Figure 2 – Dropping unwanted columns/variables
This Business Report is generated based on the Data set extracted from reliable sources
5
2.1.2. Now data set is ready for the Exploratory Data Analysis. As per the output dimension or
shape of the data set is (1525, 9). Therefore data set has 1525 rows and 9 columns. Refer
below figure
Figure 3 – Shape and Data type information of the data set
As per the figure we can observe there are 2 variables with ‘object’ type and remaining variables are
of ‘int’ type. Also there are no duplicate values in the data set.
2.1.3. There are no null values in the data set. Ref below figure
Figure 4 – Checking for Null Values
2.1.4. On the basis of problem description it is clear that ‘vote’ is the dependent variable/ target
variable and the remaining variables are independent variables. Going forward the report uses
terminology of dependent and independent variable to address the columns.
Hence, proportions for categorical variables are calculated.
This Business Report is generated based on the Data set extracted from reliable sources
6
Figure 5 – Proportions checking for Categorical Variables
From the figure we can infer that 1063 members casted their vote to Labour section and 462
members has casted their vote to Conservative. Seems, data set is the balanced with 70 and 30
proportion and good for creating Model. Also, Female stood in first place with a count of 812 over
Male with a count of 713. In addition, we can conclude that data set has no bad values.
2.1.5. Descriptive Stats:
Figure 6 – Descriptive Information
As observed, stats from target variable say Labour has higher number in vote casting.
Age ranges from 24 to 93 with a mean of 54 and having a std deviation of 16. Mean and
median are almost similar giving an intention of normal distribution.
The economic.cond.national of the polling centre ranges from 1 to 5 with a median of 3.24.
Here 1 means bad and 5 means better conditions. Mean and median are almost similar
giving an intention of normal distribution.
The economic.cond.household, blair, hague of the polling centre ranges from 1 to 5 with a
median of 3.14, 3.33, 2.7 respectively. Here 1 means bad and 5 means better conditions.
Mean and median are almost similar giving an intention of normal distribution
The ‘Europe’ feature describing the Europe sentiment has a mean of 6 on a scale of 1 to 15.
This value depicts most of the voters are neutral towards Europe sentiment feature.
Female vote casters are high in number compared to Male vote casters
This Business Report is generated based on the Data set extracted from reliable sources
7
2.1.6. Distribution and boxplot of the variables
Inferences based on the boxplots and dist plot:
1. ‘Age’ variable is normal distributed
2. Rest of the features, economic.cond.national, economic.cond.household, blair, Hague
and Political Knowledge being ordinal variables and not a continuous variables. So we
can see multiple spikes in the distribution.
Figure 7 – Distribution of the variables
2.2 Perform Univariate and Bivariate Analysis. Do exploratory data
analysis. Check for Outliers.
2.1.1. Univariate Analysis
Figure 8 – Count plot for Vote Variable
This Business Report is generated based on the Data set extracted from reliable sources
8
Majority of the voters are labour, so the vote bank more lies in Labour section.
2.1.2. Bivariate Analysis
1. Strip plot between Vote and Age
Figure 9 – Strip plot Age vs Vote
Based on the data, from the scatter plot Salary has weak correlation
Most of the voters from labour section lies in between age of 40 and 50.
2. Strip plot between Vote and Economic.cond.national
Figure 10 – Strip plot Vote vs Economic.cond.national
Based on the data, there is no clear correlation.
3. Strip plot between Economic conditional household and Age
This Business Report is generated based on the Data set extracted from reliable sources
9
Figure 11 – Economic conditional household vs Age
Based on the data, there is no clear correlation.
4. Scatter plot between old children and Salary with hue of target variable
Figure 12 – Strip plot against Vote feature
This Business Report is generated based on the Data set extracted from reliable sources
10
5. Correlation between the variables:
Figure 13 – Heat map of the Variables
Correlation between the variables is very weak as per the heat map
Hague and Economic.cond.national has negative relation
Europe and Economic.cond.national also have negative relation
Age and Political knowledge also have negative relation, which is quite opposite to
reality
6. Pairplot between the variables:
Figure 14 – Pair Plot for the variables
This Business Report is generated based on the Data set extracted from reliable sources
11
From the pair plot, we can infer that there is no strong relation between any
variable.
Hence the dataset is not cursed by mutli-collinearity and good for modelling.
2.1.3. Treating Outliers
Before we do label encoding and creating a model we observed there are some outliers existing in
each variable. Now we will treat outliers in these variables.
Figure 15 – Outlier Treatment
Figure 16 –Box plot after outlier Treatment
2.3 Encode the data (having string values) for Modelling. Is Scaling
necessary here or not? Data Split: Split the data into train and test
(70:30)
Ans: To create a model for analysis we need to convert the categorical variables. So we do Label
encoding for the categorical variables and use the data set for the model building.
2.3.1. Data Encoding and Model building for Machine Learning Analysis
2.3.1.1. Converting Target Variable to Integer type using Categorical Function
Here ‘0’ represents ‘Conservative’ and ‘1’ represents ‘Labour’
This Business Report is generated based on the Data set extracted from reliable sources
12
Figure 17 – Converting Target Variable to Integer type
2.3.1.2. Data information after conversion.
Figure 18 – Dataset info after encoding the variables
2.4 Apply Logistic Regression and LDA (linear discriminant analysis).
2.4.1. Model Building for Machine Learning Analysis – Split of data
Figure 19 – Test_Train_Split of the data
Here data is not scaled, since there is no much difference in the independent
variables. Hence is scaling is not done for the entire modelling except for KNN
Test and train split is made with ratio 70:30
X is an array of dependent variables and Y is an array of target variable
2.4.2. Logistic Regression model generation using the Grid search method
This Business Report is generated based on the Data set extracted from reliable sources
13
Figure 20 – Logistic Regression Model building
Penalty: ‘elasticnet','l2','none'
Solver: 'newton-cg', 'saga'
Tol: 0.001, 0.00001
Finally based on the GridSearchCV technique the best estimators are captured to best_model and
predictions are made based on this best_model.
2.4.3. Creating model for LDA
Figure 21 – Model for LDA
Here data is not scaled, since there is no much difference in the independent
variables. Hence is scaling is not done for the entire modelling except for KNN
Test and train split is made with ratio 70:30
X is an array of dependent variables and Y is an array of target variable
This Business Report is generated based on the Data set extracted from reliable sources
14
2.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.
2.5.1. Creating model using Gaussian NB
Figure 22 – Model for Gaussian NB
Here data is not scaled, since there is no much difference in the independent
variables. Hence is scaling is not done for the entire modelling except for KNN
Test and train split is made with ratio 70:30
X is an array of dependent variables and Y is an array of target variable
2.5.2. Creating model using KNN
2.5.2.1 Scaling the data set for KNN model building
Figure 23 – Scaling the Data Set for model building
Here data is not scaled, since there is no much difference in the independent
variables. Here scaling is done because KNN model is based on Euclidean
distance measurement. So Zscore is applied for scaling on the Train data set.
Test and train split is made with ratio 70:30
X is an array of dependent variables and Y is an array of target variable
2.5.2.2 Building Model:
Figure 24 – Model building
2.5.2.3 Building Model (K value =7):
This Business Report is generated based on the Data set extracted from reliable sources
15
Figure 25 – Model building
K value is considered as ‘7’.
Model score is 71.41 %
X is an array of dependent variables and Y is an array of target variable
2.5.2.4 Calculating Misclassification error of KNN model for a range from 1 to 19 at steps of 2
Figure 26 – Calculating Misclassification error for various values of K
It is observed that misclassification error value is less for 19
2.5.2.5 Plotting Misclassification Error for range of K values
Figure 27 – Plotting Misclassification error for various values of K
This Business Report is generated based on the Data set extracted from reliable sources
16
2.5.2.6 Building model for K=19
Figure 28 – Model building for K =19
Model score is observed as 69.8
2.6 Model Tuning, Bagging (Random Forest should be applied for
Bagging), and Boosting.
2.6.1. Creating model using Random Forest Technique
Figure 29 – Model using Random Forest
Model is created using ensembling technique, with estimators as ‘100’.
Train model score is observed as 99.9 %
Test model score is observed as 84%, seems like little bit overfitting. So we will
do Model tuning using methods like Baggind and Boosting
2.6.2. Creating model using Random Forest model and applying Bagging
Figure 30 – Model using Random Forest and applying Bagging
This Business Report is generated based on the Data set extracted from reliable sources
17
Model is created using ensembling technique, with base_estimators as
‘RF_model’.
Train model score is observed as 97.9 %
Test model score is observed as 84.27%, seems like little bit overfitting. So we
can check Model tuning using Boosting.
2.6.3. Creating model using Random Forest model and applying Boosting
Figure 31 – Model using Random Forest and applying Boosting
Model is created using ensembling technique, with base_estimators as
‘RF_model’. Gradient Boosting method is considered in the above case
Train model score is observed as 88.75 %
Test model score is observed as 83.84%, seems like both train and testing
scores are in-line, can be considered as best model.
2.7 Performance Metrics: Check the performance of Predictions on Train
and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get
ROC_AUC score for each model. Final Model: Compare the models and
write inference which model is best/optimized.
2.7.1. Performance Metrics of Logistic Regression
2.7.1.1. Confusion Matrix, accuracy and other metrics of Train data
This Business Report is generated based on the Data set extracted from reliable sources
18
Figure 32 – Confusion Matrix of Train Data
Accuracy of the train data is 83.03%
Recall for train data in the interest of ‘1’ is 0.91
2.7.1.2. Confusion Matrix, accuracy and other metrics of Test data
Figure 33 – Confusion Matrix of Train Data
Accuracy of the test data is 84.93%
Recall for test data in the interest of ‘1’ is 0.92
False positive values are coming down to 45 which is a good sign on the test
data compared to train data with a value of 111
This Business Report is generated based on the Data set extracted from reliable sources
19
2.7.1.3. Accuracy Score AUC_score and RoC Curve for the train data
Figure 34 – AUC and RoC of Train Data
AUC score of the train data is 87.72%
2.7.1.4. Accuracy Score, AUC_score and RoC Curve for the test data
Figure 35 – AUC and RoC of Test Data
AUC score of the train data is 87.72%
This Business Report is generated based on the Data set extracted from reliable sources
20
2.7.2. Performance Metrics of LDA
2.7.2.1. Confusion Matrix of Train and Test data
Figure 36 – Confusion Matrix of Train and Test Data
Recall for train data in the interest of ‘1’ is 105
Recall for Test data in the interest of ‘ 1’ is 41
2.7.2.2. Classification report of Train and Test data
Figure 37 – Classification Report of Train and Test Data
Accuracy of the Train data is 82%
Accuracy of the Test Data is 84%
2.7.2.3. AUC_score and RoC Curve for the train and test data
This Business Report is generated based on the Data set extracted from reliable sources
21
Figure 38 – AUC and RoC of Train and Test Data
AUC score of the train data is 87.7%
AUC score of the test data is 91.6%
2.7.3. Performance Metrics of Naïve Bayes
2.7.3.1. Confusion Matrix, accuracy and other metrics of Train data
Figure 39 – Confusion Matrix of Train Data
This Business Report is generated based on the Data set extracted from reliable sources
22
Accuracy of the train data is 82.66%
Recall for train data in the interest of ‘1’ is 0.88
2.7.3.2. Confusion Matrix, accuracy and other metrics of Test data
Figure 40 – Confusion Matrix of Train Data
Accuracy of the test data is 84.71%
Recall for test data in the interest of ‘1’ is 0.90
False positive values are coming down to 37 which is a good sign on the test
data compared to train data with a value of 97
2.7.3.3. Accuracy Score AUC_score and RoC Curve for the train data
Figure 41 – AUC and RoC of Train Data
AUC score of the train data is 87.52%
This Business Report is generated based on the Data set extracted from reliable sources
23
2.7.3.4. Accuracy Score, AUC_score and RoC Curve for the test data
Figure 42 – AUC and RoC of Test Data
AUC score of the train data is 91.02%
2.7.4. Performance Metrics of KNN Model with K value as 19
2.7.4.1. Confusion Matrix, accuracy and other metrics of Train data
Figure 43 – Confusion Matrix of Train Data
Accuracy of the train data is 69.82%
Recall for train data in the interest of ‘1’ is 0.99
This Business Report is generated based on the Data set extracted from reliable sources
24
2.7.4.2. Confusion Matrix, accuracy and other metrics of Test data
Figure 44 – Confusion Matrix of Train Data
Accuracy of the test data is 67.46%
Recall for test data in the interest of ‘1’ is 0.96
False positive values are coming down to 137 which is a good sign on the test
data compared to train data with a value of 311
2.7.4.3. AUC_score and RoC Curve for the train and test data
Figure 45 – AUC and RoC of Train Data
AUC score of the train data is 61.30%
AUC score of the test data is 46.40%
This Business Report is generated based on the Data set extracted from reliable sources
25
2.7.5. Performance Metrics of Random Forest with Bagging Technique
2.7.5.1. Confusion Matrix, accuracy and other metrics of Train data
Figure 46 – Confusion Matrix of Train Data
Accuracy of the train data is 97.18%
Recall for train data in the interest of ‘1’ is 0.99
2.7.5.2. Confusion Matrix, accuracy and other metrics of Test data
Figure 47 – Confusion Matrix of Train Data
Accuracy of the test data is 84.27%
Recall for test data in the interest of ‘1’ is 0.92
This Business Report is generated based on the Data set extracted from reliable sources
26
2.7.5.3. AUC_score and RoC Curve for the train and test data
Figure 48 – AUC and RoC of Train Data
AUC score of the train data is 99.70%
AUC score of the test data is 91.80%
2.7.6. Performance Metrics of Random Forest with Boosting Technique
2.7.6.1. Confusion Matrix, accuracy and other metrics of Train data
Figure 49 – Confusion Matrix of Train Data
Here Gradient boosting technique is used
Accuracy of the train data is 88.75%
Recall for train data in the interest of ‘1’ is 0.94
This Business Report is generated based on the Data set extracted from reliable sources
27
2.7.6.2. Confusion Matrix, accuracy and other metrics of Test data
Figure 50 – Confusion Matrix of Train Data
Accuracy of the test data is 83.84%
Recall for test data in the interest of ‘1’ is 0.92
2.7.6.3. AUC_score and RoC Curve for the train and test data
Figure 51 – AUC and RoC of Train Data
AUC score of the train data is 94.80%
AUC score of the test data is 90.80%
This Business Report is generated based on the Data set extracted from reliable sources
28
2.7.7. Comparison between Models
LR Model LDA Model NB Model KNN Model Bagging Boosting
Train Test Train Test Train Test Train Test Train Test Train Test
Accuracy 83.03 84.93 82.75 84.71 82.66 84.71 69.82 67.46 97.18 84.27 88.75 83.84
Precision (1) 86 87 86 88 87 89 70 69 97 86 90 86
AUC 87.72 87.72 87.7 91.6 87.52 91.02 61.3 46.4 99.7 91.8 94.8 90.8
Recall (1) 91 92 89 91 88 90 99 96 99 92 94 92
F1- Score (1) 88 90 88 89 88 89 82 80 98 89 92 89
Table 1 – Comparison between various models
On comparing the models, Random Forest technique with Boosting seems
consistent across various Model evaluation parameters
Train and test data for Boosting model are almost close and consistent
Also, the accuracy of the Boosting falls in second place with 88.75 for train and
83.84 in Test data. Bagging has highest Accuracy for train data 97.18% but test
data is performing low with deviation of 10 % almost
Finally Random Forest Technique with Boosting technique tuning provides
better model
2.8 Based on these predictions, what are the insights? (5 marks).
On the whole, based on the outcomes of the model for data set following Insights are observed:
Female voters are more than the male voters. So parties have to attract Male voters by
respective means.
Most of the voters from labour section lie in between age of 40 and 50. So mostly Labour
party is attracting this age group with some monetary benefits. So any voter from these ages
has a high change he cast his vote to labor party.
Based on the data from Strip plot, voter’s density is high for Labour party for High economic
conditional nation. So any voter from this zone will have high chances to cast their vote to
Labour party.
Similarly, is the case with high economic condition households.
Random Forest algorithm with Gradient Boosting technique tuning provides better model.
The model is giving better results for both training and test data.
3. Problem Statement:
In this particular project, we are going to work on the inaugural corpora from the nltk in Python.
We will be looking at the following speeches of the Presidents of the United States of America:
President Franklin D. Roosevelt in 1941
President John F. Kennedy in 1961
President Richard Nixon in 1973.
3.1 Find the number of characters, words, and sentences for the
mentioned documents.
Ans: To start with the counting, we need to first load the data. Figure below shows the snap for
loading data
This Business Report is generated based on the Data set extracted from reliable sources
29
Figure 52 - Loading Dataset into Jupyter notebook
Nltk file is downloaded and imported inaugural for speech library
Extracted required vice-president speeches to allocated to respective
variables.
3.1.1. Calculate length of data
Ans: To calculate the length of data below codes are executed
Figure 53 – Length of data
Number of characters in Roosevelt file is 7571.
Number of characters in Kennedy file is 7618.
Number of characters in Nixon file is 9991.
3.1.2. Calculate No of words
Ans: To calculate the no of words in data below codes are executed. Split() function is used to
extract words.
Figure 54 – No of words in each speech
Number of words in Roosevelt file is 1360.
Number of words in Kennedy file is 1390.
Number of words in Nixon file is 1819.
This Business Report is generated based on the Data set extracted from reliable sources
30
3.1.3. Calculate No of Sentences
Ans: To calculate the no of sentences in data below codes are executed. We are using sent_tokenize
command to count the sentences
3.1.3.1 A “sentence_count” function is defined to count the no of sentence in a particular speech
Figure 55 – User defined function to count no of sentences in a Text file
3.1.3.1 Code to calculate no of sentences in the speech
Figure 56 – No of sentences in each presidential speech
Number of sentences in Roosevelt file is 68.
Number of sentences in Kennedy file is 52.
Number of sentences in Nixon file is 68.
3.2 Remove all the stopwords from all three speeches.
3.2.1. Import predefined stopwords from the nltk
Ans: To import predefined stopwords from the nltk below codes are executed
Figure 57 – Importing stopwords to python
3.2.2. Step to remove all punctuations
Ans: In this step punctuations and unnecessary words are removed from the text file.
This Business Report is generated based on the Data set extracted from reliable sources
31
Figure 58 – Code to remove punctuation and unnecessary words
Punctuations listed in the punkt and addition words are removed
3.2.3. Removing stopwrods in all the files
Ans: To remove stopwords below codes are executd
Figure 59 – Removing stopwords from the speeches
The output shows the punctuations are removed and a cleaned text is shown
in suffix ‘_clean variable
We can see comma is removed from the tokens list
3.3 Which word occurs the most number of times in his inaugural address
for each president? Mention the top three words. (after removing the
stopwords)
3.3.1. Step to convert tokens into lower case.
Ans: To find top frequency words, first convert all the tokens in to lowers case. Since algorithm
considers lower case and upper case words are different and will create error in the count/
frequency. Hence in-built lower function is used to convert all the tokens into lower case and
maintain consistency for all the three speeches. Below line of codes are executed
This Business Report is generated based on the Data set extracted from reliable sources
32
Figure 60 – Converting tokens into Lower case
We can observe that word s like ‘Mr’ are converted to ‘mr’ after the above
execution.
3.3.2. Step to Lemmatization with POS
Ans: In this step lemmatization is done, to convert all the tokens to root words. For the same
‘wordNetLemmatizer’ is imported to python notebook.
Figure 61 – Lemmatization with POS tag
In the above step based on the parts of speech, words are converted to their
root words.
Words that are not categorized as any parts of speech are retained as it is.
3.3.3. Top three word frequency calculation for each speech
3.3.3.1: Word frequency is calculated through below code in Roosevelt Speech.
This Business Report is generated based on the Data set extracted from reliable sources
33
Figure 62 – Word Frequency calculation for Roosevelt Speech
Top three words in Roosevelt speech are “Nation”, “It” and “Life”
3.3.3.2: Word frequency is calculated through below code in Kennedy Speech.
Figure 63 – Word Frequency calculation for Keenedy Speech
Top three words in Roosevelt speech are “World”, “Let” and “Side”
3.3.3.3: Word frequency is calculated through below code in Nixon Speech.
Figure 64 – Word Frequency calculation for Nixon Speech
Top three words in Roosevelt speech are “America”, “Peace” and “World”
This Business Report is generated based on the Data set extracted from reliable sources
34
3.4 Which word occurs the most number of times in his inaugural address
for each president? Mention the top three words. (after removing the
stopwords)
Ans: Word cloud is a visual representation of the words available in a particular text files. The size of
the words indicates the frequency of the word. Bigger is the size more is the frequency.
3.4.1. Now we are creating the word cloud for Roosevelt speech. From wordcloud imported
WordCloud to create word cloud.
Figure 65 – Word cloud for Roosevelt speech
3.4.2. Now we are creating the word cloud for Kennedy speech. From wordcloud imported
WordCloud to create word cloud.
Figure 66 – Word cloud for Kennedy speech
This Business Report is generated based on the Data set extracted from reliable sources
35
3.4.3. Now we are creating the word cloud for Nixon speech. From wordcloud imported WordCloud
to create word cloud.
Figure 67 – Word cloud for Nixon speech
This Business Report is generated based on the Data set extracted from reliable sources