100% found this document useful (4 votes)

414 views36 pages

Project ML

This document discusses applying various machine learning algorithms like Naive Bayes, KNN, Bagging and Boosting to predict voter mindset based on a dataset. It involves data preprocessing steps like handling null values, encoding categorical variables, splitting data into train and test sets. Models like logistic regression, LDA, KNN, Naive Bayes with bagging and boosting are applied and their performance is compared using metrics like accuracy, confusion matrix, ROC curve to find the best model for the voter mindset prediction task.

Uploaded by

ANIL

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (4 votes)

414 views36 pages

Project ML

Uploaded by

ANIL

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

2021

Machine learning – Naïve Bayes, KNN, Bagging and

Boosting on Voter Mindset prediction on Election

Anil Ulchala
12/4/2021
1

1 Contents
1 Contents .......................................................................................................................................... 1
1. Action Required: ............................................................................................................................. 4
2. Problem Statement: ........................................................................................................................ 4
2.1 Read the dataset. Do the descriptive statistics and do the null value condition check? Write
an inference on it. ............................................................................................................................. 47
2.2 Perform Univariate and Bivariate Analysis. Do exploratory data analysis. Check for Outliers.
7
2.3 Encode the data (having string values) for Modelling. Is Scaling necessary here or not? Data
Split: Split the data into train and test (70:30) ................................................................................. 11
2.4 Apply Logistic Regression and LDA (linear discriminant analysis). ....................................... 12
2.5 Apply KNN Model and Naïve Bayes Model. Interpret the results. ....................................... 14
2.6 Model Tuning, Bagging (Random Forest should be applied for Bagging), and Boosting...... 16
2.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model. Final Model:
Compare the models and write inference which model is best/optimized. .................................... 17
2.8 Based on these predictions, what are the insights? (5 marks). ............................................ 28
3. Problem Statement: ...................................................................................................................... 28
3.1 Find the number of characters, words, and sentences for the mentioned documents. ...... 28
3.2 Remove all the stopwords from all three speeches. ............................................................ 30
3.3 Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words. (after removing the stopwords) ...................................................... 31
3.4 Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words. (after removing the stopwords) ...................................................... 34

List of Tables:
Table 1 – Comparison between various models .................................................................................. 28

This Business Report is generated based on the Data set extracted from reliable sources
2

Figure 1 - Loading Dataset into Jupyter notebook.................................................................................. 4

Figure 2 – Dropping unwanted columns/variables ................................................................................. 4
Figure 3 – Shape and Data type information of the data set ................................................................. 5
Figure 4 – Checking for Null Values ........................................................................................................ 5
Figure 5 – Proportions checking for Categorical Variables ..................................................................... 6
Figure 6 – Descriptive Information ......................................................................................................... 6
Figure 7 – Distribution of the variables................................................................................................... 7
Figure 8 – Count plot for Vote Variable .................................................................................................. 7
Figure 9 – Strip plot Age vs Vote ............................................................................................................. 8
Figure 10 – Strip plot Vote vs Economic.cond.national .......................................................................... 8
Figure 11 – Economic conditional household vs Age .............................................................................. 9
Figure 12 – Strip plot against Vote feature ............................................................................................. 9
Figure 13 – Heat map of the Variables ................................................................................................. 10
Figure 14 – Pair Plot for the variables ................................................................................................... 10
Figure 15 – Outlier Treatment .............................................................................................................. 11
Figure 16 –Box plot after outlier Treatment ......................................................................................... 11
Figure 17 – Converting Target Variable to Integer type ....................................................................... 12
Figure 18 – Dataset info after encoding the variables .......................................................................... 12
Figure 19 – Test_Train_Split of the data ............................................................................................... 12
Figure 20 – Logistic Regression Model building .................................................................................... 13
Figure 21 – Model for LDA .................................................................................................................... 13
Figure 22 – Model for Gaussian NB ...................................................................................................... 14
Figure 23 – Scaling the Data Set for model building ............................................................................. 14
Figure 24 – Model building ................................................................................................................... 14
Figure 25 – Model building ................................................................................................................... 15
Figure 26 – Calculating Misclassification error for various values of K................................................. 15
Figure 27 – Plotting Misclassification error for various values of K ...................................................... 15
Figure 28 – Model building for K =19 .................................................................................................... 16
Figure 29 – Model using Random Forest .............................................................................................. 16
Figure 30 – Model using Random Forest and applying Bagging ........................................................... 16
Figure 31 – Model using Random Forest and applying Boosting ......................................................... 17
Figure 32 – Confusion Matrix of Train Data .......................................................................................... 18
Figure 33 – Confusion Matrix of Train Data .......................................................................................... 18
Figure 34 – AUC and RoC of Train Data ................................................................................................ 19
Figure 35 – AUC and RoC of Test Data .................................................................................................. 19
Figure 36 – Confusion Matrix of Train and Test Data ........................................................................... 20
Figure 37 – Classification Report of Train and Test Data ...................................................................... 20
Figure 38 – AUC and RoC of Train and Test Data .................................................................................. 21
Figure 39 – Confusion Matrix of Train Data .......................................................................................... 21
Figure 40 – Confusion Matrix of Train Data .......................................................................................... 22
Figure 41 – AUC and RoC of Train Data ................................................................................................ 22
Figure 42 – AUC and RoC of Test Data .................................................................................................. 23
Figure 43 – Confusion Matrix of Train Data .......................................................................................... 23
Figure 44 – Confusion Matrix of Train Data .......................................................................................... 24
Figure 45 – AUC and RoC of Train Data ................................................................................................ 24
Figure 46 – Confusion Matrix of Train Data .......................................................................................... 25
Figure 47 – Confusion Matrix of Train Data .......................................................................................... 25
Figure 48 – AUC and RoC of Train Data ................................................................................................ 26
Figure 49 – Confusion Matrix of Train Data .......................................................................................... 26
Figure 50 – Confusion Matrix of Train Data .......................................................................................... 27
Figure 51 – AUC and RoC of Train Data ................................................................................................ 27

This Business Report is generated based on the Data set extracted from reliable sources
3

Figure 52 - Loading Dataset into Jupyter notebook .............................................................................. 29

Figure 53 – Length of data .................................................................................................................... 29
Figure 54 – No of words in each speech ............................................................................................... 29
Figure 55 – User defined function to count no of sentences in a Text file .......................................... 30
Figure 56 – No of sentences in each presidential speech..................................................................... 30
Figure 57 – Importing stopwords to python ......................................................................................... 30
Figure 58 – Code to remove punctuation and unnecessary words ...................................................... 31
Figure 59 – Removing stopwords from the speeches........................................................................... 31
Figure 60 – Converting tokens into Lower case .................................................................................... 32
Figure 61 – Lemmatization with POS tag .............................................................................................. 32
Figure 62 – Word Frequency calculation for Roosevelt Speech ........................................................... 33
Figure 63 – Word Frequency calculation for Keenedy Speech ............................................................. 33
Figure 64 – Word Frequency calculation for Nixon Speech .................................................................. 33
Figure 65 – Word cloud for Roosevelt speech ...................................................................................... 34
Figure 66 – Word cloud for Kennedy speech ........................................................................................ 34
Figure 67 – Word cloud for Nixon speech ............................................................................................ 35

This Business Report is generated based on the Data set extracted from reliable sources
4

Business Case
1. Action Required:
The purpose of this whole exercise is to explore the dataset. Do the exploratory data analysis.
Perform Machine Learning using NaiveBayes, KNN techniques on the data set and compare the
models. Also, need to provide the insights and recommendations based on the output. Also
need perform NLP techniques, on another data set containing speeches of former vice-
presidents and create a word cloud.

2. Problem Statement:
You are hired by one of the leading news channels CNBE who wants to analyze recent elections.
This survey was conducted on 1525 voters with 9 variables. You have to build a model, to predict
which party a voter will vote for on the basis of the given information, to create an exit poll that
will help in predicting overall win and seats covered by a particular party.

2.1 Read the dataset. Do the descriptive statistics and do the null value
condition check? Write an inference on it.
Ans: To start with the data analysis, we need to first load the data. Figure below shows the snap for
loading data

Figure 1 - Loading Dataset into Jupyter notebook

 Successfully loaded the data into Python.

2.1.1. Dropping the unnecessary columns from the dataset

Figure 2 – Dropping unwanted columns/variables

This Business Report is generated based on the Data set extracted from reliable sources
5

2.1.2. Now data set is ready for the Exploratory Data Analysis. As per the output dimension or
shape of the data set is (1525, 9). Therefore data set has 1525 rows and 9 columns. Refer
below figure

Figure 3 – Shape and Data type information of the data set

As per the figure we can observe there are 2 variables with ‘object’ type and remaining variables are
of ‘int’ type. Also there are no duplicate values in the data set.

2.1.3. There are no null values in the data set. Ref below figure

Figure 4 – Checking for Null Values

2.1.4. On the basis of problem description it is clear that ‘vote’ is the dependent variable/ target
variable and the remaining variables are independent variables. Going forward the report uses
terminology of dependent and independent variable to address the columns.
Hence, proportions for categorical variables are calculated.

This Business Report is generated based on the Data set extracted from reliable sources
6

Figure 5 – Proportions checking for Categorical Variables

From the figure we can infer that 1063 members casted their vote to Labour section and 462
members has casted their vote to Conservative. Seems, data set is the balanced with 70 and 30
proportion and good for creating Model. Also, Female stood in first place with a count of 812 over
Male with a count of 713. In addition, we can conclude that data set has no bad values.

2.1.5. Descriptive Stats:

Figure 6 – Descriptive Information

 As observed, stats from target variable say Labour has higher number in vote casting.
 Age ranges from 24 to 93 with a mean of 54 and having a std deviation of 16. Mean and
median are almost similar giving an intention of normal distribution.
 The economic.cond.national of the polling centre ranges from 1 to 5 with a median of 3.24.
Here 1 means bad and 5 means better conditions. Mean and median are almost similar
giving an intention of normal distribution.
 The economic.cond.household, blair, hague of the polling centre ranges from 1 to 5 with a
median of 3.14, 3.33, 2.7 respectively. Here 1 means bad and 5 means better conditions.
Mean and median are almost similar giving an intention of normal distribution
 The ‘Europe’ feature describing the Europe sentiment has a mean of 6 on a scale of 1 to 15.
This value depicts most of the voters are neutral towards Europe sentiment feature.
 Female vote casters are high in number compared to Male vote casters

This Business Report is generated based on the Data set extracted from reliable sources
7

2.1.6. Distribution and boxplot of the variables

Inferences based on the boxplots and dist plot:
1. ‘Age’ variable is normal distributed
2. Rest of the features, economic.cond.national, economic.cond.household, blair, Hague
and Political Knowledge being ordinal variables and not a continuous variables. So we
can see multiple spikes in the distribution.

Figure 7 – Distribution of the variables

2.2 Perform Univariate and Bivariate Analysis. Do exploratory data

analysis. Check for Outliers.

2.1.1. Univariate Analysis

Figure 8 – Count plot for Vote Variable

This Business Report is generated based on the Data set extracted from reliable sources
8

 Majority of the voters are labour, so the vote bank more lies in Labour section.
2.1.2. Bivariate Analysis
1. Strip plot between Vote and Age

Figure 9 – Strip plot Age vs Vote

 Based on the data, from the scatter plot Salary has weak correlation
 Most of the voters from labour section lies in between age of 40 and 50.

2. Strip plot between Vote and Economic.cond.national

Figure 10 – Strip plot Vote vs Economic.cond.national

 Based on the data, there is no clear correlation.

3. Strip plot between Economic conditional household and Age

This Business Report is generated based on the Data set extracted from reliable sources
9

Figure 11 – Economic conditional household vs Age

 Based on the data, there is no clear correlation.

4. Scatter plot between old children and Salary with hue of target variable

Figure 12 – Strip plot against Vote feature

This Business Report is generated based on the Data set extracted from reliable sources
10

5. Correlation between the variables:

Figure 13 – Heat map of the Variables

 Correlation between the variables is very weak as per the heat map
 Hague and Economic.cond.national has negative relation
 Europe and Economic.cond.national also have negative relation
 Age and Political knowledge also have negative relation, which is quite opposite to
reality
6. Pairplot between the variables:

Figure 14 – Pair Plot for the variables

This Business Report is generated based on the Data set extracted from reliable sources
11

 From the pair plot, we can infer that there is no strong relation between any
variable.
 Hence the dataset is not cursed by mutli-collinearity and good for modelling.

2.1.3. Treating Outliers

Before we do label encoding and creating a model we observed there are some outliers existing in
each variable. Now we will treat outliers in these variables.

Figure 15 – Outlier Treatment

Figure 16 –Box plot after outlier Treatment

2.3 Encode the data (having string values) for Modelling. Is Scaling
necessary here or not? Data Split: Split the data into train and test
(70:30)
Ans: To create a model for analysis we need to convert the categorical variables. So we do Label
encoding for the categorical variables and use the data set for the model building.

2.3.1. Data Encoding and Model building for Machine Learning Analysis

2.3.1.1. Converting Target Variable to Integer type using Categorical Function

 Here ‘0’ represents ‘Conservative’ and ‘1’ represents ‘Labour’

This Business Report is generated based on the Data set extracted from reliable sources
12

Figure 17 – Converting Target Variable to Integer type

2.3.1.2. Data information after conversion.

Figure 18 – Dataset info after encoding the variables

2.4 Apply Logistic Regression and LDA (linear discriminant analysis).

2.4.1. Model Building for Machine Learning Analysis – Split of data

Figure 19 – Test_Train_Split of the data

 Here data is not scaled, since there is no much difference in the independent
variables. Hence is scaling is not done for the entire modelling except for KNN
 Test and train split is made with ratio 70:30
 X is an array of dependent variables and Y is an array of target variable

2.4.2. Logistic Regression model generation using the Grid search method

This Business Report is generated based on the Data set extracted from reliable sources
13

Figure 20 – Logistic Regression Model building

 Penalty: ‘elasticnet','l2','none'
 Solver: 'newton-cg', 'saga'
 Tol: 0.001, 0.00001
Finally based on the GridSearchCV technique the best estimators are captured to best_model and
predictions are made based on this best_model.

2.4.3. Creating model for LDA

Figure 21 – Model for LDA

This Business Report is generated based on the Data set extracted from reliable sources
14

2.5 Apply KNN Model and Naïve Bayes Model. Interpret the results.

2.5.1. Creating model using Gaussian NB

Figure 22 – Model for Gaussian NB

2.5.2. Creating model using KNN

2.5.2.1 Scaling the data set for KNN model building

Figure 23 – Scaling the Data Set for model building

 Here data is not scaled, since there is no much difference in the independent
variables. Here scaling is done because KNN model is based on Euclidean
distance measurement. So Zscore is applied for scaling on the Train data set.
 Test and train split is made with ratio 70:30
 X is an array of dependent variables and Y is an array of target variable
2.5.2.2 Building Model:

Figure 24 – Model building

2.5.2.3 Building Model (K value =7):

This Business Report is generated based on the Data set extracted from reliable sources
15

Figure 25 – Model building

 K value is considered as ‘7’.

 Model score is 71.41 %
 X is an array of dependent variables and Y is an array of target variable
2.5.2.4 Calculating Misclassification error of KNN model for a range from 1 to 19 at steps of 2

Figure 26 – Calculating Misclassification error for various values of K

 It is observed that misclassification error value is less for 19

2.5.2.5 Plotting Misclassification Error for range of K values

Figure 27 – Plotting Misclassification error for various values of K

This Business Report is generated based on the Data set extracted from reliable sources
16

2.5.2.6 Building model for K=19

Figure 28 – Model building for K =19

 Model score is observed as 69.8

2.6 Model Tuning, Bagging (Random Forest should be applied for

Bagging), and Boosting.

2.6.1. Creating model using Random Forest Technique

Figure 29 – Model using Random Forest

 Model is created using ensembling technique, with estimators as ‘100’.

 Train model score is observed as 99.9 %
 Test model score is observed as 84%, seems like little bit overfitting. So we will
do Model tuning using methods like Baggind and Boosting

2.6.2. Creating model using Random Forest model and applying Bagging

Figure 30 – Model using Random Forest and applying Bagging

This Business Report is generated based on the Data set extracted from reliable sources
17

 Model is created using ensembling technique, with base_estimators as

‘RF_model’.
 Train model score is observed as 97.9 %
 Test model score is observed as 84.27%, seems like little bit overfitting. So we
can check Model tuning using Boosting.

2.6.3. Creating model using Random Forest model and applying Boosting

Figure 31 – Model using Random Forest and applying Boosting

 Model is created using ensembling technique, with base_estimators as

‘RF_model’. Gradient Boosting method is considered in the above case
 Train model score is observed as 88.75 %
 Test model score is observed as 83.84%, seems like both train and testing
scores are in-line, can be considered as best model.

2.7 Performance Metrics: Check the performance of Predictions on Train

and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get
ROC_AUC score for each model. Final Model: Compare the models and
write inference which model is best/optimized.

2.7.1. Performance Metrics of Logistic Regression

2.7.1.1. Confusion Matrix, accuracy and other metrics of Train data

This Business Report is generated based on the Data set extracted from reliable sources
18

Figure 32 – Confusion Matrix of Train Data

 Accuracy of the train data is 83.03%
 Recall for train data in the interest of ‘1’ is 0.91
2.7.1.2. Confusion Matrix, accuracy and other metrics of Test data

Figure 33 – Confusion Matrix of Train Data

 Accuracy of the test data is 84.93%

 Recall for test data in the interest of ‘1’ is 0.92
 False positive values are coming down to 45 which is a good sign on the test
data compared to train data with a value of 111

This Business Report is generated based on the Data set extracted from reliable sources
19

2.7.1.3. Accuracy Score AUC_score and RoC Curve for the train data

Figure 34 – AUC and RoC of Train Data

 AUC score of the train data is 87.72%

2.7.1.4. Accuracy Score, AUC_score and RoC Curve for the test data

Figure 35 – AUC and RoC of Test Data

 AUC score of the train data is 87.72%

This Business Report is generated based on the Data set extracted from reliable sources
20

2.7.2. Performance Metrics of LDA

2.7.2.1. Confusion Matrix of Train and Test data

Figure 36 – Confusion Matrix of Train and Test Data

 Recall for train data in the interest of ‘1’ is 105

 Recall for Test data in the interest of ‘ 1’ is 41

2.7.2.2. Classification report of Train and Test data

Figure 37 – Classification Report of Train and Test Data

 Accuracy of the Train data is 82%

 Accuracy of the Test Data is 84%

2.7.2.3. AUC_score and RoC Curve for the train and test data

This Business Report is generated based on the Data set extracted from reliable sources
21

Figure 38 – AUC and RoC of Train and Test Data

 AUC score of the train data is 87.7%

 AUC score of the test data is 91.6%

2.7.3. Performance Metrics of Naïve Bayes

2.7.3.1. Confusion Matrix, accuracy and other metrics of Train data

Figure 39 – Confusion Matrix of Train Data

This Business Report is generated based on the Data set extracted from reliable sources
22

 Accuracy of the train data is 82.66%

 Recall for train data in the interest of ‘1’ is 0.88
2.7.3.2. Confusion Matrix, accuracy and other metrics of Test data

Figure 40 – Confusion Matrix of Train Data

 Accuracy of the test data is 84.71%

 Recall for test data in the interest of ‘1’ is 0.90
 False positive values are coming down to 37 which is a good sign on the test
data compared to train data with a value of 97

2.7.3.3. Accuracy Score AUC_score and RoC Curve for the train data

Figure 41 – AUC and RoC of Train Data

 AUC score of the train data is 87.52%

This Business Report is generated based on the Data set extracted from reliable sources
23

2.7.3.4. Accuracy Score, AUC_score and RoC Curve for the test data

Figure 42 – AUC and RoC of Test Data

 AUC score of the train data is 91.02%

2.7.4. Performance Metrics of KNN Model with K value as 19

2.7.4.1. Confusion Matrix, accuracy and other metrics of Train data

Figure 43 – Confusion Matrix of Train Data

 Accuracy of the train data is 69.82%

 Recall for train data in the interest of ‘1’ is 0.99

This Business Report is generated based on the Data set extracted from reliable sources
24

2.7.4.2. Confusion Matrix, accuracy and other metrics of Test data

Figure 44 – Confusion Matrix of Train Data

 Accuracy of the test data is 67.46%

 Recall for test data in the interest of ‘1’ is 0.96
 False positive values are coming down to 137 which is a good sign on the test
data compared to train data with a value of 311

2.7.4.3. AUC_score and RoC Curve for the train and test data

Figure 45 – AUC and RoC of Train Data

 AUC score of the train data is 61.30%
 AUC score of the test data is 46.40%

This Business Report is generated based on the Data set extracted from reliable sources
25

2.7.5. Performance Metrics of Random Forest with Bagging Technique

2.7.5.1. Confusion Matrix, accuracy and other metrics of Train data

Figure 46 – Confusion Matrix of Train Data

 Accuracy of the train data is 97.18%

 Recall for train data in the interest of ‘1’ is 0.99

2.7.5.2. Confusion Matrix, accuracy and other metrics of Test data

Figure 47 – Confusion Matrix of Train Data

 Accuracy of the test data is 84.27%

 Recall for test data in the interest of ‘1’ is 0.92

This Business Report is generated based on the Data set extracted from reliable sources
26

2.7.5.3. AUC_score and RoC Curve for the train and test data

Figure 48 – AUC and RoC of Train Data

 AUC score of the train data is 99.70%

 AUC score of the test data is 91.80%

2.7.6. Performance Metrics of Random Forest with Boosting Technique

2.7.6.1. Confusion Matrix, accuracy and other metrics of Train data

Figure 49 – Confusion Matrix of Train Data

 Here Gradient boosting technique is used

 Accuracy of the train data is 88.75%
 Recall for train data in the interest of ‘1’ is 0.94

This Business Report is generated based on the Data set extracted from reliable sources
27

2.7.6.2. Confusion Matrix, accuracy and other metrics of Test data

Figure 50 – Confusion Matrix of Train Data

 Accuracy of the test data is 83.84%
 Recall for test data in the interest of ‘1’ is 0.92

2.7.6.3. AUC_score and RoC Curve for the train and test data

Figure 51 – AUC and RoC of Train Data

 AUC score of the train data is 94.80%

 AUC score of the test data is 90.80%

This Business Report is generated based on the Data set extracted from reliable sources
28

2.7.7. Comparison between Models

LR Model LDA Model NB Model KNN Model Bagging Boosting

Train Test Train Test Train Test Train Test Train Test Train Test
Accuracy 83.03 84.93 82.75 84.71 82.66 84.71 69.82 67.46 97.18 84.27 88.75 83.84
Precision (1) 86 87 86 88 87 89 70 69 97 86 90 86
AUC 87.72 87.72 87.7 91.6 87.52 91.02 61.3 46.4 99.7 91.8 94.8 90.8
Recall (1) 91 92 89 91 88 90 99 96 99 92 94 92
F1- Score (1) 88 90 88 89 88 89 82 80 98 89 92 89

Table 1 – Comparison between various models

 On comparing the models, Random Forest technique with Boosting seems
consistent across various Model evaluation parameters
 Train and test data for Boosting model are almost close and consistent
 Also, the accuracy of the Boosting falls in second place with 88.75 for train and
83.84 in Test data. Bagging has highest Accuracy for train data 97.18% but test
data is performing low with deviation of 10 % almost
 Finally Random Forest Technique with Boosting technique tuning provides
better model

2.8 Based on these predictions, what are the insights? (5 marks).

On the whole, based on the outcomes of the model for data set following Insights are observed:
 Female voters are more than the male voters. So parties have to attract Male voters by
respective means.
 Most of the voters from labour section lie in between age of 40 and 50. So mostly Labour
party is attracting this age group with some monetary benefits. So any voter from these ages
has a high change he cast his vote to labor party.
 Based on the data from Strip plot, voter’s density is high for Labour party for High economic
conditional nation. So any voter from this zone will have high chances to cast their vote to
Labour party.
 Similarly, is the case with high economic condition households.
 Random Forest algorithm with Gradient Boosting technique tuning provides better model.
The model is giving better results for both training and test data.

3. Problem Statement:
In this particular project, we are going to work on the inaugural corpora from the nltk in Python.
We will be looking at the following speeches of the Presidents of the United States of America:
 President Franklin D. Roosevelt in 1941
 President John F. Kennedy in 1961
 President Richard Nixon in 1973.

3.1 Find the number of characters, words, and sentences for the
mentioned documents.

Ans: To start with the counting, we need to first load the data. Figure below shows the snap for
loading data

This Business Report is generated based on the Data set extracted from reliable sources
29

Figure 52 - Loading Dataset into Jupyter notebook

 Nltk file is downloaded and imported inaugural for speech library
 Extracted required vice-president speeches to allocated to respective
variables.
3.1.1. Calculate length of data
Ans: To calculate the length of data below codes are executed

Figure 53 – Length of data

 Number of characters in Roosevelt file is 7571.
 Number of characters in Kennedy file is 7618.
 Number of characters in Nixon file is 9991.

3.1.2. Calculate No of words

Ans: To calculate the no of words in data below codes are executed. Split() function is used to
extract words.

Figure 54 – No of words in each speech

 Number of words in Roosevelt file is 1360.
 Number of words in Kennedy file is 1390.
 Number of words in Nixon file is 1819.

This Business Report is generated based on the Data set extracted from reliable sources
30

3.1.3. Calculate No of Sentences

Ans: To calculate the no of sentences in data below codes are executed. We are using sent_tokenize
command to count the sentences
3.1.3.1 A “sentence_count” function is defined to count the no of sentence in a particular speech

Figure 55 – User defined function to count no of sentences in a Text file

3.1.3.1 Code to calculate no of sentences in the speech

Figure 56 – No of sentences in each presidential speech

 Number of sentences in Roosevelt file is 68.
 Number of sentences in Kennedy file is 52.
 Number of sentences in Nixon file is 68.

3.2 Remove all the stopwords from all three speeches.

3.2.1. Import predefined stopwords from the nltk

Ans: To import predefined stopwords from the nltk below codes are executed

Figure 57 – Importing stopwords to python

3.2.2. Step to remove all punctuations
Ans: In this step punctuations and unnecessary words are removed from the text file.

This Business Report is generated based on the Data set extracted from reliable sources
31

Figure 58 – Code to remove punctuation and unnecessary words

 Punctuations listed in the punkt and addition words are removed

3.2.3. Removing stopwrods in all the files

Ans: To remove stopwords below codes are executd

Figure 59 – Removing stopwords from the speeches

 The output shows the punctuations are removed and a cleaned text is shown
in suffix ‘_clean variable
 We can see comma is removed from the tokens list

3.3 Which word occurs the most number of times in his inaugural address
for each president? Mention the top three words. (after removing the
stopwords)

3.3.1. Step to convert tokens into lower case.

Ans: To find top frequency words, first convert all the tokens in to lowers case. Since algorithm
considers lower case and upper case words are different and will create error in the count/
frequency. Hence in-built lower function is used to convert all the tokens into lower case and
maintain consistency for all the three speeches. Below line of codes are executed

This Business Report is generated based on the Data set extracted from reliable sources
32

Figure 60 – Converting tokens into Lower case

 We can observe that word s like ‘Mr’ are converted to ‘mr’ after the above
execution.
3.3.2. Step to Lemmatization with POS
Ans: In this step lemmatization is done, to convert all the tokens to root words. For the same
‘wordNetLemmatizer’ is imported to python notebook.

Figure 61 – Lemmatization with POS tag

 In the above step based on the parts of speech, words are converted to their
root words.
 Words that are not categorized as any parts of speech are retained as it is.

3.3.3. Top three word frequency calculation for each speech

3.3.3.1: Word frequency is calculated through below code in Roosevelt Speech.

This Business Report is generated based on the Data set extracted from reliable sources
33

Figure 62 – Word Frequency calculation for Roosevelt Speech

 Top three words in Roosevelt speech are “Nation”, “It” and “Life”
3.3.3.2: Word frequency is calculated through below code in Kennedy Speech.

Figure 63 – Word Frequency calculation for Keenedy Speech

 Top three words in Roosevelt speech are “World”, “Let” and “Side”
3.3.3.3: Word frequency is calculated through below code in Nixon Speech.

Figure 64 – Word Frequency calculation for Nixon Speech

 Top three words in Roosevelt speech are “America”, “Peace” and “World”

This Business Report is generated based on the Data set extracted from reliable sources
34

3.4 Which word occurs the most number of times in his inaugural address
for each president? Mention the top three words. (after removing the
stopwords)

Ans: Word cloud is a visual representation of the words available in a particular text files. The size of
the words indicates the frequency of the word. Bigger is the size more is the frequency.
3.4.1. Now we are creating the word cloud for Roosevelt speech. From wordcloud imported
WordCloud to create word cloud.

Figure 65 – Word cloud for Roosevelt speech

3.4.2. Now we are creating the word cloud for Kennedy speech. From wordcloud imported
WordCloud to create word cloud.

Figure 66 – Word cloud for Kennedy speech

This Business Report is generated based on the Data set extracted from reliable sources
35

3.4.3. Now we are creating the word cloud for Nixon speech. From wordcloud imported WordCloud
to create word cloud.

Figure 67 – Word cloud for Nixon speech

This Business Report is generated based on the Data set extracted from reliable sources

Girish Chadha - 29th December 2022
100% (3)
Girish Chadha - 29th December 2022
35 pages
Extended Project FastKart SQLite MYSQL 1 1 PDF
No ratings yet
Extended Project FastKart SQLite MYSQL 1 1 PDF
5 pages
Data Visualization in Tableau - Car Insurance Claim Project
50% (2)
Data Visualization in Tableau - Car Insurance Claim Project
51 pages
Predictive Modeling PDF
100% (3)
Predictive Modeling PDF
49 pages
八週團體運動介入對銀髮族幸福感及功能性體適能之提昇效益 Unlocked
No ratings yet
八週團體運動介入對銀髮族幸福感及功能性體適能之提昇效益 Unlocked
90 pages
Anshul Dyundi Machine Learning July 2022
50% (2)
Anshul Dyundi Machine Learning July 2022
46 pages
Project Questions
No ratings yet
Project Questions
3 pages
Machine Learning Project: Raghul Harish
100% (2)
Machine Learning Project: Raghul Harish
46 pages
Machine Learning Business Report - Compress (AutoRecovered)
100% (3)
Machine Learning Business Report - Compress (AutoRecovered)
69 pages
ML Project Report
100% (2)
ML Project Report
35 pages
Palash Bhai - Machine Learning Assignment
100% (2)
Palash Bhai - Machine Learning Assignment
18 pages
Sunira - Predictive Modeling
100% (1)
Sunira - Predictive Modeling
65 pages
Predictive Modelling Project 2
100% (4)
Predictive Modelling Project 2
32 pages
RACHIT MITTAL Capstone Project. Notes 2 PDF
No ratings yet
RACHIT MITTAL Capstone Project. Notes 2 PDF
39 pages
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
No ratings yet
Predictive Modelling Project Gloria Susan Raju 11 APR 2021 PDF
56 pages
Project Predictive Modeling PDF
100% (1)
Project Predictive Modeling PDF
58 pages
Linear - Regression - Assignment: Problem Statement
100% (3)
Linear - Regression - Assignment: Problem Statement
24 pages
Predictive Modelling Project - Business Report
100% (1)
Predictive Modelling Project - Business Report
23 pages
VaibhavKumar Extendedproject PDF
100% (2)
VaibhavKumar Extendedproject PDF
10 pages
Capstone Project Submission
100% (2)
Capstone Project Submission
31 pages
Fra Project Report-Bajaj Auto Ltd. Vs Hero Motocorp Ltd. (Group-X)
100% (1)
Fra Project Report-Bajaj Auto Ltd. Vs Hero Motocorp Ltd. (Group-X)
10 pages
Data Mining Project Report
100% (1)
Data Mining Project Report
98 pages
Business Report Problem 2
No ratings yet
Business Report Problem 2
10 pages
Capstone Project
100% (1)
Capstone Project
7 pages
FRA Project Report Milestone 1 PDF
No ratings yet
FRA Project Report Milestone 1 PDF
29 pages
Machine Learning
100% (2)
Machine Learning
30 pages
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
No ratings yet
Business Report DSBA Data Mining Project - Part 2 Segmentation Using K-Means Clustering
28 pages
Car Transport Machine Learning
89% (9)
Car Transport Machine Learning
28 pages
Analysis of Transport Choice of Employees - A Project On Machine Learning
100% (10)
Analysis of Transport Choice of Employees - A Project On Machine Learning
24 pages
Week 1 Graded Quiz On Solution PDF
100% (1)
Week 1 Graded Quiz On Solution PDF
2 pages
Shivani Pandey TSF
100% (1)
Shivani Pandey TSF
32 pages
Problem 2 Businessreport ML
No ratings yet
Problem 2 Businessreport ML
9 pages
ML - Project - Business Report
No ratings yet
ML - Project - Business Report
43 pages
Final Project ML Nikita Chaturvedi 03.10.2021 Text Analytics
No ratings yet
Final Project ML Nikita Chaturvedi 03.10.2021 Text Analytics
32 pages
Machine Learning Project: Problem 1
67% (3)
Machine Learning Project: Problem 1
26 pages
Data Mining Project - 27.06.2021
No ratings yet
Data Mining Project - 27.06.2021
6 pages
Predictive Model: Submitted by
100% (3)
Predictive Model: Submitted by
27 pages
Business Report Project Machine Learning Rupesh Kumar DSBA-A5-21C-2021
100% (3)
Business Report Project Machine Learning Rupesh Kumar DSBA-A5-21C-2021
77 pages
Machine Learning - Nabeel Khan - Final Project Report - Problem 2
100% (1)
Machine Learning - Nabeel Khan - Final Project Report - Problem 2
24 pages
Suresh-Rose Time Series Forecasting Project Report
100% (1)
Suresh-Rose Time Series Forecasting Project Report
75 pages
Answer Report (Preditive Modelling)
100% (1)
Answer Report (Preditive Modelling)
29 pages
FRA Business Report
100% (1)
FRA Business Report
21 pages
DataMining Aug2021
100% (2)
DataMining Aug2021
49 pages
SMDM Assignment: Problem 1
0% (1)
SMDM Assignment: Problem 1
16 pages
PM Guided Project Sample Business Report
100% (1)
PM Guided Project Sample Business Report
52 pages
Machine Learning Project - Sapan Parikh
100% (1)
Machine Learning Project - Sapan Parikh
12 pages
Extended Project
No ratings yet
Extended Project
1 page
Car Transport Prediction
100% (2)
Car Transport Prediction
27 pages
Clustering Project
100% (1)
Clustering Project
44 pages
Credit Card Default Prediction: Final Project Report
No ratings yet
Credit Card Default Prediction: Final Project Report
28 pages
Business Report: Predictive Modelling
100% (2)
Business Report: Predictive Modelling
37 pages
Great Learning Predictive Modelling Project
No ratings yet
Great Learning Predictive Modelling Project
12 pages
Marketing & Retail Analytics - Report - Part A
100% (2)
Marketing & Retail Analytics - Report - Part A
18 pages
Assighment Project 1
100% (3)
Assighment Project 1
18 pages
Data Mining Business Report
No ratings yet
Data Mining Business Report
38 pages
House Price Predection
100% (1)
House Price Predection
78 pages
Data Mining Clustering PDF
No ratings yet
Data Mining Clustering PDF
15 pages
Machine Learning (Project5) PDF
100% (2)
Machine Learning (Project5) PDF
13 pages
FRA Main Project Part B Guided
No ratings yet
FRA Main Project Part B Guided
23 pages
Assignment ML
100% (2)
Assignment ML
21 pages
Machine Learning Business Report PDF
No ratings yet
Machine Learning Business Report PDF
54 pages
Cronbach Alpha Beh Stat
No ratings yet
Cronbach Alpha Beh Stat
5 pages
Igann Sparse: Bridging Sparsity and Interpretability With Non-Linear Insight
No ratings yet
Igann Sparse: Bridging Sparsity and Interpretability With Non-Linear Insight
8 pages
Regression Analysis Final-Exam
No ratings yet
Regression Analysis Final-Exam
8 pages
Lecture 18 LSD and HSD
No ratings yet
Lecture 18 LSD and HSD
19 pages
Doc-20240330-Wa0002 240330 194818
No ratings yet
Doc-20240330-Wa0002 240330 194818
10 pages
Forecasting-Seasonal Models
No ratings yet
Forecasting-Seasonal Models
35 pages
Stat4006 2022-23 PS3
No ratings yet
Stat4006 2022-23 PS3
3 pages
ARDL Questionnaire
No ratings yet
ARDL Questionnaire
10 pages
Matching and The Propensity Score Handout
No ratings yet
Matching and The Propensity Score Handout
23 pages
Ontents: Foreword Preface To The Fourth Edition
No ratings yet
Ontents: Foreword Preface To The Fourth Edition
12 pages
Classification
No ratings yet
Classification
10 pages
Sample MCQ
No ratings yet
Sample MCQ
5 pages
6.20 Otro
No ratings yet
6.20 Otro
23 pages
DSBDAL - Assignment No 4
No ratings yet
DSBDAL - Assignment No 4
15 pages
Repeated Measures Design With Generalized Linear Mixed Models For Randomized Controlled Trials 1st Edition Full Text Download
No ratings yet
Repeated Measures Design With Generalized Linear Mixed Models For Randomized Controlled Trials 1st Edition Full Text Download
17 pages
Diponegoro Journal of Social and Political Tahun 2018, Hal 1-7
No ratings yet
Diponegoro Journal of Social and Political Tahun 2018, Hal 1-7
7 pages
Tabulasi Silang Pengetahuan Dan Pencegahan
No ratings yet
Tabulasi Silang Pengetahuan Dan Pencegahan
10 pages
VII Pearson R
No ratings yet
VII Pearson R
4 pages
Materi Discrminant Analysis
No ratings yet
Materi Discrminant Analysis
83 pages
Chapter 3 Multiple Linear Regression - Jan
No ratings yet
Chapter 3 Multiple Linear Regression - Jan
47 pages
The Two-Way Error Component Regression Model
No ratings yet
The Two-Way Error Component Regression Model
29 pages
3289-Article Text-10009-1-10-20210916 PDF
No ratings yet
3289-Article Text-10009-1-10-20210916 PDF
19 pages
Assistant Professor, Department of Economics
No ratings yet
Assistant Professor, Department of Economics
6 pages
STT153A Paper
No ratings yet
STT153A Paper
8 pages
LS 02 - Correlation - Regression
No ratings yet
LS 02 - Correlation - Regression
17 pages
Dougherty5e C14G01 2016 05 27
No ratings yet
Dougherty5e C14G01 2016 05 27
34 pages
The Latin Square Design
No ratings yet
The Latin Square Design
16 pages
Ejercicios de Regresion Lineal
No ratings yet
Ejercicios de Regresion Lineal
10 pages
CA CHP 5 MC Questions Share
No ratings yet
CA CHP 5 MC Questions Share
17 pages