ML-2 Project Final
ML-2 Project Final
BUSINESS REPORT
CODED PROJECT
Table of Contents
PROBLEM 1 ....................................................................................................................................... 5
Data Overview ............................................................................................................................... 5
Business Context:....................................................................................................................... 5
Objective: .................................................................................................................................. 5
Data Dictionary: ......................................................................................................................... 6
Problem 1.1 - Define the problem and perform Exploratory Data Analysis ..................................... 6
Problem definition - Check shape, Data types, statistical summary ............................................. 6
Univariate Analysis:.................................................................................................................. 13
Bivariate Analysis: .................................................................................................................... 15
Problem 1.2 - Data Preprocessing ................................................................................................ 19
Outlier Treatment: ................................................................................................................... 19
Encode the Data: ..................................................................................................................... 21
Data split: ................................................................................................................................ 21
Problem 1.3 – Model Building and Model evaluation ................................................................... 22
Metrics of Choice (Justify the evaluation metrics): ................................................................... 22
Model Building (KNN, Naive bayes, Bagging, Boosting) - Metrics of Choice (Justify the evaluation
metrics) - Model Building (KNN, Naive bayes, Bagging, Boosting): ............................................ 24
Problem 1 .4 – Model Performance Improvement:....................................................................... 29
Improve the model performance of bagging and boosting models by tuning the model -
Comment on the model performance improvement on training and test data: ........................ 29
Problem 1.5 - Final Model Selection............................................................................................. 35
Compare all the model built so far - Select the final model with the proper justification - Check
the most important features in the final model and draw inferences: ....................................... 35
Problem 1.6 Actionable Insights & Recommendations: ................................................................ 42
Compare all four models - Conclude with the key takeaways for the business .......................... 43
PROBLEM 2 ..................................................................................................................................... 44
Data Overview ............................................................................................................................. 44
Business Context:..................................................................................................................... 44
Objective: ................................................................................................................................ 44
Data Dictionary: ....................................................................................................................... 45
Problem 2.1 - Define the problem and perform Exploratory Data Analysis ................................... 45
Problem definition - Problem Definition - Find the number of Character, words & sentences in all
three speeches. ....................................................................................................................... 45
Number of Characters in all the three speeches. ...................................................................... 46
Number of Words in all the three speeches.............................................................................. 46
Number of Sentences in all the three speeches. ....................................................................... 47
1
Problem 2.2 – Text Cleaning ......................................................................................................... 49
Stopword removal - Stemming - find the 3 most common words used in all three speeches: .... 49
Removal of Stopword............................................................................................................... 50
Stemming ................................................................................................................................ 50
find the 3 most common words used in all three speeches....................................................... 50
Inferences: ............................................................................................................................... 52
Problem 2 .3- Plot Word cloud of all three speeches .................................................................... 52
Show the most common words used in all three speeches in the form of word clouds: ............ 52
Inferences: ............................................................................................................................... 52
2
List Of Figures
FIG: 1.1.1 FIRST FIVE ROWS OF THE DATASET ............................................................................................................... 7
FIG: 1.1.2 LAST FEW ROWS OF THE DATASET ................................................................................................................ 7
FIG: 1.1.3 SHAPE OF THE DATASET ............................................................................................................................ 7
FIG:1.1.4 DATA TYPES OF THE DATASET ....................................................................................................................... 8
FIG: 1.1.5 STATISTICAL SUMMARY OF THE DATASET ....................................................................................................... 9
FIG: 1.1.6 DESCRIPTIVE ANALYSIS OF NUMERICAL AND CATEGORICAL VARIABLE .................................................................. 10
FIG: 1.1.7 DUPLICATE CHECK................................................................................................................................. 11
FIG: 1.1.8 DESCRIPTION AND DATA TYPES ................................................................................................................. 11
FIG: 1.1.9 UNIQUE VALUES FOR CATEGORICAL VARIABLE .............................................................................................. 12
FIG: 1.1.10 HISTOGRAM AND BOXPLOT DEPICTING EACH NUMERICAL ATTRIBUTES .............................................................. 13
FIG: 1.1.11 AGE DISTRIBUTION PLOT ....................................................................................................................... 14
FIG: 1.1.12 BARPLOT DISTRIBUTION OF VOTE............................................................................................................. 15
FIG: 1.1.13 VOTE VS AGE BOXPLOT DISTRIBUTION ...................................................................................................... 15
FIG:1.1.14 RELATIONSHIP BETWEEN EACH ATTRIBUTE WITH RESPECT TO GENDER ................................................................ 16
FIG: 1.1.15 PAIR PLOT OF THE VARIABLES ................................................................................................................. 17
FIG:1.1.16 HEATMAP OF ATTRIBUTES...................................................................................................................... 18
FIG: 1.2.1 BOXPLOT TO CHECK THE PRESENCE OF OUTLIERS IN THE VARIABLE ...................................................................... 20
FIG: 1.2.2 BOXPLOT AFTER OUTLIER TREATMENT ......................................................................................................... 20
FIG: 1.2.3 DATA SET AFTER ENCODING ..................................................................................................................... 21
FIG: 1.2.4 DATA SPLIT RATIO ................................................................................................................................. 21
FIG: 1.3.1 ACCURACY, CONFUSION MATRIX AND CLASSIFICATION REPORT OF THE TRAIN DATA ................................................ 22
FIG: 1.3.2 ACCURACY, CONFUSION MATRIX AND CLASSIFICATION REPORT OF THE TEST DATA .................................................. 23
FIG: 1.3.3 DATASET USED IN KNN AFTER SCALING THROUGH ZSCORE ............................................................................... 24
FIG: 1.3.4 DATA SET SHAPE USED IN THE MODEL BUILDING ............................................................................................ 24
FIG: 1.3.5 ACCURACY, CONFUSION MATRIX AND CLASSIFICATION REPORT OF THE TRAIN DATA ................................................ 25
FIG: 1.3.6 ACCURACY, CONFUSION MATRIX AND CLASSIFICATION REPORT OF THE TEST DATA .................................................. 26
FIG: 1.3.7 DIFFERENCE BETWEEN TRAINING AND TESTING ACCURACY OF THE MODEL ............................................................ 26
FIG: 1.3.8 MISCLASSIFICATION PLOT FOR DIFFERENT K VALUES ........................................................................................ 27
FIG: 1.4.1 ACCURACY, CONFUSION MATRIX AND CLASSIFICATION REPORT OF THE TRAIN DATA ................................................ 29
FIG: 1.4.2 ACCURACY, CONFUSION MATRIX AND CLASSIFICATION REPORT OF THE TEST DATA .................................................. 30
FIG: 1.4.3 ACCURACY, CONFUSION MATRIX AND CLASSIFICATION REPORT OF THE TRAIN DATA ................................................ 30
FIG: 1.4.4 ACCURACY, CONFUSION MATRIX AND CLASSIFICATION REPORT OF THE TEST DATA .................................................. 31
FIG: 1.4.5 ACCURACY, CONFUSION MATRIX AND CLASSIFICATION REPORT OF THE TRAIN DATA ................................................ 32
FIG: 1.4.6 ACCURACY, CONFUSION MATRIX AND CLASSIFICATION REPORT OF THE TEST DATA .................................................. 32
FIG:1.4.7 DATA SET SHAPE WITH THE ACCURACY SCORE ................................................................................................ 34
FIG:2.1.1 LOADED TEXT SPEECHES INTO THE DATASET FORMAT ....................................................................................... 46
FIG:2.1.2 INFORMATION OF THE SPEECH TEXT IMPORTED .............................................................................................. 46
FIG:2.1.3 NUMBER OF CHARACTERS IN ALL THE THREE SPEECHES. ................................................................................... 46
FIG:2.1.4 NUMBER OF WORDS IN ALL THE THREE SPEECHES. ......................................................................................... 46
FIG:2.1.5 NUMBER OF SENTENCES IN ALL THE THREE SPEECHES. ..................................................................................... 47
FIG:2.1.6 COUNT OF STOP WORDS ......................................................................................................................... 48
FIG:2.1.7 COUNT OF SPECIAL CHARACTERS ............................................................................................................... 48
FIG: 2.1.8 COUNT OF NUMBERS............................................................................................................................. 48
FIG: 2.1.9 COUNT OF UPPERCASE WORDS ................................................................................................................ 48
FIG: 2.1.10 COUNT OF UPPERCASE LETTERS.............................................................................................................. 49
FIG: 2.1.11 NUMBER OF LISTINGS ON THE COMPLETE SPEECH SET ................................................................................... 49
FIG: 2.2.1 LOWER CASE CONVERSION OF THE SPEECH .................................................................................................. 49
FIG: 2.2.2 REMOVAL OF PUNCTUATION .................................................................................................................... 50
FIG:2.2.3 REMOVAL OF STOP WORD ....................................................................................................................... 50
3
FIG:2.2.4 FREQDIST COUNT OF WORDS AFTER STEMMING ............................................................................................ 50
FIG:2.2.5 THREE MOST COMMON WORDS USED IN ALL THE THREE SPEECHES ..................................................................... 50
FIG:2.2.6 TOP 3 COMMON WORDS USED IN ALL THE THREE SPEECHES AFTER REMOVING EXTENDED STOP WORD ......................... 51
FIG: 2.3.1 WORD CLOUD FOR INAUGURAL SPEECH (AFTER CLEANING)!! ........................................................................... 53
4
PROBLEM 1
Data Overview
Business Context:
CNBE, a prominent news channel, is gearing up to provide insightful coverage of recent
elections, recognizing the importance of data-driven analysis. A comprehensive survey has
been conducted, capturing the perspectives of 1525 voters across various demographic and
socio-economic factors. This dataset encompasses 9 variables, offering a rich source of
information regarding voters' characteristics and preferences.
The primary objective is to leverage machine learning to build a predictive model capable of
forecasting which political party a voter is likely to support. This predictive model, developed
based on the provided information, will serve as the foundation for creating an exit poll. The
exit poll aims to contribute to the accurate prediction of the overall election outcomes,
including determining which party is likely to secure the majority of seats.
Objective:
The primary objective is to leverage machine learning to build a predictive model capable of
forecasting which political party a voter is likely to support.
1. Define the problem and perform exploratory Data - Problem definition - Check shape, Data
types, statistical summary - Univariate analysis - Multivariate analysis - Use appropriate
visualizations to identify the patterns and insights - Key meaningful observations on individual
variables and the relationship between variables.
2. Data Pre-processing - Prepare the data for modelling: - Missing value Treatment (if needed)
- Outlier Detection (treat, if needed) - Encode the data -Data split - Scale the data (and state
your reasons for scaling the features).
3. Model Performance evaluation - Check the confusion matrix and classification metrics for
all the models (for both train and test dataset) - ROC-AUC score and plot the curve - Comment
on all the model performance.
4. Model Performance improvement - Improve the model performance of bagging and
boosting models by tuning the model - Comment on the model performance improvement on
training and test data.
5. Final Model Selection - Compare all the model built so far - Select the final model with the
proper justification - Check the most important features in the final model and draw
inferences.
4. Actionable Insights & Recommendations - Compare all four models - Conclude with the
key takeaways for the business.
5
Data Dictionary:
Problem 1.1 - Define the problem and perform Exploratory Data Analysis
The problem defined in this context is to build a predictive model leveraging machine learning
techniques. The goal is to forecast which political party a voter is likely to support based on a
comprehensive survey capturing the perspectives of 1525 voters across various demographic
and socio-economic factors. The predictive model serves a dual purpose: it is intended to
contribute to insightful coverage by a news channel (CNBE) of recent elections and to form
the foundation for an exit poll. The exit poll aims to accurately predict overall election
outcomes, specifically determining which political party is likely to secure the majority of
seats. Therefore, the objective is to develop a robust and accurate machine learning model
that can effectively predict voter preferences and, by extension, forecast election results.
observe the first few rows of the dataset, to check whether the dataset has been
loaded properly or not.
get information about the number of rows and columns in the dataset.
find out the data types of the columns to ensure that data is stored in the preferred
format and the value of each property is as expected.
check the statistical summary of the dataset to get an overview of the numerical
columns of the data.
6
Fig: 1.1.1 First five rows of the dataset
7
Fig:1.1.4 Data types of the dataset
There are 1525 rows and 10 columns, Unnamed column has been dropped from the
data frame.
There is an unnamed column that needs to be dropped it contains the serial number
information which is not required for further calculation.
From the Information it can be inferred that there are 2 categorical variables and 8
integer variables. And there are no missing values all attributes have 1525 data.
8
Fig: 1.1.5 Statistical Summary of the dataset
Upon dropping the duplicate rows and comparing the descriptive statistics with the
previous set, it appears that the attributes' central tendencies and dispersions remain
quite consistent.
Therefore, the key patterns and trends in the dataset, such as the preference for the
Labour party in voting and the gender distribution, remain consistent with the
previous analysis.
Any differences are likely within the realm of expected variability, and the overall
interpretation and inferences drawn from the data remain largely unchanged.
The dataset reveals some interesting patterns across its attributes. In terms of age, the
relatively high mean (54.18 years) suggests a generally older population.
The quartiles and median distribution indicate a fairly even spread, with the 25th
percentile at 41 years, the median at 53 years, and the 75th percentile at 67 years.
The economic conditions, both national and household, exhibit moderate means and
standard deviations, with quartiles indicating a reasonable spread of responses.
Blair's rating, with a mean of 3.33, appears relatively balanced, but the larger standard
deviation implies some variability in opinions. Similarly, Hague's rating has a lower
mean of 2.75 and a notable standard deviation, indicating a more diverse range of
opinions.
Europe's attribute demonstrates a higher mean (6.73) and a wider standard deviation,
suggesting varied sentiments towards European matters.
Political knowledge scores, with a mean of 1.54, show a low average understanding,
and the quartiles reveal a concentration of scores at the lower end, potentially
indicating a lack of widespread political knowledge among respondents.
Overall, while some attributes exhibit relatively stable and centred distributions,
others, such as political knowledge, display potential outliers or skewness that may
warrant further investigation.
9
Fig: 1.1.6 Descriptive analysis of Numerical and Categorical Variable
The dataset reveals some interesting patterns across its attributes. In terms of age, the
relatively high mean (54.18 years) suggests a generally older population.
The quartiles and median distribution indicate a fairly even spread, with the 25th
percentile at 41 years, the median at 53 years, and the 75th percentile at 67 years.
The economic conditions, both national and household, exhibit moderate means and
standard deviations, with quartiles indicating a reasonable spread of responses.
Blair's rating, with a mean of 3.33, appears relatively balanced, but the larger standard
deviation implies some variability in opinions. Similarly, Hague's rating has a lower
mean of 2.75 and a notable standard deviation, indicating a more diverse range of
opinions.
Europe's attribute demonstrates a higher mean (6.73) and a wider standard deviation,
suggesting varied sentiments towards European matters.
Political knowledge scores, with a mean of 1.54, show a low average understanding,
and the quartiles reveal a concentration of scores at the lower end, potentially
indicating a lack of widespread political knowledge among respondents.
Overall, while some attributes exhibit relatively stable and centred distributions,
others, such as political knowledge, display potential outliers or skewness that may
warrant further investigation.
The dataset provides information on voting patterns and gender distribution among
1525 respondents. In terms of voting, there are two distinct categories, with Labour
being the most frequently chosen option, garnering 1063 votes.
This dominance suggests a significant preference for the Labour Party among the
respondents.
10
The gender distribution is notably skewed, with a total of 812 females compared to
713 males, indicating a higher representation of females in the dataset.
This imbalance may be relevant in the context of the study's objectives, potentially
influencing the overall perspectives and responses captured.
Understanding the dynamics between voting patterns and gender in this dataset could
provide valuable insights into the political landscape and preferences of the surveyed
population.
Further analyses focusing on the interplay between gender and voting choices could
unveil interesting nuances and contribute to a more comprehensive understanding of
the dataset.
8 rows had duplicate records which were dropped. After dropping the duplicate rows,
the shape of the data frame is 1517 rows and 9 columns with 7 integers and 2 object
variables.
Upon dropping the duplicate rows and comparing the descriptive statistics with the
previous set, it appears that the attributes' central tendencies and dispersions remain
quite consistent.
11
Unique values for Categorical variable.
Further probing into the Categorical variables: The dataset provides insights into two
categorical variables: "VOTE" and "GENDER." For the "VOTE" variable, there are two
unique values - Conservative and Labour.
Among the 1525 respondents, 1057 individuals voted for the Labour party, while 460
opted for the Conservative party. This distribution suggests a notable preference for
the Labour party among the surveyed population, with a substantial majority choosing
this political option.
Regarding the "GENDER" variable, there are also two unique values - male and female.
Among the respondents, there were 808 females and 709 males, indicating a slight
imbalance in gender representation, with a higher number of females in the dataset.
This gender distribution may be relevant for analysing how political preferences and
opinions vary across gender lines.
In summary, the data suggests a predominant inclination towards the Labour party in
voting patterns, and the gender distribution shows a slight majority of female
respondents.
These insights into the categorical variables contribute to a better understanding of
the political and demographic composition of the surveyed population. Further
analyses could explore potential relationships between voting preferences and gender
to uncover nuanced patterns within the dataset.
12
Univariate Analysis:
13
Univariate analysis involves the examination of individual variables in isolation to
understand their distribution, characteristics, and behaviour.
In the context of the Voter’s characteristics and preferences with respect to the party’s
choice measures dataset, univariate analysis entails scrutinizing each attribute
independently.
This process includes assessing summary statistics such as mean, median, standard
deviation, and quartiles to gauge the central tendency, spread, and shape of the
variable's distribution.
Visualization tools like histograms, kernel density plots, or box plots can be employed
to provide a visual representation of the data's distribution.
Additionally, identifying potential outliers and understanding the presence of missing
values are crucial aspects of univariate analysis.
This analysis aids in uncovering patterns, trends, and anomalies within each variable,
laying the groundwork for more in-depth exploration and guiding subsequent data
preprocessing and modeling decisions.
Positive skewness (right-skewed) is observed in most of the variables. It provides
insights into the asymmetry of the distribution.
The above plots represent the behavioural characteristics of each attribute towards
contributing to the election outcomes with the choice of party the voters want to elect
and win them.
From the above plots, it can be seen that most of the variables are skewed to the left
or right which indicates that the values require pre-processing before proceeding to
the analysis.
Age distribution count in terms of the vote with respect to the gender variable.
14
Fig: 1.1.12 barplot distribution of Vote
The above bar plot represents the percentage of the party’s choice as per the voter’s
preferences.
Labour has the 69.7% of chances on acquiring the votes in the upcoming elections
where as Conservative party is likely to get the votes 30.3% out of 1517 observations
done on the survey.
Bivariate Analysis:
15
Fig:1.1.14 Relationship between each attribute with respect to gender
16
Fig: 1.1.15 Pair plot of the variables
17
Fig:1.1.16 Heatmap of Attributes
The following is the interpretation of the correlation matrix based on the above
values:
Age has a very weak positive correlation with economic conditions (national and
household), Blair's rating, and Europe's rating.
It has a weak negative correlation with political knowledge.
18
Economic conditions (national and household) are positively correlated, indicating
that individuals who perceive the national economic condition positively are likely to
perceive their household economic condition positively as well.
Both economic conditions have positive correlations with Blair's rating, suggesting
that individuals with a positive economic outlook may be more inclined to support
Blair.
Blair and Hague have a negative correlation, implying that respondents who rate
Blair higher tend to rate Hague lower and vice versa.
Europe has a negative correlation with economic conditions and Blair, suggesting that
individuals with a positive economic outlook or favourable views of Blair may have
fewer positive views towards Europe.
Political knowledge has weak negative correlations with age, economic conditions,
and ratings for Blair, Hague, and Europe. This suggests that individuals with higher
political knowledge tend to be slightly younger and may have different perspectives
on economic conditions and political figures.
These correlations provide insights into potential relationships between variables,
which can be valuable for feature selection in building predictive models. However,
keep in mind that correlation does not imply causation and further analysis or
experimentation may be needed to establish causal relationships.
Outlier Treatment:
19
Fig: 1.2.1 Boxplot to check the presence of outliers in the variable
20
Outliers in both economic.cond.household and economic.cond.national have been
removed using the IQR treatment.
One-hot encoding was done for the 2 object variables. Drop First is used to ensure that
multiple columns created based on the levels of categorical variables are not included
else it will result in multicollinearity. This is done to ensure that we do not land in to
dummy trap.
Train-Test Split Split X and y into training and test set in 70:30 ratio with random_state=1.
The dataset has been effectively split into training and testing sets, with the training
set comprising 1061 samples and the testing set consisting of 456 samples.
Each sample in both sets includes 8 features, reflecting various demographic and socio-
economic attributes.
The corresponding target variable, representing the political party vote, accompanies
each sample.
This division is essential for training a machine learning model on a subset of the data
and evaluating its performance on unseen data, ensuring that the model can
generalize well beyond the training set.
The provided shapes confirm the appropriate partitioning, setting the stage for model
training and assessment, ultimately contributing to the development of a robust
predictive model for forecasting political party preferences.
21
Problem 1.3 – Model Building and Model evaluation
Now GaussianNB classifier is built. The classifier is trained using training data. We can use fit
() method for training it. After building a classifier, our model is ready to make predictions.
We can use predict () method with test set features as its parameters.
Fig: 1.3.1 Accuracy, Confusion matrix and Classification Report of the train data
22
The F1-score, which is the harmonic mean of precision and recall, is commonly used
when there is an imbalance between classes. The weighted average F1-score is around
0.89.
In summary, while the model may exhibit high accuracy, a more nuanced assessment
involving class-specific metrics should be conducted to capture the model's
effectiveness for both Labour and Conservative predictions.
Fig: 1.3.2 Accuracy, Confusion matrix and Classification Report of the test data
The model achieved an accuracy of approximately 82.2% on the training set. This
indicates the proportion of correctly classified instances out of the total. Predicting the
Conservative class needs attention.
A balanced evaluation of precision, recall, and F1-score for both classes becomes
crucial. In this scenario, where misclassifying instances from the Conservative class
may have significant implications, it is essential to consider the model's performance
on each class individually.
True Positive (TP): 263 (Correctly predicted as Labour), True Negative (TN): 112
(Correctly predicted as Conservative).
False Positive (FP): 41 (Incorrectly predicted as Labour), False Negative (FN): 40
(Incorrectly predicted as Conservative).
Precision (for Labour): 0.87, of all instances predicted as Labour, 87% were Labour. This
indicates a relatively low false positive rate.
Recall (for Labour): 0.87 of all actual Labour instances, 87% were correctly predicted
as Labour. This indicates a relatively low false negative rate.
The overall accuracy of the model on the testing set is 82.2%. This measures the
proportion of correctly classified instances among all instances.
The F1-score, which is the harmonic mean of precision and recall, is commonly used
when there is an imbalance between classes. The weighted average F1-score is around
0.87.
Inferences:
While the model demonstrates reasonably high accuracy on both the training and
testing sets, the persistent occurrence of false positives and false negatives suggests
23
that further model improvement is necessary. The model seems to generalize
reasonably well to the testing set, as indicated by the comparable accuracy scores.
However, addressing the imbalances in precision, recall, and F1-score, particularly for
class 0, should be a focus for further refinement.
For a Naive Bayes algorithm while calculating likelihoods of numerical features it
assumes the feature to be normally distributed and then we calculate probability using
the mean and variance of that feature only and also it assumes that all the predictors
are independent of each other. The scale doesn’t matter. Performing feature scaling in
this algorithm may not have much effect.
Model Building (KNN, Naive bayes, Bagging, Boosting) - Metrics of Choice (Justify the
evaluation metrics) - Model Building (KNN, Naive bayes, Bagging, Boosting):
KNN MODEL
Generally, good KNN performance usually requires preprocessing of data to make all
variables similarly scaled and centred Now let’s apply zscore on continues columns
and see the performance for KNN.
In summary, both the training and testing sets have a consistent structure with 8
features per sample, suggesting that the datasets are well-prepared for training and
evaluating a machine learning model. The sizes of the sets (1061 samples in training
and 456 samples in testing) are reasonable for model development and validation, and
24
the dimensions align appropriately for compatibility with the chosen machine learning
algorithm.
Fig: 1.3.5 Accuracy, Confusion matrix and Classification Report of the train data
The provided output from the K-Nearest Neighbours (KNN) model on the training set
indicates a high level of accuracy and overall good performance.
The accuracy score of approximately 87.08% suggests that the model correctly
classified a substantial majority of instances in the training dataset.
The confusion matrix reveals a balanced distribution of true positives and true
negatives, with 240 instances correctly predicted as class 0 (Conservative) and 684
instances correctly predicted as class 1 (Labour).
However, there are notable false positives (81 instances predicted as Labour while
being Conservative) and false negatives (56 instances predicted as Conservative while
being Labour), indicating some misclassifications.
The precision, recall, and F1-score metrics further characterize the model's
performance, with higher values for class 1, indicating that the model is more effective
at identifying Labour voters.
The weighted average values of precision, recall, and F1-score, all around 87%, affirm
the model's balanced performance across both classes.
Overall, the KNN model demonstrates robust predictive capabilities on the training set,
and further evaluation on the testing set would be essential to assess its generalization
to unseen data.
25
Fig: 1.3.6 Accuracy, Confusion matrix and Classification Report of the test data
The output from the K-Nearest neighbours (KNN) model on the testing set indicates a
solid level of performance, although there are some areas for consideration.
The accuracy score of approximately 82.23% suggests that the model correctly
classified a significant portion of instances in the unseen dataset.
The confusion matrix reveals a distribution of true positives and true negatives, with
88 instances correctly predicted as class 0 (Conservative) and 287 instances correctly
predicted as class 1 (Labour).
However, there are noticeable false positives (51 instances predicted as Labour while
being Conservative) and false negatives (30 instances predicted as Conservative while
being Labour), indicating some misclassifications on the testing set.
The precision, recall, and F1-score metrics provide a more nuanced understanding of
the model's performance, with a higher precision, recall, and F1-score for class 1,
indicating that the model is more proficient at identifying Labour voters.
The weighted average values of precision, recall, and F1-score, all around 82%, affirm
the model's overall effectiveness on the testing set.
While the model maintains a solid level of performance, further fine-tuning or
alternative approaches may be explored to address misclassifications and enhance
predictive accuracy.
Fig: 1.3.7 Difference between training and testing accuracy of the model
26
Plot misclassification error vs k (with k value on X-axis) using matplotlib
When comparing the differences between test and train scores for different values of
k, the goal is to select a model with a smaller difference, indicating better
generalization and reduced risk of overfitting. Let's analyse the provided differences:
27
Among these options, the model with K=12 has the smallest difference between test
and train scores, suggesting better generalization. A smaller difference implies that the
model performs consistently on both the training and testing sets, reducing the
likelihood of overfitting.
Therefore, based on the provided differences, the model with K=12 appears to be the
better choice for this particular scenario. However, it's essential to consider other
performance metrics (such as overall accuracy, precision, recall, and F1-score) and
possibly perform cross-validation to ensure the robustness and generalizability of the
chosen model.
The accuracy scores for different values of k in your K-Nearest Neighbours (KNN) model
provide insights into how well the model is performing on your dataset. Let's analyse
the results:
Accuracy Score for K=7: 0.8135 Accuracy Score for K=11: 0.8245 Accuracy Score for
K=12: 0.8311
Optimal K: The increasing trend in accuracy scores suggests that higher values of k are
contributing to better performance. Among the mentioned values, K=12 has the
highest accuracy.
Model Complexity: The choice of k influences the complexity of the KNN model.
Smaller k values result in more complex models, while larger k values lead to smoother
decision boundaries and potentially better generalization.
Consistency: The consistency in the accuracy scores across K=11 and K=12 indicates
that the model's performance is stable. The small increase in accuracy from K=11 to
K=12 suggests that the model may not be overly sensitive to changes in k beyond a
certain point.
Generalization: The increasing accuracy scores imply that the model is generalizing
well to the testing data. However, as with any model evaluation, it's crucial to consider
potential overfitting and validate the model's performance on a separate validation set
or through cross-validation.
Fine-Tuning: Further exploration of different k values, especially in the vicinity of the
optimal value (around K=12), may provide additional insights into the model's
performance. Fine-tuning can help strike the right balance between bias and variance.
In summary, the results indicate that the model performs well with higher values of k,
and K=12 provides the highest accuracy among the mentioned values. The choice of
the optimal k depends on the specific characteristics of your dataset, and additional
experimentation or validation techniques may be valuable.
28
Problem 1 .4 – Model Performance Improvement:
Improve the model performance of bagging and boosting models by tuning the model -
Comment on the model performance improvement on training and test data:
Ada Boost
Fig: 1.4.1 Accuracy, Confusion matrix and Classification Report of the train data
29
Fig: 1.4.2 Accuracy, Confusion matrix and Classification Report of the test data
Gradient Boosting
Fig: 1.4.3 Accuracy, Confusion matrix and Classification Report of the train data
30
Confusion Matrix: The confusion matrix shows that there were 254 true negatives, 696
true positives, 67 false positives, and 44 false negatives in the training set.
Precision, Recall, and F1-score: Precision for class 0 is 0.85, and for class 1 is 0.91. Recall
for class 0 is 0.79, and for class 1 is 0.94. F1-score is also reported.
Fig: 1.4.4 Accuracy, Confusion matrix and Classification Report of the test data
BAGGING
The performance metrics on the train and test data sets indicate that the Bagging
Classifier is performing exceptionally well on the training data, achieving near-perfect
accuracy.
31
However, there is a noticeable drop in performance on the test data, with an accuracy
of 81.91%.
Fig: 1.4.5 Accuracy, Confusion matrix and Classification Report of the train data
Accuracy: 99.92%: The model achieves extremely high accuracy on the training data,
which might suggest overfitting.
Confusion Matrix: The confusion matrix on the training set shows that the model
predicts all instances of both classes correctly, with no false positives or false
negatives.
Fig: 1.4.6 Accuracy, Confusion matrix and Classification Report of the test data
Accuracy: 81.91%: The model's accuracy drops on the test data, indicating that the
model might not generalize as well to unseen data.
Confusion Matrix: The confusion matrix on the test set reveals that the model makes
some errors in predicting both classes. There are false positives and false negatives,
suggesting some misclassifications.
32
Precision, Recall, F1-Score: The precision, recall, and F1-score for both classes (0 and
1) are provided in the classification report.
Precision measures the accuracy of the positive predictions, recall measures the
coverage of actual positive instances, and the F1-score is the harmonic mean of
precision and recall. These metrics provide a more detailed view of the model's
performance.
Inference: The model seems to have overfitted to the training data, achieving near-
perfect accuracy on it.
The drop in accuracy on the test set suggests that the model might not generalize well
to new, unseen data. Further analysis of precision, recall, and F1-score for each class
can help understand the model's performance on specific classes and identify areas
for improvement.
Consider regularization techniques or adjusting hyperparameters to improve
generalization performance. Additionally, further evaluation using cross-validation can
provide a more robust assessment of the model's capabilities.
33
Mean training score
SMOTE
An accuracy score of 0.7741 means that your K-Nearest Neighbours (KNN) classifier,
trained on SMOTE-resampled data with K=12, achieved an accuracy of approximately
77.41% on the test set.
34
Here are some inferences and considerations based on this result: The model exhibits
decent performance with an accuracy of over 77%. However, the interpretation of
accuracy should be cautious, especially in imbalanced datasets. It's crucial to consider
other evaluation metrics, especially when dealing with imbalanced classes.
To determine which model is the best, you should consider multiple factors, including
accuracy scores, the specific characteristics of your data, and the problem you are
trying to solve. Here are some observations based on the provided information.
Accuracy Scores: KNN with K=12 and SMOTE: 0.7741 KNN with K=12 (without SMOTE):
0.8311 Naive Bayes (NB): 0.8223.
Consideration of Accuracy Alone: The KNN model without SMOTE, specifically with
K=12, has the highest accuracy (0.8311) among the models mentioned.
Compare all the model built so far - Select the final model with the proper justification - Check
the most important features in the final model and draw inferences:
35
1. Gaussian Naive Bayes Model: Recall for class 1 is 0.90 and the model accuracy is 83.50.
0.8350612629594723
[[211 96]
[ 79 675]]
2. KNN Model: Recall for class 1 is 0.92 and the model accuracy is 87.08.
0.8708765315739868
[[240 81]
[ 56 684]]
36
3. KNN Model (For k=12): Recall for class 1 is 0.91 and the model accuracy is 85.95.
0.8595664467483506
[[242 79]
[ 70 670]]
4. Ada Boost: Recall for class 1 is 0.91 and the model accuracy is 84.26.
0.8426013195098964
[[219 102]
[ 65 675]]
37
5. Gradient Boosting: Recall for class 1 is 0.94 and the model accuracy is 89.53.
0.8953817153628653
[[254 67]
[ 44 696]]
6. BAGGING: Recall for class 1 is 1.00 and the model accuracy is 99.91.
0.9991755976916735
[[377 0]
[ 1 835]]
38
macro avg 1.00 1.00 1.00 1213
Worst performing models are - Gaussian Naive Bayes Model and Ada Boost.
Best Performing models are - Gradient Boost, KNN Model and Bagging.
Gradient Boosting appears to be the most effective model among those considered,
as it provides the highest accuracy on the test set. Gradient Boosting often performs
well due to its ensemble nature, which combines multiple weak learners to create a
strong learner.
However, are these best performing models overfitted.
Let's look at the performance on the test data set.
Recall on the Test Data Set
Recall refers to the percentage of total relevant results correctly classified by the
algorithm and hence we will compare Recall of class "1" for all models.
1. Gaussian Naive Bayes Model: Recall for class 1 is 0.87 and the model accuracy is 82.23.
0.8223684210526315
[[112 41]
[ 40 263]]
39
2. KNN Model: Recall for class 1 is 0.91 and the model accuracy is 82.23.
0.8223684210526315
[[ 88 51]
[ 30 287]]
3. KNN Model (For k=12): Recall for class 1 is 0.89 and the model accuracy is 83.11.
0.831140350877193
[[ 97 42]
[ 35 282]]
40
macro avg 0.80 0.79 0.80 456
4. Ada Boost: Recall for class 1 is 0.87 and the model accuracy is 81.57.
0.8157894736842105
[[ 95 44]
[ 40 277]]
5. Gradient Boosting: Recall for class 1 is 0.90 and the model accuracy is 82.89.
0.8289473684210527
[[ 93 46]
[ 32 285]]
41
1 0.86 0.90 0.88 317
6. BAGGING: Recall for class 1 is 0.90 and the model accuracy is 81.90.
0.819078947368421
[[ 51 32]
[ 23 198]]
Inferences:
Model which has not performed well on the train data set, also have not performed
well on the test data set However Gradient Boost, KNN Model and Bagging which had
a 100% score on the train data set have shown a poor result on the test data set. Hence
a clear case of overfitting.
While KNN Model also performs well, Gradient Boosting surpasses it in terms of
accuracy.
42
Compare all four models - Conclude with the key takeaways for the business
Conclusion:
Model which has not performed well on the train data set, also have not performed
well on the test data set However Gradient Boost, KNN Model and Bagging which
had a 100% score on the train data set have shown a poor result on the test data set.
Hence a clear case of overfitting.
Gradient Boosting performs the highest accuracy on the train set while KNN model
on the train set.
An exit poll aims to predict election outcomes, including determining the party likely
to secure the majority of seats.
Model performance should be evaluated not only based on accuracy but also
considering precision, recall, and other relevant metrics.
Consider the potential impact of class imbalance, as predicting the winning party
might be challenging if there's a significant imbalance.
43
PROBLEM 2
In this particular project, we are going to work on the inaugural corpora from the nltk in Python. We
will be looking at the following speeches of the Presidents of the United States of America:
"
import nltk
nltk.download('inaugural')
inaugural.fileids()
inaugural.raw('1941-Roosevelt.txt')
inaugural.raw('1961-Kennedy.txt')
inaugural.raw('1973-Nixon.txt')
"
Data Overview
Business Context:
The goal is to perform an exploratory data analysis for the given text of speeches which has
happened during the president inaugural speech at America at different years, we need to
analyse this text content shared through text cleaning and plotting the word cloud to find the
most common words which was used in all the three speeches.
Objective:
The goal is to perform an exploratory data analysis for the given text of speeches which has
happened during the president inaugural speech at America at different years, we need to
44
analyse this text content shared through text cleaning and plotting the word cloud to find the
most common words which was used in all the three speeches.
1. Define the problem and perform exploratory Data - Problem Definition - Find the number
of Character, words & sentences in all three speeches.
2. Text cleaning - Stopword removal - Stemming - find the 3 most common words used in all
three speeches.
3. Plot Word cloud of all three speeches - Show the most common words used in all three
speeches in the form of word clouds
Data Dictionary:
1. inaugural.raw('1941-Roosevelt.txt')
2. inaugural.raw('1961-Kennedy.txt')
3. inaugural.raw('1973-Nixon.txt')
Problem 2.1 - Define the problem and perform Exploratory Data Analysis
Problem definition - Problem Definition - Find the number of Character, words & sentences in
all three speeches.
The problem is to find and analyse the following metrics in each of the three
presidential speeches: the number of characters, words, and sentences. This analysis
will provide insights into the length and structure of the speeches delivered by
Presidents Franklin D. Roosevelt in 1941, John F. Kennedy in 1961, and Richard Nixon
in 1973.
Performing EDA involves examining and summarizing key characteristics of the data.
In this case, we want to explore the length of the speeches in terms of characters,
words, and sentences.
45
Fig:2.1.1 Loaded text speeches into the dataset format
46
President Richard Nixon in 1973 speech has the 10106 number of words in it.
President Franklin D. Roosevelt in 1941 speech has the 69 number of sentences in it.
President John F. Kennedy in 1961 speech has the 56 number of sentences in it.
President Richard Nixon in 1973 speech has the 70 number of sentences in it.
The analysis of the three presidential speeches reveals interesting insights into their
lengths and structures. President Franklin D. Roosevelt's speech from 1941 is
characterized by 7,651 characters, 1,323 words, and 69 sentences.
President John F. Kennedy's speech from 1961 consists of 7,673 characters, 1,364
words, and 56 sentences.
Lastly, President Richard Nixon's speech from 1973 is the longest, containing 10106
characters, 1769 words, and 70 sentences.
From this information, it can be inferred that while Roosevelt's speech is slightly
shorter than Kennedy's in terms of characters and words, it has fewer sentences,
suggesting a more concise and focused narrative. On the other hand, Nixon's speech
stands out as the most extensive in terms of characters and words, indicating a more
detailed and comprehensive address.
The varying lengths and sentence structures of these speeches reflect the unique
communication styles and emphases of each president during their respective
inaugural moments.
47
Fig:2.1.6 Count of Stop words
Count of Numbers:
48
Count of Uppercase Letters:
Stopword removal - Stemming - find the 3 most common words used in all three speeches:
Removal of Punctuation:
49
Fig: 2.2.2 Removal of Punctuation
Removal of Stopword
Stemming
There seems to be outliers only in very few attributes. IQR was used to treat the
variables to remove the outliers.
Fig:2.2.5 Three most common words used in all the three speeches
50
The cleaned and processed words from each speech (roosevelt_cleaned,
kennedy_cleaned, and nixon_cleaned) are combined into a single list
(all_cleaned_words).
The text from the three presidential speeches by removing stop words and applying
stemming. Was cleaned and processed. It then calculates the three most common
words across all three speeches.
The clean text function is defined to tokenize the text, remove stop words, and apply
stemming using the Porter Stemmer. Stop words, common words that do not
contribute much to the meaning of the text, are excluded from the analysis.
The top 3 most common used words in all the three speeches were “us”, “new” and
“let”.
Fig:2.2.6 Top 3 common words used in all the three speeches after removing extended stop
word
The top 3 most common used words in all the three speeches were “us”, “new” and
“let”.
US refers to the United States since these speeches were conducted in USA and new
and let seems to be used as an extended stop words to see the actual most common
words.
From the above graph it is clear that “us”,” nation” and “America” are the most
commonly used words which has been found out by the text analysing method.
The FreqDist class from NLTK is used to calculate the frequency distribution of words
in the combined list. The most common (3) method is applied to retrieve the three
most common words along with their frequencies.
51
Inferences:
The three most common words obtained from the code are [('us', 45), ('new', 26),
('let', 25)].
The three most common words obtained from the code after using extended stop
words are: new” and “let” are [('us', 45), ('nation', 37), ('america', 29)].
The result suggests that the most frequently occurring words across all three speeches
are 'us,' 'nation,' and 'america.'
The word 'us' may indicate a focus on collective identity or inclusiveness in the
speeches.
The frequent occurrence of 'nation' reflects an emphasis on the country as a whole
and its role in the speeches.
The word 'let' may be indicative of a call to action or encouragement in the context of
the speeches.
The word 'new' may indicate a promise made by the president during the event of
speeches.
The word 'america' has been used since the speeches refers and represents the USA.
Show the most common words used in all three speeches in the form of word clouds:
Inferences:
The analysis of the three presidential speeches (Roosevelt 1941, Kennedy 1961, Nixon
1973) revealed that the most frequently used words across all speeches are "us,"
"nation," and "america."
This suggests a common theme of unity, patriotism, and a call to collective action.
The repetition of these words indicates a shared emphasis on the nation's strength,
responsibility, and the importance of working together towards common goals.
The presidents seem to highlight the idea of a united and empowered nation capable
of facing challenges and embracing opportunities.
Understanding these recurring themes in the speeches can provide insights into the
leaders' priorities and the overall tone of their addresses.
52
Fig: 2.3.1 Word Cloud for inaugural speech (after cleaning)!!
53