Statistical Machine Learning
Statistical Machine Learning
Daniel Fransén
Andreas Hertzberg
Alhassan Jawad
Erik Magnusson
Abstract
The project aimed to answer if Machine Learning (ML) methods can be used to
solve a classification problem of predicting the gender of a movie’s lead character.
Out of five used ML methods: Logistic regression, Discriminant Analysis, k-
Nearest Neighbor, Random Forest, and AdaBoost, the project demonstrated that
Quadratic Discriminant Analysis was the optimal method for this classification
problem.
1 Introduction
In this study, a variety of machine learning techniques were employed to determine the gender of the
lead actor in a movie. Using information from 1037 movies containing 13 inputs & 1 output, multiple
supervised classification models were trained, individually, and their efficiency was compared to
determine the most effective model for the particular purpose of determining if the lead actor is
male/female. The evaluation process involved splitting the data into training and testing sets through
k-fold cross-validation. The models tested were
1. Logistic regression
2. Linear Discriminant Analysis (LDA)
3. Quadratic Discriminant Analysis (QDA)
4. k-Nearest Neighbor
5. Tree-based methods: Random Forest
6. Boosting: AdaBoost
The optimal model will be proposed for implementation and will be further evaluated using a separate
test set of 387 films.
2 Pre-data Analysis
The following questions were answered by pre-analysing the given dataset & mainly printing out
different columns from it. The code can be seen in the appendix (section: Pre-data analysis).
Do men or women dominate speaking roles in Hollywood movies? To answer the question we can
pinpoint two points of discussion:
1. Figure 1 which shows how many words were spoken by male leads vs female leads through-
out the years.
2. Calculating the percentage of movies where men spoke more than
females or vice-versa.
Analysing Fig 1 shows that in general, men speak more than women with approximately double the
rate at some points. There exists some exceptions but the conclusion is still that men speak more than
women. The reason figure 1 is weirdly scaled is that what the figure is showing is the total number of
words (lead+non-lead) per gender per year.
Printing the percentage of times where men spoke more than women in comparison with times where
women spoke more shows the results that men spoke roughly 76.6% of the time while women the
rest of the time (23.4%).
So an answer to the question: "Do men or women dominate speaking roles in Hollywood Movies?"
is that men still dominate the speaking roles with more than half of the time.
Concerning the question about gender balance, it can be understood in two different ways:
1. Has the gender balance for a specific role changed over time (i.e. years)?, Figure 2a
2. Has the gender balance (men/females) for speaking roles (mainly lead) changed over time
(i.e. years)?, Figure 2b
Figure 2a answers the question regarding gender balance for a specific role by showing that the
percentage of female actors saw a slow increase around the 1970s but then stagnated around 30%.
The percentage of female leads, on the other hand, varied by a little bit but is still under 50% of the
total leading roles.
Figure 2b shows two subplots in which the upper plot shows a comparison between the
percentage of male & female leads while the lower plot shows a comparison between the percentage
of male & female actors in the industry. Both suplots show two different plots that seem to be the
2
(b) Percentage of male/female leads & male/female
(a) Female leads vs Female actors (percentage) actors
reflections of themselves which brings us to the answer that the gender balance has changed over
time.
2.3 Do films in which men do more speaking make a lot more money than films in which
women speak more?
Which movies make more money, the movies where men speak more or those in which women speak
more?
The solution can simply be obtained by computing the average gross for movies in which men
spoke more frequently than women and vice-versa. The calculations revealed that movies with a
predominant male dialogue had an average revenue of $118.6 million, while those with a female-led
dialogue had an average gross income of $86.6 million. Additionally, movies with a male lead actor
generated an average income of $115.2 million, compared to $98.7 million for movies with a female
lead. This indicates that movies centered around males tend to generate higher gross than those
centered around females.
3 Implementation of Methods
Of the 13 known inputs, 12 of them (all except Total words) were used as a quantitative input subset.
The input Total words was excluded because it is colinear with other inputs, specifically the inputs
Number words female & Number words male. Furthermore, the models (all except kNN) were
tuned using grid search or random search as a methodological approach and performance/evaluation
is determined using k-fold cross-validation. We chose to use accuracy as the performance metric
because it is intuitive to interpret. Mathematically, accuracy is the fraction of correctly classified
samples and will be calculated using the cross_val_score function. Furthermore, the hyper parameter
tuning is shortly described below, and for even more detail, see appendix section: Choice of hyper
parameters. Additionally, the hyper-parameter optimization results are summarized in table 1.
Female=1. This limit can be adjusted according to the needs of individual situations and calculations.
3
In our case, this threshold is set to 0.5 and means that if the probability of the class Male is 0.5 or
higher, then the logistic regression method will classify that particular data point as Male. Finding a
good limit or threshold to optimize this method is not always an easy task, and to find the optimum
value usually means that one has to manually tune this and it could also require one to work with
different projects for cross referencing. 0.5 is the standard threshold [2] and given the time and the
complexity of this project this seems like a decent threshold to use.
Linear Discriminant Analysis (LDA) is a Machine Learning model that is commonly used
for Regression Problems, in particular when it is more than 2 classes that are being used. For
our classification problem, the categorical variable Lead which can be either Male or Female is
transformed into a binary number 1 & 0 so that we can have numerical variables which LDA can be
applied on. The method works by structuring an axis that 1) maximizes the distance between the two
classes mean values while 2) keeping the two classes internal variance minimized. The data points of
each class are then extended on the shared axis which creates the linear decision boundary which
helps in predicting, in our case, if the Lead gender is Male=1 or Female=0. Worth knowing is that in
LDA, we assume that both classes have shared covariance matrices which is what results in a linear
decision boundary.
Quadratic Discriminant Analysis (QDA) is a Machine Learning model similar to LDA but with
an apparent difference of assuming that we have individual covariance matrices which results in a
quadratic decision boundary. Another thing of note is that QDA is a model made for classification
problems meaning that there is no need for transforming a categorical variable output into a numerical
variable one. There are some other difference in how QDA & LDA behave for different data types
(which model is more prone to overfit etc.) and it is up to the programmer to choose how and which
model to implement.
Both LDA & QDA were implemented using the sklearn.discriminant_analysis import and then fitted
using the package’s gridsearchCV. As a pre-analysis step, we normalized the dataset for LDA mainly
to improve and enhance performance metrics. In addition, for the LDA case we needed to transform
the outputs into binary numbers using the StandardScaler() function.
We tried to normalize the data for QDA but it did not improve the performance metrics by any way.
4
values are passed into the param_grid variable, the more computational demanding and complex the
grid search will be and it doesn’t necessarily mean that better parameter options will be found.
After grid searching through 2016 & 4480 compatible fits for both LDA & QDA, the optimal model
tuning for the models was found and gave an accuracy of 85.8% & 89.7%, respectively.
In this project, kNN works by setting a value to the number of data points you will use to make the
classification, often labeled k. As a simplification, if k is set to 5 it means that the 5 nearest data
points are compared to each other and the label that is a majority "wins" and becomes the value of
x∗ . One way of kNN optimization means finding the optimal k value which results in the minimum
validation error.
According to the documentation [7], there are 8 hyper parameters in the
sklearn.neighbors.KNeighborsClassifier function. However, because n_neighbors was the
only parameter covered in this course, we deemed it better to exclude them during model tuning.
or approximately ρσ 2 for large ensembles. This is smaller than the variance of the individual decision
trees (σ 2 ), and there is a similar result for classification.
The most obvious way of reducing the ensemble variance is by having a large ensemble (many
trees, large N), but this also implies a large computational cost. Apart from the ensemble size, the
remaining variables ρ and σ are both affected by the bagging of samples and features. For both
5
sample and feature bagging, there is an optimum in the number of samples/features drawn during the
bootstrapping, since training with a large number of samples/features makes the variance σ 2 of the
individual trees small, but increases their correlation (ρ).
for data (xi , yi ). After normalization of wb the remaining step in each iteration is to calculate the
optimal coefficients α. It can be shown that the errorof the boosted classifier formed so far by the
b
b 1 1−Etrain
weak models, can be minimized by setting α = 2 ln E b .
train
RandomizedSearchCV was not implemented for Random Forest or AdaBoost because it would have
decreased the amount of searched fits/combinations to 196 and that decreased the optimal models
accuracy. Even if the RandomizedSearchCV would have decreased the computational time by a more
than 97.5%.
Table 1: Some of the most important hyper parameters for each model, along with their respective
optimized values. The hyper parameters were optimized through grid or random search. The names
are from scikit-learn, and are largely self-explanatory, but see refs. [3], [5]–[9] for more details.
Method Parameter 1 Parameter 2 Parameter 3 Parameter 4 Parameter 5
shrinkage
LDA
= None
reg_param priors =
QDA
= 0.2 [0.25, 0.75]
Logistic fit_intercept
penalty = ’l2’ C = 1.0
regression = False
Random n_estimators bootstrap min_samples max_depth min_samples
forest = 1000 = False _split= 5 = None _leaf = 5
AdaBoost n_estimators learning_rate min_samples
max_depth = 5
tree = 1000 = 1.0 _split = 200
kNN n_neighbors = 5
6
3.6 Model Selection
For a good comparison with the implemented methods (section 3.1-3.5), a naive classifier that
always predicts the gender of Lead as Male was created. The classifier’s accuracy reached 75.6%
and methods with accuracy higher than the naive classifiers will be deemed useful for the given
classification problem.
We compared the implemented methods using k-fold cross-validation with k=14 and the results are
shown in the box-plot displayed below.
Looking at figure 3 we can presume that kNN can be eliminated as the chosen method because it has
the worst validation error of the 5 examined families. Random Forest, Logistic Regression & LDA
show a smaller validation error than kNN but they too will be eliminated because both AdaBoost &
QDA are performing better. This leaves us with one of two choices and we choose to use QDA "in
production" because it is performing better than AdaBoost by a good margin.
4 Conclusion
The chosen method/model to be put into production for this classification problem is Quadratic
Discriminant Analysis (QDA) with the hyper parameters shown in appendix (section: Choice of hyper
parameters). The script "Comparing the methods" gives that the implemented methods accuracy are
all around 80% with the QDA model performing better than the others with an impressive accuracy
of 89.7%.
Overall, this study provides valuable insights into the usage of machine learning methods and shows
that even tasks such as gender classification based on movies can be solved using Machine learning.
7
References
[1] I. B. Machines. “What is logistic regression?” IBM. (2023), [Online]. Available: https :
//www.ibm.com/topics/logistic-regression (visited on 02/25/2023).
[2] G. Harrison. “Calculating and setting thresholds to optimise logistic regression performance,”
Towards Data Science. (2023), [Online]. Available: https://towardsdatascience.com/
calculating - and - setting - thresholds - to - optimise - logistic - regression -
performance-c77e6d112d7e (visited on 02/22/2023).
[3] F. Pedregosa, G. Varoquaux, A. Gramfort, et al. “Sklearn.linear_model.logisticregression,”
Scikit-learn. (2023), [Online]. Available: https : / / scikit - learn . org / stable /
modules / generated / sklearn . linear _ model . LogisticRegression . html (visited
on 02/22/2023).
[4] A. Lindholm, N. Wahlström, F. Lindsten, and T. B. Schön, MACHINE LEARNING: A First
Course for Engineers and Scientists, New Edition. Cambridge: Cambridge University Press,
2022, ISBN: 9781108843607.
[5] F. Pedregosa, G. Varoquaux, A. Gramfort, et al. “Sklearn.discriminanta nalysis.lineardiscriminantanalysis,”
Scikit-learn. (2023), [Online]. Available: https://scikit-learn.org/stable/modules/
generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html
(visited on 02/22/2023).
[6] F. Pedregosa, G. Varoquaux, A. Gramfort, et al. “Sklearn.discriminanta nalysis.quadraticdiscriminantanalysis.”
(2023), [Online]. Available: https://scikit-learn.org/stable/modules/generated/
sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis.html (visited
on 02/22/2023).
[7] F. Pedregosa, G. Varoquaux, A. Gramfort, et al. “Sklearn.neighbors.kneighborsclassifier.”
(2023), [Online]. Available: https://scikit-learn.org/stable/modules/generated/
sklearn.neighbors.KNeighborsClassifier.html (visited on 02/22/2023).
[8] F. Pedregosa, G. Varoquaux, A. Gramfort, et al. “Sklearn.ensemble.randomforestclassifier.”
(2023), [Online]. Available: https://scikit-learn.org/stable/modules/generated/
sklearn.ensemble.RandomForestClassifier.html (visited on 02/22/2023).
[9] F. Pedregosa, G. Varoquaux, A. Gramfort, et al. “Sklearn.ensemble.adaboostclassifier.” (2023),
[Online]. Available: https : / / scikit - learn . org / stable / modules / generated /
sklearn.ensemble.AdaBoostClassifier.html (visited on 02/22/2023).
8
Appendix
Choice of hyper parameters
The parameters Dual, intercept_scaling, verbose, warm_start, l1_ratio & class_weight is not some-
thing that have been talked about in the course SML. Thus, they will not be used when tuning logistic
regression.
n_jobs controls how many CPU cores are used when working with Logistic Regression. According to
the documentation [3], n_jobs is "used when parallelizing over classes if multi_class=’ovr’", which
means that n_jobs will be set ONLY if the logistic regression code takes a long time to run and/or
loops through many parameter fits.
Linear Discriminant Analysis have 7 hyper parameters: The parameter solver specifies which
solver method would be used in the described problem as is thus something worth looping over. In
addition, this parameter is something that we talked about when LDA was introduced in the course.
Concerning the parameter shrinkage, it controls the strength of this models regularization term
(regularization applied to the covariance matrix of the input data) in which smaller/larger values
may dictate if the model overfits/underfits. Thus this parameter is of much importance to have when
looping through different fits.
The parameters priors, n_components, store_covariance and covariance_estimator is not something
that have been talked about in the course and will not be taken into account.
tol will be set to different values and looped through because of its importance to find the optimal
LDA model. Observe that tol is a parameter that is all about the tolerance for the stopping criteria.
Quadratic Discriminant Analysis have 4 required hyper parameters and thus all of them will be
included in the grid search even though two of the parameters are not something we have talked about
in the course. The four hyper parameters are priors, reg_param, store_covariance & tol.
9
kNN have 8 hyper parameters and of them only the most known one (namely n_neighbors) will be
used because this is the only parameter talked about in the course. The other 7 parameters are:
1. weights
2. algorithm
3. leaf_size
4. metric
5. metric_params
6. n_jobs
Observe that n_jobs decides how many CPU cores are used for calculations but it will not be needed
as no more than one hyper parameter is used for kNN.
Random Forest have 18 different optional and required parameters and of them only 7 will be
used due to the code taking a long time to run. The used parameters are:
1. n_estimators
2. criterion
3. max_depth
4. min_samples_split
5. min_samples_leaf
6. bootstrap
7. random_state
Note that n_jobs for the gridsearchCV function will be manually set to -1 so that all CPU proces-
sors/cores are used.
AdaBoost have 5 parameters and beside the base_estimator that will be passed in as pre-made
parameter(s), the parameters:
1. n_estimators
2. learning_rate
3. random_state
will be included as they are parameters that we went through in the course and/or feels important for
model tuning.
Due to time constraints and the fact that including 7 & 3 hyper parameters for Random Forest
& AdaBoost gives us between 56000-700000 fits/combinations to search through (takes up to 320
minutes for 1 run), it was decided that each parameter will be limited to contain a maximum of 3
& 10 values for Random Forest & AdaBoost, respectively. This is done to conserve computational
power and computational time per run.
10
Main Code
The code will be attached as a zip file and should be opened as a Jupyter Notebook. Some basic
comments that will not be shown in appendix will be found on the real code. Observe that after model
tuning the individual models, their accuracy was calculated in the Compare the methods script as to
have the same datasets from all methods and not give any method some advantage or disadvantage.
Imports
import numpy a s np
import matplotlib
import m a t p l o t l i b . pyplot as p l t
import p a n d a s a s pd
Pre-data steps
# Read t h e t r a i n i n g d a t a and s a v e i t i n t o v a r i a b l e " t r a i n "
t r a i n = pd . r e a d _ c s v ( ’ t r a i n . c s v ’ )
11
Pre-data Analysis
# F i r s t Q u e s t i o n : Do men o r women d o m i n a t e s p e a k i n g r o l e s i n
Hollywood movies ?
# C r e a t e s o r t e d l i s t o f y e a r s w i t h a t l e a s t one m o v i e
Y e a r s = t r a i n [ ’ Year ’ ] . u n i q u e ( ) # . u n i q u e ( ) r e t u r n s t h e s p e c i f i c
v a l u e s ( u n d u p l i c a t e d ) i n an u s o r t e d a r r a y
Years . s o r t ( )
# D e t e r m i n e s & c a l c u l a t e s t h e p e r c e n t a g e o f when f e m a l e s / m a l e s
s p o k e more and t h e n p r i n t s t h e r e s u l t s
m o r e _ f e m a l e _ w o r d s = t r a i n . l o c [ ( t r a i n [ ’ Number words male ’ ] \
< t r a i n [ ’ Number words f e m a l e ’ ] ) ]
more_male_words = t r a i n . l o c [ ( t r a i n [ ’ Number words f e m a l e ’ ] \
< t r a i n [ ’ Number words male ’ ] ) ]
p r i n t ( f " The p e r c e n t a g e o f m o v i e s where f e m a l e s s p e a k more t h a n
m a l e s : \ { ( l e n ( m o r e _ f e m a l e _ w o r d s ) / l e n ( t r a i n ) ) * 1 0 0 : . 1 f}%" )
p r i n t ( f " The p e r c e n t a g e o f m o v i e s where m a l e s s p e a k more t h a n
f e m a l e s : \ { ( l e n ( more_male_words ) / l e n ( t r a i n ) ) * 1 0 0 : . 1 f}%" )
total_words_female = []
total_words_male = []
for year in Years :
c u r r e n t _ m o v i e s = t r a i n [ t r a i n [ ’ Year ’ ] == y e a r ]
plt . x l a b e l ( ’ Year ’ )
plt . y l a b e l ( ’ T o t a l number o f words \ n p e r g e n d e r ’ )
plt . legend ( )
plt . show ( )
12
# S e c o n d Q u e s t i o n : Has g e n d e r b a l a n c e i n s p e a k i n g r o l e s c h a n g e d
over time ( i . e . years ) ?
Y e a r s = t r a i n [ ’ Year ’ ] . u n i q u e ( ) # . u n i q u e ( ) r e t u r n s t h e s p e c i f i c
v a l u e s ( u n d u p l i c a t e d ) i n an u s o r t e d a r r a y
Years . s o r t ( )
female_lead_percentage = []
female_actors_percentage = []
male_lead_percentage = []
male_actors_percentage = []
for i in Years :
d a t a = t r a i n . l o c [ ( t r a i n [ ’ Year ’ ] == i ) ]
# C a l c u l a t e s how many f e m a l e l e a d t h e r e h a v e b e e n e a c h y e a r (
given in percentage )
l e a d _ p r e c e n t a g e _ f e m a l e = ( d a t a [ d a t a [ ’ Lead ’ ] == ’ Female ’ ] . s h a p e
[ 0 ] ) / l e n ( d a t a ) * 100
f e m a l e _ l e a d _ p e r c e n t a g e . append ( l e a d _ p r e c e n t a g e _ f e m a l e )
# C a l c u l a t e s how many f e m a l e a c t o r s t h e r e h a v e b e e n e a c h y e a r
( given in percentage )
a c t o r s _ p e r c e n t a g e _ f e m a l e = d a t a [ ’ Number o f f e m a l e a c t o r s ’ ] . sum
() \
/ ( d a t a [ ’ Number o f f e m a l e a c t o r s ’ ] . sum ( ) + d a t a [ ’ Number o f
male a c t o r s ’ ] . sum ( ) ) * 100
f e m a l e _ a c t o r s _ p e r c e n t a g e . append ( a c t o r s _ p e r c e n t a g e _ f e m a l e )
# Plot the r e s u l t s
# F i g u r e 2 : Female l e a d s v s Female a c t o r s ( p e r c e n t a g e )
plt . figure (2)
p l t . p l o t ( Years , f e m a l e _ l e a d _ p e r c e n t a g e , ’ r . ’ , l a b e l = ’ P e r c e n t a g e o f
female leads ’ )
p l t . p l o t ( Years , f e m a l e _ a c t o r s _ p e r c e n t a g e , ’ b . ’ , l a b e l = ’ P e r c e n t a g e
of female a c t o r s ’ )
p l t . x l a b e l ( ’ Year ’ )
p l t . y l a b e l ( ’ P e r c e n t a g e $\%$ ’ )
p l t . legend ( )
p l t . show ( )
# Female l e a d s v s Male l e a d s ( p e r c e n t a g e )
13
ax1 . p l o t ( Years , f e m a l e _ l e a d _ p e r c e n t a g e , ’ r . ’ , l a b e l = ’ P e r c e n t a g e o f
female leads ’ )
ax1 . p l o t ( Years , m a l e _ l e a d _ p e r c e n t a g e , ’ g . ’ , l a b e l = ’ P e r c e n t a g e o f
male l e a d s ’ )
ax1 . s e t _ y l a b e l ( ’ P e r c e n t a g e $\%$ ’ )
ax1 . s e t _ t i t l e ( ’ P e r c e n t a g e o f f e m a l e v s male l e a d s ’ )
ax1 . l e g e n d ( )
# Female a c t o r s v s Male a c t o r s
ax2 . p l o t ( Years , f e m a l e _ a c t o r s _ p e r c e n t a g e , ’ r . ’ , l a b e l = ’ P e r c e n t a g e
of female a c t o r s ’ )
ax2 . p l o t ( Years , m a l e _ a c t o r s _ p e r c e n t a g e , ’ g . ’ , l a b e l = ’ P e r c e n t a g e o f
male a c t o r s ’ )
ax2 . s e t _ x l a b e l ( ’ Year ’ )
ax2 . s e t _ y l a b e l ( ’ P e r c e n t a g e $\%$ ’ )
ax2 . s e t _ t i t l e ( ’ P e r c e n t a g e o f f e m a l e v s male a c t o r s ’ )
ax2 . l e g e n d ( )
p l t . show ( )
14
# T h i r d Q u e s t i o n : Do f i l m s i n w h i c h men do more s p e a k i n g make a
l o t more money t h a n f i l m s i n w h i c h women s p e a k more ?
15
Implementation of methods
# Used a l l 13 i n p u t s e x c e p t ’ T o t a l words ’ w h i c h i s c o l i n e a r w i t h ’
Number words f e m a l e ’ & ’ Number words male ’ and was e x c l u d e d
beacuse of warnings
# L i n e a r D i s c r i m i n a n t A n a l y s i s ( LDA )
from s k l e a r n . m o d e l _ s e l e c t i o n import GridSearchCV
from s k l e a r n . p r e p r o c e s s i n g import L a b e l E n c o d e r
from s k l e a r n . p r e p r o c e s s i n g import S t a n d a r d S c a l e r
# Implementation :
np . random . s e e d ( 1 ) # For r e p r o d u c i b i l i t y
t r a i n = pd . r e a d _ c s v ( ’ t r a i n . c s v ’ )
normalizer = StandardScaler ()
# Create X & y
X = t r a i n [ [ ’ Number words f e m a l e ’ , ’ Number o f words l e a d ’ ,
’ D i f f e r e n c e i n words l e a d and co − l e a d ’ , \
’ Number o f male a c t o r s ’ , ’ Year ’ , ’ Number o f f e m a l e a c t o r s ’
, ’ Number words male ’ , ’ G r o s s ’ , \
’ Mean Age Male ’ , ’ Mean Age Female ’ , ’ Age Lead ’ ,
’ Age Co−Lead ’ ] ]
X = n o r m a l i z e r . f i t _ t r a n s f o r m (X) # To n o r m a l i z e t h e d a t a
y = t r a i n [ ’ Lead ’ ]
# Model s t r u c t u r i n g
LDA_model = s k l _ d a . L i n e a r D i s c r i m i n a n t A n a l y s i s ( )
# Model t u n i n g & f i t t i n g
cv = s k l _ m s . KFold ( n _ s p l i t s =14 , s h u f f l e = True , r a n d o m _ s t a t e =None )
param_grid = { ’ s o l v e r ’ : [ ’ svd ’ , ’ l s q r ’ , ’ e i g e n ’ ] , ’ t o l ’ :
[0.001 , 0.0001 , 0.00001 , 0.000001] ,\
’ s h r i n k a g e ’ : [ None , 0 . 0 , 0 . 1 , 0 . 2 , 0 . 3 , 0 . 4 , 0 . 5 , 0 . 6 , 0 . 7 ,
0.8 , 0.9 , 1.0]}
g r i d _ s e a r c h = GridSearchCV ( e s t i m a t o r =LDA_model , p a r a m _ g r i d =
p a r a m _ g r i d , s c o r i n g = ’ a c c u r a c y ’ , v e r b o s e =1 , cv = cv )
g r i d _ s e a r c h . f i t (X, y )
best_params = grid_search . best_params_
# Note :
# The w a r n i n g t h a t we g e t i s b e c a u s e :
# 1 ) S h r i n k a g e i s n o t s u p p r o t e d w i t h t h e s o l v e r =’ s v d ’
# We t e s t e d d i f f e r n t c o m b i n a t i o n s and t h i s w a r n i n g d o e s n ’ t
a f f e c t f i n d i n g t h e o p t i m a l model !
16
# Q u a d t r a t i c D i s c r i m i n a n t A n a l y s i s (QDA)
from s k l e a r n . m o d e l _ s e l e c t i o n import GridSearchCV
# Implementation :
np . random . s e e d ( 1 ) # For r e p r o d u c i b i l i t y
t r a i n = pd . r e a d _ c s v ( ’ t r a i n . c s v ’ )
# Create X & y
X = t r a i n [ [ ’ Number words f e m a l e ’ , ’ Number o f words l e a d ’ ,
’ D i f f e r e n c e i n words l e a d and co − l e a d ’ , \
’ Number o f male a c t o r s ’ , ’ Year ’ , ’ Number o f f e m a l e a c t o r s ’
, ’ Number words male ’ , ’ G r o s s ’ , \
’ Mean Age Male ’ , ’ Mean Age Female ’ , ’ Age Lead ’ ,
’ Age Co−Lead ’ ] ]
y = t r a i n [ ’ Lead ’ ]
# Model s t r u c t u r i n g
QDA_model = s k l _ d a . Q u a d r a t i c D i s c r i m i n a n t A n a l y s i s ( )
# Model t u n i n g & f i t t i n g
param_grid = { ’ reg_param ’ : [ 0 . 1 , 0 . 2 , 0 . 3 , 0 . 4 , 0 . 5 , 0 . 6 , 0 . 7 ,
0.8 , 0.9 , 1] ,\
’ t o l ’ : [0.001 , 0.0001 , 0.00001 , 0.000001] ,\
’ p r i o r s ’ : [ None , [ 0 . 2 5 , 0 . 7 5 ] , [ 0 . 5 , 0 . 5 ] , [ 0 . 7 5 , 0 . 2 5 ] ] , \
’ s t o r e _ c o v a r i a n c e ’ : [ True , F a l s e ] }
cv = s k l _ m s . KFold ( n _ s p l i t s =14 , s h u f f l e = True , r a n d o m _ s t a t e =None )
g r i d _ s e a r c h = GridSearchCV ( e s t i m a t o r =QDA_model , p a r a m _ g r i d =
p a r a m _ g r i d , s c o r i n g = ’ a c c u r a c y ’ , v e r b o s e =1 , cv = cv )
g r i d _ s e a r c h . f i t (X, y )
best_params = grid_search . best_params_
17
# Logistic Regression
from s k l e a r n . m o d e l _ s e l e c t i o n import RandomizedSearchCV
# Read i n t h e d a t a t r a i n i n g ( O b s e r v e t h a t t h e f i l e i s i n t h i s c a s e
l o a d e d l o c a l l y on t h e c o m p u t e r )
u r l = ’ t r a i n . csv ’
t r a i n i n g = pd . r e a d _ c s v ( u r l , n a _ v a l u e s = ’ ? ’ , d t y p e ={ ’ ID ’ : s t r } ) .
dropna ( ) . r e s e t _ i n d e x ( )
# s a m p l i n g i n d i c e s f o r t h e d a t a s e t i n a t r a i n i n g s e t and t e s t s e t
np . random . s e e d ( 1 )
t r a i n I = np . random . c h o i c e ( t r a i n i n g . s h a p e [ 0 ] , s i z e = 5 0 0 , r e p l a c e =
False )
t r a i n I n d e x = t r a i n i n g . index . i s i n ( t r a i n I )
t r a i n = t r ai n in g . iloc [ trainIndex ] # Training set
t e s t = t r a i n i n g . i l o c [~ t r a i n I n d e x ] # t e s t s e t
# S e t t h e model u s i n g s k l e a r n t o s o l v e w i t h l o g i s t i c r e g r e s s i o n
LogReg_model = s k l _ l m . L o g i s t i c R e g r e s s i o n ( )
# Model t u n i n g & f i t t i n g
p a r a m _ g r i d = { ’ p e n a l t y ’ : [ ’ l 1 ’ , ’ l 2 ’ , ’ e l a s t i c n e t ’ , None ] , \
’C ’ : [ 0 . 0 0 0 1 , 0 . 0 0 1 , 0 . 0 1 , 0 . 1 , 1 . 0 ] , \
’ f i t _ i n t e r c e p t ’ : [ True , F a l s e ] , \
’ s o l v e r ’ : [ ’ l b f g s ’ , ’ l i b l i n e a r ’ , ’ newton − cg ’ , ’
newton − c h o l e s k y ’ , ’ s a g ’ , ’ s a g a ’ ] , \
’ m u l t i _ c l a s s ’ : [ ’ auto ’ , ’ ovr ’ , ’ multinomial ’ ] , \
’ t o l ’ : [0.001 , 0.0001 , 0.00001 , 0.000001] ,\
’ r a n d o m _ s t a t e ’ : [ None , 0 , 1 2 3 4 , 4 3 2 1 ] , ’ m a x _ i t e r ’ :
[100 , 1000 , 10000 , 10000]}
r a n d o m _ s e a r c h = RandomizedSearchCV ( e s t i m a t o r =LogReg_model ,
p a r a m _ d i s t r i b u t i o n s =param_grid , \
n _ i t e r =14 , s c o r i n g = ’ a c c u r a c y ’ , n _ j o b s = −1 ,
r a n d o m _ s t a t e =1 , v e r b o s e = 1 )
random_search . f i t ( X_train , Y_train )
best_params = random_search . best_params_
18
model . f i t ( X _ t r a i n , Y _ t r a i n )
# p r i n t ( ’ model summary : ’ )
# p r i n t ( model )
# C a l c u l a t e p r e d i c t e d v a l u e s , w i t h t h e f i r s t 10 r e s u l t s .
p r e d i c t _ p r o b = model . p r e d i c t _ p r o b a ( X _ t e s t )
# p r i n t ( ’ Classes ’)
# p r i n t ( model . c l a s s e s _ )
# print ( predict_prob [0:10])
# P r e d i c t t h e t h e r e s u l t s b a s e d on t h e p a r a m e t e r s b e f o r e , t h e f i r s t
20 i s c h o s e n h e r e
# s i m p l y because t o check i f t h e a l g o r i t h m works p r o p e r l y s i n c e t h e
f i r s t 10 r e s u l t s i n ’ Male ’ .
p r e d i c t i o n = np . empty ( l e n ( X _ t e s t ) , d t y p e = o b j e c t )
p r e d i c i t o n = np . where ( p r e d i c t _ p r o b [ : , 0 ] > = 0 . 5 , ’ Female ’ , ’ Male ’ )
# print ( prediciton [0:20])
#n− f o l d C r o s s v a l i d a t i o n w i t h l o g i s t i c r e g r e s s i o n , u s i n g n = 1 4 ,
r a n d o m _ s t a t e = 42 and
# 1400 max i t e r a t i o n s t o e x c l u d e p r o b l e m s r u n n i n g t h e c o d e
l o g _ c v = s k l _ l m . L o g i s t i c R e g r e s s i o n C V ( cv =14 , r a n d o m _ s t a t e =42 , m a x _ i t e r
=1400)
19
# Random F o r e s t & A d a B o o s t
from s k l e a r n . p r e p r o c e s s i n g import S t a n d a r d S c a l e r
from s k l e a r n . m o d e l _ s e l e c t i o n import KFold , GridSearchCV
from s k l e a r n import t r e e
from s k l e a r n . e n s e m b l e import A d a B o o s t C l a s s i f i e r ,
RandomForestClassifier
# Load t h e d a t a s e t
d f = pd . r e a d _ c s v ( ’ t r a i n . c s v ’ )
# I f r e f i t =T r u e t h e n m o d e l s a r e r u n e v e n i f t h e r e a l r e a d y e x i s t
r e s u l t s f o r them .
# However , i f r e f i t =F a l s e t h e n m o d e l s a r e n o t r e f i t t e d o n c e s t o r e d
results exist .
r e f i t = True
r e s u l t s = {}
# Use a l l d a t a s a m p l e s f o r c r o s s v a l i d a t i o n s i n c e t h e f i n a l t e s t
data i s stored in a separate f i l e .
t r a i n = df_prep
# T r a i n w i t h a l l f e a t u r e s e x c e p t t h e l e a d ( ’ Lead_Male ’ and ’
Lead_Female ’ )
X _ t r a i n = t r a i n . d r o p ( c o l u m n s = [ ’ T o t a l words ’ , ’ Lead_Male ’ , ’
Lead_Female ’ ] )
# T r a i n i n g l a b e l i s ’ Lead_Female ’ ( 1 o r 0 )
y _ t r a i n = t r a i n [ ’ Lead_Female ’ ]
# D e f i n e j o b s t o ( re −) r u n
job_list = [
’RF ’ ,
’AB−T ’ ,
]
# D e f i n e m o d e l s and p a r a m e t e r s f o r t h e g r i d s e a r c h p a r a m e t e r
optimiaztion
full_names = {
’RF ’ : ’ Random F o r e s t ’ ,
’AB−T ’ : ’ AdaBoosted t r e e ’ ,
}
base_models = {
’RF ’ : R a n d o m F o r e s t C l a s s i f i e r ( ) ,
’AB−T ’ : A d a B o o s t C l a s s i f i e r ( b a s e _ e s t i m a t o r = t r e e .
DecisionTreeClassifier () ) ,
}
model_parameters = {
’RF ’ : {
’ max_depth ’ : [ None , 5 , 1 5 ] ,
20
’ m i n _ s a m p l e s _ s p l i t ’ : [5 , 10 , 20] ,
’ n _ e s t i m a t o r s ’ : [500 , 1000 , 2000] ,
’ max_samples ’ : [ None , 0 . 5 , 0 . 8 ] ,
’ c r i t e r i o n ’ : [ ’ gini ’ , ’ entropy ’ ] ,
’ min_samples_leaf ’ : [5 , 15 , 25] ,
’ b o o t s t r a p ’ : [ True , F a l s e ] ,
’ r a n d o m _ s t a t e ’ : [ None , 0 , 1 2 3 4 ] } ,
’AB−T ’ : {
’ b a s e _ e s t i m a t o r _ _ m a x _ d e p t h ’ : [ None , 5 , 1 0 , 1 5 ] ,
’ b a s e _ e s t i m a t o r _ _ m i n _ s a m p l e s _ s p l i t ’ : [1 , 5 , 20 , 200] ,
’ n _ e s t i m a t o r s ’ : [500 , 1000 , 2000] ,
’ learning_rate ’ : [0.1 , 0.2 , 0.3 , 0.4 , 0.5 , 0.6 , 0.7 , 0.8 ,
0.9 , 1.0] ,
’ r a n d o m _ s t a t e ’ : [ None , 0 , 1 2 3 4 ] }
}
# C r o s s v a l i d a t i o n sheme
n _ s p l i t s = 14
cv = KFold ( n _ s p l i t s = n _ s p l i t s , r a n d o m _ s t a t e =None , s h u f f l e = T r u e )
# F i t m o d e l s and p r i n t r e s p e c t i v e b e s t p a r a m e t e r v a l u e s and
accuracy
n_jobs = len ( j o b _ l i s t )
f o r i , job_name i n enumerate ( j o b _ l i s t ) :
i f r e f i t == F a l s e and model_name i n r e s u l t s . k e y s ( ) : c o n t i n u e #
Don ’ t r e f i t m o d e l s i f r e f i t ==F a l s e
# P r e p a r e model
b a s e _ m o d e l = b a s e _ m o d e l s [ job_name ]
p a r a m _ g r i d = m o d e l _ p a r a m e t e r s [ job_name ]
model_name = f u l l _ n a m e s [ job_name ]
c l f = GridSearchCV ( b a s e _ m o d e l , p a r a m _ g r i d , s c o r i n g = ’ a c c u r a c y ’ ,
cv =cv , n _ j o b s = −1 , v e r b o s e = 1 )
# F i t model and s e a r c h f o r b e s t p a r a m e t e r s
# p r i n t ( f ’ \ n \ n F i t t i n g { job_name } ( Job { i + 1 } / { n _ j o b s } ) \ nModel :
{ model_name } \ n−−−−−−−−−−’)
c l f . f i t ( X_train , y _ t r a i n )
21
# k−N e a r e s t −N e i g b o r
d e f p e r f o r m _ k N N _ c r o s s _ v a l i d a t i o n _ t o _ g e t _ k (K, columns , n _ f o l d s ,
data , t e s t ) :
m i s s c l a s s i f i c a t i o n = np . z e r o s ( l e n (K) )
cv = s k l _ m s . KFold ( n _ s p l i t s = n _ f o l d s , r a n d o m _ s t a t e =2 , s h u f f l e =
True )
data_X = d a t a . i l o c [ : , columns ]
d a t a _ Y = d a t a . i l o c [ : , −1]
t e s t _ X = t e s t . i l o c [ : , columns ]
f o r j , k i n enumerate (K) :
model = s k l _ n b . K N e i g h b o r s C l a s s i f i e r ( n _ n e i g h b o r s =k )
s c o r e s = s k l _ m s . c r o s s _ v a l _ s c o r e ( model , data_X , data_Y ,
s c o r i n g = ’ a c c u r a c y ’ , cv = cv )
m i s s c l a s s i f i c a t i o n [ j ] = 1 − np . mean ( s c o r e s )
plt . p l o t (K, m i s s c l a s s i f i c a t i o n )
plt . t i t l e ( ’ C r o s s v a l i d a t i o n e r r o r f o r kNN ( n f o l d ) ’ )
plt . xlabel ( ’k ’ )
plt . ylabel ( ’ Validation error ’ )
plt . show ( )
m i n _ i n d e x = np . a r g m i n ( m i s s c l a s s i f i c a t i o n )
min_value = m i s s c l a s s i f i c a t i o n [ min_index ]
r e t u r n m i n _ v a l u e , m i n _ i n d e x +1
d e f p e r f o r m _ k N N _ 1 0 _ v a l _ t o _ g e t _ k ( d a t a , columns , n _ r u n s , K) :
m i s s c l a s s i f i c a t i o n = np . z e r o s ( ( n _ r u n s , l e n (K) ) )
f o r i i n range ( n _ r u n s ) :
X _ t r a i n , X_val , Y _ t r a i n , Y_val = s k l _ m s . t r a i n _ t e s t _ s p l i t (
d a t a . i l o c [ : , columns ] , \
d a t a . i l o c [ : , −1] , t e s t _ s i z e = 0 . 3 )
f o r j , k i n enumerate (K) :
model = s k l _ n b . K N e i g h b o r s C l a s s i f i e r ( n _ n e i g h b o r s =k )
model . f i t ( X _ t r a i n , Y _ t r a i n )
p r e d i c t i o n = model . p r e d i c t ( X_val )
m i s s c l a s s i f i c a t i o n [ i , j ] = np . mean ( p r e d i c t i o n ! = Y_val
)
a v e r a g e _ m i s = np . mean ( m i s s c l a s s i f i c a t i o n , a x i s = 0 )
m i n _ e r r o r = np . min ( a v e r a g e _ m i s )
i n d e x = np . a r g m i n ( a v e r a g e _ m i s )
k _ v a l u e = K[ i n d e x ]
plt . p l o t (K, a v e r a g e _ m i s )
plt . t i t l e ( ’ C r o s s v a l i d a t i o n e r r o r f o r kNN ( n r u n a v e r a g e ) ’ )
plt . xlabel ( ’k ’ )
plt . ylabel ( ’ Validation error ’ )
plt . show ( )
d e f p r e d i c t i o n ( k , columns , d a t a , t e s t ) :
data_X = d a t a . i l o c [ : , columns ]
d a t a _ Y = d a t a . i l o c [ : , −1]
t e s t _ X = t e s t . i l o c [ : , columns ]
model = s k l _ n b . K N e i g h b o r s C l a s s i f i e r ( n _ n e i g h b o r s =k )
22
model . f i t ( data_X , d a t a _ Y )
p r e d i c t i o n = model . p r e d i c t ( t e s t _ X )
return p r e d i c t i o n
# S e t random s e e d
np . random . s e e d ( 1 )
# Get d a t a f r o m t r a i n . c s v
d a t a = pd . r e a d _ c s v ( ’ t r a i n . c s v ’ )
# Get t e s t d a t a f r o m t e s t . c s v
t e s t = pd . r e a d _ c s v ( ’ t e s t . c s v ’ )
# General parameters
K = np . a r a n g e ( 1 , 1 9 9 ) # k = 1 , 2 , . . . , 1 9 8
columns = [ 0 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 1 0 , 1 1 , 1 2 ] # b e t t e r accuracy w i t h o u t 1
# Get b e s t k f o r K−NN u s i n g f o l d c r o s s v a l i d a t i o n
n _ f o l d s = 14
m i n _ v a l u e , k _ f o l d = p e r f o r m _ k N N _ c r o s s _ v a l i d a t i o n _ t o _ g e t _ k (K,
columns , n _ f o l d s , d a t a , t e s t )
#print (k)
# p r i n t ( f ’ The b e s t k v a l u e a c c o r d i n g t o n f o l d c r o s s v a l i d a t i o n : {
k_fold } ’)
p r i n t ( f " The minimum m i s s c l a s s i f i c a t i o n v a l u e i s { m i n _ v a l u e } a t k =
{ k_fold }\
according to n fold cross validation " )
# Get b e s t k f o r K−NN by d o i n g 10 r u n s w i t h d i f f e r e n t v a l i d a t i o n
sets
n _ r u n s = 10
m i n _ e r r o r , k _ 1 0 _ v a l = p e r f o r m _ k N N _ 1 0 _ v a l _ t o _ g e t _ k ( d a t a , columns ,
n _ r u n s , K)
p r i n t ( f " The minimum m i s s c l a s s i f i c a t i o n v a l u e i s { m i n _ e r r o r } a t k =
{ k_10_val }\
# a c c o r d i n g t o a v e r a g e from n r u n s " )
23
Which inputs to consider
# Implementation :
np . random . s e e d ( 1 ) # For r e p r o d u c i b i l i t y
t r a i n = pd . r e a d _ c s v ( ’ t r a i n . c s v ’ )
X = t r a i n . copy ( ) . d r o p ( c o l u m n s = ’ Lead ’ )
plt . figure (5)
pd . p l o t t i n g . s c a t t e r _ m a t r i x (X, f i g s i z e = ( 4 0 , 4 0 ) )
p l t . show ( )
Naive Classifier
# Implementation :
t r a i n = pd . r e a d _ c s v ( ’ t r a i n . c s v ’ )
# N a i v e C l a s s i f i e r t h a t a l w a y s p r e d i c t s l e a d g e n d e r a s male
def n a i v e _ c l a s s i f i e r ( t r a i n ) :
male_prediction = []
f o r i i n t r a i n [ ’ Lead ’ ] :
m a l e _ p r e d i c t i o n . a p p e n d ( ’ Male ’ )
return m a l e _ p r e d i c t i o n
# Function to c a l c u l a t e accuracy
def accuracy ( y _ p r e d i c t i o n , y _ v a l i d a t i o n ) :
a c c u r a c y = np . mean ( y _ p r e d i c t i o n == y _ v a l i d a t i o n )
return accuracy
24
Comparing the methods
# P e r f o r m i n g k− f o l d c r o s s − v a l i d a t i o n ( w i t h k =14) on a l l t h e c h o s e n
m o d e l s t o c h o o s e b e s t o p t i m a l method f o r p r e d i c t i n g l e a d .
# Implementation :
from s k l e a r n import t r e e
np . random . s e e d ( 1 ) # For r e p r o d u c i b i l i t y
t r a i n = pd . r e a d _ c s v ( ’ t r a i n . c s v ’ )
X = t r a i n [ [ ’ Number words f e m a l e ’ , ’ Number o f words l e a d ’ , ’
D i f f e r e n c e i n words l e a d and co − l e a d ’ , \
’ Number o f male a c t o r s ’ , ’ Year ’ , ’ Number o f f e m a l e
a c t o r s ’ , ’ Number words male ’ , ’ G r o s s ’ , \
’ Mean Age Male ’ , ’ Mean Age Female ’ , ’ Age Lead ’ ,
’ Age Co−Lead ’ ] ]
y = t r a i n [ ’ Lead ’ ]
c r o s s _ v a l i d a t i o n = s k l _ m s . KFold ( n _ s p l i t s =14 , s h u f f l e = True ,
r a n d o m _ s t a t e =1)
# L i n e a r D i s c r i m i n a n t A n a l y s i s ( LDA )
# E v a l u a t e o p t i m a l LDA u s i n g k− f o l d c r o s s − v a l i d a t i o n where k =14
LDA_model = s k l _ d a . L i n e a r D i s c r i m i n a n t A n a l y s i s ( s o l v e r = ’ s v d ’ , t o l
= 0 . 0 0 1 , s h r i n k a g e =None )
A c c u r a c y = s k l _ m s . c r o s s _ v a l _ s c o r e ( LDA_model , X, y , s c o r i n g = ’
a c c u r a c y ’ , cv = c r o s s _ v a l i d a t i o n )
p r i n t ( f " A c c u r a c y w i t h LDA u s i n g k− f o l d c r o s s − v a l i d a t i o n w i t h k = 1 4 :
{ ( np . mean ( A c c u r a c y ) ) * 1 0 0 : . 1 f}%" )
# Q u a d t r a t i c D i s c r i m i n a n t A n a l y s i s (QDA)
# E v a l u a t e o p t i m a l QDA u s i n g k− f o l d c r o s s − v a l i d a t i o n where k =14
QDA_model = s k l _ d a . Q u a d r a t i c D i s c r i m i n a n t A n a l y s i s ( r e g _ p a r a m = 0 . 2 ,
tol =0.001 ,\
s t o r e _ c o v a r i a n c e = True , p r i o r s = [ 0 . 2 5 , 0 . 7 5 ] )
A c c u r a c y = s k l _ m s . c r o s s _ v a l _ s c o r e ( QDA_model , X, y , s c o r i n g = ’
a c c u r a c y ’ , cv = c r o s s _ v a l i d a t i o n )
p r i n t ( f " A c c u r a c y w i t h QDA u s i n g k− f o l d c r o s s − v a l i d a t i o n w i t h k = 1 4 :
{ ( np . mean ( A c c u r a c y ) ) * 1 0 0 : . 1 f}%" )
# Logistic Regression
# E v a l u a t e o p t i m a l L o g i s t i c R e g r e s s i o n u s i n g k− f o l d c r o s s −
v a l i d a t i o n where k =14
LogReg_model = s k l _ l m . L o g i s t i c R e g r e s s i o n ( s o l v e r = ’ newton − c h o l e s k y ’ ,
t o l =0.0001 , p e n a l t y = ’ l2 ’ , \ m u l t i _ c l a s s = ’ ovr ’ , f i t _ i n t e r c e p t =
F a l s e , C = 1 . 0 , m a x _ i t e r =10000 , r a n d o m _ s t a t e = 0 )
A c c u r a c y = s k l _ m s . c r o s s _ v a l _ s c o r e ( LogReg_model , X, y , s c o r i n g = ’
a c c u r a c y ’ , cv = c r o s s _ v a l i d a t i o n )
p r i n t ( f " A c c u r a c y w i t h L o g i s t i c R e g r e s s i o n u s i n g k− f o l d c r o s s −
v a l i d a t i o n with k =14:\
{ ( np . mean ( A c c u r a c y ) ) * 1 0 0 : . 1 f}%" )
# Tree −b a s e d m e t h o d s : Random F o r e s t
# E v a l u a t e o p t i m a l Random F o r e s t u s i n g k− f o l d c r o s s − v a l i d a t i o n
where k =14
f o r e s t _ m o d e l = s k l _ e n . R a n d o m F o r e s t C l a s s i f i e r ( n _ e s t i m a t o r s =1000 ,
min_samples_split =5 ,\
max_depth =None , max_samples =None , b o o t s t r a p = F a l s e , \
c r i t e r i o n = ’ e n t r o p y ’ , m i n _ s a m p l e s _ l e a f =5 , r a n d o m _ s t a t e = 0 )
25
A c c u r a c y = s k l _ m s . c r o s s _ v a l _ s c o r e ( f o r e s t _ m o d e l , X, y , s c o r i n g = ’
a c c u r a c y ’ , cv = c r o s s _ v a l i d a t i o n )
p r i n t ( f " A c c u r a c y u s i n g Random F o r e s t & k− f o l d c r o s s − v a l i d a t i o n
with k =14:\
{ ( np . mean ( A c c u r a c y ) ) * 1 0 0 : . 1 f}%" )
# B o o s t i n g : AdaBoost t r e e
# E v a l u a t e o p t i m a l A d a B o o s t t r e e model u s i n g k− f o l d c r o s s −
v a l i d a t i o n where k =14
e s t i m a t o r = t r e e . D e c i s i o n T r e e C l a s s i f i e r ( m i n _ s a m p l e s _ s p l i t =200 ,
max_depth = 5 )
adaboost_model = skl_en . AdaBoostClassifier ( e s t i m a t o r = estimator ,
n _ e s t i m a t o r s =1000 , r a n d o m _ s t a t e =None , \
learning_rate =1.0)
A c c u r a c y = s k l _ m s . c r o s s _ v a l _ s c o r e ( a d a b o o s t _ m o d e l , X, y , s c o r i n g = ’
a c c u r a c y ’ , cv = c r o s s _ v a l i d a t i o n )
p r i n t ( f " A c c u r a c y u s i n g AdaBoost T r e e & k− f o l d c r o s s − v a l i d a t i o n
with k =14:\
{ ( np . mean ( A c c u r a c y ) ) * 1 0 0 : . 1 f}%" )
# k−N e a r e s t −N e i g h b o r ( kNN )
# E v a l u a t e o p t i a m l kNN w i t h k=5 u s i n g k− f o l d c r o s s − v a l i d a t i o n
where k =14
kNN_model = s k l _ n b . K N e i g h b o r s C l a s s i f i e r ( n _ n e i g h b o r s = 5 )
A c c u r a c y = s k l _ m s . c r o s s _ v a l _ s c o r e ( kNN_model , X, y , s c o r i n g = ’
a c c u r a c y ’ , cv = c r o s s _ v a l i d a t i o n )
p r i n t ( f " A c c u r a c y u s i n g kNN w i t h k=5 & k− f o l d c r o s s − v a l i d a t i o n w i t h
k =14:\
{ ( np . mean ( A c c u r a c y ) ) * 1 0 0 : . 1 f}%" )
# N a i v e C l a s s i f i e r t h a t a l w a y s p r e d i c t s l e a d g e n d e r a s male
y_validation = y
Accuracy = [ ]
y_prediction = naive_classifier ( train )
Accuracy . append ( a c c u r a c y ( y _ p r e d i c t i o n , y _ v a l i d a t i o n ) )
A c c u r a c y _ p e r c e n t a g e = np . mean ( A c c u r a c y ) * 100
p r i n t ( f " Accuracy of t h e Naive C l a s s i f i e r t h a t p r e d i c t s l e a d gender
a s male : { A c c u r a c y _ p e r c e n t a g e : . 1 f}%" )
26
Plotting misclassifications for implemented methods
from s k l e a r n . p r e p r o c e s s i n g import L a b e l E n c o d e r
from s k l e a r n import t r e e
# Implementation :
e s t i m a t o r = t r e e . D e c i s i o n T r e e C l a s s i f i e r ( m i n _ s a m p l e s _ s p l i t =200 ,
max_depth = 1 0 0 )
# k− f o l d c r o s s − v a l i d a t i o n f o r a l l c h o s e n m e t h o d s ( e x c e p t N a i v e
C l a s s i f i e r ) u s i n g k =14
models = [ s k l _ d a . L i n e a r D i s c r i m i n a n t A n a l y s i s ( s o l v e r = ’ svd ’ , t o l
= 0 . 0 0 1 , s h r i n k a g e =None ) , \
skl_da . Q u a d r a t i c D i s c r i m i n a n t A n a l y s i s ( reg_param =0.2 , t o l
=0.001 ,\
s t o r e _ c o v a r i a n c e = True , p r i o r s = [ 0 . 2 5 , 0 . 7 5 ] ) , \
s k l _ l m . L o g i s t i c R e g r e s s i o n ( s o l v e r = ’ newton − c h o l e s k y ’ ,
t o l =0.0001 , penalty = ’ l2 ’ , \
m u l t i _ c l a s s = ’ ovr ’ , f i t _ i n t e r c e p t = False , C= 1 . 0 , m a x _ i t e r
=10000 ,\
r a n d o m _ s t a t e =0) , \
s k l _ e n . R a n d o m F o r e s t C l a s s i f i e r ( n _ e s t i m a t o r s =1000 ,
min_samples_split =5 ,\
max_depth =None , max_samples =None , b o o t s t r a p = F a l s e , \
c r i t e r i o n = ’ e n t r o p y ’ , m i n _ s a m p l e s _ l e a f =5 , r a n d o m _ s t a t e = 0 ) , \
skl_en . AdaBoostClassifier ( estimator =estimator ,
n _ e s t i m a t o r s =1000 , r a n d o m _ s t a t e =None , \
learning_rate =1.0) ,\
s k l _ n b . K N e i g h b o r s C l a s s i f i e r ( n _ n e i g h b o r s =5) ]
m i s s c l a s s i f i c a t i o n = np . z e r o s ( ( 1 4 , l e n ( m o d e l s ) ) )
f o r i , [ t r a i n _ i n d e x , v a l i d a t i o n _ i n d e x ] i n enumerate (
c r o s s _ v a l i d a t i o n . s p l i t (X) ) :
X_train , X _ v a l i d a t i o n = X. i l o c [ t r a i n _ i n d e x ] , X. i l o c [
validation_index ]
y_train , y_validation = y . iloc [ train_index ] , y . iloc [
validation_index ]
f o r m i n np . a r a n g e ( 0 , 6 ) :
model = m o d e l s [m]
model . f i t ( X _ t r a i n , y _ t r a i n )
p r e d i c i t i o n = model . p r e d i c t ( X _ v a l i d a t i o n )
m i s s c l a s s i f i c a t i o n [ i , m] = np . mean ( p r e d i c i t i o n ! =
y_validation )
27
Predicting Lead using chosen model
# Implementation :
t r a i n = pd . r e a d _ c s v ( ’ t r a i n . c s v ’ )
t e s t = pd . r e a d _ c s v ( ’ t e s t . c s v ’ )
X _ t r a i n = t r a i n . copy ( ) . d r o p ( c o l u m n s = [ ’ T o t a l words ’ , ’ Lead ’ ] )
X _ t e s t = t e s t . copy ( ) . d r o p ( c o l u m n s = [ ’ T o t a l words ’ ] )
y = t r a i n [ ’ Lead ’ ]
QDA_model = s k l _ d a . Q u a d r a t i c D i s c r i m i n a n t A n a l y s i s ( r e g _ p a r a m = 0 . 2 ,
tol =0.001 ,\
s t o r e _ c o v a r i a n c e = True , p r i o r s = [ 0 . 2 5 , 0 . 7 5 ] )
QDA_model . f i t ( X _ t r a i n , y )
Q D A _ p r e d i c t i o n s _ c a t e g o c i a l = QDA_model . p r e d i c t ( X _ t e s t )
28