0% found this document useful (0 votes)

57 views28 pages

Gender Prediction in Film Using ML

The document discusses using machine learning methods to classify movie leads as male or female based on speaking roles. It tested logistic regression, linear discriminant analysis, quadratic discriminant analysis, k-nearest neighbors, random forest, and AdaBoost models. Quadratic discriminant analysis performed best, correctly classifying the gender 87.5% of the time based on factors like number of words spoken by each gender.

Uploaded by

alhassan.jawad1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views28 pages

Gender Prediction in Film Using ML

Uploaded by

alhassan.jawad1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Do (wo)men talk too much in films?

Daniel Fransén

Andreas Hertzberg

Alhassan Jawad

Erik Magnusson

Abstract
The project aimed to answer if Machine Learning (ML) methods can be used to
solve a classification problem of predicting the gender of a movie’s lead character.
Out of five used ML methods: Logistic regression, Discriminant Analysis, k-
Nearest Neighbor, Random Forest, and AdaBoost, the project demonstrated that
Quadratic Discriminant Analysis was the optimal method for this classification
problem.

1 Introduction
In this study, a variety of machine learning techniques were employed to determine the gender of the
lead actor in a movie. Using information from 1037 movies containing 13 inputs & 1 output, multiple
supervised classification models were trained, individually, and their efficiency was compared to
determine the most effective model for the particular purpose of determining if the lead actor is
male/female. The evaluation process involved splitting the data into training and testing sets through
k-fold cross-validation. The models tested were

1. Logistic regression
2. Linear Discriminant Analysis (LDA)
3. Quadratic Discriminant Analysis (QDA)
4. k-Nearest Neighbor
5. Tree-based methods: Random Forest
6. Boosting: AdaBoost

The optimal model will be proposed for implementation and will be further evaluated using a separate
test set of 387 films.
2 Pre-data Analysis
The following questions were answered by pre-analysing the given dataset & mainly printing out
different columns from it. The code can be seen in the appendix (section: Pre-data analysis).

2.1 Do men or women dominate speaking roles in Hollywood movies?

Do men or women dominate speaking roles in Hollywood movies? To answer the question we can
pinpoint two points of discussion:

1. Figure 1 which shows how many words were spoken by male leads vs female leads through-
out the years.
2. Calculating the percentage of movies where men spoke more than
females or vice-versa.

Figure 1: Number of words spoken by males/females throughout the years

Analysing Fig 1 shows that in general, men speak more than women with approximately double the
rate at some points. There exists some exceptions but the conclusion is still that men speak more than
women. The reason figure 1 is weirdly scaled is that what the figure is showing is the total number of
words (lead+non-lead) per gender per year.
Printing the percentage of times where men spoke more than women in comparison with times where
women spoke more shows the results that men spoke roughly 76.6% of the time while women the
rest of the time (23.4%).
So an answer to the question: "Do men or women dominate speaking roles in Hollywood Movies?"
is that men still dominate the speaking roles with more than half of the time.

2.2 Has gender balance in speaking roles changed over time?

Concerning the question about gender balance, it can be understood in two different ways:

1. Has the gender balance for a specific role changed over time (i.e. years)?, Figure 2a
2. Has the gender balance (men/females) for speaking roles (mainly lead) changed over time
(i.e. years)?, Figure 2b

Figure 2a answers the question regarding gender balance for a specific role by showing that the
percentage of female actors saw a slow increase around the 1970s but then stagnated around 30%.
The percentage of female leads, on the other hand, varied by a little bit but is still under 50% of the
total leading roles.

Figure 2b shows two subplots in which the upper plot shows a comparison between the
percentage of male & female leads while the lower plot shows a comparison between the percentage
of male & female actors in the industry. Both suplots show two different plots that seem to be the

2
(b) Percentage of male/female leads & male/female
(a) Female leads vs Female actors (percentage) actors

reflections of themselves which brings us to the answer that the gender balance has changed over
time.

2.3 Do films in which men do more speaking make a lot more money than films in which
women speak more?

Which movies make more money, the movies where men speak more or those in which women speak
more?

The solution can simply be obtained by computing the average gross for movies in which men
spoke more frequently than women and vice-versa. The calculations revealed that movies with a
predominant male dialogue had an average revenue of $118.6 million, while those with a female-led
dialogue had an average gross income of $86.6 million. Additionally, movies with a male lead actor
generated an average income of $115.2 million, compared to $98.7 million for movies with a female
lead. This indicates that movies centered around males tend to generate higher gross than those
centered around females.

3 Implementation of Methods
Of the 13 known inputs, 12 of them (all except Total words) were used as a quantitative input subset.
The input Total words was excluded because it is colinear with other inputs, specifically the inputs
Number words female & Number words male. Furthermore, the models (all except kNN) were
tuned using grid search or random search as a methodological approach and performance/evaluation
is determined using k-fold cross-validation. We chose to use accuracy as the performance metric
because it is intuitive to interpret. Mathematically, accuracy is the fraction of correctly classified
samples and will be calculated using the cross_val_score function. Furthermore, the hyper parameter
tuning is shortly described below, and for even more detail, see appendix section: Choice of hyper
parameters. Additionally, the hyper-parameter optimization results are summarized in table 1.

3.1 Logistic Regression

3.1.1 Description & Implementation

Logistic regression is a method within statistical machine learning that uses both continuous and
discrete measurements to classify new samples. One can with this interpret logistic regression as a
linear regression model that is more fitted for binary classification, since the model will use values of
0 and 1 instead of continuous values. In short, the logistic regression uses numerical input values and
gives a binary output value of 0 and 1, or True and False. In our case the output values are used to
classify whether the class Lead is Male or Female.
With the input data and the binary output data we then fit a curve of a logistic function [1]
ez
h(z) = 1+e ( z) that is later used to calculate the probability of the class being either Male=0 or

Female=1. This limit can be adjusted according to the needs of individual situations and calculations.

3
In our case, this threshold is set to 0.5 and means that if the probability of the class Male is 0.5 or
higher, then the logistic regression method will classify that particular data point as Male. Finding a
good limit or threshold to optimize this method is not always an easy task, and to find the optimum
value usually means that one has to manually tune this and it could also require one to work with
different projects for cross referencing. 0.5 is the standard threshold [2] and given the time and the
complexity of this project this seems like a decent threshold to use.

3.1.2 Evaluation & Model tuning

With regards to model tuning the logistic regression model, we used the RandomizedSearchCV
function which looped through different combinations of the hyper parameters passed into the model
[3] and found the hyper parameters optimal for this specific classification problem.
A randomized search through 70 fits gave that the optimal model tuning for Logistic Regression was
found and gave an accuracy of 87.5%. The reason why GridsearchCV was not used here is that
GridsearchCV loops through 516096 fits which takes up to 3 hours per run and does not give any
better accuracy.

3.2 Discriminant Analysis (LDA & QDA)

3.2.1 Description & Implementation

Discriminant analysis is based on the generative model Gaussian Mixture Model (GMM) which gives
the insight that making predictions comes to computing the conditional distribution that results in the
classifier [4]:
bm N x⋆ | µ
π b m, Σ
bm
p (y = m | x⋆ ) = P (1)
M
j=1 bj N x⋆ | µ
π bj, Σ
bj
GMM can be used everywhere but in the supervised learning environment, the classifiers that we get
are called LDA & QDA [4].

Linear Discriminant Analysis (LDA) is a Machine Learning model that is commonly used
for Regression Problems, in particular when it is more than 2 classes that are being used. For
our classification problem, the categorical variable Lead which can be either Male or Female is
transformed into a binary number 1 & 0 so that we can have numerical variables which LDA can be
applied on. The method works by structuring an axis that 1) maximizes the distance between the two
classes mean values while 2) keeping the two classes internal variance minimized. The data points of
each class are then extended on the shared axis which creates the linear decision boundary which
helps in predicting, in our case, if the Lead gender is Male=1 or Female=0. Worth knowing is that in
LDA, we assume that both classes have shared covariance matrices which is what results in a linear
decision boundary.

Quadratic Discriminant Analysis (QDA) is a Machine Learning model similar to LDA but with
an apparent difference of assuming that we have individual covariance matrices which results in a
quadratic decision boundary. Another thing of note is that QDA is a model made for classification
problems meaning that there is no need for transforming a categorical variable output into a numerical
variable one. There are some other difference in how QDA & LDA behave for different data types
(which model is more prone to overfit etc.) and it is up to the programmer to choose how and which
model to implement.
Both LDA & QDA were implemented using the sklearn.discriminant_analysis import and then fitted
using the package’s gridsearchCV. As a pre-analysis step, we normalized the dataset for LDA mainly
to improve and enhance performance metrics. In addition, for the LDA case we needed to transform
the outputs into binary numbers using the StandardScaler() function.
We tried to normalize the data for QDA but it did not improve the performance metrics by any way.

3.2.2 Evaluation & Model tuning

Python’s GridSearchCV function which used a grid search scheme to go through different combina-
tions of the hyper parameters passed into each model [5] [6]. Note that the more hyper parameter

4
values are passed into the param_grid variable, the more computational demanding and complex the
grid search will be and it doesn’t necessarily mean that better parameter options will be found.
After grid searching through 2016 & 4480 compatible fits for both LDA & QDA, the optimal model
tuning for the models was found and gave an accuracy of 85.8% & 89.7%, respectively.

3.3 k-Nearest Neighbor (kNN)

3.3.1 Description & Implementation

k-Nearest Neighbor evaluates a data point by looking at the closest neighboring data points. The
method changes a bit if it is used on either a regression problem or classification problem. For a
regression problem the average value of the neighboring data points will be the predicted value for
the evaluated data point while for classification the predication will be the label that is the majority.
The method only accepts numerical data and uses e.g. euclidean distance measurement between data
point x∗ and neighboring data points xi to find the closest points [4].
v
u n
uX 2
t (xi − x∗ ) , i = 1, 2, 3, ... (2)
i=1

In this project, kNN works by setting a value to the number of data points you will use to make the
classification, often labeled k. As a simplification, if k is set to 5 it means that the 5 nearest data
points are compared to each other and the label that is a majority "wins" and becomes the value of
x∗ . One way of kNN optimization means finding the optimal k value which results in the minimum
validation error.
According to the documentation [7], there are 8 hyper parameters in the
[Link] function. However, because n_neighbors was the
only parameter covered in this course, we deemed it better to exclude them during model tuning.

3.3.2 Evaluation & Model tuning

The first approach for determining the optimal k value was to perform a training and validation
split then calculating the average misclassification error for each k value across 10 different datasets
for which the k value with the lowest average misclassification error was selected. For the second
approach, we used n-fold cross-validation to determine the optimal k value. The n-fold cross-
validation method was used to train and test the model on random subsets of the data. Thereafter
the performance of the model was evaluated by calculating the average misclassification error for all
cross-validation runs. The k value with the lowest misclassification error was the optimal one. Both
approaches found that setting k to 5 resulted in an optimal kNN model with an accuracy of 81.0%
which is the worst one out of the other methods.

3.4 Tree-based methods: Random forest

3.4.1 Description & Implementation

Decision trees can be used for regression or classification problems, and function by partitioning the
input space into disjoint regions and fitting a simple model in each region. Since finding the globally
optimal partitioning with respect to error minimization is computationally infeasible, decision trees
use a greedy optimization algorithm based on recursive binary splitting.
A Random Forest is an ensemble method that improves the performance of decision trees by reduction
of variance through bootstrap aggregation (Bagging) of both samples and features. For regression, the
aggregation function is averaging, and in more precise mathematicalhterms the ensemble
i variance of
PN
N trees with correlation ρ and variance σ 2 can be shown to be Var N1 n=1 zn = 1−ρ 2
N σ + ρσ
2

or approximately ρσ 2 for large ensembles. This is smaller than the variance of the individual decision
trees (σ 2 ), and there is a similar result for classification.
The most obvious way of reducing the ensemble variance is by having a large ensemble (many
trees, large N), but this also implies a large computational cost. Apart from the ensemble size, the
remaining variables ρ and σ are both affected by the bagging of samples and features. For both

5
sample and feature bagging, there is an optimum in the number of samples/features drawn during the
bootstrapping, since training with a large number of samples/features makes the variance σ 2 of the
individual trees small, but increases their correlation (ρ).

3.4.2 Evaluation & Model tuning

Random Forest was implemented using python’s package scikit-learn. The model was tuned using the
GridSearchCV function which explores different combinations of hyper parameters [8]. Using Grid
Search on 40824 parameter combinations gave an accuracy of 85.0% for the optimally tuned model.

3.5 Boosting: AdaBoost

3.5.1 Description & Implementation

AdaBoost is an ensemble method that attempts to form a strong model from a linear combination of
weak models. The weak models are learned sequentially from reweighed data such that each new
model tries to correct the errors of the one before. More specifically, samples that are misclassified
by one learner will be assigned a larger coefficient in the error function of the subsequent learner.
Mathematically,Pthe prediction of the boosted classifier is taken to be a weighted majority vote
B B b b
ybboost = sign b=1 α y
b . Here, ybb , b = 1, ..., B are the predictions of the weak models. In each
step the error of the next weak model is taken to be
Xn
b
Etrain = wib I{by (b) (xi ) ̸= yi } with wib = wib−1 exp −α(b−1) yi yb(b−1) (xi ) , (3)
i=1

for data (xi , yi ). After normalization of wb the remaining step in each iteration is to calculate the
optimal coefficients α. It can be shown that the errorof the boosted classifier formed so far by the
b

b 1 1−Etrain
weak models, can be minimized by setting α = 2 ln E b .
train

3.5.2 Evaluation & Model tuning

We chose to train a model using AdaBoost with classification tree as the base model, implemented in
scikit-learn. For some reason it seems that scikit-learn does not implement the theoretical optimal
weighing, but instead all weak models are given the same weight, by default α = 1. Regarding model
tuning, a grid Search using the passed hyper parameters [9] to loop through 20160 fits gave that
the optimally tuned model for AdaBoost has an accuracy of 87.8% which is the second best after QDA.

RandomizedSearchCV was not implemented for Random Forest or AdaBoost because it would have
decreased the amount of searched fits/combinations to 196 and that decreased the optimal models
accuracy. Even if the RandomizedSearchCV would have decreased the computational time by a more
than 97.5%.

Table 1: Some of the most important hyper parameters for each model, along with their respective
optimized values. The hyper parameters were optimized through grid or random search. The names
are from scikit-learn, and are largely self-explanatory, but see refs. [3], [5]–[9] for more details.
Method Parameter 1 Parameter 2 Parameter 3 Parameter 4 Parameter 5
shrinkage
LDA
= None
reg_param priors =
QDA
= 0.2 [0.25, 0.75]
Logistic fit_intercept
penalty = ’l2’ C = 1.0
regression = False
Random n_estimators bootstrap min_samples max_depth min_samples
forest = 1000 = False _split= 5 = None _leaf = 5
AdaBoost n_estimators learning_rate min_samples
max_depth = 5
tree = 1000 = 1.0 _split = 200
kNN n_neighbors = 5

6
3.6 Model Selection

For a good comparison with the implemented methods (section 3.1-3.5), a naive classifier that
always predicts the gender of Lead as Male was created. The classifier’s accuracy reached 75.6%
and methods with accuracy higher than the naive classifiers will be deemed useful for the given
classification problem.
We compared the implemented methods using k-fold cross-validation with k=14 and the results are
shown in the box-plot displayed below.

Figure 3: Box-plot showing the validation error for different methods

Looking at figure 3 we can presume that kNN can be eliminated as the chosen method because it has
the worst validation error of the 5 examined families. Random Forest, Logistic Regression & LDA
show a smaller validation error than kNN but they too will be eliminated because both AdaBoost &
QDA are performing better. This leaves us with one of two choices and we choose to use QDA "in
production" because it is performing better than AdaBoost by a good margin.

4 Conclusion
The chosen method/model to be put into production for this classification problem is Quadratic
Discriminant Analysis (QDA) with the hyper parameters shown in appendix (section: Choice of hyper
parameters). The script "Comparing the methods" gives that the implemented methods accuracy are
all around 80% with the QDA model performing better than the others with an impressive accuracy
of 89.7%.
Overall, this study provides valuable insights into the usage of machine learning methods and shows
that even tasks such as gender classification based on movies can be solved using Machine learning.

7
References
[1] I. B. Machines. “What is logistic regression?” IBM. (2023), [Online]. Available: https :
//[Link]/topics/logistic-regression (visited on 02/25/2023).
[2] G. Harrison. “Calculating and setting thresholds to optimise logistic regression performance,”
Towards Data Science. (2023), [Online]. Available: [Link]
calculating - and - setting - thresholds - to - optimise - logistic - regression -
performance-c77e6d112d7e (visited on 02/22/2023).
[3] F. Pedregosa, G. Varoquaux, A. Gramfort, et al. “Sklearn.linear_model.logisticregression,”
Scikit-learn. (2023), [Online]. Available: https : / / scikit - learn . org / stable /
modules / generated / sklearn . linear _ model . LogisticRegression . html (visited
on 02/22/2023).
[4] A. Lindholm, N. Wahlström, F. Lindsten, and T. B. Schön, MACHINE LEARNING: A First
Course for Engineers and Scientists, New Edition. Cambridge: Cambridge University Press,
2022, ISBN: 9781108843607.
[5] F. Pedregosa, G. Varoquaux, A. Gramfort, et al. “[Link] [Link],”
Scikit-learn. (2023), [Online]. Available: [Link]
generated/sklearn.discriminant_analysis.[Link]
(visited on 02/22/2023).
[6] F. Pedregosa, G. Varoquaux, A. Gramfort, et al. “[Link] [Link].”
(2023), [Online]. Available: [Link]
sklearn.discriminant_analysis.[Link] (visited
on 02/22/2023).
[7] F. Pedregosa, G. Varoquaux, A. Gramfort, et al. “[Link].”
(2023), [Online]. Available: [Link]
[Link] (visited on 02/22/2023).
[8] F. Pedregosa, G. Varoquaux, A. Gramfort, et al. “[Link].”
(2023), [Online]. Available: [Link]
[Link] (visited on 02/22/2023).
[9] F. Pedregosa, G. Varoquaux, A. Gramfort, et al. “[Link].” (2023),
[Online]. Available: https : / / scikit - learn . org / stable / modules / generated /
[Link] (visited on 02/22/2023).

8
Appendix
Choice of hyper parameters

Logistic Regression have 14 different optional & required hyper parameters:

The parameter penalty is something the that have been talked about when introducing logistic
regression (in the course SML). In addition, this parameter have to do with the regularization
parameter which means it may have an influence on how the optimal model is structured.
The parameters tol, C, fit_intercept, random_state, solver, max_iter & multi_class will be set to
different values and looped through because of their importance to find the optimal logistic regression
model.

1. tol is all about the tolerance for the stopping criteria

2. C specifies the inverse of regularization strength, must be non-negative float
3. fit_intercept specifies if a constant θ0 should be added to the function or not
4. random_state to decide if data should be shuffled or not, for reproducible reasons too
5. solver specifies which solver method would be used in the described problem
6. max_iter is the maximum number of iterations taken for the solver to converge [3]
7. multi_class decides how the solver will go through solving the given problem

The parameters Dual, intercept_scaling, verbose, warm_start, l1_ratio & class_weight is not some-
thing that have been talked about in the course SML. Thus, they will not be used when tuning logistic
regression.
n_jobs controls how many CPU cores are used when working with Logistic Regression. According to
the documentation [3], n_jobs is "used when parallelizing over classes if multi_class=’ovr’", which
means that n_jobs will be set ONLY if the logistic regression code takes a long time to run and/or
loops through many parameter fits.

Linear Discriminant Analysis have 7 hyper parameters: The parameter solver specifies which
solver method would be used in the described problem as is thus something worth looping over. In
addition, this parameter is something that we talked about when LDA was introduced in the course.
Concerning the parameter shrinkage, it controls the strength of this models regularization term
(regularization applied to the covariance matrix of the input data) in which smaller/larger values
may dictate if the model overfits/underfits. Thus this parameter is of much importance to have when
looping through different fits.
The parameters priors, n_components, store_covariance and covariance_estimator is not something
that have been talked about in the course and will not be taken into account.
tol will be set to different values and looped through because of its importance to find the optimal
LDA model. Observe that tol is a parameter that is all about the tolerance for the stopping criteria.

Quadratic Discriminant Analysis have 4 required hyper parameters and thus all of them will be
included in the grid search even though two of the parameters are not something we have talked about
in the course. The four hyper parameters are priors, reg_param, store_covariance & tol.

9
kNN have 8 hyper parameters and of them only the most known one (namely n_neighbors) will be
used because this is the only parameter talked about in the course. The other 7 parameters are:

1. weights
2. algorithm
3. leaf_size
4. metric
5. metric_params
6. n_jobs

Observe that n_jobs decides how many CPU cores are used for calculations but it will not be needed
as no more than one hyper parameter is used for kNN.

Random Forest have 18 different optional and required parameters and of them only 7 will be
used due to the code taking a long time to run. The used parameters are:

1. n_estimators
2. criterion
3. max_depth
4. min_samples_split
5. min_samples_leaf
6. bootstrap
7. random_state

Note that n_jobs for the gridsearchCV function will be manually set to -1 so that all CPU proces-
sors/cores are used.

AdaBoost have 5 parameters and beside the base_estimator that will be passed in as pre-made
parameter(s), the parameters:

1. n_estimators
2. learning_rate
3. random_state

will be included as they are parameters that we went through in the course and/or feels important for
model tuning.

Due to time constraints and the fact that including 7 & 3 hyper parameters for Random Forest
& AdaBoost gives us between 56000-700000 fits/combinations to search through (takes up to 320
minutes for 1 run), it was decided that each parameter will be limited to contain a maximum of 3
& 10 values for Random Forest & AdaBoost, respectively. This is done to conserve computational
power and computational time per run.

10
Main Code

The code will be attached as a zip file and should be opened as a Jupyter Notebook. Some basic
comments that will not be shown in appendix will be found on the real code. Observe that after model
tuning the individual models, their accuracy was calculated in the Compare the methods script as to
have the same datasets from all methods and not give any method some advantage or disadvantage.

Imports
import numpy a s np
import matplotlib
import m a t p l o t l i b . pyplot as p l t
import p a n d a s a s pd

import sklearn . p r e p r o c e s s i n g as skl_pre

import sklearn . linear_model as skl_lm
import sklearn . d i s c r i m i n a n t _ a n a l y s i s as skl_da
import sklearn . neighbors as skl_nb
import sklearn . m o d e l _ s e l e c t i o n as skl_ms
import sklearn . ensemble as skl_en

Pre-data steps
# Read t h e t r a i n i n g d a t a and s a v e i t i n t o v a r i a b l e " t r a i n "
t r a i n = pd . r e a d _ c s v ( ’ t r a i n . c s v ’ )

# Information about " t r a i n " v a r i a b l e

# −> mean ( ’ Number o f male a c t o r s ’ ) =~7.8
# −> mean ( ’ Number o f f e m a l e a c t o r s ’ ) =~3.5
train . describe ()

11
Pre-data Analysis
# F i r s t Q u e s t i o n : Do men o r women d o m i n a t e s p e a k i n g r o l e s i n
Hollywood movies ?

# C r e a t e s o r t e d l i s t o f y e a r s w i t h a t l e a s t one m o v i e
Y e a r s = t r a i n [ ’ Year ’ ] . u n i q u e ( ) # . u n i q u e ( ) r e t u r n s t h e s p e c i f i c
v a l u e s ( u n d u p l i c a t e d ) i n an u s o r t e d a r r a y
Years . s o r t ( )

# D e t e r m i n e s & c a l c u l a t e s t h e p e r c e n t a g e o f when f e m a l e s / m a l e s
s p o k e more and t h e n p r i n t s t h e r e s u l t s
m o r e _ f e m a l e _ w o r d s = t r a i n . l o c [ ( t r a i n [ ’ Number words male ’ ] \
< t r a i n [ ’ Number words f e m a l e ’ ] ) ]
more_male_words = t r a i n . l o c [ ( t r a i n [ ’ Number words f e m a l e ’ ] \
< t r a i n [ ’ Number words male ’ ] ) ]
p r i n t ( f " The p e r c e n t a g e o f m o v i e s where f e m a l e s s p e a k more t h a n
m a l e s : \ { ( l e n ( m o r e _ f e m a l e _ w o r d s ) / l e n ( t r a i n ) ) * 1 0 0 : . 1 f}%" )
p r i n t ( f " The p e r c e n t a g e o f m o v i e s where m a l e s s p e a k more t h a n
f e m a l e s : \ { ( l e n ( more_male_words ) / l e n ( t r a i n ) ) * 1 0 0 : . 1 f}%" )

total_words_female = []
total_words_male = []
for year in Years :
c u r r e n t _ m o v i e s = t r a i n [ t r a i n [ ’ Year ’ ] == y e a r ]

w o r d s _ f e m a l e _ a c t o r s = np . sum ( c u r r e n t _ m o v i e s [ ’ Number words

female ’ ] )
f e m a l e _ l e a d = ( c u r r e n t _ m o v i e s [ ’ Lead ’ ]== ’ Female ’ )
w o r d s _ f e m a l e _ l e a d s = np . sum ( c u r r e n t _ m o v i e s [ ’ Number o f words
lead ’ ] * female_lead )
t o t a l _ w o r d s _ f e m a l e . append ( w o r d s _ f e m a l e _ a c t o r s +
words_female_leads )

w o r d s _ m a l e _ a c t o r s = np . sum ( c u r r e n t _ m o v i e s [ ’ Number words male ’

])
m a l e _ l e a d = ( c u r r e n t _ m o v i e s [ ’ Lead ’ ]== ’ Male ’ )
w o r d s _ m a l e _ l e a d s = np . sum ( c u r r e n t _ m o v i e s [ ’ Number o f words l e a d
’ ] * male_lead )
t o t a l _ w o r d s _ m a l e . append ( w o r d s _ m a l e _ a c t o r s + words_male_leads )
# P l o t t h a t shows w h i c h g e n d e r s p o k e more t h r o u g h o u t t h e y e a r s
plt . figure (1)
p l t . b a r ( Years , t o t a l _ w o r d s _ m a l e , c o l o r = ’ b l u e ’ , a l p h a = 0 . 5 , l a b e l =
’ Male ’ )
p l t . b a r ( Years , t o t a l _ w o r d s _ f e m a l e , c o l o r = ’ r e d ’ , a l p h a = 0 . 5 , l a b e l
= ’ Female ’ )

plt . x l a b e l ( ’ Year ’ )
plt . y l a b e l ( ’ T o t a l number o f words \ n p e r g e n d e r ’ )
plt . legend ( )
plt . show ( )

12
# S e c o n d Q u e s t i o n : Has g e n d e r b a l a n c e i n s p e a k i n g r o l e s c h a n g e d
over time ( i . e . years ) ?

Y e a r s = t r a i n [ ’ Year ’ ] . u n i q u e ( ) # . u n i q u e ( ) r e t u r n s t h e s p e c i f i c
v a l u e s ( u n d u p l i c a t e d ) i n an u s o r t e d a r r a y
Years . s o r t ( )
female_lead_percentage = []
female_actors_percentage = []
male_lead_percentage = []
male_actors_percentage = []

for i in Years :
d a t a = t r a i n . l o c [ ( t r a i n [ ’ Year ’ ] == i ) ]

# C a l c u l a t e s how many f e m a l e l e a d t h e r e h a v e b e e n e a c h y e a r (
given in percentage )
l e a d _ p r e c e n t a g e _ f e m a l e = ( d a t a [ d a t a [ ’ Lead ’ ] == ’ Female ’ ] . s h a p e
[ 0 ] ) / l e n ( d a t a ) * 100
f e m a l e _ l e a d _ p e r c e n t a g e . append ( l e a d _ p r e c e n t a g e _ f e m a l e )

# C a l c u l a t e s how many f e m a l e a c t o r s t h e r e h a v e b e e n e a c h y e a r
( given in percentage )
a c t o r s _ p e r c e n t a g e _ f e m a l e = d a t a [ ’ Number o f f e m a l e a c t o r s ’ ] . sum
() \
/ ( d a t a [ ’ Number o f f e m a l e a c t o r s ’ ] . sum ( ) + d a t a [ ’ Number o f
male a c t o r s ’ ] . sum ( ) ) * 100
f e m a l e _ a c t o r s _ p e r c e n t a g e . append ( a c t o r s _ p e r c e n t a g e _ f e m a l e )

# C a l c u l a t e s how many male l e d t h e r e h a v e b e e n e a c h y e a r (

given in percentage )
l e a d _ p r e c e n t a g e _ m a l e = ( d a t a [ d a t a [ ’ Lead ’ ] == ’ Male ’ ] . s h a p e [ 0 ] )
/ l e n ( d a t a ) * 100
m a l e _ l e a d _ p e r c e n t a g e . append ( l e a d _ p r e c e n t a g e _ m a l e )

# C a l c u l a t e s how many male a c t o r s t h e r e h a v e b e e n e a c h y e a r (

given in percentage )
a c t o r s _ p e r c e n t a g e _ m a l e = d a t a [ ’ Number o f male a c t o r s ’ ] . sum ( ) \
/ ( d a t a [ ’ Number o f f e m a l e a c t o r s ’ ] . sum ( ) + d a t a [ ’ Number o f
male a c t o r s ’ ] . sum ( ) ) * 100
m a l e _ a c t o r s _ p e r c e n t a g e . append ( a c t o r s _ p e r c e n t a g e _ m a l e )

# Plot the r e s u l t s
# F i g u r e 2 : Female l e a d s v s Female a c t o r s ( p e r c e n t a g e )
plt . figure (2)
p l t . p l o t ( Years , f e m a l e _ l e a d _ p e r c e n t a g e , ’ r . ’ , l a b e l = ’ P e r c e n t a g e o f
female leads ’ )
p l t . p l o t ( Years , f e m a l e _ a c t o r s _ p e r c e n t a g e , ’ b . ’ , l a b e l = ’ P e r c e n t a g e
of female a c t o r s ’ )
p l t . x l a b e l ( ’ Year ’ )
p l t . y l a b e l ( ’ P e r c e n t a g e $\%$ ’ )
p l t . legend ( )
p l t . show ( )

# F i g u r e 3 : P e r c e n t a g e o f male / f e m a l e l e a d s \& male / f e m a l e a c t o r s

plt . figure (3)
f i g , ( ax1 , ax2 ) = p l t . s u b p l o t s ( 2 , 1 , s h a r e x = T r u e )

# Female l e a d s v s Male l e a d s ( p e r c e n t a g e )

13
ax1 . p l o t ( Years , f e m a l e _ l e a d _ p e r c e n t a g e , ’ r . ’ , l a b e l = ’ P e r c e n t a g e o f
female leads ’ )
ax1 . p l o t ( Years , m a l e _ l e a d _ p e r c e n t a g e , ’ g . ’ , l a b e l = ’ P e r c e n t a g e o f
male l e a d s ’ )
ax1 . s e t _ y l a b e l ( ’ P e r c e n t a g e $\%$ ’ )
ax1 . s e t _ t i t l e ( ’ P e r c e n t a g e o f f e m a l e v s male l e a d s ’ )
ax1 . l e g e n d ( )

# Female a c t o r s v s Male a c t o r s
ax2 . p l o t ( Years , f e m a l e _ a c t o r s _ p e r c e n t a g e , ’ r . ’ , l a b e l = ’ P e r c e n t a g e
of female a c t o r s ’ )
ax2 . p l o t ( Years , m a l e _ a c t o r s _ p e r c e n t a g e , ’ g . ’ , l a b e l = ’ P e r c e n t a g e o f
male a c t o r s ’ )
ax2 . s e t _ x l a b e l ( ’ Year ’ )
ax2 . s e t _ y l a b e l ( ’ P e r c e n t a g e $\%$ ’ )
ax2 . s e t _ t i t l e ( ’ P e r c e n t a g e o f f e m a l e v s male a c t o r s ’ )
ax2 . l e g e n d ( )

p l t . show ( )

14
# T h i r d Q u e s t i o n : Do f i l m s i n w h i c h men do more s p e a k i n g make a
l o t more money t h a n f i l m s i n w h i c h women s p e a k more ?

# C a l u l a t e s & p r i n t s a v e r a g e g r o s s made on m o v i e s where m a l e s

s p e a k more t h a n f e m a l e s
m a l e _ g r o s s _ m e a n = more_male_words [ ’ G r o s s ’ ] . mean ( )
f e m a l e _ g r o s s _ m e a n = m o r e _ f e m a l e _ w o r d s [ ’ G r o s s ’ ] . mean ( )
p r i n t ( f ’ A v e r a g e G r o s s f o r m o v i e s where m a l e s s p e a k more t h a n
females : \
$ { m a l e _ g r o s s _ m e a n : . 1 f }m’ )
p r i n t ( f ’ A v e r a g e G r o s s f o r m o v i e s where f e m a l e s s p e a k more t h a n
males : \
$ { f e m a l e _ g r o s s _ m e a n : . 1 f }m’ )

# C a l u l a t e s & p r i n t s a v e r a g e g r o s s made on m o v i e s w i t h male / f e m a l e

leads
l e a d _ m a l e = t r a i n . l o c [ ( t r a i n [ ’ Lead ’ ] == ’ Male ’ ) ]
l e a d _ f e m a l e = t r a i n . l o c [ ( t r a i n [ ’ Lead ’ ] == ’ Female ’ ) ]
l e a d _ g r o s s _ m a l e = l e a d _ m a l e [ ’ G r o s s ’ ] . mean ( )
l e a d _ g r o s s _ f e m a l e = l e a d _ f e m a l e [ ’ G r o s s ’ ] . mean ( )
p r i n t ( f " A v e r a g e G r o s s f o r m o v i e s where t h e l e a d i s male : $ {
l e a d _ g r o s s _ m a l e : . 1 f }m" )
p r i n t ( f " A v e r a g e G r o s s f o r m o v i e s where t h e l e a d i s f e m a l e : $ {
l e a d _ g r o s s _ f e m a l e : . 1 f }m" )

15
Implementation of methods
# Used a l l 13 i n p u t s e x c e p t ’ T o t a l words ’ w h i c h i s c o l i n e a r w i t h ’
Number words f e m a l e ’ & ’ Number words male ’ and was e x c l u d e d
beacuse of warnings

# L i n e a r D i s c r i m i n a n t A n a l y s i s ( LDA )
from s k l e a r n . m o d e l _ s e l e c t i o n import GridSearchCV
from s k l e a r n . p r e p r o c e s s i n g import L a b e l E n c o d e r
from s k l e a r n . p r e p r o c e s s i n g import S t a n d a r d S c a l e r

# Implementation :

np . random . s e e d ( 1 ) # For r e p r o d u c i b i l i t y
t r a i n = pd . r e a d _ c s v ( ’ t r a i n . c s v ’ )
normalizer = StandardScaler ()

# C o n v e r t t h e t a r g e t v a r i a b l e t o b i n a r y r e p r e s e n t a t i o n where " Male

"=1 & " Female "=0
encoder = LabelEncoder ( )
t r a i n [ ’ Lead ’ ] = e n c o d e r . f i t _ t r a n s f o r m ( t r a i n [ ’ Lead ’ ] )

# Create X & y
X = t r a i n [ [ ’ Number words f e m a l e ’ , ’ Number o f words l e a d ’ ,
’ D i f f e r e n c e i n words l e a d and co − l e a d ’ , \
’ Number o f male a c t o r s ’ , ’ Year ’ , ’ Number o f f e m a l e a c t o r s ’
, ’ Number words male ’ , ’ G r o s s ’ , \
’ Mean Age Male ’ , ’ Mean Age Female ’ , ’ Age Lead ’ ,
’ Age Co−Lead ’ ] ]
X = n o r m a l i z e r . f i t _ t r a n s f o r m (X) # To n o r m a l i z e t h e d a t a
y = t r a i n [ ’ Lead ’ ]

# Model s t r u c t u r i n g
LDA_model = s k l _ d a . L i n e a r D i s c r i m i n a n t A n a l y s i s ( )

# Model t u n i n g & f i t t i n g
cv = s k l _ m s . KFold ( n _ s p l i t s =14 , s h u f f l e = True , r a n d o m _ s t a t e =None )
param_grid = { ’ s o l v e r ’ : [ ’ svd ’ , ’ l s q r ’ , ’ e i g e n ’ ] , ’ t o l ’ :
[0.001 , 0.0001 , 0.00001 , 0.000001] ,\
’ s h r i n k a g e ’ : [ None , 0 . 0 , 0 . 1 , 0 . 2 , 0 . 3 , 0 . 4 , 0 . 5 , 0 . 6 , 0 . 7 ,
0.8 , 0.9 , 1.0]}
g r i d _ s e a r c h = GridSearchCV ( e s t i m a t o r =LDA_model , p a r a m _ g r i d =
p a r a m _ g r i d , s c o r i n g = ’ a c c u r a c y ’ , v e r b o s e =1 , cv = cv )
g r i d _ s e a r c h . f i t (X, y )
best_params = grid_search . best_params_

# Note :
# The w a r n i n g t h a t we g e t i s b e c a u s e :
# 1 ) S h r i n k a g e i s n o t s u p p r o t e d w i t h t h e s o l v e r =’ s v d ’
# We t e s t e d d i f f e r n t c o m b i n a t i o n s and t h i s w a r n i n g d o e s n ’ t
a f f e c t f i n d i n g t h e o p t i m a l model !

16
# Q u a d t r a t i c D i s c r i m i n a n t A n a l y s i s (QDA)
from s k l e a r n . m o d e l _ s e l e c t i o n import GridSearchCV

# Implementation :

np . random . s e e d ( 1 ) # For r e p r o d u c i b i l i t y
t r a i n = pd . r e a d _ c s v ( ’ t r a i n . c s v ’ )

# Model s t r u c t u r i n g
QDA_model = s k l _ d a . Q u a d r a t i c D i s c r i m i n a n t A n a l y s i s ( )

# Model t u n i n g & f i t t i n g
param_grid = { ’ reg_param ’ : [ 0 . 1 , 0 . 2 , 0 . 3 , 0 . 4 , 0 . 5 , 0 . 6 , 0 . 7 ,
0.8 , 0.9 , 1] ,\
’ t o l ’ : [0.001 , 0.0001 , 0.00001 , 0.000001] ,\
’ p r i o r s ’ : [ None , [ 0 . 2 5 , 0 . 7 5 ] , [ 0 . 5 , 0 . 5 ] , [ 0 . 7 5 , 0 . 2 5 ] ] , \
’ s t o r e _ c o v a r i a n c e ’ : [ True , F a l s e ] }
cv = s k l _ m s . KFold ( n _ s p l i t s =14 , s h u f f l e = True , r a n d o m _ s t a t e =None )
g r i d _ s e a r c h = GridSearchCV ( e s t i m a t o r =QDA_model , p a r a m _ g r i d =
p a r a m _ g r i d , s c o r i n g = ’ a c c u r a c y ’ , v e r b o s e =1 , cv = cv )
g r i d _ s e a r c h . f i t (X, y )
best_params = grid_search . best_params_

17
# Logistic Regression
from s k l e a r n . m o d e l _ s e l e c t i o n import RandomizedSearchCV

# Read i n t h e d a t a t r a i n i n g ( O b s e r v e t h a t t h e f i l e i s i n t h i s c a s e
l o a d e d l o c a l l y on t h e c o m p u t e r )
u r l = ’ t r a i n . csv ’
t r a i n i n g = pd . r e a d _ c s v ( u r l , n a _ v a l u e s = ’ ? ’ , d t y p e ={ ’ ID ’ : s t r } ) .
dropna ( ) . r e s e t _ i n d e x ( )

# s a m p l i n g i n d i c e s f o r t h e d a t a s e t i n a t r a i n i n g s e t and t e s t s e t
np . random . s e e d ( 1 )
t r a i n I = np . random . c h o i c e ( t r a i n i n g . s h a p e [ 0 ] , s i z e = 5 0 0 , r e p l a c e =
False )
t r a i n I n d e x = t r a i n i n g . index . i s i n ( t r a i n I )
t r a i n = t r ai n in g . iloc [ trainIndex ] # Training set
t e s t = t r a i n i n g . i l o c [~ t r a i n I n d e x ] # t e s t s e t

# S e t t h e model u s i n g s k l e a r n t o s o l v e w i t h l o g i s t i c r e g r e s s i o n
LogReg_model = s k l _ l m . L o g i s t i c R e g r e s s i o n ( )

X _ t r a i n = t r a i n [ [ ’ Number words f e m a l e ’ , ’ Number o f words l e a d ’ , \

’ D i f f e r e n c e i n words l e a d and co − l e a d ’ , ’ Number o f male
a c t o r s ’ , ’ Year ’ , \
’ Number o f f e m a l e a c t o r s ’ , ’ Number words male ’ , ’ G r o s s ’ , ’
Mean Age Male ’ , \
’ Mean Age Female ’ , ’ Age Lead ’ , ’ Age Co−Lead ’ ] ]
Y _ t r a i n = t r a i n [ ’ Lead ’ ]
X _ t e s t = t e s t [ [ ’ Number words f e m a l e ’ , ’ Number o f words l e a d ’ , \
’ D i f f e r e n c e i n words l e a d and co − l e a d ’ , ’ Number o f male
a c t o r s ’ , ’ Year ’ , \
’ Number o f f e m a l e a c t o r s ’ , ’ Number words male ’ , ’ G r o s s ’ , ’
Mean Age Male ’ , \
’ Mean Age Female ’ , ’ Age Lead ’ , ’ Age Co−Lead ’ ] ]
Y _ t e s t = t e s t [ ’ Lead ’ ]

# Model t u n i n g & f i t t i n g
p a r a m _ g r i d = { ’ p e n a l t y ’ : [ ’ l 1 ’ , ’ l 2 ’ , ’ e l a s t i c n e t ’ , None ] , \
’C ’ : [ 0 . 0 0 0 1 , 0 . 0 0 1 , 0 . 0 1 , 0 . 1 , 1 . 0 ] , \
’ f i t _ i n t e r c e p t ’ : [ True , F a l s e ] , \
’ s o l v e r ’ : [ ’ l b f g s ’ , ’ l i b l i n e a r ’ , ’ newton − cg ’ , ’
newton − c h o l e s k y ’ , ’ s a g ’ , ’ s a g a ’ ] , \
’ m u l t i _ c l a s s ’ : [ ’ auto ’ , ’ ovr ’ , ’ multinomial ’ ] , \
’ t o l ’ : [0.001 , 0.0001 , 0.00001 , 0.000001] ,\
’ r a n d o m _ s t a t e ’ : [ None , 0 , 1 2 3 4 , 4 3 2 1 ] , ’ m a x _ i t e r ’ :
[100 , 1000 , 10000 , 10000]}
r a n d o m _ s e a r c h = RandomizedSearchCV ( e s t i m a t o r =LogReg_model ,
p a r a m _ d i s t r i b u t i o n s =param_grid , \
n _ i t e r =14 , s c o r i n g = ’ a c c u r a c y ’ , n _ j o b s = −1 ,
r a n d o m _ s t a t e =1 , v e r b o s e = 1 )
random_search . f i t ( X_train , Y_train )
best_params = random_search . best_params_

# F i t t i n g t h e model and a l s o make s u r e t h a t we a r e u s i n g l o g i s t i c

regression
model = s k l _ l m . L o g i s t i c R e g r e s s i o n ( s o l v e r = ’ newton − c h o l e s k y ’ , t o l
=0.0001 , penalty = ’ l2 ’ , \
m u l t i _ c l a s s = ’ ovr ’ , f i t _ i n t e r c e p t = False , C= 1 . 0 , m a x _ i t e r
=10000 ,\
r a n d o m _ s t a t e =0)

18
model . f i t ( X _ t r a i n , Y _ t r a i n )
# p r i n t ( ’ model summary : ’ )
# p r i n t ( model )

# C a l c u l a t e p r e d i c t e d v a l u e s , w i t h t h e f i r s t 10 r e s u l t s .
p r e d i c t _ p r o b = model . p r e d i c t _ p r o b a ( X _ t e s t )
# p r i n t ( ’ Classes ’)
# p r i n t ( model . c l a s s e s _ )
# print ( predict_prob [0:10])

# P r e d i c t t h e t h e r e s u l t s b a s e d on t h e p a r a m e t e r s b e f o r e , t h e f i r s t
20 i s c h o s e n h e r e
# s i m p l y because t o check i f t h e a l g o r i t h m works p r o p e r l y s i n c e t h e
f i r s t 10 r e s u l t s i n ’ Male ’ .
p r e d i c t i o n = np . empty ( l e n ( X _ t e s t ) , d t y p e = o b j e c t )
p r e d i c i t o n = np . where ( p r e d i c t _ p r o b [ : , 0 ] > = 0 . 5 , ’ Female ’ , ’ Male ’ )
# print ( prediciton [0:20])

#n− f o l d C r o s s v a l i d a t i o n w i t h l o g i s t i c r e g r e s s i o n , u s i n g n = 1 4 ,
r a n d o m _ s t a t e = 42 and
# 1400 max i t e r a t i o n s t o e x c l u d e p r o b l e m s r u n n i n g t h e c o d e
l o g _ c v = s k l _ l m . L o g i s t i c R e g r e s s i o n C V ( cv =14 , r a n d o m _ s t a t e =42 , m a x _ i t e r
=1400)

# Fitting the crossvalidation

log_cv . f i t ( X_train , Y_train )

19
# Random F o r e s t & A d a B o o s t
from s k l e a r n . p r e p r o c e s s i n g import S t a n d a r d S c a l e r
from s k l e a r n . m o d e l _ s e l e c t i o n import KFold , GridSearchCV
from s k l e a r n import t r e e
from s k l e a r n . e n s e m b l e import A d a B o o s t C l a s s i f i e r ,
RandomForestClassifier

# Load t h e d a t a s e t
d f = pd . r e a d _ c s v ( ’ t r a i n . c s v ’ )

# I f r e f i t =T r u e t h e n m o d e l s a r e r u n e v e n i f t h e r e a l r e a d y e x i s t
r e s u l t s f o r them .
# However , i f r e f i t =F a l s e t h e n m o d e l s a r e n o t r e f i t t e d o n c e s t o r e d
results exist .
r e f i t = True
r e s u l t s = {}

# Make one −h o t e n c o d i n g f o r t h e ’ Lead ’− v a r i a b l e

d f _ p r e p = pd . g e t _ d u m m i e s ( df , c o l u m n s = [ ’ Lead ’ ] )

# Use a l l d a t a s a m p l e s f o r c r o s s v a l i d a t i o n s i n c e t h e f i n a l t e s t
data i s stored in a separate f i l e .
t r a i n = df_prep

# T r a i n w i t h a l l f e a t u r e s e x c e p t t h e l e a d ( ’ Lead_Male ’ and ’
Lead_Female ’ )
X _ t r a i n = t r a i n . d r o p ( c o l u m n s = [ ’ T o t a l words ’ , ’ Lead_Male ’ , ’
Lead_Female ’ ] )

# Normalize the data

scaler = StandardScaler ()
X _ t r a i n = pd . DataFrame ( s c a l e r . f i t _ t r a n s f o r m ( X _ t r a i n ) , c o l u m n s =
X _ t r a i n . columns )

# T r a i n i n g l a b e l i s ’ Lead_Female ’ ( 1 o r 0 )
y _ t r a i n = t r a i n [ ’ Lead_Female ’ ]

# D e f i n e j o b s t o ( re −) r u n
job_list = [
’RF ’ ,
’AB−T ’ ,
]

# D e f i n e m o d e l s and p a r a m e t e r s f o r t h e g r i d s e a r c h p a r a m e t e r
optimiaztion
full_names = {
’RF ’ : ’ Random F o r e s t ’ ,
’AB−T ’ : ’ AdaBoosted t r e e ’ ,
}

base_models = {
’RF ’ : R a n d o m F o r e s t C l a s s i f i e r ( ) ,
’AB−T ’ : A d a B o o s t C l a s s i f i e r ( b a s e _ e s t i m a t o r = t r e e .
DecisionTreeClassifier () ) ,
}

model_parameters = {
’RF ’ : {
’ max_depth ’ : [ None , 5 , 1 5 ] ,

20
’ m i n _ s a m p l e s _ s p l i t ’ : [5 , 10 , 20] ,
’ n _ e s t i m a t o r s ’ : [500 , 1000 , 2000] ,
’ max_samples ’ : [ None , 0 . 5 , 0 . 8 ] ,
’ c r i t e r i o n ’ : [ ’ gini ’ , ’ entropy ’ ] ,
’ min_samples_leaf ’ : [5 , 15 , 25] ,
’ b o o t s t r a p ’ : [ True , F a l s e ] ,
’ r a n d o m _ s t a t e ’ : [ None , 0 , 1 2 3 4 ] } ,
’AB−T ’ : {
’ b a s e _ e s t i m a t o r _ _ m a x _ d e p t h ’ : [ None , 5 , 1 0 , 1 5 ] ,
’ b a s e _ e s t i m a t o r _ _ m i n _ s a m p l e s _ s p l i t ’ : [1 , 5 , 20 , 200] ,
’ n _ e s t i m a t o r s ’ : [500 , 1000 , 2000] ,
’ learning_rate ’ : [0.1 , 0.2 , 0.3 , 0.4 , 0.5 , 0.6 , 0.7 , 0.8 ,
0.9 , 1.0] ,
’ r a n d o m _ s t a t e ’ : [ None , 0 , 1 2 3 4 ] }
}

# C r o s s v a l i d a t i o n sheme
n _ s p l i t s = 14
cv = KFold ( n _ s p l i t s = n _ s p l i t s , r a n d o m _ s t a t e =None , s h u f f l e = T r u e )

# F i t m o d e l s and p r i n t r e s p e c t i v e b e s t p a r a m e t e r v a l u e s and
accuracy
n_jobs = len ( j o b _ l i s t )
f o r i , job_name i n enumerate ( j o b _ l i s t ) :
i f r e f i t == F a l s e and model_name i n r e s u l t s . k e y s ( ) : c o n t i n u e #
Don ’ t r e f i t m o d e l s i f r e f i t ==F a l s e

# P r e p a r e model
b a s e _ m o d e l = b a s e _ m o d e l s [ job_name ]
p a r a m _ g r i d = m o d e l _ p a r a m e t e r s [ job_name ]
model_name = f u l l _ n a m e s [ job_name ]
c l f = GridSearchCV ( b a s e _ m o d e l , p a r a m _ g r i d , s c o r i n g = ’ a c c u r a c y ’ ,
cv =cv , n _ j o b s = −1 , v e r b o s e = 1 )

# F i t model and s e a r c h f o r b e s t p a r a m e t e r s
# p r i n t ( f ’ \ n \ n F i t t i n g { job_name } ( Job { i + 1 } / { n _ j o b s } ) \ nModel :
{ model_name } \ n−−−−−−−−−−’)
c l f . f i t ( X_train , y _ t r a i n )

21
# k−N e a r e s t −N e i g b o r
d e f p e r f o r m _ k N N _ c r o s s _ v a l i d a t i o n _ t o _ g e t _ k (K, columns , n _ f o l d s ,
data , t e s t ) :
m i s s c l a s s i f i c a t i o n = np . z e r o s ( l e n (K) )
cv = s k l _ m s . KFold ( n _ s p l i t s = n _ f o l d s , r a n d o m _ s t a t e =2 , s h u f f l e =
True )
data_X = d a t a . i l o c [ : , columns ]
d a t a _ Y = d a t a . i l o c [ : , −1]
t e s t _ X = t e s t . i l o c [ : , columns ]

f o r j , k i n enumerate (K) :
model = s k l _ n b . K N e i g h b o r s C l a s s i f i e r ( n _ n e i g h b o r s =k )
s c o r e s = s k l _ m s . c r o s s _ v a l _ s c o r e ( model , data_X , data_Y ,
s c o r i n g = ’ a c c u r a c y ’ , cv = cv )
m i s s c l a s s i f i c a t i o n [ j ] = 1 − np . mean ( s c o r e s )

plt . p l o t (K, m i s s c l a s s i f i c a t i o n )
plt . t i t l e ( ’ C r o s s v a l i d a t i o n e r r o r f o r kNN ( n f o l d ) ’ )
plt . xlabel ( ’k ’ )
plt . ylabel ( ’ Validation error ’ )
plt . show ( )

m i n _ i n d e x = np . a r g m i n ( m i s s c l a s s i f i c a t i o n )
min_value = m i s s c l a s s i f i c a t i o n [ min_index ]

r e t u r n m i n _ v a l u e , m i n _ i n d e x +1

d e f p e r f o r m _ k N N _ 1 0 _ v a l _ t o _ g e t _ k ( d a t a , columns , n _ r u n s , K) :
m i s s c l a s s i f i c a t i o n = np . z e r o s ( ( n _ r u n s , l e n (K) ) )

f o r i i n range ( n _ r u n s ) :
X _ t r a i n , X_val , Y _ t r a i n , Y_val = s k l _ m s . t r a i n _ t e s t _ s p l i t (
d a t a . i l o c [ : , columns ] , \
d a t a . i l o c [ : , −1] , t e s t _ s i z e = 0 . 3 )
f o r j , k i n enumerate (K) :
model = s k l _ n b . K N e i g h b o r s C l a s s i f i e r ( n _ n e i g h b o r s =k )
model . f i t ( X _ t r a i n , Y _ t r a i n )
p r e d i c t i o n = model . p r e d i c t ( X_val )
m i s s c l a s s i f i c a t i o n [ i , j ] = np . mean ( p r e d i c t i o n ! = Y_val
)
a v e r a g e _ m i s = np . mean ( m i s s c l a s s i f i c a t i o n , a x i s = 0 )
m i n _ e r r o r = np . min ( a v e r a g e _ m i s )
i n d e x = np . a r g m i n ( a v e r a g e _ m i s )
k _ v a l u e = K[ i n d e x ]

plt . p l o t (K, a v e r a g e _ m i s )
plt . t i t l e ( ’ C r o s s v a l i d a t i o n e r r o r f o r kNN ( n r u n a v e r a g e ) ’ )
plt . xlabel ( ’k ’ )
plt . ylabel ( ’ Validation error ’ )
plt . show ( )

return min_error , k_value

d e f p r e d i c t i o n ( k , columns , d a t a , t e s t ) :
data_X = d a t a . i l o c [ : , columns ]
d a t a _ Y = d a t a . i l o c [ : , −1]
t e s t _ X = t e s t . i l o c [ : , columns ]
model = s k l _ n b . K N e i g h b o r s C l a s s i f i e r ( n _ n e i g h b o r s =k )

22
model . f i t ( data_X , d a t a _ Y )
p r e d i c t i o n = model . p r e d i c t ( t e s t _ X )

return p r e d i c t i o n

# S e t random s e e d
np . random . s e e d ( 1 )

# Get d a t a f r o m t r a i n . c s v
d a t a = pd . r e a d _ c s v ( ’ t r a i n . c s v ’ )

# Get t e s t d a t a f r o m t e s t . c s v
t e s t = pd . r e a d _ c s v ( ’ t e s t . c s v ’ )

# General parameters
K = np . a r a n g e ( 1 , 1 9 9 ) # k = 1 , 2 , . . . , 1 9 8
columns = [ 0 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 1 0 , 1 1 , 1 2 ] # b e t t e r accuracy w i t h o u t 1

# Get b e s t k f o r K−NN u s i n g f o l d c r o s s v a l i d a t i o n
n _ f o l d s = 14
m i n _ v a l u e , k _ f o l d = p e r f o r m _ k N N _ c r o s s _ v a l i d a t i o n _ t o _ g e t _ k (K,
columns , n _ f o l d s , d a t a , t e s t )
#print (k)
# p r i n t ( f ’ The b e s t k v a l u e a c c o r d i n g t o n f o l d c r o s s v a l i d a t i o n : {
k_fold } ’)
p r i n t ( f " The minimum m i s s c l a s s i f i c a t i o n v a l u e i s { m i n _ v a l u e } a t k =
{ k_fold }\
according to n fold cross validation " )

# Get b e s t k f o r K−NN by d o i n g 10 r u n s w i t h d i f f e r e n t v a l i d a t i o n
sets
n _ r u n s = 10
m i n _ e r r o r , k _ 1 0 _ v a l = p e r f o r m _ k N N _ 1 0 _ v a l _ t o _ g e t _ k ( d a t a , columns ,
n _ r u n s , K)
p r i n t ( f " The minimum m i s s c l a s s i f i c a t i o n v a l u e i s { m i n _ e r r o r } a t k =
{ k_10_val }\
# a c c o r d i n g t o a v e r a g e from n r u n s " )

23
Which inputs to consider
# Implementation :

np . random . s e e d ( 1 ) # For r e p r o d u c i b i l i t y
t r a i n = pd . r e a d _ c s v ( ’ t r a i n . c s v ’ )
X = t r a i n . copy ( ) . d r o p ( c o l u m n s = ’ Lead ’ )
plt . figure (5)
pd . p l o t t i n g . s c a t t e r _ m a t r i x (X, f i g s i z e = ( 4 0 , 4 0 ) )
p l t . show ( )

Naive Classifier
# Implementation :
t r a i n = pd . r e a d _ c s v ( ’ t r a i n . c s v ’ )

# N a i v e C l a s s i f i e r t h a t a l w a y s p r e d i c t s l e a d g e n d e r a s male
def n a i v e _ c l a s s i f i e r ( t r a i n ) :
male_prediction = []
f o r i i n t r a i n [ ’ Lead ’ ] :
m a l e _ p r e d i c t i o n . a p p e n d ( ’ Male ’ )
return m a l e _ p r e d i c t i o n
# Function to c a l c u l a t e accuracy
def accuracy ( y _ p r e d i c t i o n , y _ v a l i d a t i o n ) :
a c c u r a c y = np . mean ( y _ p r e d i c t i o n == y _ v a l i d a t i o n )
return accuracy

24
Comparing the methods

# P e r f o r m i n g k− f o l d c r o s s − v a l i d a t i o n ( w i t h k =14) on a l l t h e c h o s e n
m o d e l s t o c h o o s e b e s t o p t i m a l method f o r p r e d i c t i n g l e a d .

# Implementation :
from s k l e a r n import t r e e

np . random . s e e d ( 1 ) # For r e p r o d u c i b i l i t y
t r a i n = pd . r e a d _ c s v ( ’ t r a i n . c s v ’ )
X = t r a i n [ [ ’ Number words f e m a l e ’ , ’ Number o f words l e a d ’ , ’
D i f f e r e n c e i n words l e a d and co − l e a d ’ , \
’ Number o f male a c t o r s ’ , ’ Year ’ , ’ Number o f f e m a l e
a c t o r s ’ , ’ Number words male ’ , ’ G r o s s ’ , \
’ Mean Age Male ’ , ’ Mean Age Female ’ , ’ Age Lead ’ ,
’ Age Co−Lead ’ ] ]
y = t r a i n [ ’ Lead ’ ]
c r o s s _ v a l i d a t i o n = s k l _ m s . KFold ( n _ s p l i t s =14 , s h u f f l e = True ,
r a n d o m _ s t a t e =1)

# L i n e a r D i s c r i m i n a n t A n a l y s i s ( LDA )
# E v a l u a t e o p t i m a l LDA u s i n g k− f o l d c r o s s − v a l i d a t i o n where k =14
LDA_model = s k l _ d a . L i n e a r D i s c r i m i n a n t A n a l y s i s ( s o l v e r = ’ s v d ’ , t o l
= 0 . 0 0 1 , s h r i n k a g e =None )
A c c u r a c y = s k l _ m s . c r o s s _ v a l _ s c o r e ( LDA_model , X, y , s c o r i n g = ’
a c c u r a c y ’ , cv = c r o s s _ v a l i d a t i o n )
p r i n t ( f " A c c u r a c y w i t h LDA u s i n g k− f o l d c r o s s − v a l i d a t i o n w i t h k = 1 4 :
{ ( np . mean ( A c c u r a c y ) ) * 1 0 0 : . 1 f}%" )

# Q u a d t r a t i c D i s c r i m i n a n t A n a l y s i s (QDA)
# E v a l u a t e o p t i m a l QDA u s i n g k− f o l d c r o s s − v a l i d a t i o n where k =14
QDA_model = s k l _ d a . Q u a d r a t i c D i s c r i m i n a n t A n a l y s i s ( r e g _ p a r a m = 0 . 2 ,
tol =0.001 ,\
s t o r e _ c o v a r i a n c e = True , p r i o r s = [ 0 . 2 5 , 0 . 7 5 ] )
A c c u r a c y = s k l _ m s . c r o s s _ v a l _ s c o r e ( QDA_model , X, y , s c o r i n g = ’
a c c u r a c y ’ , cv = c r o s s _ v a l i d a t i o n )
p r i n t ( f " A c c u r a c y w i t h QDA u s i n g k− f o l d c r o s s − v a l i d a t i o n w i t h k = 1 4 :
{ ( np . mean ( A c c u r a c y ) ) * 1 0 0 : . 1 f}%" )

# Logistic Regression
# E v a l u a t e o p t i m a l L o g i s t i c R e g r e s s i o n u s i n g k− f o l d c r o s s −
v a l i d a t i o n where k =14
LogReg_model = s k l _ l m . L o g i s t i c R e g r e s s i o n ( s o l v e r = ’ newton − c h o l e s k y ’ ,
t o l =0.0001 , p e n a l t y = ’ l2 ’ , \ m u l t i _ c l a s s = ’ ovr ’ , f i t _ i n t e r c e p t =
F a l s e , C = 1 . 0 , m a x _ i t e r =10000 , r a n d o m _ s t a t e = 0 )
A c c u r a c y = s k l _ m s . c r o s s _ v a l _ s c o r e ( LogReg_model , X, y , s c o r i n g = ’
a c c u r a c y ’ , cv = c r o s s _ v a l i d a t i o n )
p r i n t ( f " A c c u r a c y w i t h L o g i s t i c R e g r e s s i o n u s i n g k− f o l d c r o s s −
v a l i d a t i o n with k =14:\
{ ( np . mean ( A c c u r a c y ) ) * 1 0 0 : . 1 f}%" )

# Tree −b a s e d m e t h o d s : Random F o r e s t
# E v a l u a t e o p t i m a l Random F o r e s t u s i n g k− f o l d c r o s s − v a l i d a t i o n
where k =14
f o r e s t _ m o d e l = s k l _ e n . R a n d o m F o r e s t C l a s s i f i e r ( n _ e s t i m a t o r s =1000 ,
min_samples_split =5 ,\
max_depth =None , max_samples =None , b o o t s t r a p = F a l s e , \
c r i t e r i o n = ’ e n t r o p y ’ , m i n _ s a m p l e s _ l e a f =5 , r a n d o m _ s t a t e = 0 )

25
A c c u r a c y = s k l _ m s . c r o s s _ v a l _ s c o r e ( f o r e s t _ m o d e l , X, y , s c o r i n g = ’
a c c u r a c y ’ , cv = c r o s s _ v a l i d a t i o n )
p r i n t ( f " A c c u r a c y u s i n g Random F o r e s t & k− f o l d c r o s s − v a l i d a t i o n
with k =14:\
{ ( np . mean ( A c c u r a c y ) ) * 1 0 0 : . 1 f}%" )

# B o o s t i n g : AdaBoost t r e e
# E v a l u a t e o p t i m a l A d a B o o s t t r e e model u s i n g k− f o l d c r o s s −
v a l i d a t i o n where k =14
e s t i m a t o r = t r e e . D e c i s i o n T r e e C l a s s i f i e r ( m i n _ s a m p l e s _ s p l i t =200 ,
max_depth = 5 )
adaboost_model = skl_en . AdaBoostClassifier ( e s t i m a t o r = estimator ,
n _ e s t i m a t o r s =1000 , r a n d o m _ s t a t e =None , \
learning_rate =1.0)
A c c u r a c y = s k l _ m s . c r o s s _ v a l _ s c o r e ( a d a b o o s t _ m o d e l , X, y , s c o r i n g = ’
a c c u r a c y ’ , cv = c r o s s _ v a l i d a t i o n )
p r i n t ( f " A c c u r a c y u s i n g AdaBoost T r e e & k− f o l d c r o s s − v a l i d a t i o n
with k =14:\
{ ( np . mean ( A c c u r a c y ) ) * 1 0 0 : . 1 f}%" )

# k−N e a r e s t −N e i g h b o r ( kNN )
# E v a l u a t e o p t i a m l kNN w i t h k=5 u s i n g k− f o l d c r o s s − v a l i d a t i o n
where k =14
kNN_model = s k l _ n b . K N e i g h b o r s C l a s s i f i e r ( n _ n e i g h b o r s = 5 )
A c c u r a c y = s k l _ m s . c r o s s _ v a l _ s c o r e ( kNN_model , X, y , s c o r i n g = ’
a c c u r a c y ’ , cv = c r o s s _ v a l i d a t i o n )
p r i n t ( f " A c c u r a c y u s i n g kNN w i t h k=5 & k− f o l d c r o s s − v a l i d a t i o n w i t h
k =14:\
{ ( np . mean ( A c c u r a c y ) ) * 1 0 0 : . 1 f}%" )

# N a i v e C l a s s i f i e r t h a t a l w a y s p r e d i c t s l e a d g e n d e r a s male
y_validation = y
Accuracy = [ ]
y_prediction = naive_classifier ( train )
Accuracy . append ( a c c u r a c y ( y _ p r e d i c t i o n , y _ v a l i d a t i o n ) )
A c c u r a c y _ p e r c e n t a g e = np . mean ( A c c u r a c y ) * 100
p r i n t ( f " Accuracy of t h e Naive C l a s s i f i e r t h a t p r e d i c t s l e a d gender
a s male : { A c c u r a c y _ p e r c e n t a g e : . 1 f}%" )

26
Plotting misclassifications for implemented methods
from s k l e a r n . p r e p r o c e s s i n g import L a b e l E n c o d e r
from s k l e a r n import t r e e

# Implementation :
e s t i m a t o r = t r e e . D e c i s i o n T r e e C l a s s i f i e r ( m i n _ s a m p l e s _ s p l i t =200 ,
max_depth = 1 0 0 )

# k− f o l d c r o s s − v a l i d a t i o n f o r a l l c h o s e n m e t h o d s ( e x c e p t N a i v e
C l a s s i f i e r ) u s i n g k =14
models = [ s k l _ d a . L i n e a r D i s c r i m i n a n t A n a l y s i s ( s o l v e r = ’ svd ’ , t o l
= 0 . 0 0 1 , s h r i n k a g e =None ) , \
skl_da . Q u a d r a t i c D i s c r i m i n a n t A n a l y s i s ( reg_param =0.2 , t o l
=0.001 ,\
s t o r e _ c o v a r i a n c e = True , p r i o r s = [ 0 . 2 5 , 0 . 7 5 ] ) , \
s k l _ l m . L o g i s t i c R e g r e s s i o n ( s o l v e r = ’ newton − c h o l e s k y ’ ,
t o l =0.0001 , penalty = ’ l2 ’ , \
m u l t i _ c l a s s = ’ ovr ’ , f i t _ i n t e r c e p t = False , C= 1 . 0 , m a x _ i t e r
=10000 ,\
r a n d o m _ s t a t e =0) , \
s k l _ e n . R a n d o m F o r e s t C l a s s i f i e r ( n _ e s t i m a t o r s =1000 ,
min_samples_split =5 ,\
max_depth =None , max_samples =None , b o o t s t r a p = F a l s e , \
c r i t e r i o n = ’ e n t r o p y ’ , m i n _ s a m p l e s _ l e a f =5 , r a n d o m _ s t a t e = 0 ) , \
skl_en . AdaBoostClassifier ( estimator =estimator ,
n _ e s t i m a t o r s =1000 , r a n d o m _ s t a t e =None , \
learning_rate =1.0) ,\
s k l _ n b . K N e i g h b o r s C l a s s i f i e r ( n _ n e i g h b o r s =5) ]

m i s s c l a s s i f i c a t i o n = np . z e r o s ( ( 1 4 , l e n ( m o d e l s ) ) )

f o r i , [ t r a i n _ i n d e x , v a l i d a t i o n _ i n d e x ] i n enumerate (
c r o s s _ v a l i d a t i o n . s p l i t (X) ) :
X_train , X _ v a l i d a t i o n = X. i l o c [ t r a i n _ i n d e x ] , X. i l o c [
validation_index ]
y_train , y_validation = y . iloc [ train_index ] , y . iloc [
validation_index ]

f o r m i n np . a r a n g e ( 0 , 6 ) :
model = m o d e l s [m]
model . f i t ( X _ t r a i n , y _ t r a i n )
p r e d i c i t i o n = model . p r e d i c t ( X _ v a l i d a t i o n )
m i s s c l a s s i f i c a t i o n [ i , m] = np . mean ( p r e d i c i t i o n ! =
y_validation )

plt . figure (6)

p l t . boxplot ( m i s s c l a s s i f i c a t i o n )
p l t . t i t l e ( ’ k− f o l d c r o s s v a l i d a t i o n f o r t h e i m p e l m e n t e d m e t h o d s
w i t h k =14 ’ )
p l t . x t i c k s ( np . a r a n g e ( 6 ) +1 , ( ’LDA ’ , ’QDA’ , ’ LogReg ’ , ’R . F o r e s t ’ , ’
AdaBoost ’ , ’kNN ’ ) )
plt . ylabel ( ’ Validation Error ’ )
p l t . show ( )

27
Predicting Lead using chosen model
# Implementation :

t r a i n = pd . r e a d _ c s v ( ’ t r a i n . c s v ’ )
t e s t = pd . r e a d _ c s v ( ’ t e s t . c s v ’ )
X _ t r a i n = t r a i n . copy ( ) . d r o p ( c o l u m n s = [ ’ T o t a l words ’ , ’ Lead ’ ] )
X _ t e s t = t e s t . copy ( ) . d r o p ( c o l u m n s = [ ’ T o t a l words ’ ] )
y = t r a i n [ ’ Lead ’ ]
QDA_model = s k l _ d a . Q u a d r a t i c D i s c r i m i n a n t A n a l y s i s ( r e g _ p a r a m = 0 . 2 ,
tol =0.001 ,\
s t o r e _ c o v a r i a n c e = True , p r i o r s = [ 0 . 2 5 , 0 . 7 5 ] )
QDA_model . f i t ( X _ t r a i n , y )
Q D A _ p r e d i c t i o n s _ c a t e g o c i a l = QDA_model . p r e d i c t ( X _ t e s t )

# Converting categorical classes to numerical

c o n v e r t _ e l e m e n t = { ’ Female ’ : 1 , ’ Male ’ : 0}
QDA_predictions_numerical = [ convert_element [ x ] for x in
QDA_predictions_categocial ]

# P r i n t s t h e l e a d g e n d e r p r e d i c t i o n s u s i n g QDA a s model & compare

with categorical predictions
# p r i n t ( " P r e d i c t i o n s o f Lead g e n d e r u s i n g QDA where ’ Female ’=1 & ’
Male =0")
# print ("\ n ")
# p r i n t ( QDA_predictions_numerical )
# print ("\ n ")
# print ( QDA_predictions_categocial )

# S a v e t h e p r e d i c t i o n s a s a s i n g l e row o f comma s e p e r a t e d o n e s and

zeroes
Q D A _ p r e d i c t i o n s _ n u m e r i c a l = np . a r r a y ( Q D A _ p r e d i c t i o n s _ n u m e r i c a l ) .
r e s h a p e ( 1 , −1)
np . s a v e t x t ( ’ Q D A _ p r e d i c t i o n s . c s v ’ , Q D A _ p r e d i c t i o n s _ n u m e r i c a l ,
d e l i m i t e r = ’ , ’ , f m t = ’%d ’ )

Best Websites For Learning Machine Learning
No ratings yet
Best Websites For Learning Machine Learning
5 pages
7 Time Series Datasets For Machine Learning
No ratings yet
7 Time Series Datasets For Machine Learning
8 pages
Data Science vs. Machine Learning
No ratings yet
Data Science vs. Machine Learning
5 pages
Simple Linear and Logistic Regression
No ratings yet
Simple Linear and Logistic Regression
81 pages
Bayesian Data Analysis Using JASP
No ratings yet
Bayesian Data Analysis Using JASP
76 pages
One-Hot Encoding for Categorical Data
No ratings yet
One-Hot Encoding for Categorical Data
4 pages
Machine Learning Quiz Insights
No ratings yet
Machine Learning Quiz Insights
8 pages
Python Programming for Data Science I
No ratings yet
Python Programming for Data Science I
6 pages
Machine Learning Journey Logs
No ratings yet
Machine Learning Journey Logs
15 pages
K-means Clustering Explained
No ratings yet
K-means Clustering Explained
13 pages
Linear Algebra LectureNote
No ratings yet
Linear Algebra LectureNote
288 pages
JASP: Open-Source Stats for Psychologists
No ratings yet
JASP: Open-Source Stats for Psychologists
7 pages
Diffusion Models
No ratings yet
Diffusion Models
46 pages
Machine Learning Course Notes
No ratings yet
Machine Learning Course Notes
112 pages
RBF Neural Network
No ratings yet
RBF Neural Network
34 pages
Real Estate ML Project Guide
No ratings yet
Real Estate ML Project Guide
20 pages
Understanding Decision Trees in Classification
100% (1)
Understanding Decision Trees in Classification
58 pages
Lecture+Notes Intro To MLOps Session3
No ratings yet
Lecture+Notes Intro To MLOps Session3
8 pages
Deep Learning: Prof:Naveen Ghorpade
No ratings yet
Deep Learning: Prof:Naveen Ghorpade
43 pages
Python Programming Important Notes
No ratings yet
Python Programming Important Notes
46 pages
Probability Statistics
No ratings yet
Probability Statistics
125 pages
Fuzzy Logic: UIET (Computer Science Deptt)
100% (1)
Fuzzy Logic: UIET (Computer Science Deptt)
56 pages
Customer Data Analysis & Feature Engineering
No ratings yet
Customer Data Analysis & Feature Engineering
35 pages
Python Tools for Parallel Computing
No ratings yet
Python Tools for Parallel Computing
16 pages
Python Data Visualization Techniques
No ratings yet
Python Data Visualization Techniques
13 pages
Machine Learning Most Important Question For Mid Term Ipu University
No ratings yet
Machine Learning Most Important Question For Mid Term Ipu University
36 pages
Logistic Regression
100% (1)
Logistic Regression
14 pages
Big Data Analytics for B.Tech Students
No ratings yet
Big Data Analytics for B.Tech Students
175 pages
100 Days of ML Code Journey
100% (1)
100 Days of ML Code Journey
15 pages
A Guide To Teaching Data Science PDF
No ratings yet
A Guide To Teaching Data Science PDF
26 pages
UnSupervised ML
No ratings yet
UnSupervised ML
17 pages
Machine Learning Exam Questions and Answers
No ratings yet
Machine Learning Exam Questions and Answers
16 pages
Data Preprocessing & Mining Techniques
No ratings yet
Data Preprocessing & Mining Techniques
8 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
Classifying mRNA vs ncRNA Using ML
100% (1)
Classifying mRNA vs ncRNA Using ML
27 pages
JES2Mail and JES2FTP
No ratings yet
JES2Mail and JES2FTP
62 pages
TutorialsDuniya.com Overview
No ratings yet
TutorialsDuniya.com Overview
52 pages
Python List Comprehensions - Learn Python List Comprehensions
No ratings yet
Python List Comprehensions - Learn Python List Comprehensions
12 pages
Regression Project
100% (1)
Regression Project
60 pages
Qlik Sense Installation Guide
No ratings yet
Qlik Sense Installation Guide
63 pages
Understanding Statistical Deception
No ratings yet
Understanding Statistical Deception
13 pages
Introduction To Data Visualization With Python
No ratings yet
Introduction To Data Visualization With Python
47 pages
DEV UNIT 1&2 Notes
No ratings yet
DEV UNIT 1&2 Notes
118 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
21 pages
High-Speed Data Acquisition with FPSLIC
No ratings yet
High-Speed Data Acquisition with FPSLIC
4 pages
U02Lecture07 Classification
100% (1)
U02Lecture07 Classification
56 pages
Machine Learning Refined (Foundations, Algorithms, and Applications) (2nd Edition) Watt
No ratings yet
Machine Learning Refined (Foundations, Algorithms, and Applications) (2nd Edition) Watt
10 pages
Training Feedforward DNN Guide
No ratings yet
Training Feedforward DNN Guide
9 pages
Unit 4
No ratings yet
Unit 4
38 pages
100+ Real Hadoop Interview Questions
No ratings yet
100+ Real Hadoop Interview Questions
32 pages
Understanding Fuzzy Logic Concepts
No ratings yet
Understanding Fuzzy Logic Concepts
45 pages
CEC453 Machine Learning
No ratings yet
CEC453 Machine Learning
168 pages
Neural Networks in Pattern Classification
No ratings yet
Neural Networks in Pattern Classification
35 pages
04 Chap04 ClassificationMethods LDA QDA
No ratings yet
04 Chap04 ClassificationMethods LDA QDA
28 pages
Dijiktras Algorithm: 16it206 Data Structures and Algorithms Unit Iii: Dijiktras Algorithm - Tracy Sneha
No ratings yet
Dijiktras Algorithm: 16it206 Data Structures and Algorithms Unit Iii: Dijiktras Algorithm - Tracy Sneha
24 pages
Machine Learning Introduction
100% (1)
Machine Learning Introduction
20 pages
Bayes Classification for Fish Sorting
No ratings yet
Bayes Classification for Fish Sorting
86 pages
Mobile-Based SIWES Placement Recommendation System (A Case Study of Nigerian Universities)
No ratings yet
Mobile-Based SIWES Placement Recommendation System (A Case Study of Nigerian Universities)
7 pages
Unit 4 Hadoop Ecosystem - HIVE and PIG
No ratings yet
Unit 4 Hadoop Ecosystem - HIVE and PIG
157 pages
Gender Inequality in English-Speaking Film Industry
No ratings yet
Gender Inequality in English-Speaking Film Industry
25 pages
Castille Guide for EU4 Players
No ratings yet
Castille Guide for EU4 Players
15 pages
Culture and Religion Spread Modifiers
No ratings yet
Culture and Religion Spread Modifiers
14 pages
Final Phase: Bluetooth ECG Device Update
No ratings yet
Final Phase: Bluetooth ECG Device Update
7 pages
CV Engelska
No ratings yet
CV Engelska
1 page
IntroToCG 09 25 04
No ratings yet
IntroToCG 09 25 04
52 pages
Wheelchair Innovations and History
No ratings yet
Wheelchair Innovations and History
4 pages
Multimedia Practical File: Gurujambheshwaruniversityofscienceandtechnology (2015-2016)
No ratings yet
Multimedia Practical File: Gurujambheshwaruniversityofscienceandtechnology (2015-2016)
26 pages
A Virtual Reality Laboratory For Distance
No ratings yet
A Virtual Reality Laboratory For Distance
9 pages
1589456917difference Between Input and Output Devices
No ratings yet
1589456917difference Between Input and Output Devices
2 pages
Electrical BOQ-Martin Showroom
No ratings yet
Electrical BOQ-Martin Showroom
30 pages
Network Design National Case Study
No ratings yet
Network Design National Case Study
5 pages
Construction and Detailing For Interior Design - Chapter 1 Existing Walls
No ratings yet
Construction and Detailing For Interior Design - Chapter 1 Existing Walls
25 pages
Instrumentation in Oil and Gas Industry
No ratings yet
Instrumentation in Oil and Gas Industry
16 pages
Media Provider Activity Log
No ratings yet
Media Provider Activity Log
15 pages
Sensitivity Analysis in Linear Programming - PDF - 20250822 - 130438 - 0000
No ratings yet
Sensitivity Analysis in Linear Programming - PDF - 20250822 - 130438 - 0000
10 pages
Computer Networks Lab Manual
No ratings yet
Computer Networks Lab Manual
34 pages
Project Proposal
100% (1)
Project Proposal
3 pages
Top 20 OS Interview Questions EPAM
No ratings yet
Top 20 OS Interview Questions EPAM
3 pages
FM3316 On-Board Computer Overview
No ratings yet
FM3316 On-Board Computer Overview
5 pages
GCU iExpenses User Guide
No ratings yet
GCU iExpenses User Guide
89 pages
UNIT 1 - Introduction To Computer System
No ratings yet
UNIT 1 - Introduction To Computer System
54 pages
Math 8: Polynomials Guide
100% (1)
Math 8: Polynomials Guide
117 pages
Documentclass
No ratings yet
Documentclass
6 pages
Siemens DP/PA Coupler, Active Field Distributors, DP/PA Link and Y Link Operating Instructions
No ratings yet
Siemens DP/PA Coupler, Active Field Distributors, DP/PA Link and Y Link Operating Instructions
266 pages
Real-Time API Ecosystem Challenges
No ratings yet
Real-Time API Ecosystem Challenges
42 pages
APS 2023 Notification Guidelines
No ratings yet
APS 2023 Notification Guidelines
5 pages
Canadian Common Short Code Application Guidelines
No ratings yet
Canadian Common Short Code Application Guidelines
30 pages
Smartronix SuperLooper
No ratings yet
Smartronix SuperLooper
5 pages
Hkt043ata 1C
No ratings yet
Hkt043ata 1C
31 pages
Inputs, Processors, Memory and Outputs
No ratings yet
Inputs, Processors, Memory and Outputs
56 pages
Backup and Restore Success Rates Analysis
No ratings yet
Backup and Restore Success Rates Analysis
2 pages
Configuring Snmpv3 On Huawei Devices
No ratings yet
Configuring Snmpv3 On Huawei Devices
3 pages
Design and Simulation of 100mw Photovoltaic Power Plant Using Matlab Simulink
No ratings yet
Design and Simulation of 100mw Photovoltaic Power Plant Using Matlab Simulink
5 pages
AN-RFC2544 Ethernet Testing
No ratings yet
AN-RFC2544 Ethernet Testing
2 pages

Gender Prediction in Film Using ML

Uploaded by

Gender Prediction in Film Using ML

Uploaded by

Do (wo)men talk too much in films?

2.1 Do men or women dominate speaking roles in Hollywood movies?

Figure 1: Number of words spoken by males/females throughout the years

2.2 Has gender balance in speaking roles changed over time?

3.1 Logistic Regression

3.1.1 Description & Implementation

3.1.2 Evaluation & Model tuning

3.2 Discriminant Analysis (LDA & QDA)

3.2.1 Description & Implementation

3.2.2 Evaluation & Model tuning

3.3 k-Nearest Neighbor (kNN)

3.3.1 Description & Implementation

3.3.2 Evaluation & Model tuning

3.4 Tree-based methods: Random forest

3.4.1 Description & Implementation

3.4.2 Evaluation & Model tuning

3.5 Boosting: AdaBoost

3.5.1 Description & Implementation

3.5.2 Evaluation & Model tuning

Figure 3: Box-plot showing the validation error for different methods

Logistic Regression have 14 different optional & required hyper parameters:

1. tol is all about the tolerance for the stopping criteria

import sklearn . p r e p r o c e s s i n g as skl_pre

# Information about " t r a i n " v a r i a b l e

w o r d s _ f e m a l e _ a c t o r s = np . sum ( c u r r e n t _ m o v i e s [ ’ Number words

w o r d s _ m a l e _ a c t o r s = np . sum ( c u r r e n t _ m o v i e s [ ’ Number words male ’

# C a l c u l a t e s how many male l e d t h e r e h a v e b e e n e a c h y e a r (

# C a l c u l a t e s how many male a c t o r s t h e r e h a v e b e e n e a c h y e a r (

# F i g u r e 3 : P e r c e n t a g e o f male / f e m a l e l e a d s \& male / f e m a l e a c t o r s

# C a l u l a t e s & p r i n t s a v e r a g e g r o s s made on m o v i e s where m a l e s

# C a l u l a t e s & p r i n t s a v e r a g e g r o s s made on m o v i e s w i t h male / f e m a l e

# C o n v e r t t h e t a r g e t v a r i a b l e t o b i n a r y r e p r e s e n t a t i o n where " Male

X _ t r a i n = t r a i n [ [ ’ Number words f e m a l e ’ , ’ Number o f words l e a d ’ , \

# F i t t i n g t h e model and a l s o make s u r e t h a t we a r e u s i n g l o g i s t i c

# Fitting the crossvalidation

# Make one −h o t e n c o d i n g f o r t h e ’ Lead ’− v a r i a b l e

# Normalize the data

return min_error , k_value

plt . figure (6)

# Converting categorical classes to numerical

# P r i n t s t h e l e a d g e n d e r p r e d i c t i o n s u s i n g QDA a s model & compare

# S a v e t h e p r e d i c t i o n s a s a s i n g l e row o f comma s e p e r a t e d o n e s and

You might also like