0% found this document useful (0 votes)
36 views

Assignment 4

This document describes an assignment involving building a logistic regression model to predict extramarital affairs using demographic and survey data. Students are asked to perform variable selection, make predictions on new data, and evaluate a research paper involving predictive modeling.

Uploaded by

colelavigne000
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Assignment 4

This document describes an assignment involving building a logistic regression model to predict extramarital affairs using demographic and survey data. Students are asked to perform variable selection, make predictions on new data, and evaluate a research paper involving predictive modeling.

Uploaded by

colelavigne000
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

WEEK 11 Activity (Assignment 4) – 14 marks

Group members: Serag Elganga, Mustafa Haider, Cole Lavoie

1) You previously used the dataset called “affairs”, which contains cross-sectional data from a survey conducted by
Psychology Today in 1969. The variables include:

Variable Description
affairs number of extramarital affairs within the past year
gender factor indicating gender
age age in years
yearsmarried number of years in current marriage
children whether the person has children in the current marriage
religiousness religiousness scale: 1 = anti, 2 = not at all, 3 = slightly, 4 = somewhat, 5 = very
education education scale: 9 = grade school, 12 = high school graduate, 14 = some college, 16
= college graduate, 17 = some graduate work, 18 = master’s degree, 20 = Ph.D.,
M.D., or other advanced degree
occupation occupation coded according to Hollingshead classification
rating self rating of marriage: 1 = very unhappy, 2 = somewhat unhappy, 3 = average, 4 =
happier than average, 5 = very happy

You used the glm() function and the “affairs” dataset to make a logistic regression model with a new variable
(binaryaffair) as the outcome and age, yearsmarried, religiousness, and rating as the significant predictors. You started
with a model that used all of the predictors except for occupation (Week 7 activity).

a) Start again with all of the predictors except for occupation. Use backward selection to identify the important
predictors, using p<0.05 for the significance level. Which predictors are retained in your final model? (2 marks)

ANSWER:

1. Gender
2. Age
3. Years married
4. Religiousness
5. Rating

b) Paste the script for your final model here. (1 mark)

ANSWER:

Assuming that the affairs dataset and the dplyr package have been loaded, this is the script for the final model:

initial_model <- glm(binaryaffair ~ gender + age + yearsmarried + children +


religiousness + education + rating,
family = binomial, data = affairs)

final_model <- step(initial_model, direction = "backward")

summary(final_model)
2) Can we use a model to make predictions in R? Can we find out likelihood that someone has had an affair? An example
of how to do this (with the affairs dataset) can be found at the following website. Scroll down to the section called
“Predict the outcome using new data”. Your prediction script will not be the same, but you can use the codes presented
to figure out how to do it. (Note that fit.reduced is the name of the model used in the example.)

https://towardsdatascience.com/how-to-do-logistic-regression-in-r-456e9cfec7cd

a) Let’s use sample data to make a prediction:

gender male
age 41
yearsmarried 7
children yes
religiousness 2
education 18
occupation (missing data – the coding system was confusing)
rating 5

Based on your model, what is the probability that the sample had an affair within the past year? (1 mark)

ANSWER:
The value is 0.1448248, so 14%.

b) Paste your script here. If I already have the affairs dataset with the binaryaffair variable, I should be able to
paste your script into RStudio and get the same results that you have. (3 marks)

ANSWER:
Note: the model that was used was the model created in question 3 e) from Week 7 activity where only the significant
variables were included. In the script, its called “logistic_model_2”.

logistic_model_2 <- glm(binaryaffair ~ age + yearsmarried + religiousness + rating,


family = binomial, data = affairs)

newdataset <- data.frame(


gender = "male",
age = 41,
yearsmarried = 7,
children = "yes",
religiousness = 2,
education = 18,
rating = 5
)

newdataset$prob <- predict(logistic_model_2, newdata=newdataset,


type="response")

print(newdataset$prob)
c) Do you think that this prediction is valid? Do you have any concerns about the data used to make the model or
about the sample? Explain. (2 marks)

ANSWER:

The validity of this prediction depends on various factors, including the representativeness and quality of the data
used to train the model, potential biases in the sample, and the assumptions underlying the logistic regression
framework. I would have concerns about the data/sample if the original dataset is not sufficiently diverse or if there
are unaccounted-for variables that could influence the outcome but are not included in the model. Additionally, the
model's predictive accuracy should be assessed using independent validation data to ensure its generalizability to new
observations.

3) For this last question, each group is required to go and find a new research paper that involves building a model with
multiple predictors. Each group’s paper must be unique; any group that selects the same paper as another group will be
given a 0. Papers that have been used for presentations are not eligible. If your paper does not allow you to answer the
following questions, I suggest finding another paper. Attach a PDF of the paper with your assignment submission.
For your model, answer:

a) If you had to remove one predictor/feature/variable from the model, which one would it be and why? (1 mark)

ANSWER:

Based on the research paper, if I had to remove one predictor/variable from the model, I would consider removing
"marijuana use" as a predictive feature for both generalized anxiety disorder (GAD) and major depressive disorder
(MDD). The inclusion of marijuana use as a predictor could introduce bias into the model and affect the validity of the
predictions. Factors such as varying frequencies of use, different strains of marijuana, and individual differences in
response to marijuana could complicate the interpretation of this variable. Additionally, the legality and social
acceptance of marijuana use may vary across different populations, further complicating its predictive value.

b) If you could replace one predictor/feature/variable in the model, which one would it be and why? (1 mark)

ANSWER:

If I could replace one predictor/feature/variable in the model described in the research paper, I would consider
replacing "satisfaction with living conditions" with a more objective measure of socioeconomic status or household
environment. While satisfaction with living conditions may capture some aspects of well-being, it is inherently
subjective and may not fully reflect the broader socioeconomic and environmental factors that can significantly impact
mental health outcomes such as GAD and MDD.

c) Critically evaluate the predictors in the model; what issues do you see with data collection, accuracy, careless
responding, or any other issues with data integrity or missingness? This could include biases in the sampled
population, for example. (2 marks)
ANSWER:

The predictive model described in the research paper exhibits several notable issues that warrant critical evaluation.
The reliance on a sample of undergraduate students from a single university introduces sampling bias, potentially
limiting the generalizability of the findings to the broader population. Self-reported measures, particularly for sensitive
variables like lifestyle behaviors, may be prone to response bias and inaccuracies. Missing data, ranging from <1% to
36%, raises concerns about data completeness and the potential impact of imputation methods on the results.
Feature engineering, while beneficial for enhancing model performance, introduces the risk of overfitting and spurious
correlations if not carefully validated. Moreover, the limited sensitivity and specificity of the screening questionnaire
for MDD and GAD may affect the accuracy of the predictive model. Confounding variables and the complexity of
machine learning models further complicate interpretation and generalizability.

You might also like