0% found this document useful (0 votes)
16 views26 pages

ST201 Project Report 2023 Mark72

The document describes a study that uses linear regression to analyze factors related to anxiety levels during the COVID-19 pandemic. Survey responses are used as data to build models identifying key predictors of variance in anxiety. Stepwise selection identified that older age and higher COVID risk relate to anxiety levels, though the effect of age is ambiguous. Models with 10 variables minimized criteria for model selection.

Uploaded by

cweqing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views26 pages

ST201 Project Report 2023 Mark72

The document describes a study that uses linear regression to analyze factors related to anxiety levels during the COVID-19 pandemic. Survey responses are used as data to build models identifying key predictors of variance in anxiety. Stepwise selection identified that older age and higher COVID risk relate to anxiety levels, though the effect of age is ambiguous. Models with 10 variables minimized criteria for model selection.

Uploaded by

cweqing
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

ST201 Project Coversheet

Project Title: Studying Generalized Anxiety Levels During COVID-19 Using Linear
Regression Analysis

Word count: 2483


Candidate IDs: 45448 46201 44320
Date: 4 May 2023

Permission to use as an example


We give permission for my assignment to be used as an example in ST201 (in an electronic
form).
Abstract
Many recent concerns were raised regarding COVID-19 related mental health
issues. This study utilizes responses to a European survey as data to build linear
regression model to identify driving factors of variance in anxiety levels during the
pandemic. Stepwise feature selection was used to identify key features from many
available predictors. Despite increased vulnerability to the disease, older age is found to
be related to lower anxiety levels during the pandemic. Health and life risk of COVID-
19 is positively related to anxiety level. Whether the effect vary between age groups is
ambiguous.

Introduction

The COVID-19 pandemic and its related large-scale mandates and restriction policies put
heavy tolls on minds of many individuals. Saeed et al. (2022) found that pandemic-related
anxiety has become especially prevalent in healthcare workers, students, parents, and teachers.
High covid Risk groups and professions are shown to be more prone to anxieties and even
mental illnesses due to prolonged exposure and high perceived risks in a study in 2020 (Walton
et al.) Recent studies in 2021 also showed that people of old age face much higher pandemic
related difficulties, not only due to increased risks of harsh symptom if the virus, but also
because they are less adaptive in their lifestyle and can face increased difficulties when travel
restrictions are imposed. (Lebrasseur et al.)

This study aims at identifying 1) the extent at which age is associated with pandemic
related anxieties, 2) the extent at which Covid risk is related to people’s anxieties, and whether
the relationship vary across age groups, and 3) key factors that influences people’s anxiety
during the pandemic in general. This will be done through building linear regression models and
using model building techniques such as feature selections to identify key predictors regarding
generalized anxiety scores. Inference will be made from the linear regression model regarding
whether the data shows that the predictors of interest are of high significance.
Data and Exploratory Analysis

Dataset
The dataset used contains 1115 observations of participants in a European survey on the COVID-
19 pandemic. The study records demographics information such as sex, age, and education of the
participants, and their survey responses to the extent at which the COVID-19 pandemic has posed
difficulties to different aspects of their lives. (See Appendix A for difficulties corresponding to the
pandemic difficulties variables)

Data type Variable

Numeric Age, Pandemic_Difficulties_*(All), Covid_risk, Social_support

Factor Sex, Education, IncomeContinuity, HealthStatus, Unemployed, Student

Outcome variable Gad_score (numeric)

Descriptive statistics

Table 1 | Descriptive Statistic for numeric variables. The youngest participant is 18 yearls old and
the oldest is 85 years old. Pandemic_Difficulties_* have been renamed to PD*. There are no missing
data. The Gad_score variable is standardized, with mean 0 and standard deviation around 1.
Table 2 | Descriptive Statistic for categorical/dummy variables. Sex0= Male, Sex1=Female. around 1.
Dummy variables are generated for each category of Education and HealthStatus, both originally stored as
integers. Education2 = Vocational education, Education3 = Secondary education, Education4 = Post-
secondary education, Education5 = University education and above. Ommited group for education is primary
education. HealthStatus3 = No pre-existing health condition, HealthStatus 2 = don’t know, HealthStatus
omitted group = Has pre-existing health condition. IncomeContinuity have 391 missing values and NAs are
generated and used as a category.

Exploratory Data Analysis

A correlation matrix was constructed between variables with numeric values (Appendix
A) to address concerns with perfect or near-perfect collinearity in the data. Namely, high levels
of correlations are expected between values for different types of pandemic anxiety, as an
individual who faces more difficulties in one aspect of the pandemic are expected to be more
prone to other pandemic difficulties. The correlation matrix yielded positive pair-wise
correlations between all pandemic difficulty levels as expected, but no pair-wise correlations
exceeded 0.50 between the numeric variables. As a result, multicollinearity was not of particular
concern in the regression analysis.

One other concern was that pandemic difficulties record discrete responses, and the true
functional form might not be linear in the pandemic difficulty value, then perhaps generating
dummies for each level of difficulty (e.g., PD1==1, PD1==2, etc) better captures the true
functional form. Plotting Gad_score against different pandemic difficulties (Appendix B)
showed that the increment at which Gad_score increased with each discrete jump (from 1 to 2,
from 2 to 3, etc) in pandemic difficulties are mostly constant throughout.

Continuous variables were plotted against Gad_score (Figure 1) to check functional form
assumptions. For covid risk, age, and social support (Appendix C), there seem to be linear
relationships with Gad_score. The signs (positive for covid risk, negative for a) were expected.
There might exist potential outliers for participants with very high age (Figure 1, b).
Nonetheless, the high variance can be induced by the lack of observations for individuals beyond
70 years of age paired with high variance of anxiety score for any age group.

a b

Figure 1 | Relationship between Covid_risk/Age and anxiety score.


a, Mean anxiety score (Gad_score) plotted against covid_risk. For easier visualization the mean
value for all Gad_scores in each Covid_risk group is plotted using group_by.mean() in R.
b, Mean anxiety score plotted against Age. Similarly, the mean Gad_score is generated for each
age and plotted.
There is not enough reason to believe functional forms with higher degree polynomials
exist in the population. Therefore, only linear terms of those variables will be included in the first
stage of model building. The potential for the true function to contain higher polynomials is not
dismissed as pair-wise scatterplots in Figure 1 do not control for other features. If residual plots
during model building yield different results, then adding polynomial regression will be
considered.
Regression Analysis and Results

Initializing the model


The model building process began with an initial linear regression model (Appendix E)
containing all available features in the dataset. All variables were included because for each
independent variable in the dataset there exist arguments that they should affect the generalized
anxiety level. The initial linear regression model yielded a 𝑅 2 value of 0.36, so 36% of the
variations in generalized anxiety level were explained by the model. However, the 𝑅 2 value is
likely inflated as 30 regressors were included in this initial model. With the OLS approach,
adding more regressors in a linear regression always yields higher 𝑅 2 . The adjusted 𝑅 2 value,
0.336, is more indicative of the model’s performance.
Most coefficients in the initial regression were not statistically significant at the 5% level.
It is however unwise to remove all variables that are not statistically significant, as 1) a set of
variables can appear as not statistically significant due to multicollinearity, and 2) some variables
may appear statistically significant by chance. Therefore, stepwise feature selection was used to
reduce the number of features and add interpretability to the model. Feature selection can also
prevent overfitting and allows better predictions for new data.
Feature Selection
Forward and backward selection was used to reduce the number of features. (Appendix
E) Both forward and backward selection are stepwise selection methods that aim to produce the
model that best fits the data with a reduced number of features.
a b

Figure 2 | Adjusted 𝑹𝟐 and BIC of models of different numbers of features used in stepwise selection.
a, Forward stepwise selection. The red line represents BIC score, and blue line represents adjusted 𝑅 2 at different
number of features. As shown with respective dotted lines, Here n=10 minimizes BIC and n=18 maximizes
adjusted 𝑅 2
b, Backward stepwise selection. Same with forward selection, n=10 also minimizes the BIC score. n=17 now
maximizes adjusted 𝑅 2

For stepwise feature selection, an approximation for the best set of features (best subset
selection is required for the strictly best set of features) is given for every choice of numbers of
features. BIC (Bayesian Information Criterion)1 score is used to find the best model among 30
different models were returned for forward and backward feature selection. The BIC penalizes
model inaccuracy and high number of features in the model. For both forward and backward
selection BIC for each number of features were calculated (Figure 2). 𝑛 = 10 minimizes the BIC
for both forward and backwards selection. Using adjusted 𝑅 2 as the criterion for choosing the
model would result in a model with more parameters and reduced interpretability, as shown in
figure 2. When 𝑛 = 10, forward and backward stepwise selection offers the same linear
regression model, which also saves the effort of choosing between the two models. (See Figure 3
for the model until this stage)

Figure 3 | Residual plots of linear regression model after feature selection.


The residual plots show that the linear functional form is suitable for most variables except for
Pandemic_Difficulties_10 and Pandemic_Difficulties_11. Second degree polynomial terms can be used for
those variables to better capture the functional form for these variables.

Examining effects of covid risk on different age groups: adding interactions


By imposing a functional form that has interactions on the linear regression model,
association of regressors on generalized anxiety levels is allowed to be dependent on other
regressors. Namely, whether the association between covid risk and generalized anxiety levels

1
1
𝐵𝐼𝐶 = (𝑅𝑆𝑆 + log(𝑛) 𝑑𝜎̂ 2 ) , where 𝜎̂ 2 is the estimate for the variance of the error term, 𝑛 is the number of
𝑛
observations, and 𝑑 is the number of features.
depend on a person’s age is of particular interest. Nonetheless, interactions of age with all other
variables will be added to begin with. After this, feature selection will be performed again with
the same methods to filter out interaction terms that do not help achieve better model fits
significantly.
Checking residual plots and adding second degree polynomial terms
Residual plots were constructed to ensure the most suitable functional form is used in the
linear regression model. (Appendix F) The linear functional form seemed suitable for all features
with exception for Pandemic_difficulties_10 and Pandemic_difficulties_11, for which the
quadratic functional form seemed more suitable. (figure 3) Residual plots for continuous
variables (Age, Social_support, and Covid_risk) supports the previous assumption that a linear
functional form for those variables is sufficient.
Since both Pandemic_difficulties_10 and Pandemic_difficulties_11 have discrete values
(survey responses between 1, 2, 3, and 4), arguments can be made for transforming those
variables into categorical/dummy variables. However, after a dummy variable is created for each
response, some of those variables of less statistical significance will be eliminated during
stepwise feature selection, which leaves an incomplete set of response variables in the final
model. Such a model may be better at generating predictions but will be less intuitive in giving
interpretations.
Final model presentation

Table 3 | Models reached after stepwise selection at different stages.


The coefficients (standard error in parenthesis) suggest the direction and strength of the association
between the variables and Gad_score, the dependent variable. Asterisks indicate statistical significance,
with more asterisks indicating greater significance. For models with all features see Appendix G.

In figure 3, model (1) is the model obtained after stepwise feature selection from OLS
with a linear functional term of all variables and with no interaction terms. After adding
interaction terms of Age with all other variables, model (2) and (3) are obtained through forward
and backward feature selection respectively and yielded very distinct choices on features. That
said, model (2) has relatively more intuitive interpretations in coefficients, thus terms of
Pandemic_Difficulties_10 and Pandemic_Difficulties_11 are further added to arrive at model
(4).
Assessing model accuracy with cross validation

*Final linear model for inference


Table 4 | Cross-validation mean squared errors for tested linear models.
The coefficients (standard error in parenthesis) suggest the direction and strength of the association
between the variables and Gad_score, the dependent variable. Asterisks indicate statistical significance,
with more asterisks indicating greater significance. For models with all features see Appendix G.

Cross-validation tests were carried out to examine the prediction accuracy of linear
models in the model building process. Mean squared error of cross-validation test sets are
computed at 𝑘 = 5, 𝑘 = 10, and 𝑘 = 1115 (LOOCV) respectively. (Table 4) The initial linear
model with all predictors (OLS) has marginally higher cross-validation MSEs across the board
when compared to the feature selection model based on the same set of variables. When new
interaction terms with the Age predictor were added (‘OLS+interaction terms’, Table 4) the MSE
increased for CV, suggesting overfitting of the test data. Once feature selection was used again,
there is a larger drop in MSE, which implies significance of adding the interactions and perhaps
collinearities in the interacted terms.
Figure 4 provides visual comparison between cross-validation performance of the linear
models. Models with less predictors generally outperformed those with more predictors. The
feature selected model with interaction and polynomial terms (model (4) in Table 3) has the best
test performance.

Figure 4 | Cross-validation mean squared errors for tested linear models at k=5, k=10, and k=1115
From left to right are CV MSE of iterations of the linear regression model. The vertical axis does not begin at
zero but 0.6 to show better comparisons between model performance. The true variation in MSE of the models
are less significant than the figure may suggest.
Discussion and Limitations

Potential non-included cofounding variables

The goal with the linear model in this paper is to explain the relationship between
generalized anxiety with COVID-19 and the available predictors. One sign that there might exist
some non-included cofounding variables is that all pandemic difficulty levels show high pair-
wise correlation. (See appendix A) Some non-included cofounding variable can be causing the
variation the different aspects of pandemic difficulties at the same time. For instance, a person
with low income could be subjected to increased difficulties in both 1) Pandemic_Difficulties_3:
increased number of daily duties and 2) Pandemic_Difficulties_10: Feeling of uncertainty,
unpredictability of the situation.

Missing data / data structure limitations

The only missing data in the raw dataset were 391 missing values for IncomeContinuity.
For the regression a separated category of IncomeContinuity was created to account for the
missing values. Note that IncomeContinuity as a predictor was filtered out by feature selection in
the final models, which could be due to its missing values.
The dataset offers a range of numeric predictors yet most of those predictors are discrete
integers. In a linear model, for each of those discrete predictors (16 of which being
Pandemic_Difficulties responses) strong assumptions had to be made on discrete jump having
the same marginal association on anxiety level, which likely does not hold. For this study it is
assumed that the relationship is linear, apart from quadratic associations with two of said
predictors.

Trade-off between prediction and interpretation

The linear regression model in this paper is built for primarily inference and giving policy
suggestions. Therefore, a large degree of model performance in prediction was sacrificed in
favor of better interpretability. A model that has better test performance most likely features
dummy variables for each discrete response to Pandemic_Difficulties. A better prediction model
would also have more, if not all available interaction terms. For this paper it has been considered
to add those extra variables, however, using this approach along with feature selection will leave
out some predictors (whether random interaction terms or one isolated dummy such as
Pandemic_Difficulties_1==4) that have no good interpretations. Nonetheless, the model in this
paper can achieve a better test performance while having a reduced number of features and
relatively easy inference.
Interpretation & Conclusions

Interpretation of the results

Association between people’s generalized anxiety and their age:

Models (3) and (4) from Table 3 can be used to explain the association between age and
people’s generalized anxiety. Both models identified a negative relationship, with coefficients of
-0.028 for model (3) and -0.018 for model (4). The coefficients are statistically significant at the
1% level, meaning that assuming no association between age and generalized anxiety, the results
would be within the 1% most surprising outcomes. A coefficient of -0.018 for age in model (4)
suggests that each additional year of age of the person is associated with a 0.018 standard
deviation decrease in predicted anxiety scores. Additionally, the association between age and
anxiety levels depend strongly how much different pandemic difficulties are experienced by the
person. For instance, for people who face very high Pandemic_Difficulties_11 (Feeling of
danger, anxiety associated with the spread of the virus) additional year of age will not but
associated with as much decrease in generalized anxiety scores. Linear model (3) features even
more interaction terms between age and pandemic difficulties. It can nevertheless be said from
model (3) that, higher covid risk and more pandemic difficulty seem to undermine the degree at
which additional age is associated with lower anxiety scores.

Relationship between perceived health and life risks of COVID-19 and people’s mental health
status

Observing the coefficient of Age:Covid_risk in model (3) and Covid_risk in model (4) in
Table 3, the linear regression model suggests that higher perceived health and life risks of
COVID-19 is generally associated with higher anxiety, thus worse mental health status on
average.

Comparing model (3) to model (4), it is ambiguous whether age plays a role in how covid
risk is associated with anxiety scores. Observing model (3), the coefficient for interaction term
between covid risk and age (Age:Covid_risk) is statistically significant at 0.003. Thus, the
association between covid risk and anxiety scores seem to depend on age, with each additional
year of age contributing to a 0.003 increase in how standard deviation anxiety score respond to
one additional point of covid risk. For instance, one additional point of covid risk score would
increase that anxiety score by 0.003 ∗ 48 = 0.144 standard deviations for a 48-year-old, while
an 18-year-old should see only a 0.003 ∗ 18 = 0.054 standard deviation increase. (The base
covid_risk term is not present in model (3)) Unfortunately, it cannot be said for certain whether
age does or does not play a role in the relationship between covid risk and people’s anxiety, as
the linear regression model provides a valid argument for when either one is assumed.

Policy suggestions

Model (4) in Table 3 shows that among the most significant pandemic difficulties are
Pandemic_Difficulties_5 (Feeling that freedom as a human is restricted),
Pandemic_Difficulties_10 (Feeling of uncertainty, unpredictability of the situation),
Pandemic_Difficulties_11 (Feeling of danger, anxiety associated with the spread of the virus),
Pandemic_Difficulties_13 (Boredom, monotony), and Pandemic_Difficulties_16 (Feeling lost in
the recommendations and restrictions), out of which Pandemic_Difficulties_10 and
Pandemic_Difficulties_11 is negatively associated with anxiety levels. One likely explanation is
that ceterus paribus of covid risk changes, higher attention to the spread of the virus might be
associated with increased vigilance of the situation, which correlates with less anxiety level
overall.
If policies were to be implemented in effort to reduce anxiety levels during the pandemic,
they shall be aimed at 1) decreased restrictions to freedom 2) subsidizing entertainment to reduce
boredom, and 3) give clearer and more available recommendations, restrictions, and instructions.
The age groups which the policy effects should also be considered, as the model shows that
effects of Pandemic_Difficulties_13 (Boredom, monotony) for instance is more prevalent at
younger ages. Therefore, policy makers can choose to subsidize a form of entertainment for
which people of younger age are especially affected, for example, the video game industry.
While Covid-mandating policies (mask mandates, travel restrictions) might correlate with higher
pandemic difficulties as well, policy makers can aim at enforcing penalties on spreading
misinformation and fund public infographics to reduce unnecessary fear about the virus such that
perceived covid risks are lower.
References

Lebrasseur, Audrey, et al. “Impact of the COVID-19 Pandemic on Older Adults: Rapid Review.”
JMIR Aging, vol. 4, no. 2, 2021, doi:10.2196/26474.

Saeed, Hafsah, et al. “Anxiety Linked to Covid-19: A Systematic Review Comparing Anxiety
Rates in Different Populations.” International Journal of Environmental Research and
Public Health, vol. 19, no. 4, 2022, p. 2189., doi:10.3390/ijerph19042189.

Walton, Matthew, et al. “Mental Health Care for Medical Staff and Affiliated Healthcare
Workers during the COVID-19 Pandemic.” European Heart Journal: Acute
Cardiovascular Care, vol. 9, no. 3, 2020, pp. 241–247., doi:10.1177/2048872620922795.
Appendix A. Correlation matrix of continuous variables in the raw dataset
Appendix B. Distribution histogram of Gad_score
Appendix C. Mean anxiety score level plotted against each level of social support received.
Appendix D. Mean anxiety score at different levels of pandemic difficulties
Appendix E. Linear regression model building with models featuring all predictors included.

You might also like