Videos and Tutorials On Data Analysis in The Psychometrics Lab
Videos and Tutorials On Data Analysis in The Psychometrics Lab
tidyverse
psyntur
car
lm.beta
Check your package listing to see if these are all installed. If not, install these package using the “Install” button in
the “Packages” tab in RStudio. Alternatively, you can do the following:
packageVersion("psyntur")
## [1] '0.1.0'
You should put all your analysis code in one R script. This script should contain all and only the code for the
analysis. In other words, everything you need to do every step of the analysis, including the reading of the data,
should be there, and there should be no unnecessary code. Keep your code clean and well organized. Use
“sections” in the script to organize your code into regions of similar code. Sections can be inserted using the “Insert
Section” item RStudio’s “Code” menu.
Once installed, you must then load the required packages using the library function as follows:
library(tidyverse)
library(psyntur)
library(car)
library(lm.beta)
> getwd()
[1] "/home/andrews/psychometrics"
If I had a file named psychometrics_lab_data.csv and moved into this folder, then in my R script, I can do the
following:
In general, whenever we read in a csv file into R using the command read_csv, the data is returned as an R data
frame. In the above example, I named this data frame lab_data. If I type lab_data, I will then see the data.
Alternatively, I were to type glimpse(lab_data), I will get a more useful view of it (see next example).
For the purposes of this guide, I will read in an example .csv file from a URL web address rather than a file on my
local computer, and I will give the data frame that is returned the name psymetr_df.
glimpse(psymetr_df)
## Rows: 44
## Columns: 52
## $ gender <dbl> 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 1, 2, 2, 2, 1, 1, 1, 1, 2…
## $ age <dbl> 19, 22, 20, 20, 19, 21, 19, 18, 22, 23, 19, 18, 19, 21,…
## $ anxiety_1 <dbl> 1, 2, 2, 2, 0, 1, 2, 2, 1, 1, 2, 2, 3, 3, 2, 2, 1, 2, 1…
## $ anxiety_2 <dbl> 3, 3, 1, 2, 1, 2, 2, 2, 2, 2, 3, 1, 2, 2, 2, 1, 2, 2, 3…
## $ anxiety_3 <dbl> 1, 3, 2, 3, 1, 1, 3, 1, 1, 2, 2, 1, 2, 2, 1, 2, 3, 2, 2…
## $ anxiety_4 <dbl> 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 2, 3, 4…
## $ anxiety_5 <dbl> 2, 1, 2, 1, 1, 2, 2, 1, 1, 2, 3, 1, 3, 2, 2, 1, 3, 2, 1…
## $ anxiety_6 <dbl> 2, 3, 2, 2, 3, 3, 1, 2, 2, 3, 2, 2, 1, 2, 3, 3, 4, 2, 2…
## $ anxiety_7 <dbl> 2, 2, 2, 2, 3, 3, 0, 2, 2, 3, 2, 3, 1, 3, 2, 2, 2, 1, 2…
## $ anxiety_8 <dbl> 2, 3, 1, 2, 2, 2, 1, 2, 3, 3, 2, 2, 1, 3, 1, 1, 1, 1, 2…
## $ anxiety_9 <dbl> 2, 3, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 3, 2, 2, 1…
## $ anxiety_10 <dbl> 1, 2, 1, 3, 3, 2, 2, 2, 3, 2, 2, 2, 1, 2, 1, 1, 2, 1, 1…
## $ depression_1 <dbl> 2, 2, 2, 3, 2, 1, 3, 4, 1, 2, 2, 1, 3, 3, 4, 2, 2, 4, 3…
## $ depression_2 <dbl> 1, 2, 2, 2, 2, 1, 3, 2, 2, 1, 1, 2, 1, 2, 1, 1, 2, 1, 2…
## $ depression_3 <dbl> 3, 3, 2, 3, 2, 2, 3, 2, 3, 3, 2, 1, 1, 2, 3, 2, 3, 2, 2…
## $ depression_4 <dbl> 1, 1, 4, 2, 1, 3, 3, 2, 2, 1, 1, 2, 3, 2, 1, 1, 2, 2, 3…
## $ depression_5 <dbl> 1, 3, 2, 3, 2, 1, 4, 2, 2, 2, 2, 1, 2, 2, 2, 2, 4, 2, 3…
## $ depression_6 <dbl> 1, 2, 3, 2, 1, 2, 2, 2, 1, 3, 3, 2, 1, 2, 3, 2, 3, 4, 3…
## $ depression_7 <dbl> 1, 1, 3, 2, 2, 3, 5, 2, 1, 1, 4, 1, 3, 2, 2, 4, 2, 4, 3…
## $ depression_8 <dbl> 3, 5, 2, 4, 4, 2, 2, 4, 3, 4, 2, 4, 2, 4, 4, 4, 3, 3, 4…
## $ depression_9 <dbl> 3, 4, 4, 4, 5, 4, 3, 4, 3, 4, 4, 4, 3, 3, 4, 2, 4, 1, 5…
## $ depression_10 <dbl> 2, 5, 2, 4, 5, 2, 4, 3, 2, 5, 3, 4, 2, 4, 4, 4, 1, 1, 5…
## $ efficacy_1 <dbl> 2, 1, 1, 2, 4, 3, 2, 3, 2, 3, 1, 2, 2, 2, 3, 3, 2, 3, 2…
## $ efficacy_2 <dbl> 2, 1, 3, 3, 4, 2, 2, 1, 3, 3, 2, 1, 2, 2, 2, 2, 2, 2, 3…
## $ efficacy_3 <dbl> 3, 1, 2, 3, 4, 2, 1, 1, 3, 1, 2, 2, 2, 1, 2, 2, 1, 1, 1…
## $ efficacy_4 <dbl> 3, 2, 2, 5, 3, 3, 1, 1, 3, 2, 1, 2, 2, 2, 3, 2, 2, 3, 2…
## $ efficacy_5 <dbl> 1, 1, 3, 4, 3, 3, 2, 1, 3, 3, 3, 2, 2, 1, 2, 2, 4, 3, 2…
## $ efficacy_6 <dbl> 2, 2, 2, 4, 3, 2, 2, 2, 3, 1, 1, 3, 3, 2, 2, 2, 2, 2, 1…
## $ efficacy_7 <dbl> 3, 4, 4, 4, 3, 2, 5, 5, 2, 4, 5, 4, 2, 5, 3, 3, 3, 5, 4…
## $ efficacy_8 <dbl> 3, 4, 5, 4, 1, 2, 3, 5, 4, 4, 5, 4, 2, 4, 3, 3, 3, 4, 3…
## $ efficacy_9 <dbl> 5, 2, 4, 4, 3, 3, 4, 4, 2, 3, 4, 4, 2, 3, 4, 4, 4, 2, 5…
## $ efficacy_10 <dbl> 4, 5, 4, 3, 5, 5, 3, 2, 3, 3, 5, 3, 4, 4, 5, 3, 4, 4, 4…
## $ sociability_1 <dbl> 1, 2, 4, 4, 4, 1, 2, 4, 5, 5, 4, 3, 3, 4, 2, 3, 3, 2, 5…
## $ sociability_2 <dbl> 1, 2, 1, 2, 4, 1, 2, 4, 4, 5, 5, 1, 3, 2, 5, 3, 3, 1, 5…
## $ sociability_3 <dbl> 4, 3, 1, 2, 3, 5, 3, 2, 2, 2, 3, 4, 4, 4, 2, 3, 1, 3, 4…
## $ sociability_4 <dbl> 1, 3, 2, 1, 3, 2, 3, 3, 3, 1, 4, 3, 2, 3, 1, 3, 2, 4, 3…
## $ sociability_5 <dbl> 3, 5, 5, 2, 2, 5, 3, 4, 2, 1, 4, 2, 5, 2, 2, 4, 3, 2, 4…
## $ sociability_6 <dbl> 2, 3, 4, 3, 2, 2, 4, 3, 1, 1, 1, 2, 2, 2, 1, 1, 3, 5, 3…
## $ sociability_7 <dbl> 4, 5, 5, 4, 3, 1, 5, 3, 4, 1, 4, 4, 5, 2, 2, 1, 2, 4, 1…
## $ sociability_8 <dbl> 1, 3, 4, 1, 1, 2, 2, 2, 4, 4, 1, 5, 5, 1, 2, 1, 2, 5, 3…
## $ sociability_9 <dbl> 2, 5, 5, 1, 3, 5, 5, 2, 3, 3, 1, 4, 4, 2, 4, 2, 2, 5, 4…
## $ sociability_10 <dbl> 3, 2, 4, 3, 3, 2, 4, 4, 1, 2, 3, 2, 4, 1, 3, 1, 4, 3, 1…
## $ stress_1 <dbl> 1, 4, 2, 1, 1, 0, 3, 2, 2, 2, 4, 2, 3, 1, 1, 0, 3, 4, 1…
## $ stress_2 <dbl> 2, 4, 1, 0, 1, 2, 3, 3, 2, 1, 3, 3, 2, 3, 3, 1, 1, 2, 4…
## $ stress_3 <dbl> 1, 2, 3, 1, 2, 2, 3, 1, 3, 2, 4, 3, 2, 1, 2, 0, 0, 4, 3…
## $ stress_4 <dbl> 4, 1, 0, 4, 4, 1, 2, 1, 4, 2, 1, 1, 0, 2, 3, 3, 3, 2, 0…
## $ stress_5 <dbl> 0, 2, 0, 3, 3, 4, 1, 0, 3, 3, 1, 0, 0, 0, 1, 4, 0, 0, 3…
## $ stress_6 <dbl> 0, 4, 3, 3, 1, 1, 3, 1, 2, 0, 4, 2, 3, 1, 3, 2, 1, 2, 2…
## $ stress_7 <dbl> 1, 0, 3, 2, 4, 2, 0, 0, 4, 2, 0, 4, 0, 1, 1, 2, 1, 1, 2…
## $ stress_8 <dbl> 0, 2, 3, 1, 2, 4, 2, 3, 3, 2, 0, 1, 1, 0, 1, 3, 0, 0, 3…
## $ stress_9 <dbl> 1, 2, 2, 1, 2, 3, 3, 1, 3, 0, 4, 0, 4, 3, 0, 1, 2, 3, 4…
## $ stress_10 <dbl> 0, 2, 2, 0, 0, 1, 4, 4, 1, 1, 3, 4, 3, 3, 3, 1, 1, 4, 2…
Note: As we can see, in this data, there is a consistent naming pattern for each item on each scale. For example, the
“anxiety” scale items are anxiety_1, anxiety_2, and so on, the “depression” scale items are depression_1,
depression_2, and so on. It is necessary for you to use a consistent naming scheme like this for your data.
In the following code, for each item that needs to be reverse coded, we have one line of code that names the item
and gives their original values and the new values to which they are mapped.
Be careful with this code. Check every item to make sure the item name is correct and the original and new values
are correct.
Remember to assign results to new data frame. In the code above, after the recoding is done, the new data frame
produced is named psymetr_df_fix. We use this data frame from now on.
For each scale, we want to calculate Cronbach’s alpha measure of internal consistency. We do this using the
cronbach function in the psyntur package.
Remember to use the data frame where the items have been recoded. For example, in my case, this is
psymetr_df_fix.
In the following calculations, we select the items for each scale using the start_with functions. This assumes that
all items for each scale begin with a common prefix, which they do in my case, as mentioned above. For example,
all the items on the stress scale begin with stress_, and all the items on the depression scale begin with
depression_, and so on. For each set of items that is selected, the cronbach function will return the estimate of the
\(\alpha\) coefficient and its 95% confidence interval.
cronbach(psymetr_df_fix,
anxiety = starts_with('anxiety_'),
depression = starts_with('depression_'),
efficacy = starts_with('efficacy_'),
sociability = starts_with('sociability_'),
stress = starts_with('stress_')
)
## # A tibble: 5 × 4
## scale alpha ci_lo ci_hi
## <chr> <dbl> <dbl> <dbl>
## 1 anxiety 0.620 0.452 0.788
## 2 depression 0.734 0.620 0.848
## 3 efficacy 0.706 0.577 0.835
## 4 sociability 0.634 0.473 0.794
## 5 stress 0.834 0.761 0.907
## # A tibble: 44 × 5
## anxiety depression efficacy sociability stress
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 2 2.2 3.2 1.6
## 2 1.8 1.8 1.7 2.3 2.9
## 3 2.1 2.8 2 1.9 2.3
## 4 1.8 2.3 3 3.5 1.2
## 5 1.1 1.6 3.3 3.6 1
## 6 1.6 2.3 2.7 3 1.4
## 7 2.5 3.2 1.9 2.3 3
## 8 1.8 2.3 1.7 3.1 2.4
## 9 1.5 2.2 3 3.5 1.5
## 10 1.5 1.8 2.3 4.1 1.3
## # … with 34 more rows
Should we calculate the mean or the sum of all the items’ values? If we have missing values, the sum can be
misleading. For example, if we have 10 items on a 5 point scale, the total score possible is 50. The mean shows
that they have on average the maximum score, but this is not apparent from the sum. However, sometimes people
want to report the sum, though don’t want it affect by any missing values. A way of doing this is to multiply the
mean, calculated after the missing values have been removed, by the number of items. For example, in the example
just mentioned, we could multiply the mean of 5.0 by 10 to get 50. This can be done in the total_scores function,
by saying .method = 'sum_like as follows.
total_scores(psymetr_df_fix,
anxiety = starts_with('anxiety_'),
depression = starts_with('depression_'),
efficacy = starts_with('efficacy_'),
sociability = starts_with('sociability_'),
stress = starts_with('stress_'),
.method = 'sum_like'
)
## # A tibble: 44 × 5
## anxiety depression efficacy sociability stress
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 20 20 22 32 16
## 2 18 18 17 23 29
## 3 21 28 20 19 23
## 4 18 23 30 35 12
## 5 11 16 33 36 10
## 6 16 23 27 30 14
## 7 25 32 19 23 30
## 8 18 23 17 31 24
## 9 15 22 30 35 15
## 10 15 18 23 41 13
## # … with 34 more rows
The total_scores function uses the same aggregation method for all the variables. Sometimes, however, you
might like to calculate the, for example, mean for some variables and the sum (or sum_like) for other variables. To
do this, you must use the total_scores function twice, once of one set of variables, and then a second time for
another set of variables. The resulting two data frames can be bound together using bind_cols. In the following
example, we calculate the mean for the anxiety and depression scores, and the sum (sum_like) for the remaining
three variables, and then we bind them together with bind_cols.
bind_cols(
total_scores(psymetr_df_fix,
anxiety = starts_with('anxiety_'),
depression = starts_with('depression_'),
.method = 'mean'),
total_scores(psymetr_df_fix,
efficacy = starts_with('efficacy_'),
sociability = starts_with('sociability_'),
stress = starts_with('stress_'),
.method = 'sum_like')
)
## # A tibble: 44 × 5
## anxiety depression efficacy sociability stress
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 2 22 32 16
## 2 1.8 1.8 17 23 29
## 3 2.1 2.8 20 19 23
## 4 1.8 2.3 30 35 12
## 5 1.1 1.6 33 36 10
## 6 1.6 2.3 27 30 14
## 7 2.5 3.2 19 23 30
## 8 1.8 2.3 17 31 24
## 9 1.5 2.2 30 35 15
## 10 1.5 1.8 23 41 13
## # … with 34 more rows
Calculating descriptives
For each variable, we can get back the mean, standard deviation, or any other descriptive statistic, as follows:
describe_across(psymetr_df_total,
variables = c(stress, anxiety, depression, efficacy, sociability),
functions = list(avg = mean, stdev = sd),
pivot = TRUE)
## # A tibble: 5 × 3
## variable avg stdev
## <chr> <dbl> <dbl>
## 1 stress 2.16 0.826
## 2 anxiety 1.85 0.458
## 3 depression 2.21 0.506
## 4 efficacy 2.25 0.488
## 5 sociability 3.08 0.615
We can make the above code a little bit simpler by using the everything() function for the value of the variables
argument. As we can see above, we individually selected each one of all the variables in the data set. Instead, if we
use everything(), we automatically select all the variables.
describe_across(psymetr_df_total,
variables = everything(),
functions = list(avg = mean, stdev = sd),
pivot = TRUE)
## # A tibble: 5 × 3
## variable avg stdev
## <chr> <dbl> <dbl>
## 1 anxiety 1.85 0.458
## 2 depression 2.21 0.506
## 3 efficacy 2.25 0.488
## 4 sociability 3.08 0.615
## 5 stress 2.16 0.826
If there are any missing values in psymetr_df_total, we will get NA values in the table of results from
describe_across. To avoid this, we can use counterparts of mean and sd that remove missing values before they
calculate the results. These are mean_xna and sd_xna, respectively. The following code uses these, but in this case,
because there were no missing values in the data, nothing changes in the table.
describe_across(psymetr_df_total,
variables = everything(),
functions = list(avg = mean_xna, stdev = sd_xna),
pivot = TRUE)
## # A tibble: 5 × 3
## variable avg stdev
## <chr> <dbl> <dbl>
## 1 anxiety 1.85 0.458
## 2 depression 2.21 0.506
## 3 efficacy 2.25 0.488
## 4 sociability 3.08 0.615
## 5 stress 2.16 0.826
cor(psymetr_df_total)
Note. If we had NA values in the psymetr_df_total, we would have to remove these first before we calculate the
correlation matrix. We would do this with the following version of the cor command.
We can make a scatterplot matrix using the command scatterplot_matrix from psyntur.
scatterplot_matrix(psymetr_df_total,
anxiety,
depression,
efficacy,
sociability,
stress)
Regression analysis
Regression
We do the multiple regression by indicating the outcome variable and the predictor variables, which in this case are
stress and anxiety, depression, efficacy, sociability, respectively.
summary(model)
##
## Call:
## lm(formula = stress ~ anxiety + depression + efficacy + sociability,
## data = psymetr_df_total)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8980 -0.3043 0.0488 0.2591 0.9700
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.5410 0.7505 2.053 0.04678 *
## anxiety 1.0748 0.3203 3.355 0.00178 **
## depression 0.1250 0.2583 0.484 0.63129
## efficacy -0.4512 0.1776 -2.541 0.01513 *
## sociability -0.2045 0.1143 -1.790 0.08127 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.455 on 39 degrees of freedom
## Multiple R-squared: 0.7247, Adjusted R-squared: 0.6965
## F-statistic: 25.67 on 4 and 39 DF, p-value: 1.803e-10
As we can see, the \(R^2\) value is 0.725, the adjusted \(R^2\) value is 0.696, the F statistic is \(F(4, 39) = 25.67\).
Confidence intervals
We can get the confidence intervals for the coefficients as follows.
confint(model)
## 2.5 % 97.5 %
## (Intercept) 0.02304224 3.05905333
## anxiety 0.42688263 1.72274893
## depression -0.39752733 0.64742803
## efficacy -0.81037120 -0.09208644
## sociability -0.43558962 0.02661865
Multicollinearity
We can measure the multicollinearity using the variance inflation factor using the vif function from car.
vif(model)
Standardized coefficients
The standardized coefficients can be obtained using the lm.beta function from the lm.beta package. We send the
model to lm.beta to get a new standardized model, and then we can use summary etc with this model.
summary(model_standardized)
##
## Call:
## lm(formula = stress ~ anxiety + depression + efficacy + sociability,
## data = psymetr_df_total)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8980 -0.3043 0.0488 0.2591 0.9700
##
## Coefficients:
## Estimate Standardized Std. Error t value Pr(>|t|)
## (Intercept) 1.54105 0.00000 0.75049 2.053 0.04678 *
## anxiety 1.07482 0.59642 0.32033 3.355 0.00178 **
## depression 0.12495 0.07648 0.25831 0.484 0.63129
## efficacy -0.45123 -0.26675 0.17756 -2.541 0.01513 *
## sociability -0.20449 -0.15237 0.11426 -1.790 0.08127 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.455 on 39 degrees of freedom
## Multiple R-squared: 0.7247, Adjusted R-squared: 0.6965
## F-statistic: 25.67 on 4 and 39 DF, p-value: 1.803e-10