SEM Boot Camp Day 1 Morning: Basics & Data Screening: James Gaskin James - Gaskin@byu - Edu
SEM Boot Camp Day 1 Morning: Basics & Data Screening: James Gaskin James - Gaskin@byu - Edu
James Gaskin
[email protected]
3
Quick survey of participants
Field
Student/faculty/other
General Stats savvy
SEM savvy
PLS savvy
Quant vs Qual focus
4
Who are you dealing
Silver16
(putting up) with…?
5
Disclaimer…
The Plan: cover three semesters’ of statistics in three days
Schedule
1. Stats Basics (Data Screening, means tests, correlation etc.)
2. Factor analysis (EFA/CFA)
3. Testing causal models in AMOS & PLS, maybe show Mplus if there is time
Protocol
4. Demonstrate and discuss, follow along if you want, but we can’t spend a ton of
time on troubleshooting individual technical problems during the sessions. I can
help you between or after sessions if you like.
5. Work with your own data if you’d like.
6. Ask questions at any time, stop me as needed. Remote folks either interrupt or post
in chat window and I’ll check it periodically. This is a ‘how-to’ boot camp more
than a deep explanatory lecture series.
7
7. Feel free to help others.
Resources
StatWiki: http://statwiki.kolobkreations.com
Wiki with all sorts of Stats help for SEM stuff
Pay particular attention to the updated Order of Operations, and References
Gaskination: http://www.youtube.com/gaskination
My YouTube channel with 200+ video tutorials
Particularly pay attention to the SEM Series (2016) Playlist and the SEM Speed
Run (2017)
Our Google Drive folder
Where I’m putting slides and data and stuff from this Boot Camp.
Printer on 5th floor east side
Wifi – eduroam or byuwifi
8
Today: Basics
• Normal distribution
• Mean
• Median
• Standard deviation Brief
• Non-normal distribution
• Skewness
• Kurtosis
• Data Issues
• Missing Data
• Unengaged Respondents
• Outliers
• Significance testing
• Hypothesis (also wording)
• Global and Local tests
• Regression vs. ANOVA 9
Data and Model Experience
Usefulness
Information
Playfulness
Acquisition
10
Missing Data: Statistical Problems
If you are missing much of your data, this can cause several problems;
e.g., can’t calculate the estimated model.
SEM requires a certain minimum number of data points in order to
compute estimates – each missing data point reduces your valid n by 1.
16
Distribution Let’s take a look and try it out
17
Outliers and Influentials
Outliers can influence your results, pulling the mean away from the median.
Outliers also affect distributional assumptions and often reflect false or
mistaken responses
Two types of outliers:
outliers for individual variables (univariate)
Extreme values for a single variable
outliers for the model (multivariate) – do these later
Extreme (uncommon) values for a correlation
A cool plotter for visualizing the problems with multivariate outliers:
https://www.nctm.org/Classroom-Resources/Illuminations/Interactives/Line-of-Best-Fit/
Detecting Univariate Outliers
68%
Mean should
fall within
the box 95%
should
fall within
this range
In SPSS:
Analyze
Let’s take a look and try it out Descriptives
Explore
Outliers!
Handling Univariate Outliers
Univariate outliers should be examined on a case by case basis.
If the outlier is truly abnormal, and not representative of your
population, then it is okay to remove. But this requires careful
examination of the data points
e.g., you are studying dogs, but somehow a cat got ahold of your survey
e.g., someone answered “3” for all 45 questions on the survey
However, just because a datapoint doesn’t fit comfortably with the
distributions does not nominate that datapoint for removal
?Extreme outliers on short ordinal scales (e.g., 5-point Likert)?
22
Tests for Skewness and Kurtosis
Standard rule:
Statistic > 1 = positive (right) skewed
Statistic < -1 = negative (left) skewed
Statistic between -1 and 1 is fine
Strict rule:
Abs(Statistic) > 3*Std. error = Skewed (Hair)
Practical purposes… Let’s take a look and try it out
Problems arise outside of (+/-) 2.2
Sposito, V. A., Hand, M. L., & Skarpness, B. (1983). On the efficiency of using the sample kurtosis in
24
The Null Hypothesis
Denoted as H0
Hypothesizes “no effect” (or, effect is no different from zero)
We (usually) seek to reject H0 by providing evidence that there is an effect
The alternative hypothesis (denoted as Hn) suggests that there is some
effect different from zero
We seek to support this alternative hypothesis by rejecting the null
hypothesis
When we cannot support Hn, we might say that we have “failed to reject
the null hypothesis”
25
Samples Significantly Different (t-test / ANOVA)
Age at Death µ = Mean
50 60 70 80 90 100
µ1 µ2
26
Variables vary or change (“move”)
IV DV
High How much of the
movement (or variance) in
Job Satisfaction, can be
explained by the movement
(variance) of Autonomy?
Variable
Autonomy Job Satisfaction Variance
value
Low
27
Types of variance
Measurement
Response
Yi β 0 β1X i ε i
Y = Job Satisfaction
1 = Low, 5 = High
X = Autonomy
1 = Low, 5 = High 30
The Linear Regression Model
Relationship between X and Y is described by a linear function
Xi (Autonomy)X
31
Confidence in findings
This means that we have a certain degree of confidence that the effect is
different from zero
99% confidence in medical studies
95% confidence is the standard in Social Sciences
90% confidence OK when exploratory
99.999999% confidence in Physics (five sigma)
32
Reporting Significance
Using p-value (probability value) that results from statistical tests:
33
Hypothesis testing in regression
Global test through R2
.33
H0 = no relationship between any IVs .42
and DV (R2=0) .19
Local test checks path by path effects
H0 = no relationship between a specific IV and DV
Evidence to reject null is direction, amplitude, and significance of effect
MUST meet Global test to consider Local test
If R2 is meaningless (e.g., <0.10), then a “significant” relationship between IV
and DV is not meaningful
34
Global Data Validity
Factor Validity
Model Fit
R-square
Effect Size
P-Value
Local
35
36
Interrelatedness of measurements
Sample size is the This leads to We use the critical value P-values can be
great “lever” of narrower confidence to determine if an effect is misleading and
statistical power intervals, and different from zero (i.e., result in false
therefore, greater can we reject the null positives (or
confidence hypothesis?) negatives) if effect
sizes are ignored
37
Day 1 Morning Exercises if there’s time
Using your own data or the Boot Camp data:
1. Address missing values and unengaged responses
2. Check normality of several variables
1. Distribution, skewness, kurtosis
3. Detect and fix outliers if any
4. Run some ANOVAs using a grouping variable (e.g., child order)
5. Run independent samples t-test using a binary variable (e.g., gender)
6. Run some correlations and regressions using ordinal or continuous
variables (e.g., enjoyment, decision quality, experience)
38
Extra Time?
Random sampling in SPSS
Automatic linear regression for predictive analytics
Stepwise (forward/backward) regression
Forward: add one variable at a time until R-square doesn’t change
Backward: remove one variable at a time until R-square significantly drops
Linear regression, ANOVA, and Correlation in Excel
Control variables in simple linear regression
Categorical Dummy variables in linear regression
Including auto-recoding in SPSS
Logistic Regression
One-on-One assistance
39