0% found this document useful (0 votes)
152 views38 pages

SEM Boot Camp Day 1 Morning: Basics & Data Screening: James Gaskin James - Gaskin@byu - Edu

This document provides an overview of the schedule and content for a three-day SEM boot camp. Day 1 will cover basics of statistics including the normal distribution, non-normal distributions, data issues like missing data and outliers, and significance testing. Participants will analyze a dataset on Excel usage. The document reviews key statistical concepts like missing data imputation methods, detecting univariate outliers, and the null hypothesis. Participants are provided resources and instructions for the boot camp.

Uploaded by

Tram Anh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
152 views38 pages

SEM Boot Camp Day 1 Morning: Basics & Data Screening: James Gaskin James - Gaskin@byu - Edu

This document provides an overview of the schedule and content for a three-day SEM boot camp. Day 1 will cover basics of statistics including the normal distribution, non-normal distributions, data issues like missing data and outliers, and significance testing. Participants will analyze a dataset on Excel usage. The document reviews key statistical concepts like missing data imputation methods, detecting univariate outliers, and the null hypothesis. Participants are provided resources and instructions for the boot camp.

Uploaded by

Tram Anh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

SEM Boot Camp

Day 1 Morning: Basics & Data Screening

James Gaskin
[email protected]

Updated 7 May 2019


My Favorite Local Food Joints
Cannon Center BYU Commons (we’ll go here for lunch if you like)
Creamery on 9th for ice cream (maybe we’ll go here if we have time)
La Jolla Groves in Riverwoods (good for dinner; classy, but not pricey)
Pita Pitt (walking distance)
Bombay House
Thai House Cuisine

3
Quick survey of participants
Field
Student/faculty/other
General Stats savvy
SEM savvy
PLS savvy
Quant vs Qual focus

4
Who are you dealing
Silver16
(putting up) with…?

5
Disclaimer…
The Plan: cover three semesters’ of statistics in three days
Schedule
1. Stats Basics (Data Screening, means tests, correlation etc.)
2. Factor analysis (EFA/CFA)
3. Testing causal models in AMOS & PLS, maybe show Mplus if there is time
Protocol
4. Demonstrate and discuss, follow along if you want, but we can’t spend a ton of
time on troubleshooting individual technical problems during the sessions. I can
help you between or after sessions if you like.
5. Work with your own data if you’d like.
6. Ask questions at any time, stop me as needed. Remote folks either interrupt or post
in chat window and I’ll check it periodically. This is a ‘how-to’ boot camp more
than a deep explanatory lecture series.
7
7. Feel free to help others.
Resources
StatWiki: http://statwiki.kolobkreations.com
Wiki with all sorts of Stats help for SEM stuff
Pay particular attention to the updated Order of Operations, and References
Gaskination: http://www.youtube.com/gaskination
My YouTube channel with 200+ video tutorials
Particularly pay attention to the SEM Series (2016) Playlist and the SEM Speed
Run (2017)
Our Google Drive folder
Where I’m putting slides and data and stuff from this Boot Camp.
Printer on 5th floor east side
Wifi – eduroam or byuwifi
8
Today: Basics
• Normal distribution
• Mean
• Median
• Standard deviation Brief
• Non-normal distribution
• Skewness
• Kurtosis
• Data Issues
• Missing Data
• Unengaged Respondents
• Outliers
• Significance testing
• Hypothesis (also wording)
• Global and Local tests
• Regression vs. ANOVA 9
Data and Model Experience

Usefulness

Information
Playfulness
Acquisition

381 Excel users


 How they use Excel
Decision
Anxiety
 Outcomes of usage Quality

 A bit about the user


CompUse

Multi Group Controls


Gender: Age
1=Male Frequency
2=Female Social Desirability

10
Missing Data: Statistical Problems
If you are missing much of your data, this can cause several problems;
e.g., can’t calculate the estimated model.
SEM requires a certain minimum number of data points in order to
compute estimates – each missing data point reduces your valid n by 1.

Let’s open the data first and


then review slides if needed.
Missing Data: Logical Problems
Systematic missing data may indicate systematic bias (poor item
formulation, sensitivity, etc.).
If females are less likely to report gender than males, you will have “male-
biased” data.
e.g., only 50% of the females report their gender, but 95% of the males report
their gender.
What then if you use gender as a moderator (or in some other critical
role)?
Handling Missing Data
Missing less than 10% from a variable or respondent is typically not
problematic (I prefer less than 5%).
Method for handling missing data:
>10% - Just don't use that variable/respondent
<10% - Impute if not categorical
Warning: If you remove too many respondents (or impute too much), you will
introduce response bias and also dilute effects
Imputation Methods (Hair, table 2-2)
Option 1: Use only complete and valid data
No imputation, just use valid cases or variables
In SPSS: Exclude Pairwise (variable), Listwise (case)
Option 2: Use known replacement values
Match missing value with similar case’s value
Option 3: Use calculated replacement values
Use variable mean, median, or mode
Predicted based on known relationships (y=mx+b, solve for x)
If on an item for a reflective latent factor, use values from other items
Best Method – Prevention!
Shorter surveys (pre-testing critical!)
Easy to understand and to answer survey items (pre-testing critical)
Force completion
Bribe/motivate (iPad drawing)
Digital surveys (rather than paper)
Put DVs at the beginning of the survey.
Put sensitive items at the end of the survey.
Unengaged responses
Symptoms
B-liners: same answer: 3, 3, 3, 3…
Patterned responses: 1, 2, 3, 4, 1, 2, 3… or 1, 1, 1, 1, 2, 2, 2...
Reverse-coded questions same as regular
Detection
Standard deviation of ordinal scales
Visual inspection
Prevention
Traps in survey: “I regularly eat car tires for breakfast.”

16
Distribution Let’s take a look and try it out

To check distribution in SPSS:


1. Analyze,
2. Explore,
3. Plots: Histogram with normality plot

17
Outliers and Influentials
Outliers can influence your results, pulling the mean away from the median.
Outliers also affect distributional assumptions and often reflect false or
mistaken responses
Two types of outliers:
outliers for individual variables (univariate)
 Extreme values for a single variable
outliers for the model (multivariate) – do these later
 Extreme (uncommon) values for a correlation
 A cool plotter for visualizing the problems with multivariate outliers:

https://www.nctm.org/Classroom-Resources/Illuminations/Interactives/Line-of-Best-Fit/
Detecting Univariate Outliers
68%
Mean should
fall within
the box 95%
should
fall within
this range

In SPSS:
Analyze
Let’s take a look and try it out Descriptives
Explore

Outliers!
Handling Univariate Outliers
Univariate outliers should be examined on a case by case basis.
If the outlier is truly abnormal, and not representative of your
population, then it is okay to remove. But this requires careful
examination of the data points
e.g., you are studying dogs, but somehow a cat got ahold of your survey
e.g., someone answered “3” for all 45 questions on the survey
However, just because a datapoint doesn’t fit comfortably with the
distributions does not nominate that datapoint for removal
?Extreme outliers on short ordinal scales (e.g., 5-point Likert)?
22
Tests for Skewness and Kurtosis
Standard rule:
Statistic > 1 = positive (right) skewed
Statistic < -1 = negative (left) skewed
Statistic between -1 and 1 is fine
Strict rule:
Abs(Statistic) > 3*Std. error = Skewed (Hair)
Practical purposes… Let’s take a look and try it out
Problems arise outside of (+/-) 2.2
 Sposito, V. A., Hand, M. L., & Skarpness, B. (1983). On the efficiency of using the sample kurtosis in

selecting optimal lpestimators. Communications in Statistics-simulation and Computation, 12(3), 265-272.


Loose rule >10 Kline (2005)
 Kline,R.B. (2005), Principles and practice of structural equation modeling, 2nd ed., Guilford Press, New
York, NY.
Need to transform continuous variables: https://youtu.be/twwT6FgwlAo
Testing for significance
1. Sample means are significantly different (difference of means tests)
• T-test: two samples or same sample at two times
• ANOVA: multiple samples or multiple times
• H1. Australians live longer than Canadians.

2. Variables move together (covary) significantly


• Correlation (not causation)
• Regression analysis (implies causation)
• H1. Autonomy has a positive effect on job satisfaction.

24
The Null Hypothesis
Denoted as H0
Hypothesizes “no effect” (or, effect is no different from zero)
We (usually) seek to reject H0 by providing evidence that there is an effect
The alternative hypothesis (denoted as Hn) suggests that there is some
effect different from zero
We seek to support this alternative hypothesis by rejecting the null
hypothesis
When we cannot support Hn, we might say that we have “failed to reject
the null hypothesis”

25
Samples Significantly Different (t-test / ANOVA)
Age at Death µ = Mean
50 60 70 80 90 100

µ1 µ2

Let’s take a look and try it out


Canadians
Australians

What is the null hypothesis?

26
Variables vary or change (“move”)
IV DV
High How much of the
movement (or variance) in
Job Satisfaction, can be
explained by the movement
(variance) of Autonomy?

Variable
Autonomy Job Satisfaction Variance
value

Low

27
Types of variance

What we measure Trait

Measurement

Response

Trait: variance due to the “true” construct value


Measurement: variance due to the method of measurement (e.g., scale granularity)
Response: variance due to respondent reaction to measure (e.g., offensive or personal)
Let’s take a look and try it out
29
The Linear Regression Model
Population Independent Error
Population Variable (data term
Slope
Y intercept points)
Coefficient
Dependent
Variable

Yi  β 0  β1X i  ε i
Y = Job Satisfaction
1 = Low, 5 = High
X = Autonomy
1 = Low, 5 = High 30
The Linear Regression Model
Relationship between X and Y is described by a linear function

(Job Satisfaction)Y Yi  β 0  β1X i  ε i Pat

Observed Value Tom


of Y for Xi
Jane
εi Tina Slope = β1
Predicted Value Error for this Xi
of Y for Xi Bob
value Dave
Joe
Intercept = β0

Xi (Autonomy)X
31
Confidence in findings
 This means that we have a certain degree of confidence that the effect is
different from zero
 99% confidence in medical studies
 95% confidence is the standard in Social Sciences
 90% confidence OK when exploratory
 99.999999% confidence in Physics (five sigma)

32
Reporting Significance
Using p-value (probability value) that results from statistical tests:

p < 0.01 --- 99% confidence that effect is not zero


p < 0.05 --- 95% confidence that effect is not zero
p < 0.10 --- 90% confidence that effect is not zero

The way of reporting significance is changing.


Some journals have banned the reporting of p-value thresholds

33
Hypothesis testing in regression
Global test through R2
.33
H0 = no relationship between any IVs .42
and DV (R2=0) .19
Local test checks path by path effects
H0 = no relationship between a specific IV and DV
Evidence to reject null is direction, amplitude, and significance of effect
MUST meet Global test to consider Local test
If R2 is meaningless (e.g., <0.10), then a “significant” relationship between IV
and DV is not meaningful

34
Global Data Validity

Factor Validity

Model Fit

R-square

Effect Size

P-Value
Local

35
36
Interrelatedness of measurements

Error Critical pvalue


Sample size
Value

Sample size is the This leads to We use the critical value P-values can be
great “lever” of narrower confidence to determine if an effect is misleading and
statistical power intervals, and different from zero (i.e., result in false
therefore, greater can we reject the null positives (or
confidence hypothesis?) negatives) if effect
sizes are ignored

37
Day 1 Morning Exercises if there’s time
Using your own data or the Boot Camp data:
1. Address missing values and unengaged responses
2. Check normality of several variables
1. Distribution, skewness, kurtosis
3. Detect and fix outliers if any
4. Run some ANOVAs using a grouping variable (e.g., child order)
5. Run independent samples t-test using a binary variable (e.g., gender)
6. Run some correlations and regressions using ordinal or continuous
variables (e.g., enjoyment, decision quality, experience)

38
Extra Time?
Random sampling in SPSS
Automatic linear regression for predictive analytics
Stepwise (forward/backward) regression
Forward: add one variable at a time until R-square doesn’t change
Backward: remove one variable at a time until R-square significantly drops
Linear regression, ANOVA, and Correlation in Excel
Control variables in simple linear regression
Categorical Dummy variables in linear regression
Including auto-recoding in SPSS
Logistic Regression
One-on-One assistance

39

You might also like