0% found this document useful (0 votes)
16 views34 pages

2802 Key Points

The document discusses key concepts in experimental design and statistics including types of variables, reliability, validity, and statistical tests. Parametric tests like t-tests and ANOVA make assumptions about populations while non-parametric tests do not. Within-subject and between-subjects designs are covered along with reliability, validity, and hypothesis testing procedures.

Uploaded by

sethtward
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views34 pages

2802 Key Points

The document discusses key concepts in experimental design and statistics including types of variables, reliability, validity, and statistical tests. Parametric tests like t-tests and ANOVA make assumptions about populations while non-parametric tests do not. Within-subject and between-subjects designs are covered along with reliability, validity, and hypothesis testing procedures.

Uploaded by

sethtward
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Z-Distribution (normal) for sample size > 30, otherwise t-distribution.

As t approaches 30, it becomes


closer to z.

Parametric Tests (T-Test and ANOVA) – Interval (Rank & Distance, Celsius) or Ratio (True Zero, Kelvin)
Make Strong Assumptions about the Population
Non-Parametric – Nominal (Labels & Categories, Likert) or Ordinal (Rank, Podium) Deviate from
Normality

o Reliability is the proportion of the total variance that is systematic variance


associated with true scores
o No reliability = 0.00
o Perfect reliability = 1.00
Reliability is at its peak when all variance is from the IV, and no other external factors.
o Reliability = True Score Variance/Total Variance
- So if reliability = 0, then none of the total variance is true score variance.
- Generally, a measure with reliability of 0.7 or above is considered acceptable.
o often reported as “Cronbach’s Alpha”
A measure can be highly reliable, but not valid!

- Purpose of descriptive statistics (mean, SD, mode)


o To summarize and describe behavior (sample)
- Purpose of inferential statistics (ANOVA, assumption checks)
o To draw conclusions about the reliability and generalizability of the findings
(population)

o
Measures how far a set of (random) numbers are spread out from their average
value.
o So, the larger the variance, the larger the range in a data set.
o Variance emphasizes outliers more than standard deviation does.
o With variance (because of the squaring operation), not in same measuring unit as its
original data set anymore. SD returns to original units.
The experimenter exercises control by manipulating the independent variable and
holding all other factors constant to determine whether the dependent variable changes
in response.
o Independent variable (IV): A variable manipulated by the experimenter to observe
its effect on a dependent variable.
o Dependent variable (DV): The factor of interest being measured in the experiment
- Environmental manipulation: Alters the participants’ physical or social context for each
level of the independent variable.
- Instructional manipulation: Provides different directions for each level of the independent
variable.
- Stimulus manipulation: Uses different stimuli for each level of the independent variable.
- Invasive manipulation: Uses the administration of drugs or surgery to create physical
changes within the participant for each level of the independent variable
- Experimenters use a manipulation check to test whether the manipulation of the independent
variable was effective and elicited the expected differences between conditions.
Between Subjects Design

Assumptions: Interval/Ratio Data, Random sampling,


independence of cases, normality, homogeneity of variance
- Independent T-test (comparing two samples, null hypothesis that the means are the same, df =
n-2, Subjects split into treatment A and B)
Within Subjects Design

Assumptions: Interval/Ratio
Data, random sampling, independence of cases, normality
- Dependent T-test (comparing two means from a sample tested under two conditions, null is
mean difference = 0, df = n-1, Subjects go through A and B together. Requires fewer
participants, must use same participants each treatment, creates carryover effects)
- The between-subjects design is conceptually simpler, avoids order/carryover effects, and
minimizes the time and effort of each participant. The within-subjects design is more efficient for
the researcher and controls extraneous participant variables.
Latin Squares Counterbalancing

-
- Each condition occurs once in each column and once in each row
- The number of possible orders will always equal the number of experimental conditions
Matched Group Design

- Experimenters assign separate groups of participants in each


condition and “twin” a participant in one group with a participant in another group.
- Advantages of matched-group designs:
- They reduce a lot of the unwanted variability caused by individual differences.
- Order effects are not a concern.
- Disadvantages of matched-group designs:
- The process of matching can be difficult.
- It can be hard to know which dimensions to match participants on.
- Matching will be ineffective if dimensions are not correctly identified.
- Recruiting matched samples may be difficult and expensive
Demand Characteristics
- Features of the experimental design itself that lead participants to make certain conclusions about the
purpose of the experiment and then adjust their behaviour accordingly, either consciously or
unconsciously.
Single and Double-Blind Designs
- Single-blind: Either the experimenters or the participants are unaware of the experimental
condition they are in.
- Double-blind: Both the experimenters and the participants are unaware of the experimental
condition they are in.
Ceiling effect: Occurs when scores on a measure cluster at the upper end (the ceiling) of the measure.
Floor effect: Occurs when scores on a measure cluster at the lower end (the floor) of the measure.
These effects often occur when developmental experimenters use measures that are not appropriate
for the given population.
Quasi-experimental design: A design that is similar to a true experiment but does not randomly assign
participants to conditions or randomly assign the order of the conditions.
- Flexible in terms of how experimenters set up their methods. Possible to study topics that
pose practical or ethical constraints for true experimental designs.
o A study can be quasi-experimental when it:
 Manipulates an independent variable but does not have a full set of other experimental
controls.
 Has some experimental controls but does not fully manipulate an independent variable.
o Non-equivalent groups design: A between-subjects quasi-experimental design in which participants
are not randomly assigned to conditions.
 It is not possible to randomly assign participants to some conditions (e.g., smoker/non-
smoker).
o Pretest-posttest design: A within-subjects quasiexperimental design in which participants are tested
before and after an experimental treatment or condition is introduced.
 The order of conditions is fixed.
Steps in Hypothesis Testing
- 1. State hypothesis
- 2. Choose Alpha (usually .05)
- 3. If calculating by hand, look up critical value
o Based on tables
- 4. Determine test statistic
o Observed value of t, Z, F etc.
- 5. If calculating by hand:
o determine if observed test statistic exceeds critical value
o If using computer analysis:
 Look up p-value…is it less than .05?
- 6. Decision: Reject H0, or fail to reject H0
T-test
- Used for determining the difference between 2 means/groups.
- 3 Types:
o The one sample t-test: comparing a sample mean with a hypothetical population mean.
o The dependent samples t-test: comparing two means for the same sample tested at two
different times or two different conditions. Within subjects
o The independent samples t-test: comparing two different samples tested under different
conditions. Between subjects
- When the null hypothesis is true, the t-distribution is normal.
- T-test is only for two groups. When you find a significant difference in a t-test, you know the
groups are different. ANOVA is with more than two samples, so you do not know where the
differences lie, but you know there is a significant difference between at least two of the
samples. You can use ANOVA for two groups, but you would find the same results as a t-test.
Four Types of Validity
- An empirical study is said to be high in internal validity if the way it was conducted supports the
conclusion that the independent variable caused any observed differences in the dependent variable.
- An empirical study is high in external validity if the way it was conducted supports generalizing the
results to people and situations beyond those actually studied.
- The quality of the experiment’s manipulations represent the construct validity. Do the manipulations
relate to the hypothesis being tested?
- Statistical validity concerns the proper statistical treatment of data and the soundness of the
researchers’ statistical conclusions
Any statistical relationship in a sample can be interpreted in two ways:
o There is a relationship in the population, and the relationship in the sample reflects
this.
o There is no relationship in the population, and the relationship in the sample reflects
only sampling error
 We conduct null hypothesis testing to help us decide which way to go,
between the two.

- On top example, lower variance between groups, and higher variance within groups
- On low example, higher variance between groups, and lower variance between groups
One-Way ANOVA
- An experiment in which only one variable is manipulated
- Must have at least two levels (though typically use t-test in this case). An independent
variable is a factor, but a level of the IV/variable is the number of conditions/treatments
- Usually used when there is 1 IV with 3+ levels
- Different Types
o Randomized groups
o Matched-subjects
o Within Subjects/Repeated measures
T-test can only compare 2 groups, so with 3+ groups, we use ANOVA
Technically, we could run a bunch of t-tests if we wanted…what issue do we run into with this?
- How can we determine whether or not there are any significant differences between these
means? We could run 10 t-tests, comparing each mean to every other in the group. The more
t-tests we run, the higher the chances of an error. Since there are more alphas, there are more
opportunities for the alpha to be incorrect.
Two-way ANOVA is for 2 IVs
The Bonferroni Adjustment
- Divide the desired alpha level by the number of tests you plan to conduct
- So in the previous example, we would divide an alpha of 0.05 by 10, making a more
conservative test using alpha = .005
- This reduces the probability of making a Type I error, BUT can increase risk of making a
Type II error, by reducing statistical power. This is an acceptable tradeoff in some conditions.
- Because of the risk of increasing Type II error, researchers typically use the Bonferroni
adjustment when they plan to conduct only a few statistical tests. If a large number of
comparison between means is planned, it makes more sense to use an ANOVA. ANOVA will
compare all means simultaneously to determine if any differences are present…this allows
us to hold alpha at 0.05 (or whichever alpha is chosen)
Type 1 Error:
o Alpha
o Probability of accepting HA when H0 is true
o Ie: Deciding there is an effect when there really isn’t
Type 2 Error:
o Beta
o Probability of accepting H0 when HA is true.
o Ie: Deciding there is not an effect when there actually is.
o A Type 1 Error is much worse to commit than a Type 2 Error, because a Type 2 Error
is ignoring an effect, which is an easier fix, whereas Type 1 is finding a false positive.
- Power:
o 1-Beta
o Probability of finding an effect, given that effect exists (Inverse of a Type 2 Error)
o If your test is less sensitive, your test will not recognize the effect
 Sensitivity measures the ability of a test to correctly identify true positives,
particularly in diagnostic testing. Power gauges the likelihood that a
statistical test will detect a true effect when one exists, mainly in hypothesis
testing and experimental design
If the variance between experimental conditions is markedly greater than the variance within the
conditions, it suggests the independent variable is causing the difference
- F-test: Ratio of the variance among conditions (between-groups variance) to the variance
within conditions (within-groups, or error, variance)
- ***Because all data points in a single group have been treated with the same IV, it is
impossible for them to contribute to systematic variance…within group variance is therefore
treated as error.***
We expect lower WG variance, and higher BG variance. If the proportion of BG variance is high enough,
results are significant, and we can reject the null hypothesis

One-way ANOVA partition

(where the number of degrees of freedom for the k treatments is (k – 1))

(n is sample size)
 The numerator of the F-statistic (MST) must be large relative to its
denominator (MSE) in order for it to reach significance
 Values near 1 indicate that variation between treatment and error is
approximately equal.
 This is a ratio of treatment variance to error variance (Signal to Noise)
The F-Distribution
- The F-test seeks to find a difference amongst the means (note that that it does not specify
exactly which means are differ significantly)
- Regardless, the F-test will always be a one-tailed, upper bound test.
-
- Knowing the distribution of F when the null hypothesis is true allows us to find the p value.
Raw effects (or unstandardized effects): Straightforward measures of effect size, such as the
difference in the means for two samples.
o E.g., Two groups of participants have mean IQ scores of 101 and 113, respectively.
The raw effect of the difference of means would be 12.
- Standardized effects: Adjust the raw effect based on the amount of variability in the data.
- Cohen’s d: A common standardized effect size that is defined as the raw effect divided by its
standard deviation. Indicator of the practical value of results
o E.g., If we continue with the IQ example and find that the standard deviation for our
data values is 15, then Cohen’s d would be the raw effect of 12 divided by the
standard deviation of 15 to give a standardized effect size of 0.8.
- Standardized effects have two main advantages over raw effects:
o 1) Can readily be compared across studies even when the specific measures used are
on different scales.
 E.g., Cohen’s d computed on data using a measurement scale ranging from 0
to 20 can be meaningfully compared to Cohen’s d computed on data from a 0
to 100 scale.
o 2) Can be interpreted with respect to the standard deviation.
 E.g. Cohen’s D of 0.8 tells us that the two groups differ by just under 1 SD
- Correlation-like effects: Measure the association between two variables.
o Ex: Pearson’s r, R2
- Correlation-like measures such as eta squared and omega squared are appropriate for analysis
of variance designs. They are all standardized
- If we square a correlation coefficient (R2) , we obtain the proportion of the total variance in
one set of scores that is systematic variance related to another set of scores.
Eta-Squared
- Effect size commonly reported with ANOVA
o Sums of squares treatment/Sums of Squares Total (or just look at your output!)
o Value always between 0-1
o Larger values = higher proportion of variance attributed to IV
- .01: Small effect size
- .06: Medium effect size
- .14 or higher: Large effect size
Confidence Intervals
- Confidence interval: A range of values around the effect size obtained in your sample that is
likely to contain the true population effect size with a given level of plausibility or
confidence.
o A 95% confidence interval is commonly used in behavioural research
 we were able to identify the true population value of our measure of interest
(e.g., a difference of two group means), there is a 95% chance that our
confidence interval would contain that true value.
- The size of a confidence interval depends on both the variability in your data and your sample
size.
Basic Factorial Designs: The 2 x 2
- The most basic factorial design is the 2 x 2.
- Each number refers to a factor, or independent variable.
o 2 x 2 = 2 IV, 2 levels each
o 2 x 2 x 2 = 3 IV, 2 levels each
o 2 x 2 x 3 = 3 IV, two with 2 levels and one with 3 levels
- The number of conditions in a factorial design – or all possible combinations of the
independent variables – can be computed by multiplying the numbers of levels of your
different factors.
o A 2 x 2 design has four possible conditions.
o A 2 x 3 design has six possible conditions.
o A 2 x 2 x 2 design has eight possible conditions
- Each condition – referred to as a cell – represents a unique combination of the levels of the
independent variables.

- In a 2x2 Design, there are 2 possible main effects and one interactions, 3 tests.
- In a 2x2x2, there are 3 possible main effects and three 2-way interactions, one 3-way
interaction, so 7 tests.
- Spreading Interactions
o IV B has an effect on level 1 of IV A, but not at level 2.
o IV B has a stronger effect on level 1 of IV A than level 2
o
o One line represents one half of the table, (one set of red and blue)
- Cross-Over Interactions
o IV B has an effect on both levels of IV A, but in different directions. (Positive on
level 1, negative on level 2).

o
An interaction simply informs us that the effects of at least one independent variable depend on the level
of another independent variable. Whenever an interaction is detected, researchers need to conduct
additional analyses to determine where that interaction is coming from.
Monitor the IV at each level of the other IV, looking for effects.
It is only necessary to look for simple effects when an interaction is present.
You look for simple effects for each condition, a 2x2 design would have 4 potential simple effects
Main effect of weather Main effect
of task

Null effect Interaction


- Alpha:
o The criterion we set (often .05)
o The amount of Type 1 error risk we are willing to accept
- p
o how likely you are to have found a particular set of observations if the null
hypothesis were true
o The actual probability of committing a Type I error associated with a given test
statistic
- Test statistic
o t, F, Z etc. calculated based on your data
- Critical Value
o The value (of t, Z, F etc…) that corresponds to the selected alpha value.
o The test statistic must exceed this value in order to reach significance.
o “Cut off” for rejection region
- Weak relationships based on medium or small samples are never statistically significant and
strong relationships based on medium or larger samples are always statistically significant.
Example: The Fairness of a Coin
- Assess whether a coin is fair by flipping it 20 times.
o If a coin is fair, we would expect to get roughly 10 heads and 10 tails.
- We anticipate fluctuations from the expected value just by chance. This chance deviation is
known as sampling error – differences between the measurements observed in a data set and
what would be expected from the population values.
- Our null hypothesis (H0) is that the coin is fair – the probability of getting a head (or tail) on
any given flip is 0.5.
o The H1 is that the coin is not fair (the probability of getting a head ≠ 0.5).
Possible Decisions
- Correct decision: The coin is fair, and we do not reject the H0.
- Correct decision: The coin is unfair, and we reject the H0.
- Type I error: We conclude the coin is unfair, when in fact the H0 is true. The probability of a
type I error is the same as the value of α.
- Type II error: We fail to reject the null hypothesis of a fair coin when the coin is actually
unfair.
o The probability of a type II error is denoted by beta (B ).
o 1 - B = the power of the test (how likely you are to detect a real effect of a certain
magnitude, given your sample size).

-
Law of Large Numbers
- As we observe more results, the average gets closer to our theoretical mean
o In a normal distribution, the standard deviation is meaningful.
o 68% of values fall within 1 SD of the mean
o 95% of values fall within 2 SD
o 99% fall within 3SD
- Sampling distribution of the mean: The pattern of mean values obtained when drawing
many random samples of a given size from a population and computing the mean for each
sample.
- An important property of sampling distributions is that as sample size increases, the
variability (variance, standard deviation) of the sampling distribution decreases.
- Standard error: The standard deviation of a sampling distribution. This value is calculated
by dividing the standard deviation by the square root of the sample size.
o As a result, the larger the sample size, the greater the precision in our estimates and
the smaller our p value will be.
- The standard deviation (SD) measures the amount of variability, or dispersion, from the
individual data values to the mean, while the standard error of the mean (SEM) measures
how far the sample mean (average) of the data is likely to be from the true population mean.
- Central limit theorem: A theorem that says with a large sample size, the sampling
distribution of the mean will be normal or nearly normal in shape.
o Even with populations having dramatically non-normal distributions, the sampling
distribution of the mean will be increasingly normal in shape as sample sizes
increase.
o This allows us to make use of the many attractive properties of the normal
distribution
- Statistical assumptions are made based on our understanding of how this works:
o Normality
o Homogeneity of variance
 Ex: Levene’s test, Mauchley’s test
Criticisms of Hypothesis Testing
- Three of the common issues surrounding the use of NHST are:
o 1) The overreliance on p as an indicator of effect size or importance.
o 2) The arbitrary nature of a reject/fail-to-reject decision based on p.
o 3) The overemphasis on α and type I errors, leading to underpowered research
studies.

-
- to achieve a power of 0.8 in the presence of a large effect, 25 participants are required for
each group
- for a small effect, 393 participants are required for each group!
- To find a small effect, with high power, a large sample size is required. To find a large effect,
with high power, a small sample size is required.
- Small effect + High power = Large sample size.
- Medium effect + High power = Medium sample size
- Large effect + High power = Small sample size.
Effect size is the magnitude of the effect, power is your ability to recognize it.
The more conditions you add, the less ability to capture an effect in each condition. Power must be
distributed. Because of this, within-subjects design maximizes power.

- Between Groups:
o Two groups = independent t-test
o 3+ groups = One-way ANOVA
- Within Groups:
o Two sets of observations = dependent/paired t-test (could use anova)
o 3+ sets of observations = single factor repeated measures
o Can also be Randomized Blocks
- Factorial Designs:
o factorial designs can be within subjects, or mixed (both between and within subjects
factors)

-
- One-way ANOVA is not appropriate for within-subjects designs in which the means being
compared come from the same participants tested under different conditions or at different
times.
- The main difference is that measuring the dependent variable multiple times for each
participant allows for a more refined measure of MSE
- In a between-subjects design, these stable individual differences would simply add to the
variability within the groups and increase the value of MSE (which would, in turn, decrease
the value of F). In a within-subjects design, however, these stable individual differences can
be measured and subtracted from the value of MSE. This lower value of MSE means a higher
value of F and a more sensitive test.
Repeated Measures
- sometimes also called “within-subjects” or “within-participants”
- the same participant is measured on the same dependent variable multiple times (more than 2)
- (if only 2 measurements just use a paired samples t-test)
- e.g. the same participant is measured on their mood (1) before and (2) after a treatment and
then (3) again after a week
- effects of placebo vs treatment A vs treatment B on blood pressure can be studied in the same
participants, each participant can serve as their own control
- behaviour of subjects can be studied over multiple time points
Assumptions
- Independent random sampling
- Normality
- Circularity of the covariates, (Sphericity)
- Null hypothesis
One Way ANOVA

-
-

-
- Since the error term is reduced, the denominator of the F ratio is reduced, increasing F.
- Because big F values usually let us reject the idea that differences in our means are due to
chance, the repeated-measures ANOVA becomes a more sensitive test of the differences
(its F-values are usually larger).
- The repeated measures ANOVA uses different degrees of freedom for the error term, and
these are typically a smaller number of degrees of freedom. So, the F-distributions for the
repeated measures and between-subjects designs are actually different F-distributions,
because they have different degrees of freedom.
Repeated Measures:
o the same individuals are measured multiple times, as one group
o Basic within subjects design
Pros/Cons of Repeated Measures
- Advantages
o Each subject serves as their own control
o Increased Power due to partitioning of error
o More efficient
- Disadvantages
o Memory/fatigue effects
o Carry Over Effects
 Ex: alcohol accumulated in blood stream
o Order Effects
 0, 2, 4, 6 oz of alcohol vs. 6, 4, 2, 0 oz

Between Subjects Factorial ANOVA


- When more than one independent variable is included in a factorial design, the appropriate
approach is the factorial ANOVA.
- The main difference is that it produces an F ratio and p value for each main effect and for
each interaction
- Appropriate modifications must be made depending on whether the design is between-
subjects, within-subjects, or mixed.
Randomized Blocks:
o Uses “blocks” of subjects matched on some relevant feature
o Controls for within group error
o

Assumptions
- Independent Random Sampling
- Normality
- Homogeneity of Variance
o Between groups:
 homogeneity of variance
 Equivalence of covariance matrices (Box’s test)
o Within groups:
 assumption of circularity
 Use Greenhouse Geisser Correction if there are more than 2 levels!
- Null hypotheses
o Two sets: interaction and main effects
 Ex: A (between subjects main effect), B (within subjects main effect), AB (interaction)
o Error term for between subject is distinct from error term used within subjects
o Between subjects x within subjects interaction is considered a within subjects effect
 So A is a between subjects effect, but B and AB are within subjects
- Example A shows a between subjects design. Each subject only uses 1 brand.
o Uses 40 sample size, while B uses 10.
- In example B, each golfer is assigned to a block, but still experiences all 4 brands.
o If you want to stay strictly within subjects, you will use a randomized block design.

-
Split Plots
- 2 or more factors (IVs)
o At least one is independent/between groups
o At least one is repeated measures/within groups
- We’ll stick with the most simple option:
o 2 factor split-plot design (1 between, 1 within)
- Again…NOT the same as 2 one-way anovas
o Looking for interaction between the 2 factors
o Main effects of secondary interest and must be interpreted in light of the interaction
o Post Hoc on main effects or simple main effects. (The differences in one variable at
every level of the other)
Example
- A personality researcher thought that women are more worried than men about what people
think about them. To test this, he looked at males and females nose picking behaviour as a
function of how many people were potentially watching. To do this experiment, they used a
doctor’s office waiting room. They left each subject (who was actually going to see the
doctor) in the waiting room for 40 minutes. While the subject waited, other people came and
left. The experimenter engineered it so that, out of the 40 minutes there were 0,1,2 or 3 other
people in the room for 10 minutes each. Through a two-way mirror, an independent judge
recorded the number of times that the subject inserted a finger into either of their nostrils.

-
Greenhouse-Geisser
- Makes test more conservative
- Lowers DFs (does not change F)
o Lowering DFs makes associated p-value higher (ie: closer to 1.0), because you are
getting p-value from F-distribution based on lower DFs
- Report Greenhouse-Geisser for interaction and main effect of within-subjects factor if within
subjects factor has more than 2 levels
o If 2 levels, no need for correction (no risk of inflation)
o Greenhouse Geisser and “Sphericity Assumed” will be identical if only two levels.

- In Anova (or t-tests):


o IVs are categorical
o DVs are continuous:
- In linear regression:
o IV’s are Continuous
o DV’s are Continuous
Single-Subject Research Designs
- A primary reason to use single-subject research is that large sample sizes may hide individual
differences of participants.
o For example, a treatment that has a positive effect for half the people exposed to it
but a negative effect for the other half would, on average, appear to have no effect at
all. Single-subject research, however, would likely reveal these individual
differences.
- A secondary assumption of single-subject research is that it is important to discover causal
relationships through the manipulation of an independent variable, the careful measurement
of a dependent variable, and the control of extraneous variables.
o For this reason single-subject research is considered to have good internal validity.
- A third assumption of single-subject research is that it is important to study strong and
consistent effects that have biological or social importance
o Applied researchers, in particular, are interested in treatments that have substantial
effects on important behaviors and that can be implemented reliably in the real-world
contexts in which they occur. This is sometimes referred to as social validity
- This type of research is most often used in behavioural studies, where conditions can differ so
extremely, and it is far more important to be able to see individual differences.

-
- Generic design, the individuals are exposed to different levels of the dependent variable over
time. This is called a reversal design, following a A-B-A format.
o The change from one condition to the next does not usually occur after a fixed
amount of time or number of observations. Instead, it depends on the participant’s
behavior. Specifically, the researcher waits until the participant’s behavior in one
condition becomes fairly consistent from observation to observation before changing
conditions. This is sometimes referred to as the steady state strategy. This is under
the notion that when the participant’s behaviour becomes steady in one condition, it
will be easier to recognize a change in another condition.
o The effect of an independent variable is easier to detect when the “noise” in the data
is minimized.
Reversal Design
- In a basic reversal design, a baseline for the dependent variable is established before
treatment introduction, serving as a control condition (phase A). Once steady state responding
is reached, phase B begins with treatment introduction. The researcher waits for the
dependent variable to stabilize to assess changes. The design can include treatment
reintroduction (ABAB) or further baseline phases. In such designs, the levels of A may differ
upon reintroduction due to residual excitement, emphasizing the need for B to reach stability
before returning to A. Reversal increases internal validity by demonstrating that changes in
the dependent variable coincide with treatment introduction and removal, suggesting causal
relationships and minimizing the influence of extraneous variables.
- In a multiple-treatment reversal design, a baseline phase is followed by separate phases in
which different treatments are introduced. (ABCACBA). The participant could then be
returned to a baseline phase before reintroducing each treatment—perhaps in the reverse
order as a way of controlling for carryover effects.
- In an alternating treatments design, two or more treatments are alternated relatively quickly
on a regular schedule. (ABCBC)
A measure of the strength of the relationship between two variables (behaviours, beliefs, etc) Variables
are “things that can change”
- Asks the question: “Do people with high (or low) scores on X also tend to have high (or low)
scores on Y?”
- *Note that while we were using categorical with Anova and t-tests, correlation uses
continuous IVs)
- Correlations establish an association, not causation
o So…why not skip this and just do experimental research?
- May be starting point for future research
- Topic may be unethical or impractical to manipulate
o Amount of smoking and work productivity
 Categorical variables represent distinct categories or groups and can only
take on a limited number of values. Examples include gender (male/female),
color (red/blue/green), and type of car (sedan/SUV/truck).
 Continuous variables can take on any value within a range and are often
measured on a scale. Examples include height, weight, temperature, and age.
They can have infinite possible values within a given range.

- Inferential statistics: Tests the probability that this


relationship occurred by chance (i.e., was a fluke). Assuming that alpha is .05, probability of
less than 5% chance it was a fluke (ie: a Type I error)
The Causation Issue
- Problems interpreting causation from correlation:
o Spurious Correlations
 Nick Cage Movies vs. # Drownings
 Correlations that are not due to an effect. No cause. Occur in populations.
o Third Variable
 Shark Attacks vs. Ice Cream Sales
 Third variable? Sunny Weather!
o Chance Correlations, or due to Random Sampling
 For any given sample, we will have “fluke” correlations
 This will happen less as sample size (n) gets larger
 A correlation less representative of a population
Quasi-Experimental Designs
- Different reasons for “Quasi” status
o Participant variable
o ex post facto
- Lack of randomization (of participants or order of group)
o Changes Across time
 Longitudinal
 One group tested at ages 2, 4, & 6 years
 Cross-Sectional
 Groups of 2, 4, & 6 year olds tested simultaneously
 Pretest-Postest
 Ex: depression inventory taken before and after treatment
(counterbalancing not possible)
 Regression to the mean and maturation can cause changes in DV
 Interrupted time-series: multiple measurements across time
One-Group Designs
- In a one-group posttest only design, a treatment is implemented (or an independent variable
is manipulated) and then a dependent variable is measured once after the treatment is
implemented.
o This is the weakest type of quasi-experimental design. A major limitation to this
design is the lack of a control or comparison group. There is no way to determine
what the attitudes of these students would have been if they hadn’t undergone the
treatment.
- In a one-group pretest-posttest design, the dependent variable is measured once before the
treatment is implemented and once after it is implemented.
 One alternative explanation goes under the name of history. Other things
might have happened between the pretest and the posttest that caused a
change from pretest to posttest.
 Another alternative explanation goes under the name of maturation.
Participants might have changed between the pretest and the posttest in ways
that they were going to anyway because they are growing and learning.
 Similarly, instrumentation can be a threat to the internal validity of studies
using this design. Instrumentation refers to when the basic characteristics of
the measuring instrument change over time.
 Another alternative explanation for a change in the dependent variable in a
pretest-posttest design is regression to the mean. This refers to the statistical
fact that an individual who scores extremely high or extremely low on a
variable on one occasion will tend to score less extremely on the next
occasion.
 A closely related concept—and an extremely important one in psychological
research—is spontaneous remission. This is the tendency for many medical
and psychological problems to improve over time without any form of
treatment.
- A common approach to ruling out the threats to internal validity described above is by revisiting
the research design to include a control group, one that does not receive the treatment effect.
Interrupted Time-Series Design
- A variant of the pretest-posttest design is the interrupted time-series design. A time series is
a set of measurements taken at intervals over a period of time.

-
Non-equivalent Groups Design
- A nonequivalent groups design, is a between-subjects design in which participants have not
been randomly assigned to conditions.
o In the posttest only nonequivalent groups design, participants in one group are
exposed to a treatment, a nonequivalent group is not exposed to the treatment, and
then the two groups are compared.
o In the pretest-posttest nonequivalent groups design there is a treatment group that
is given a pretest, receives a treatment, and then is given a posttest. But at the same
time there is a nonequivalent control group that is given a pretest, does not receive
the treatment, and then is given a posttest.
Qualitative Research
- collects large amounts of data from a small number of participants, often exploratory
- analyses the data nonstatistically.
Complex Correlational Designs
- When you can’t run an experiment, but you still want to understand the (probable) cause
- Two improvements we could make:
o Track things over time (longitudinal research)
o Control for confounding variables (multiple predictors)
Longitudinal Designs
- Allow you to see if changes in X precede changes in Y
o Establish directionality
o E.g., “smartphone use is associated with depression”
 Which comes first?
Longitudinal Designs vs Correlational Designs
- The advantages of longitudinal designs is that it gives you a better sense of the directionality
between two variables.
o Which one changes first?
o Which comes before the other?
- With a simple cross-sectional design, you can see that two variables are related, but you don’t
know which might have caused which.
o When you measure them both repeatedly, the time course of the variables becomes
clearer.
Alternative Explanations
- Longitudinal designs can’t eliminate third variable problem.
- What if some other, third variable is causing both?
o Physical Activity Level
o Family involvement?
o Stress?
- Solution? Maybe using multiple predictors! (ie: multiple Ivs)
- (Still no experimental manipulation, only measured variables)
Correlation Matrix

-
- The numerical value in each box represents the correlation (r)
o Ranges from -1.0 to +1.0
o Value of .2 is modest, .5 is quite large
Significance is indicated with asterisks
- * means p < .05
- ** means p < .01
- *** means p < .001
Linear Regression
- Attempts to “predict” Y using X, by fitting a regression line
o Y = b(X) + error
- 1 Predictor and 1 DV
- Both continuous in nature
Multiple Regression
- With just one predictor, regression is just like correlation
- Advantage of regression: you can have multiple predictors
o Y = b1(X1) + b2(X2) + b3(X3) + error
 Each b makes its own independent contribution to the model, over and above
other predictors
- DV must be continuous

-
-

-
IV’s as statistical control
- What is the effect of X1 on Y, over and above the effects of X2 and X3 on Y?
- The variables you control for are called covariates
- They’re expected to co-vary with your main IV!
- In our current example, we might want to consider the effects of screen time on depression,
while controlling for physical activity
o covariates are additional variables that are taken into account in statistical models to
make sure that the effect of the primary independent variable(s) on the dependent
variable is accurately estimated, considering the potential influence of other relevant
factors.
o For example, in a study examining the effect of a new teaching method (independent
variable) on student performance (dependent variable), factors such as prior academic
achievement, socioeconomic status, or student motivation might be considered as
covariates to control for their potential influence on student performance.
Survey Research
- Quantitative and Qualitative method with two important characteristics
o Measured using self-reports
o Lots of emphasis put on sampling, often random.
- Most survey research is non-experimental. It is used to describe single variables, without any
manipulation.
Context Effects on Survey Responses
- Complexity can lead to unintended influences on respondents’ answers. These are often
referred to as context effects because they are not related to the content of the item but to the
context in which the item appears.
o For example, there is an item-order effect when the order in which the items are
presented affects people’s responses. One item can change how participants interpret
a later item or change the information that they retrieve to respond to later items
o For example, researcher Fritz Strack and his colleagues asked college students about
both their general life satisfaction and their dating frequency. When the life
satisfaction item came first, the correlation between the two was only −.12,
suggesting that the two variables are only weakly related. But when the dating
frequency item came first, the correlation between the two was +.66, suggesting that
those who date more have a strong tendency to be more satisfied with their lives.
Reporting the dating frequency first made that information more accessible in
memory so that they were more likely to base their life satisfaction rating on it.
o The response options provided can also have unintended effects on people’s
responses. For example, when people are asked how often they are “really irritated”
and given response options ranging from “less than once a year” to “more than once a
month,” they tend to think of major irritations and report being irritated infrequently.
But when they are given response options ranging from “less than once a day” to
“several times a month,” they tend to think of minor irritations and report being
irritated frequently. People also tend to assume that middle response options
represent what is normal or typical. So if they think of themselves as normal or
typical, they tend to choose middle response options. For example, people are likely
to report watching more television when the response options are centered on a
middle option of 4 hours than when centered on a middle option of 2 hours. To
mitigate against order effects, rotate questions and response items when there is no
natural order. Counterbalancing or randomizing the order of presentation of the
questions in online surveys are good practices for survey questions and can reduce
response order effects that show that among undecided voters, the first candidate
listed in a ballot receives a 2.5% boost simply by virtue of being listed first
Types of Survey Items
- Questionnaire items can be either open-ended or closed-ended.
o Open-ended: Fill in a unique response (qualitative)
o Closed-ended: multiple choice (quantitative)
 Likert Scale: Present people with a statement. They respond on a scale of
strongly disagree to strongly agree
 For closed-ended items, it is also important to create an appropriate response
scale. For categorical variables, the categories presented should generally be
mutually exclusive and exhaustive.
- Effective items: An acronym, BRUSO stands for “brief,” “relevant,” “unambiguous,”
“specific,” and “objective.
Formatting a Survey
- Every survey should have a written or spoken introduction that serves two basic functions
o One is to encourage respondents to participate in the survey
o The second function of the introduction is to establish informed consent. Remember
that this involves describing to respondents everything that might affect their decision
to participate.
Sampling

-
- Once the population has been specified, probability sampling requires a sampling frame.
This sampling frame is essentially a list of all the members of the population from which to
select the respondents.
- There are a variety of different probability sampling methods. Simple random sampling is
done in such a way that each individual in the population has an equal probability of being
selected for the sample.
- A common alternative to simple random sampling is stratified random sampling, in which
the population is divided into different subgroups or “strata” (usually based on demographic
characteristics) and then a random sample is taken from each “stratum.”
- Proportionate stratified random sampling can be used to select a sample in which the
proportion of respondents in each of various subgroups matches the proportion in the
population.
- Disproportionate stratified random sampling can also be used to sample extra respondents
from particularly small subgroups—allowing valid conclusions to be drawn about those
subgroups.
- Yet another type of probability sampling is cluster sampling, in which larger clusters of
individuals are randomly sampled and then individuals within each cluster are randomly
sampled. This is the only probability sampling method that does not require a sampling
frame.
Sample Size and Population Size
- Confidence intervals depend only on the size of the sample and not on the size of the
population. So a sample of 1,000 would produce a 95% confidence interval of 47 to 53
regardless of whether the population size was a hundred thousand, a million, or a hundred
million.
Bias
- Sampling bias occurs when a sample is selected in such a way that it is not representative of
the entire population and therefore produces inaccurate results.
- If these survey non-responders differ from survey responders in systematic ways, then this
difference can produce non-response bias
-

-
-

-
-
Time Variables
- Confounds are variables that might explain your effect
o X predicts Y, and that’s because by Z
- Moderators are variables that qualify your effect
o X predicts Y, but only under Z circumstances

You might also like