Stats Unit 2
Stats Unit 2
Correlational analysis is a statistical technique used to examine the relationship between two
variables, which is often quantified using the correlation coefficient (r). This coefficient
gives insight into both the direction and magnitude of the relationship between the
variables.
The magnitude, or strength of the relationship, is indicated by how close the correlation
coefficient is to +1 or -1:
r = +1 or -1: Indicates a perfect linear relationship. All data points lie exactly on a
straight line.
r close to ±1: Indicates a strong relationship, where changes in one variable reliably
predict changes in the other.
r close to 0: Indicates a weak or no relationship. Changes in one variable do not
provide much information about changes in the other.
Simply put, the correlation coefficient describes the extent to which two variables are
related to each other. There are several ways to calculate the correlation coefficient
divided as per its usage in parametric or non-parametric tests.
Correlational analysis can be conducted using both parametric and non-parametric tests,
depending on the nature of the data and the assumptions that can be made about its
distribution.
Parametric tests assume that the data is normally distributed and that the relationship between
the variables is linear. The most common parametric test used for correlation is:
Non-parametric tests do not assume a normal distribution of data and are used when data is
ordinal, not normally distributed, or has outliers that would distort parametric analysis. The
two most common non-parametric correlation tests are:
In summary, Pearson’s correlation is the most widely used parametric test for correlational
analysis, while Spearman’s rho and Kendall’s tau are non-parametric alternatives that are
more suitable for ranked or non-normally distributed data.
Pearson's r measures the strength and direction of the linear association between two
variables, ranging from -1 (perfect negative correlation) to +1 (perfect positive correlation),
with 0 indicating no linear relationship.
Another way to conceptualize Pearson’s r is as the average product of the paired z-scores,
which means each variable's values are standardized (converted into z-scores), and the
correlation coefficient is derived from the average of their products.
Methods of Calculation
Both methods yield the same correlation coefficient but are used depending on the context
and the size of the dataset.
The Product Moment Correlation (Pearson’s r) is based on several key assumptions that
need to be met to ensure valid and reliable results. While it is a robust measure of linear
association between two continuous variables, these assumptions ensure that Pearson’s r is
accurately interpreted. Here’s an elaboration on these assumptions:
The correlation coefficient (r) is a statistical measure that quantifies the strength and
direction of a linear relationship between two variables. Pearson’s correlation coefficient,
commonly denoted as r, has several important properties:
1. Range of r
Correlation does not imply causation. A high correlation between two variables
does not mean that changes in one variable cause changes in the other. There could be
a third variable affecting both, or the relationship could be coincidental.
Example: A high correlation between ice cream sales and drowning incidents may
exist, but this doesn’t mean ice cream causes drowning. Instead, both are influenced
by a third factor (hot weather).
3. r Remains Constant
The coefficient of alienation (k) is the square root of the coefficient of non-
determination (k²):
Partial correlation measures the strength and direction of the linear relationship between
two variables (X and Y) while controlling for the influence of a third variable (Z). This
method allows researchers to isolate the direct relationship between two variables,
eliminating the confounding effect of one or more other variables that might be influencing
both.
Why use partial correlation? In many cases, the relationship between two variables may
appear strong, but this relationship could be influenced by a third variable. By calculating
partial correlation, you can assess the true relationship between the two variables of interest.
Example:
Let’s say we are investigating the relationship between exercise (X) and weight loss (Y), but
we know that diet (Z) also affects both variables.
Raw correlation (rₓy) between exercise and weight loss may appear strong.
But when we calculate the partial correlation by controlling for diet (Z), we might
find that the relationship between exercise and weight loss weakens, indicating that
diet plays a significant role in explaining weight loss alongside exercise.
Partial correlation provides a powerful method for isolating the direct relationship between
two variables while accounting for the influence of a third. It is widely used in research to
refine interpretations and avoid confounding biases, making it a reliable way to measure
independent linear relationships.
Semi-Partial Correlation
Example:
Suppose we want to examine the relationship between job performance (Y) and job
satisfaction (X), while controlling for the effect of work experience (Z) on job
satisfaction. Semi-partial correlation would adjust job satisfaction to account for
work experience, and then measure the correlation between job satisfaction (with
work experience controlled) and job performance.
Types of Dichotomy:
1. Natural Dichotomy:
o A natural dichotomy occurs when the division into two categories is inherent
in the variable itself. These variables naturally exist in only two possible
states.
o Characteristics:
The division is inherent to the variable.
The underlying assumption is that the variable is naturally categorical,
meaning there are only two distinct categories with no in-between
states.
o Examples:
Alive or dead: There is no middle state between being alive or dead.
Normal vision or color-blind: Individuals either have normal vision
or are color-blind.
Head or tails: In a coin toss, the result can only be one of the two
outcomes.
2. Artificial Dichotomy:
o An artificial dichotomy is created when a variable that is originally
continuous or has multiple categories is forced into two groups based on
specific criteria set by the researcher.
o Characteristics:
The division is created by the researcher and is not naturally present
in the variable.
The underlying assumption is that the variable could be continuous,
but for the sake of convenience or analysis, it is divided into two
categories.
o Examples:
Fail or pass: A student’s grade could be a continuous variable (e.g.,
percentages), but it is dichotomized into fail or pass based on a
predetermined cutoff point (e.g., 50%).
Socially adjusted or maladjusted: Social adjustment could be
measured on a spectrum, but the researcher divides it into two
categories for analysis.
Poor or not poor: Income levels are continuous, but poverty is often
dichotomized into poor or not poor based on a threshold.
Special Correlations
In correlational analysis, special correlations are used when dealing with variables that are
not both continuous, such as when variables are dichotomous (either artificially or naturally).
Here’s an overview of the key special correlations, their assumptions, and when to use each
one.
Used for: When one variable is continuous and the other is an artificial dichotomy.
Example:
o Test Score: A continuous variable ranging from 0 to 100.
o Selection Status: Artificially dichotomized into "Selected" and "Not Selected"
based on a cutoff score (e.g., 60%).
Assumptions:
o Requires that the dichotomous variable reflects an underlying continuous
distribution.
o Assumes normality in the distribution of the continuous variable.
o Requires a large sample size and a balanced split (where the number of cases
in each group is roughly equal).
Key Points:
o The biserial correlation coefficient is not restricted between -1 and +1.
o It cannot be compared directly to other correlation coefficients because it
assumes that the artificial dichotomy stems from a continuous underlying
variable.
Standard Error: Cannot be computed directly, as assumptions about the distribution
of the underlying continuous variable introduce uncertainties.
4. Point Biserial Correlation (rpb) (Continuous and Natural Dichotomous Variable)
Used for: When one variable is continuous and the other is a naturally dichotomous
variable.
Example:
o Salary: A continuous variable representing income.
o Biological Sex: Naturally dichotomous with categories "Male" and "Female."
Assumptions:
o Makes no assumptions about normality or distribution of the continuous
variable.
o Does not require continuity in the underlying dichotomous variable.
Key Points:
o rpb is restricted between -1 and +1, making it comparable to Pearson’s r.
o Standard Error can be computed directly, allowing for significance testing.
The biserial correlation and point biserial correlation are both used to measure the
relationship between a continuous variable and a dichotomous variable, but they differ
in their assumptions and applications. The biserial correlation is applied when the
dichotomous variable is artificial, meaning it is derived from an underlying
continuous distribution that has been divided into two categories (e.g., "pass" and
"fail" based on a cutoff score). It assumes that the underlying variable follows a
normal distribution and requires a large sample size with a balanced split for
accurate results. Additionally, the biserial correlation coefficient is not restricted
between -1 and +1, making it harder to compare with other correlation coefficients,
and its standard error cannot be easily computed due to uncertainties about the
distribution.
In contrast, the point biserial correlation is used when the dichotomous variable is
naturally occurring (e.g., male/female), without assuming any underlying continuity.
It makes no assumptions about the distribution of either variable and is simpler to
apply in a wider range of situations. The point biserial correlation is restricted
between -1 and +1, making it comparable to Pearson’s r, and its standard error can
be computed, allowing for significance testing. While both methods measure the
strength of association between a continuous and dichotomous variable, the choice
between them depends on whether the dichotomous variable is naturally occurring or
artificially created.
The Spearman Rank-Order Correlation (denoted by the Greek letter ρ, pronounced "rho")
is a non-parametric measure of correlation, named after Charles Spearman. It is used to
assess the strength and direction of the monotonic relationship between two variables.
Unlike Pearson’s correlation, which assumes normally distributed data and linear
relationships, Spearman’s ρ does not make such assumptions, making it useful in specific
situations.
When to Use Spearman Rank-Order Correlation:
1. Ordinal Data: It is ideal for data that is measured on an ordinal scale, where the
variables are ranked, but the intervals between them are not necessarily equal.
o Example: Ranking students based on their class performance (1st, 2nd, 3rd,
etc.).
2. Highly Skewed Distribution: Spearman’s correlation is suitable when the
distribution of the data is skewed or does not meet the normality assumption required
by parametric tests.
o Example: Income levels, which are often skewed, can be analyzed using
Spearman’s ρ.
3. Interval and Ratio Data Converted into Ranks: When continuous data on interval
or ratio scales (e.g., test scores or height) are converted into ranks, Spearman’s ρ can
be applied to measure the correlation between the ranks.
o Example: Ranking athletes based on their performance scores in a
competition.
4. Small Sample Sizes: Spearman’s correlation is also useful when the sample size is
very small, as it does not rely on the assumption of normality or the robustness of
large samples.
o Example: A small study with fewer than 30 participants may benefit from
Spearman’s correlation for reliable results.
Kendall's Tau (𝜏) is a non-parametric measure of the association between two variables. It
is based on the relationship between the ranks of the two variables, rather than their actual
values, and is commonly used when data is ordinal or when parametric assumptions (such as
normality) cannot be met. Kendall's Tau is particularly useful for smaller datasets and can
handle ties in the rankings.
1. Ordinal Data: It is ideal when the data is ranked rather than continuous.
2. Non-parametric: Useful when the assumptions of parametric correlation tests
(like Pearson’s r) are not satisfied.
3. Continuous Data with Outliers: Kendall's Tau can be applied to continuous
variables when there are outliers, as it is less sensitive to extreme values.
4. Two Variables: It is only applicable when analyzing the relationship between two
variables.
5. Sample Size (N > 10): Kendall’s Tau works best when the sample size is at least
greater than 10 to provide a meaningful measure.
Kendall’s Tau compares the ranks of two variables by evaluating concordant and discordant
pairs:
𝜏 = +1: Perfect positive relationship (higher ranks in one set correspond to higher
ranks in the other).
𝜏 = -1: Perfect negative relationship (higher ranks in one set correspond to lower
ranks in the other).
𝜏 = 0: No correlation, indicating independence or randomness in the ranks.
In summary, Kendall’s Tau is a versatile and robust measure of correlation, especially useful
for ranked data and when the assumptions of other correlation measures cannot be met. Its
focus on concordant and discordant pairs allows it to provide insight into the monotonic
relationship between two variables
Concept of Regression
Regression analysis goes beyond the simple association measured by correlation and allows
researchers to examine how one variable can predict or explain the variation in another
variable. It is a powerful statistical tool that models the relationship between two or more
variables, enabling predictions about one variable based on information from others.
1. Predictor (Independent) Variable: The variable that is used to predict or explain the
other variable. This is also called the explanatory variable.
2. Outcome (Dependent) Variable: The variable that is being predicted or explained,
also known as the response variable.
For example, if a researcher is studying the relationship between test scores and self-esteem,
test scores might be the predictor variable, while self-esteem is the outcome variable.
Similarly, in a study on job satisfaction and employee retention, job satisfaction could be
the predictor, and employee retention the outcome variable.
Regression Line
The regression line is a line that best fits the data in a scatterplot, representing the
relationship between the predictor (independent) variable and the outcome (dependent)
variable. It is the line that minimizes the error, or the distance between the observed data
points and the predicted values on the line. The purpose of the regression line is to help
predict the outcome variable based on the predictor variable.
Prediction: The regression line is used to predict the value of the outcome variable
(Y) based on the values of the predictor variable (X).
Best Fit: The line is calculated in a way that minimizes the sum of the squared
differences (errors) between the observed data points and the values predicted by the
line. This method is known as least squares regression.
Thus, the regression line offers a mathematical model for understanding and predicting
the relationship between variables.
Types of Regression:
1. Simple Linear Regression: Involves one predictor and one outcome variable, fitting
a straight line to the data to make predictions.
o Equation: Y=a+bXY
Y = predicted outcome
a = intercept
b = slope (indicating the relationship between X and Y)
X= predictor variable
2. Multiple Regression: Involves two or more predictor variables used to predict an
outcome.
o Equation: Y=a+b1X1+b2X2+…+ bnXn
Overall, regression offers a more detailed and predictive approach than correlation,
allowing for an in-depth analysis of relationships between variables.
When performing linear regression, several key assumptions need to be met for the results to
be valid. These assumptions ensure that the linear model provides an accurate representation
of the data and that the statistical tests (such as hypothesis tests) produce reliable results.
1. Linearity
Definition: The relationship between the independent variable (X) and the dependent
variable (Y) should be linear.
Explanation: The change in the dependent variable is proportional to the change in
the independent variable, meaning the relationship can be represented with a straight
line.
Implication: If the relationship is non-linear, the linear regression model may not
capture the true association, and predictions will be inaccurate.
Check: A scatterplot of the variables can help determine whether the relationship is
linear.
Definition: The residuals (errors) of the model should be independent of each other.
Explanation: This means that the error term for one observation should not be
correlated with the error term for another observation. In time series data, for
example, errors may be correlated, which would violate this assumption (a condition
known as autocorrelation).
Implication: Violations of independence can lead to biased estimates and incorrect
statistical inferences.
Check: For time series data, a Durbin-Watson test can be used to check for
autocorrelation.
3. Homoscedasticity
Definition: The variance of the residuals should be constant across all levels of the
independent variable (X).
Explanation: The spread of the residuals should not vary with the value of the
predictor. If the variance of the residuals changes, the data are said to be
heteroscedastic.
Implication: If heteroscedasticity is present, it can lead to inefficient estimates and
affect the validity of hypothesis tests.
Check: A residual vs. fitted values plot can help identify heteroscedasticity, as non-
constant variance will appear as a funnel or fan-shaped pattern.
4. Normality of Residuals
Definition: The residuals (errors) of the model should follow a normal distribution.
Explanation: While linear regression does not require the independent or dependent
variables to be normally distributed, the residuals themselves should be normally
distributed. This assumption is crucial for valid statistical inference, particularly when
calculating confidence intervals and p-values.
Implication: If the residuals are not normally distributed, hypothesis tests may be
unreliable, and predictions may not be accurate.
Check: Residual plots, histograms, or Q-Q plots can be used to assess the normality
of residuals. Additionally, statistical tests like the Shapiro-Wilk test or Kolmogorov-
Smirnov test can test the normality assumption.
The assumptions of linear regression — linearity, independence, homoscedasticity, and
normality of residuals — ensure the reliability of the regression model. Violations of these
assumptions can lead to inaccurate predictions and invalid statistical conclusions, so it is
essential to check and address any violations before interpreting the results of a regression
analysis.
Estimation methods in linear regression are techniques used to calculate the parameters
(coefficients) of the regression equation that best fit the data. These methods estimate the
relationship between the independent and dependent variables, producing a linear model for
predictions or analysis.
Definition: OLS is the most commonly used method to estimate the coefficients in
linear regression.
How It Works: OLS minimizes the sum of the squared differences (errors) between
the observed values (actual data points) and the predicted values (fitted line). This
method ensures that the line of best fit captures the overall trend of the data.
Objective: Find the values of the coefficients (slope and intercept) that minimize the
residual sum of squares (RSS).
o Simple and widely used.
o Produces unbiased estimates under the assumption that the errors are normally
distributed and homoscedastic.
Limitations:
o OLS is sensitive to outliers and assumes that the residuals are homoscedastic
and normally distributed.
2. Method of Moments
Definition: MLE is a method for estimating the parameters of a statistical model that
maximizes the likelihood function, which measures how well the model explains the
observed data.
Multiple Regression
Multiple regression is an extension of simple linear regression that models the linear
relationship between a single outcome variable (dependent variable) and two or more
predictor variables (independent variables). It helps to determine how each predictor
influences the outcome while controlling for the other variables, allowing researchers to
analyze more complex relationships between variables.
Key Components:
Mathematical Equation:
Interpretation:
Intercept (a): Represents the expected value of Y when all predictor variables are
zero.
Slopes (b₁, b₂, …, bₙ): Indicate the individual contribution of each predictor variable.
For example, b₁ shows the effect of X₁ on Y, controlling for all other predictors.
Each coefficient (b) quantifies how much the dependent variable Y changes for a
one-unit change in the corresponding predictor variable X, while keeping other
variables constant.
Uses:
Example:
In a study exploring the relationship between salary (Y) and years of experience (X₁),
education level (X₂), and location (X₃), a multiple regression model would estimate how
much each factor (experience, education, location) influences salary.
Assumptions of Multiple Regression
Multiple regression analysis relies on several key assumptions to ensure the validity and
reliability of the results. These assumptions must be met to avoid biased or misleading
estimates of the relationships between the dependent and independent variables.
1. Linear Relationship
Definition: There must be a linear relationship between the dependent variable (Y)
and each of the independent variables (X₁, X₂, …, Xₙ).
Implication: The changes in the outcome variable (Y) should correspond
proportionally to changes in the predictor variables.
How to check: A scatterplot or residual plot can be used to visually assess if the
relationship appears linear.
2. Multivariate Normality
Definition: The residuals (errors) of the regression model should follow a normal
distribution.
Implication: This assumption is particularly important for hypothesis testing (e.g., t-
tests for coefficients) and for making reliable confidence intervals.
How to check:
o Use a histogram or Q-Q plot of residuals to visually inspect normality.
o Formal tests like the Shapiro-Wilk test can also be applied.
3. No Multicollinearity
Definition: The independent variables should not be too highly correlated with each
other, a phenomenon known as multicollinearity.
Implication: If multicollinearity exists, it becomes difficult to isolate the individual
effect of each predictor, leading to inflated standard errors and unstable coefficient
estimates.
How to check:
o Calculate the Variance Inflation Factor (VIF) for each predictor. A VIF
value above 10 typically indicates problematic multicollinearity.
o Tolerance values can also be used (a tolerance below 0.1 suggests high
multicollinearity).
4. Homoscedasticity
Definition: The variance of the residuals (errors) should be constant across all levels
of the independent variables (X₁, X₂, …, Xₙ). This is known as homoscedasticity.
Implication: If the variance of the errors changes across levels of the independent
variables, a condition called heteroscedasticity exists, which can affect the efficiency
of the coefficient estimates and the accuracy of predictions.
How to check:
o Use a scatterplot of residuals versus predicted values. If the plot shows a
funnel shape (wide variance at one end), it indicates heteroscedasticity.
o Formal tests like the Breusch-Pagan test can be used to detect
heteroscedasticity.
5. Independence of Errors
Definition: The residuals (errors) should be independent of each other, meaning that
the error for one observation should not be correlated with the error for another.
Implication: Violations of this assumption (often seen in time series data) can lead to
biased standard errors, which affects significance tests.
How to check:
o Durbin-Watson statistic is commonly used to detect autocorrelation in the
residuals.
6. No Perfect Collinearity
Logistic Regression
Logistic regression is a statistical method used to examine the relationship between one or
more independent variables (categorical or continuous) and a dichotomous dependent
variable (i.e., an outcome variable with two possible values, such as success/failure or
yes/no). Unlike linear regression, logistic regression is used when the outcome variable is
binary.
Key Concepts:
1. Outcome Variable:
o The outcome variable is dichotomous, meaning it has two categories (e.g.,
success/failure, yes/no, present/absent).
o It models the probability of the event of interest occurring, often coded as 1
for "success" or the event happening, and 0 for "failure" or the event not
happening.
2. Predictor Variables:
o Logistic regression can include one or more continuous or categorical
predictor variables.
o These predictors are used to estimate the probability of the occurrence of the
outcome.
3. Logistic Function:
o Logistic regression uses the logistic function (also called the sigmoid
function) to model the probability of the binary outcome.
4. Odds and Odds Ratio:
o Odds: The odds are defined as the ratio of the probability that an event will
occur to the probability that it will not occur. Odds=P1−P\text{Odds} =
\frac{P}{1 - P}Odds=1−PP
o Odds Ratio (OR): Logistic regression outputs odds ratios, which represent
the ratio of the odds of the event occurring for different values of the predictor
variables.
1. OR > 1: The event is more likely to occur as the predictor variable
increases.
2. OR < 1: The event is less likely to occur as the predictor variable
increases.
Example:
In the case of a study investigating whether a history of suicide attempts is associated with an
increased risk of subsequent suicide attempts:
The logistic regression model would predict the odds of a future suicide attempt based on
whether a person has a history of prior attempts. The odds ratio would compare the odds of a
future attempt in those with a history of attempts versus those without.
Assumptions:
The key assumptions of logistic regression ensure the model works effectively and provides
valid results. Here's an overview:
Mediation Analysis
Mediation analysis is a statistical approach used to explore how and why a predictor variable
(X) affects an outcome variable (Y) by introducing a mediating variable (M). This third
variable represents the process or mechanism through which the predictor influences the
outcome, helping to clarify the underlying relationship.
In essence, mediation analysis breaks down the total effect of X on Y into direct and
indirect effects:
Mediation involves estimating three key regression equations to assess the relationships
between the variables:
1. X → Y (Total Effect): This step assesses the direct relationship between the predictor
(X) and the outcome (Y). It represents the total effect without accounting for the
mediator.
2. X → M (Effect of X on M): This step examines the relationship between the predictor
(X) and the mediator (M). It tells us whether the predictor variable is associated with
the mediator, indicating a potential pathway.
3. X + M → Y (Effect of X and M on Y): In this final step, both X (the predictor) and M
(the mediator) are entered into a regression model predicting Y. The effect of X on Y
while controlling for M is the direct effect. The extent to which M explains part of
the X → Y relationship is the indirect effect.
Interpretation
Types of Mediation
1. Full Mediation: The effect of X on Y disappears when the mediator (M) is included
in the model, meaning that X affects Y entirely through M.
2. Partial Mediation: X still has a direct effect on Y, but part of the effect is mediated
by M. In this case, both the direct and indirect effects are significant.
Example:
Consider a study where job satisfaction (X) influences employee retention (Y). The
researcher hypothesizes that organizational commitment (M) mediates this relationship.
Mediation analysis would assess:
Mediation analysis provides insights into how and why relationships between variables
occur, helping to identify potential mechanisms or processes that explain observed
associations. It is commonly used in fields such as psychology, social sciences, and health
research to deepen understanding beyond direct relationships.