0% found this document useful (0 votes)
26 views24 pages

Stats Unit 2

Uploaded by

pakhareashlesha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views24 pages

Stats Unit 2

Uploaded by

pakhareashlesha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Module 2: Measures of Association and Prediction

2.1 Correlation: product moment, partial correlation, special


Correlations
Francis Galton, an early pioneer of the study of correlation, described it as a measure of co-
relationship between variables, helping to show the direction and strength of their association
without assuming a cause-and-effect link.

Correlational analysis is a statistical technique used to examine the relationship between two
variables, which is often quantified using the correlation coefficient (r). This coefficient
gives insight into both the direction and magnitude of the relationship between the
variables.

1. Direction of the Relationship

The direction tells us how one variable changes in relation to another:

 Positive Correlation (Same direction):


o When two variables move in the same direction, meaning as one increases, the
other also increases, and as one decreases, the other decreases.
o For example, height and weight might have a positive correlation, where taller
people generally weigh more.
o Mathematically, the correlation coefficient (r) for positive correlation lies
between 0 and +1. An r value closer to +1 indicates a stronger positive
relationship.
 Negative Correlation (Opposite direction):
o In a negative correlation, as one variable increases, the other decreases, and
vice versa.
o For example, the time spent studying and the number of errors on a test may
have a negative correlation, where more study time leads to fewer errors.
o The correlation coefficient (r) for negative correlation ranges between 0 and -
1, with an r value closer to -1 indicating a stronger negative relationship.
 No Correlation (Nil correlation):
o If there is no discernible relationship between the two variables, changes in
one do not predict changes in the other.
o For instance, the amount of tea a person drinks and their IQ level likely have
no correlation.
o In this case, the correlation coefficient (r) will be close to 0, indicating no
relationship.

2. Magnitude of the Relationship

The magnitude, or strength of the relationship, is indicated by how close the correlation
coefficient is to +1 or -1:

 r = +1 or -1: Indicates a perfect linear relationship. All data points lie exactly on a
straight line.
 r close to ±1: Indicates a strong relationship, where changes in one variable reliably
predict changes in the other.
 r close to 0: Indicates a weak or no relationship. Changes in one variable do not
provide much information about changes in the other.

Simply put, the correlation coefficient describes the extent to which two variables are
related to each other. There are several ways to calculate the correlation coefficient
divided as per its usage in parametric or non-parametric tests.

Correlational analysis can be conducted using both parametric and non-parametric tests,
depending on the nature of the data and the assumptions that can be made about its
distribution.

Parametric Tests for Correlation

Parametric tests assume that the data is normally distributed and that the relationship between
the variables is linear. The most common parametric test used for correlation is:

1. Pearson’s Correlation Coefficient (r):


o Purpose: It measures the strength and direction of the linear relationship
between two continuous variables.
o Assumptions:
 Both variables are normally distributed.
 The relationship is linear.
 The variables are measured on interval or ratio scales.
o Example: The relationship between height and weight in a normally
distributed population.
o Values:
 r ranges from -1 to +1, where:
 +1 indicates a perfect positive linear relationship.
 -1 indicates a perfect negative linear relationship.
 0 indicates no linear relationship.

Non-Parametric Tests for Correlation

Non-parametric tests do not assume a normal distribution of data and are used when data is
ordinal, not normally distributed, or has outliers that would distort parametric analysis. The
two most common non-parametric correlation tests are:

1. Spearman’s Rank Correlation Coefficient (ρ or Spearman's rho):


o Purpose: It assesses the strength and direction of a monotonic relationship
(whether linear or not) between two variables, often used for ordinal data or
non-normally distributed continuous data.
o Assumptions:
 Variables do not need to be normally distributed.
 The relationship is monotonic (i.e., consistently increasing or
decreasing but not necessarily linear).
o Example: The correlation between ranks in a competition and the hours spent
practicing.
o Values:
Ranges from -1 to +1, with the same interpretation as Pearson’s
correlation (i.e., perfect positive, perfect negative, or no correlation).
2. Kendall’s Tau (τ):
o Purpose: Like Spearman’s rho, Kendall’s tau measures the strength and
direction of a monotonic relationship between two ordinal or continuous
variables.
o Assumptions:
 It is used when data contains many tied ranks or when the sample size
is small.
o Example: Assessing the agreement between two rankings in a preference
survey.
o Values:
 Ranges from -1 to +1, with the same interpretation as other correlation
coefficients.

When to Use Parametric vs. Non-Parametric Tests:

 Use parametric tests (Pearson’s r) if:


o Both variables are normally distributed.
o The relationship is linear.
o The data is continuous and meets other parametric assumptions.
 Use non-parametric tests (Spearman’s rho, Kendall’s tau) if:
o Data is ordinal or ranked.
o The relationship is non-linear but monotonic.
o Data is not normally distributed or has outliers.

In summary, Pearson’s correlation is the most widely used parametric test for correlational
analysis, while Spearman’s rho and Kendall’s tau are non-parametric alternatives that are
more suitable for ranked or non-normally distributed data.

PEARSON’S PRODUCT-MOMENT CORRELATION.

Pearson’s Correlation Coefficient (r), often referred to as Pearson’s r, is a widely used


measure of the linear relationship between two variables. It was developed by Karl Pearson
in 1886, building upon earlier work by Francis Galton, who introduced the idea of
correlation in the 1880s.

Pearson's r measures the strength and direction of the linear association between two
variables, ranging from -1 (perfect negative correlation) to +1 (perfect positive correlation),
with 0 indicating no linear relationship.

Another way to conceptualize Pearson’s r is as the average product of the paired z-scores,
which means each variable's values are standardized (converted into z-scores), and the
correlation coefficient is derived from the average of their products.

Methods of Calculation

There are two common methods to calculate Pearson’s r:

1. Raw Score Method:


o In this method, you use the actual raw scores (X and Y) for each variable
directly in the formula.
o It is more straightforward for smaller datasets, but as the number of data points
increases, the calculations can become more cumbersome.

The raw score formula is:

2. Deviation Score Method:


o In this method, you first calculate the deviations of each score from their
respective means and then use these deviations to compute the correlation.
o This method is useful because it centers the data, making it easier to see the
relationship between the variables.

The deviation score formula is:

Both methods yield the same correlation coefficient but are used depending on the context
and the size of the dataset.

Assumptions Underlying Pearson’s Product Moment Correlation:

The Product Moment Correlation (Pearson’s r) is based on several key assumptions that
need to be met to ensure valid and reliable results. While it is a robust measure of linear
association between two continuous variables, these assumptions ensure that Pearson’s r is
accurately interpreted. Here’s an elaboration on these assumptions:

1. Data is Continuous (Interval or Ratio Scale):


o Pearson’s r assumes that both variables are measured on a continuous scale
(either interval or ratio). This means that the data must have meaningful
numerical distances between values (interval) or a true zero point (ratio).
o Example: Heights, weights, test scores, temperatures, etc., are continuous data
types.
2. Linearity of Regression:
o The relationship between the two variables should be linear, meaning the
trend of the relationship forms a straight line when plotted on a scatter plot.
Pearson's r measures the strength and direction of this linear relationship.
o If the relationship is non-linear or curvilinear, Pearson's r will not
appropriately capture the relationship. Non-linear relationships might follow a
curve, indicating a non-constant ratio between the variables.
o Scatter Plot Analysis: The linearity assumption can be visually assessed
using a scatter plot. If the plot shows a straight-line pattern, the relationship is
linear. If the points form a curve, the relationship may be non-linear, and
Pearson’s r might not be suitable.
3. No Collinearity:
o Collinearity refers to a situation where one predictor variable in a regression
model can be linearly predicted from the others with a substantial degree of
accuracy. In correlation, this would imply a perfect or near-perfect relationship
between the two variables. Pearson’s r is designed to measure relationships
that are rectilinear, not collinear.
o When a relationship is collinear, one variable might be redundant because it
contributes no new information beyond what is provided by another variable.
4. Symmetrical or Unimodal Distribution:
o While Pearson’s r does not require exact normality for the variables, it is
assumed that the variables have a roughly symmetrical or unimodal
distribution. This means the data should have a single peak and should not be
severely skewed.
o Why is this important? A highly skewed distribution can distort the
correlation results, even though Pearson’s r is relatively robust to minor
deviations from normality. The symmetry ensures that the linear relationship
is not distorted by extreme values or skewness.

Properties of the Correlation Coefficient (r)

The correlation coefficient (r) is a statistical measure that quantifies the strength and
direction of a linear relationship between two variables. Pearson’s correlation coefficient,
commonly denoted as r, has several important properties:

1. Range of r

 The value of r always falls between -1 and +1:


o r = +1: Indicates a perfect positive linear relationship between the two
variables. As one variable increases, the other also increases proportionally.
o r = -1: Indicates a perfect negative linear relationship between the two
variables. As one variable increases, the other decreases proportionally.
o r = 0: Indicates no linear relationship between the two variables.
 The closer r is to ±1, the stronger the linear relationship. The closer it is to 0, the
weaker the linear relationship.

2. Does Not Assume Causality

 Correlation does not imply causation. A high correlation between two variables
does not mean that changes in one variable cause changes in the other. There could be
a third variable affecting both, or the relationship could be coincidental.
 Example: A high correlation between ice cream sales and drowning incidents may
exist, but this doesn’t mean ice cream causes drowning. Instead, both are influenced
by a third factor (hot weather).

3. r Remains Constant

 The correlation coefficient remains unchanged by changes in the origin or scale of


measurement. This means that if you:
o Add or subtract a constant to all values of a variable, or
o Multiply or divide all values of a variable by a constant, the value of r will
remain the same.
 Example: Converting a dataset from inches to centimeters will not change the
correlation between height and weight.

4. Coefficient of Determination (r²)

 The coefficient of determination (r²) is a key property derived from r. It represents


the proportion of variance in one variable that can be explained by the variance in
the other variable.
 r² is always between 0 and 1 and is interpreted as a percentage.
o r² = 0.64 (or 64%) indicates that 64% of the variation in one variable is
explained by the variation in the other variable.
o r² = 0 indicates that none of the variation in one variable is explained by the
other.
 Example: If the correlation between hours studied and exam scores is 0.8, then r² =
0.64. This means 64% of the variance in exam scores is explained by the number of
hours studied.

5. Coefficient of Non-Determination (k² = 1 - r²)

 The coefficient of non-determination (k²) represents the proportion of variance that


is not explained by the relationship between the two variables.
 Mathematically, k² = 1 - r².
 It shows the remaining unexplained variability in one variable after accounting for
the linear relationship with the other variable.
 Example: If r² = 0.64, then k² = 0.36. This means that 36% of the variance in the
dependent variable is not explained by the independent variable and may be due to
other factors.
6. Coefficient of Alienation (k)

 The coefficient of alienation (k) is the square root of the coefficient of non-
determination (k²):

 It represents the degree of non-association or residual error in the relationship


between the two variables. In simpler terms, it tells us how much of the relationship is
alien or unexplained by the linear relationship.
 Example: If r² = 0.64, then k² = 0.36, and k = √0.36 = 0.6. This means that 60% of the
relationship between the two variables remains unexplained by the linear correlation.

Partial Correlation (rₓyz)

Partial correlation measures the strength and direction of the linear relationship between
two variables (X and Y) while controlling for the influence of a third variable (Z). This
method allows researchers to isolate the direct relationship between two variables,
eliminating the confounding effect of one or more other variables that might be influencing
both.

Why use partial correlation? In many cases, the relationship between two variables may
appear strong, but this relationship could be influenced by a third variable. By calculating
partial correlation, you can assess the true relationship between the two variables of interest.

Example:

Let’s say we are investigating the relationship between exercise (X) and weight loss (Y), but
we know that diet (Z) also affects both variables.

 Raw correlation (rₓy) between exercise and weight loss may appear strong.
 But when we calculate the partial correlation by controlling for diet (Z), we might
find that the relationship between exercise and weight loss weakens, indicating that
diet plays a significant role in explaining weight loss alongside exercise.

Partial correlation provides a powerful method for isolating the direct relationship between
two variables while accounting for the influence of a third. It is widely used in research to
refine interpretations and avoid confounding biases, making it a reliable way to measure
independent linear relationships.

Uses of Partial Correlation:


 Eliminating Confounding Effects: In research, variables are often influenced by
external factors (confounders). Partial correlation helps disentangle these influences,
revealing the independent relationship between the primary variables.
 Better Interpretation: It allows researchers to avoid overestimating the strength of a
relationship by taking into account other variables that might be artificially inflating
the correlation.

Limitations of Partial Correlation:

 Only linear relationships: Partial correlation only examines linear relationships


between variables. If the relationships are non-linear, partial correlation may not
provide an accurate picture.
 Assumes other variables are correctly identified: If the third variable (Z) is not
properly identified or measured, the partial correlation might still be misleading.

Semi-Partial Correlation

Semi-partial correlation (also called part correlation) is a statistical method used to


measure the strength and direction of the linear relationship between two variables, while
controlling for the influence of other variables on only one of the two variables involved in
the relationship.

How Semi-Partial Correlation Works:

 In semi-partial correlation, we control for the effect of an additional variable (or


variables) on only one of the two primary variables.
 It differs from partial correlation, where the influence of the control variable(s) is
removed from both variables.

Example:

 Suppose we want to examine the relationship between job performance (Y) and job
satisfaction (X), while controlling for the effect of work experience (Z) on job
satisfaction. Semi-partial correlation would adjust job satisfaction to account for
work experience, and then measure the correlation between job satisfaction (with
work experience controlled) and job performance.

2.2 Nonparametric correlations: Kendall's tau, Spearman’s


rho, other measures.
Dichotomy in Variables
A dichotomy refers to the division of variables into two distinct categories or groups. This
binary categorization simplifies the analysis by dividing a variable into two mutually
exclusive groups, such as Yes/No, True/False, or Male/Female. Dichotomies can occur
naturally or be created artificially by the researcher, depending on the nature of the variable
being studied.
Dichotomy in variables is connected to special correlations through the way binary (two-
category) variables are handled in correlation analysis. When one or both variables in a
correlation are dichotomous (natural or artificial), different special correlation techniques
are applied because traditional correlation methods, like Pearson’s r, may not be appropriate
for analyzing relationships involving dichotomous variables.

Types of Dichotomy:

1. Natural Dichotomy:
o A natural dichotomy occurs when the division into two categories is inherent
in the variable itself. These variables naturally exist in only two possible
states.
o Characteristics:
 The division is inherent to the variable.
 The underlying assumption is that the variable is naturally categorical,
meaning there are only two distinct categories with no in-between
states.
o Examples:
 Alive or dead: There is no middle state between being alive or dead.
 Normal vision or color-blind: Individuals either have normal vision
or are color-blind.
 Head or tails: In a coin toss, the result can only be one of the two
outcomes.
2. Artificial Dichotomy:
o An artificial dichotomy is created when a variable that is originally
continuous or has multiple categories is forced into two groups based on
specific criteria set by the researcher.
o Characteristics:
 The division is created by the researcher and is not naturally present
in the variable.
 The underlying assumption is that the variable could be continuous,
but for the sake of convenience or analysis, it is divided into two
categories.
o Examples:
 Fail or pass: A student’s grade could be a continuous variable (e.g.,
percentages), but it is dichotomized into fail or pass based on a
predetermined cutoff point (e.g., 50%).
 Socially adjusted or maladjusted: Social adjustment could be
measured on a spectrum, but the researcher divides it into two
categories for analysis.
 Poor or not poor: Income levels are continuous, but poverty is often
dichotomized into poor or not poor based on a threshold.

Special Correlations

In correlational analysis, special correlations are used when dealing with variables that are
not both continuous, such as when variables are dichotomous (either artificially or naturally).
Here’s an overview of the key special correlations, their assumptions, and when to use each
one.

1. Tetrachoric Correlation (2 Artificially Dichotomous Variables)


 Used for: Two artificially dichotomous variables derived from continuous underlying
distributions.
 Example:
o Understanding: Dichotomized into "High Understanding" and "Low
Understanding" based on a continuous test score.
o Performance: Dichotomized into "High Performance" and "Low
Performance" based on another continuous test score.
 Assumptions:
o Assumes that the two dichotomous variables were originally continuous but
were divided into categories artificially.
o Assumes the underlying distribution is normal.
o Requires a large sample size to produce reliable results.
 Key Point: Tetrachoric correlation estimates what the correlation would be if both
variables were continuous, treating the dichotomization as an artificial construct.

2. Phi Correlation (Φ) (2 Naturally Dichotomous Variables)

 Used for: Two naturally dichotomous variables.


 Example:
o Biological Sex: Naturally dichotomous with categories "Male" and "Female."
o Genetic Trait: Naturally dichotomous with categories "Present" and
"Absent."
 Assumptions:
o No assumptions about the distribution.
o The phi coefficient is essentially Pearson’s r applied to two dichotomous
variables.
 Key Point: Phi correlation measures the strength and direction of the association
between two naturally occurring binary categories. It is restricted between -1 and +1.

3. Biserial Correlation (rb) (Continuous and Artificial Dichotomous Variable)

 Used for: When one variable is continuous and the other is an artificial dichotomy.
 Example:
o Test Score: A continuous variable ranging from 0 to 100.
o Selection Status: Artificially dichotomized into "Selected" and "Not Selected"
based on a cutoff score (e.g., 60%).
 Assumptions:
o Requires that the dichotomous variable reflects an underlying continuous
distribution.
o Assumes normality in the distribution of the continuous variable.
o Requires a large sample size and a balanced split (where the number of cases
in each group is roughly equal).
 Key Points:
o The biserial correlation coefficient is not restricted between -1 and +1.
o It cannot be compared directly to other correlation coefficients because it
assumes that the artificial dichotomy stems from a continuous underlying
variable.
 Standard Error: Cannot be computed directly, as assumptions about the distribution
of the underlying continuous variable introduce uncertainties.
4. Point Biserial Correlation (rpb) (Continuous and Natural Dichotomous Variable)

 Used for: When one variable is continuous and the other is a naturally dichotomous
variable.
 Example:
o Salary: A continuous variable representing income.
o Biological Sex: Naturally dichotomous with categories "Male" and "Female."
 Assumptions:
o Makes no assumptions about normality or distribution of the continuous
variable.
o Does not require continuity in the underlying dichotomous variable.
 Key Points:
o rpb is restricted between -1 and +1, making it comparable to Pearson’s r.
o Standard Error can be computed directly, allowing for significance testing.

 Comparison between Biserial and Point Biserial Correlation

The biserial correlation and point biserial correlation are both used to measure the
relationship between a continuous variable and a dichotomous variable, but they differ
in their assumptions and applications. The biserial correlation is applied when the
dichotomous variable is artificial, meaning it is derived from an underlying
continuous distribution that has been divided into two categories (e.g., "pass" and
"fail" based on a cutoff score). It assumes that the underlying variable follows a
normal distribution and requires a large sample size with a balanced split for
accurate results. Additionally, the biserial correlation coefficient is not restricted
between -1 and +1, making it harder to compare with other correlation coefficients,
and its standard error cannot be easily computed due to uncertainties about the
distribution.

In contrast, the point biserial correlation is used when the dichotomous variable is
naturally occurring (e.g., male/female), without assuming any underlying continuity.
It makes no assumptions about the distribution of either variable and is simpler to
apply in a wider range of situations. The point biserial correlation is restricted
between -1 and +1, making it comparable to Pearson’s r, and its standard error can
be computed, allowing for significance testing. While both methods measure the
strength of association between a continuous and dichotomous variable, the choice
between them depends on whether the dichotomous variable is naturally occurring or
artificially created.

Spearman Rank-Order Correlation

The Spearman Rank-Order Correlation (denoted by the Greek letter ρ, pronounced "rho")
is a non-parametric measure of correlation, named after Charles Spearman. It is used to
assess the strength and direction of the monotonic relationship between two variables.
Unlike Pearson’s correlation, which assumes normally distributed data and linear
relationships, Spearman’s ρ does not make such assumptions, making it useful in specific
situations.
When to Use Spearman Rank-Order Correlation:

1. Ordinal Data: It is ideal for data that is measured on an ordinal scale, where the
variables are ranked, but the intervals between them are not necessarily equal.
o Example: Ranking students based on their class performance (1st, 2nd, 3rd,
etc.).
2. Highly Skewed Distribution: Spearman’s correlation is suitable when the
distribution of the data is skewed or does not meet the normality assumption required
by parametric tests.
o Example: Income levels, which are often skewed, can be analyzed using
Spearman’s ρ.
3. Interval and Ratio Data Converted into Ranks: When continuous data on interval
or ratio scales (e.g., test scores or height) are converted into ranks, Spearman’s ρ can
be applied to measure the correlation between the ranks.
o Example: Ranking athletes based on their performance scores in a
competition.
4. Small Sample Sizes: Spearman’s correlation is also useful when the sample size is
very small, as it does not rely on the assumption of normality or the robustness of
large samples.
o Example: A small study with fewer than 30 participants may benefit from
Spearman’s correlation for reliable results.

The formula for the Spearman Rank-Order Correlation (ρ) is:

Kendall’s Tau (𝜏) Correlation

Kendall's Tau (𝜏) is a non-parametric measure of the association between two variables. It
is based on the relationship between the ranks of the two variables, rather than their actual
values, and is commonly used when data is ordinal or when parametric assumptions (such as
normality) cannot be met. Kendall's Tau is particularly useful for smaller datasets and can
handle ties in the rankings.

When to Use Kendall’s Tau:

1. Ordinal Data: It is ideal when the data is ranked rather than continuous.
2. Non-parametric: Useful when the assumptions of parametric correlation tests
(like Pearson’s r) are not satisfied.
3. Continuous Data with Outliers: Kendall's Tau can be applied to continuous
variables when there are outliers, as it is less sensitive to extreme values.
4. Two Variables: It is only applicable when analyzing the relationship between two
variables.
5. Sample Size (N > 10): Kendall’s Tau works best when the sample size is at least
greater than 10 to provide a meaningful measure.

How Kendall’s Tau Works:

Kendall’s Tau compares the ranks of two variables by evaluating concordant and discordant
pairs:

 Concordant Pairs: A pair of observations (A and B) is concordant if the ranks for


both variables (X and Y) are in the same order. For example, if A ranks higher than B
in both X and Y, the pair is concordant. This indicates that the two variables have a
positive association.
o Example: If for subject A, the rank in variable X is 1 and for B, it’s 2, and the
rank in variable Y for A is 1 and for B it’s 3, then both ranks follow the same
pattern, so the pair is concordant.
 Discordant Pairs: A pair is discordant if the ranks for the two variables are in the
opposite order. If A ranks higher than B in X but lower in Y, the pair is discordant,
indicating a negative association between the variables.
o Example: For subject A, the rank in X is 2 and for B it’s 3, while in Y, A’s
rank is 3 and B’s is 2. The signs are opposite, making this a discordant pair.

Formula for Kendall's Tau:

Interpretation of Kendall’s Tau:

 𝜏 = +1: Perfect positive relationship (higher ranks in one set correspond to higher
ranks in the other).
 𝜏 = -1: Perfect negative relationship (higher ranks in one set correspond to lower
ranks in the other).
 𝜏 = 0: No correlation, indicating independence or randomness in the ranks.

Features of Kendall’s Tau:


1. Rank-Based: It operates on ranks rather than actual values, making it suitable for
ordinal data.
2. Robust to Outliers: It is less affected by outliers compared to Pearson’s r, which
assumes linearity and normality.
3. Non-parametric: Kendall’s Tau does not require any assumptions about the
distribution of the data.
4. Only Two Variables: It can only be applied to assess the correlation between two
variables.
5. Small Samples: Kendall’s Tau works well with small sample sizes (though N should
be greater than 10 for accuracy).

In summary, Kendall’s Tau is a versatile and robust measure of correlation, especially useful
for ranked data and when the assumptions of other correlation measures cannot be met. Its
focus on concordant and discordant pairs allows it to provide insight into the monotonic
relationship between two variables

2.3 Linear Regression (OLS), Multiple Regression, Logistic


Regression

Concept of Regression

Regression analysis goes beyond the simple association measured by correlation and allows
researchers to examine how one variable can predict or explain the variation in another
variable. It is a powerful statistical tool that models the relationship between two or more
variables, enabling predictions about one variable based on information from others.

In regression, the variables are divided into two categories:

1. Predictor (Independent) Variable: The variable that is used to predict or explain the
other variable. This is also called the explanatory variable.
2. Outcome (Dependent) Variable: The variable that is being predicted or explained,
also known as the response variable.

For example, if a researcher is studying the relationship between test scores and self-esteem,
test scores might be the predictor variable, while self-esteem is the outcome variable.
Similarly, in a study on job satisfaction and employee retention, job satisfaction could be
the predictor, and employee retention the outcome variable.

Regression Line

The regression line is a line that best fits the data in a scatterplot, representing the
relationship between the predictor (independent) variable and the outcome (dependent)
variable. It is the line that minimizes the error, or the distance between the observed data
points and the predicted values on the line. The purpose of the regression line is to help
predict the outcome variable based on the predictor variable.

Key Elements of the Regression Line:


1. Slope (b): This represents the rate of change of the outcome variable (Y) for every
one-unit change in the predictor variable (X). It indicates the direction and strength of
the relationship.
o A positive slope means that as X increases, Y also increases.
o A negative slope means that as X increases, Y decreases.
2. Intercept (a): This is the point where the regression line crosses the Y-axis,
representing the predicted value of Y when X is 0. It gives the baseline value of the
outcome variable when there is no influence from the predictor.

Purpose of the Regression Line:

 Prediction: The regression line is used to predict the value of the outcome variable
(Y) based on the values of the predictor variable (X).
 Best Fit: The line is calculated in a way that minimizes the sum of the squared
differences (errors) between the observed data points and the values predicted by the
line. This method is known as least squares regression.

Thus, the regression line offers a mathematical model for understanding and predicting
the relationship between variables.

Types of Regression:

1. Simple Linear Regression: Involves one predictor and one outcome variable, fitting
a straight line to the data to make predictions.
o Equation: Y=a+bXY
 Y = predicted outcome
 a = intercept
 b = slope (indicating the relationship between X and Y)
 X= predictor variable
2. Multiple Regression: Involves two or more predictor variables used to predict an
outcome.
o Equation: Y=a+b1X1+b2X2+…+ bnXn

Why Regression Is Useful:

 It helps researchers move from just describing relationships (correlation) to


understanding how much one variable can explain the variation in another.
 It is applicable in various fields, such as education (test scores predicting success),
psychology (stress levels predicting health outcomes), and business (job satisfaction
predicting employee retention).

Overall, regression offers a more detailed and predictive approach than correlation,
allowing for an in-depth analysis of relationships between variables.

Assumptions of Linear Regression

When performing linear regression, several key assumptions need to be met for the results to
be valid. These assumptions ensure that the linear model provides an accurate representation
of the data and that the statistical tests (such as hypothesis tests) produce reliable results.
1. Linearity

 Definition: The relationship between the independent variable (X) and the dependent
variable (Y) should be linear.
 Explanation: The change in the dependent variable is proportional to the change in
the independent variable, meaning the relationship can be represented with a straight
line.
 Implication: If the relationship is non-linear, the linear regression model may not
capture the true association, and predictions will be inaccurate.
 Check: A scatterplot of the variables can help determine whether the relationship is
linear.

2. Independence of Errors (Residuals)

 Definition: The residuals (errors) of the model should be independent of each other.
 Explanation: This means that the error term for one observation should not be
correlated with the error term for another observation. In time series data, for
example, errors may be correlated, which would violate this assumption (a condition
known as autocorrelation).
 Implication: Violations of independence can lead to biased estimates and incorrect
statistical inferences.
 Check: For time series data, a Durbin-Watson test can be used to check for
autocorrelation.

3. Homoscedasticity

 Definition: The variance of the residuals should be constant across all levels of the
independent variable (X).
 Explanation: The spread of the residuals should not vary with the value of the
predictor. If the variance of the residuals changes, the data are said to be
heteroscedastic.
 Implication: If heteroscedasticity is present, it can lead to inefficient estimates and
affect the validity of hypothesis tests.
 Check: A residual vs. fitted values plot can help identify heteroscedasticity, as non-
constant variance will appear as a funnel or fan-shaped pattern.

4. Normality of Residuals

 Definition: The residuals (errors) of the model should follow a normal distribution.
 Explanation: While linear regression does not require the independent or dependent
variables to be normally distributed, the residuals themselves should be normally
distributed. This assumption is crucial for valid statistical inference, particularly when
calculating confidence intervals and p-values.
 Implication: If the residuals are not normally distributed, hypothesis tests may be
unreliable, and predictions may not be accurate.
 Check: Residual plots, histograms, or Q-Q plots can be used to assess the normality
of residuals. Additionally, statistical tests like the Shapiro-Wilk test or Kolmogorov-
Smirnov test can test the normality assumption.
The assumptions of linear regression — linearity, independence, homoscedasticity, and
normality of residuals — ensure the reliability of the regression model. Violations of these
assumptions can lead to inaccurate predictions and invalid statistical conclusions, so it is
essential to check and address any violations before interpreting the results of a regression
analysis.

Different Methods of Estimation for Linear Regression

Estimation methods in linear regression are techniques used to calculate the parameters
(coefficients) of the regression equation that best fit the data. These methods estimate the
relationship between the independent and dependent variables, producing a linear model for
predictions or analysis.

Here are the key methods of estimation:

1. Ordinary Least Squares(OLS)

 Definition: OLS is the most commonly used method to estimate the coefficients in
linear regression.
 How It Works: OLS minimizes the sum of the squared differences (errors) between
the observed values (actual data points) and the predicted values (fitted line). This
method ensures that the line of best fit captures the overall trend of the data.
 Objective: Find the values of the coefficients (slope and intercept) that minimize the
residual sum of squares (RSS).
o Simple and widely used.
o Produces unbiased estimates under the assumption that the errors are normally
distributed and homoscedastic.
 Limitations:
o OLS is sensitive to outliers and assumes that the residuals are homoscedastic
and normally distributed.

2. Method of Moments

 Definition: The method of moments is a more general statistical technique for


parameter estimation, which can also be applied to linear regression.

3. Maximum Likelihood Estimation (MLE)

 Definition: MLE is a method for estimating the parameters of a statistical model that
maximizes the likelihood function, which measures how well the model explains the
observed data.

Multiple Regression

Multiple regression is an extension of simple linear regression that models the linear
relationship between a single outcome variable (dependent variable) and two or more
predictor variables (independent variables). It helps to determine how each predictor
influences the outcome while controlling for the other variables, allowing researchers to
analyze more complex relationships between variables.
Key Components:

 Outcome Variable (Y): The variable being predicted or explained (dependent


variable).
 Predictor Variables (X₁, X₂, …, Xₙ): The independent variables used to predict the
outcome variable.

Mathematical Equation:

The general form of a multiple regression equation is:

Interpretation:

 Intercept (a): Represents the expected value of Y when all predictor variables are
zero.
 Slopes (b₁, b₂, …, bₙ): Indicate the individual contribution of each predictor variable.
For example, b₁ shows the effect of X₁ on Y, controlling for all other predictors.
 Each coefficient (b) quantifies how much the dependent variable Y changes for a
one-unit change in the corresponding predictor variable X, while keeping other
variables constant.

Uses:

 Understanding relationships: Multiple regression helps in understanding how


different factors simultaneously influence an outcome.
 Prediction: It can be used to predict outcomes based on multiple predictors.
 Controlling for confounding variables: Multiple regression can isolate the effect of
each predictor variable by controlling for the presence of others.

Example:

In a study exploring the relationship between salary (Y) and years of experience (X₁),
education level (X₂), and location (X₃), a multiple regression model would estimate how
much each factor (experience, education, location) influences salary.
Assumptions of Multiple Regression

Multiple regression analysis relies on several key assumptions to ensure the validity and
reliability of the results. These assumptions must be met to avoid biased or misleading
estimates of the relationships between the dependent and independent variables.

1. Linear Relationship

 Definition: There must be a linear relationship between the dependent variable (Y)
and each of the independent variables (X₁, X₂, …, Xₙ).
 Implication: The changes in the outcome variable (Y) should correspond
proportionally to changes in the predictor variables.
 How to check: A scatterplot or residual plot can be used to visually assess if the
relationship appears linear.

2. Multivariate Normality

 Definition: The residuals (errors) of the regression model should follow a normal
distribution.
 Implication: This assumption is particularly important for hypothesis testing (e.g., t-
tests for coefficients) and for making reliable confidence intervals.
 How to check:
o Use a histogram or Q-Q plot of residuals to visually inspect normality.
o Formal tests like the Shapiro-Wilk test can also be applied.

3. No Multicollinearity

 Definition: The independent variables should not be too highly correlated with each
other, a phenomenon known as multicollinearity.
 Implication: If multicollinearity exists, it becomes difficult to isolate the individual
effect of each predictor, leading to inflated standard errors and unstable coefficient
estimates.
 How to check:
o Calculate the Variance Inflation Factor (VIF) for each predictor. A VIF
value above 10 typically indicates problematic multicollinearity.
o Tolerance values can also be used (a tolerance below 0.1 suggests high
multicollinearity).

4. Homoscedasticity

 Definition: The variance of the residuals (errors) should be constant across all levels
of the independent variables (X₁, X₂, …, Xₙ). This is known as homoscedasticity.
 Implication: If the variance of the errors changes across levels of the independent
variables, a condition called heteroscedasticity exists, which can affect the efficiency
of the coefficient estimates and the accuracy of predictions.
 How to check:
o Use a scatterplot of residuals versus predicted values. If the plot shows a
funnel shape (wide variance at one end), it indicates heteroscedasticity.
o Formal tests like the Breusch-Pagan test can be used to detect
heteroscedasticity.

5. Independence of Errors

 Definition: The residuals (errors) should be independent of each other, meaning that
the error for one observation should not be correlated with the error for another.
 Implication: Violations of this assumption (often seen in time series data) can lead to
biased standard errors, which affects significance tests.
 How to check:
o Durbin-Watson statistic is commonly used to detect autocorrelation in the
residuals.

6. No Perfect Collinearity

 Definition: There should be no perfect linear relationship between the independent


variables. Perfect collinearity occurs when one predictor is an exact linear
combination of others.
 Implication: This leads to an inability to compute regression coefficients because the
model cannot uniquely estimate the effect of each independent variable.
 How to check: This is typically checked as part of the multicollinearity assessment
(e.g., through VIF or tolerance).

Logistic Regression

Logistic regression is a statistical method used to examine the relationship between one or
more independent variables (categorical or continuous) and a dichotomous dependent
variable (i.e., an outcome variable with two possible values, such as success/failure or
yes/no). Unlike linear regression, logistic regression is used when the outcome variable is
binary.

Key Concepts:

1. Outcome Variable:
o The outcome variable is dichotomous, meaning it has two categories (e.g.,
success/failure, yes/no, present/absent).
o It models the probability of the event of interest occurring, often coded as 1
for "success" or the event happening, and 0 for "failure" or the event not
happening.
2. Predictor Variables:
o Logistic regression can include one or more continuous or categorical
predictor variables.
o These predictors are used to estimate the probability of the occurrence of the
outcome.
3. Logistic Function:
o Logistic regression uses the logistic function (also called the sigmoid
function) to model the probability of the binary outcome.
4. Odds and Odds Ratio:
o Odds: The odds are defined as the ratio of the probability that an event will
occur to the probability that it will not occur. Odds=P1−P\text{Odds} =
\frac{P}{1 - P}Odds=1−PP
o Odds Ratio (OR): Logistic regression outputs odds ratios, which represent
the ratio of the odds of the event occurring for different values of the predictor
variables.
1. OR > 1: The event is more likely to occur as the predictor variable
increases.
2. OR < 1: The event is less likely to occur as the predictor variable
increases.

Example:

In the case of a study investigating whether a history of suicide attempts is associated with an
increased risk of subsequent suicide attempts:

 Outcome (dependent variable): Whether or not a person makes a future suicide


attempt (dichotomous: yes/no).
 Predictor variable: History of past suicide attempts (yes/no).

The logistic regression model would predict the odds of a future suicide attempt based on
whether a person has a history of prior attempts. The odds ratio would compare the odds of a
future attempt in those with a history of attempts versus those without.

Threshold for Classification:

 Logistic regression outputs a probability value (between 0 and 1) that represents


the likelihood of the event occurring. To make a binary decision (whether the event
occurs or not), a threshold is set.
 Threshold (typically 0.5):
o If the predicted probability is greater than or equal to 0.5, the outcome is
classified as 1 (event happens).
o If the predicted probability is less than 0.5, the outcome is classified as 0
(event does not happen).
This threshold can be adjusted based on the problem's context. For example, in medical
diagnosis, you might set a lower threshold (e.g., 0.3) to err on the side of caution when
predicting a disease.

Assumptions:

The key assumptions of logistic regression ensure the model works effectively and provides
valid results. Here's an overview:

1. Binary Outcome Variable:


o Logistic regression is designed for binary dependent variables, meaning the
outcome must have only two possible categories (e.g., 0/1, yes/no,
success/failure).
2. Linearity of Logits:
o While logistic regression doesn't require a linear relationship between the
independent and dependent variables, it assumes that the log-odds (logit
transformation of the probability of the outcome) have a linear relationship
with the predictor variables.
o This is crucial to the interpretation of the coefficients in terms of the odds of
the event happening.
3. Independence of Observations:
o The observations must be independent of one another, meaning that the
outcome for one observation should not influence the outcome of any other
observation. Violating this assumption, such as in time-series or clustered
data, can lead to biased results.
4. No Multicollinearity:
o The predictor variables should not be highly correlated with each other.
Multicollinearity can inflate standard errors and make the coefficients
unreliable or difficult to interpret. Techniques like variance inflation factor
(VIF) can be used to check for multicollinearity.
5. Large Sample Size:
o Logistic regression works best with larger sample sizes, especially when the
event of interest (i.e., the positive outcome) is rare. Small sample sizes can
lead to unstable coefficient estimates and poor model performance. Ideally,
there should be sufficient cases for both outcomes.
6. Correct Specification of the Model:
o The model should include all relevant predictors and exclude irrelevant ones.
Omitting important predictors or including too many unnecessary ones can
introduce bias or overfitting, leading to incorrect conclusions.
7. No Outliers in Predictors:
o Significant outliers in the predictor variables can disproportionately affect the
results, causing incorrect estimation of coefficients and unreliable predictions.
Outlier detection and treatment methods should be applied to minimize their
impact.
2.4 Mediation Analysis – Concepts only

Mediation Analysis

Mediation analysis is a statistical approach used to explore how and why a predictor variable
(X) affects an outcome variable (Y) by introducing a mediating variable (M). This third
variable represents the process or mechanism through which the predictor influences the
outcome, helping to clarify the underlying relationship.

In essence, mediation analysis breaks down the total effect of X on Y into direct and
indirect effects:

 Direct Effect: The effect of X on Y that is not explained by the mediator.


 Indirect Effect: The effect of X on Y that occurs through the mediator (M). This is
the "mediated" part of the relationship.

Steps in Mediation Analysis

Mediation involves estimating three key regression equations to assess the relationships
between the variables:

1. X → Y (Total Effect): This step assesses the direct relationship between the predictor
(X) and the outcome (Y). It represents the total effect without accounting for the
mediator.
2. X → M (Effect of X on M): This step examines the relationship between the predictor
(X) and the mediator (M). It tells us whether the predictor variable is associated with
the mediator, indicating a potential pathway.
3. X + M → Y (Effect of X and M on Y): In this final step, both X (the predictor) and M
(the mediator) are entered into a regression model predicting Y. The effect of X on Y
while controlling for M is the direct effect. The extent to which M explains part of
the X → Y relationship is the indirect effect.

Interpretation

 Total Effect: The overall impact of X on Y without considering M.


 Direct Effect: The impact of X on Y after accounting for the mediator (M). If the
direct effect is significantly reduced compared to the total effect, it suggests that M
partially mediates the relationship between X and Y.
 Indirect Effect: The portion of the relationship between X and Y that occurs through
the mediator (M). A significant indirect effect suggests mediation.

Types of Mediation

1. Full Mediation: The effect of X on Y disappears when the mediator (M) is included
in the model, meaning that X affects Y entirely through M.
2. Partial Mediation: X still has a direct effect on Y, but part of the effect is mediated
by M. In this case, both the direct and indirect effects are significant.

Example:
Consider a study where job satisfaction (X) influences employee retention (Y). The
researcher hypothesizes that organizational commitment (M) mediates this relationship.
Mediation analysis would assess:

1. The total effect of job satisfaction on retention (X → Y).


2. The effect of job satisfaction on organizational commitment (X → M).
3. The effect of both job satisfaction and organizational commitment on retention (X +
M → Y).

If organizational commitment significantly mediates the relationship between job satisfaction


and retention, it suggests that job satisfaction influences retention indirectly by increasing
commitment.

Mediation analysis provides insights into how and why relationships between variables
occur, helping to identify potential mechanisms or processes that explain observed
associations. It is commonly used in fields such as psychology, social sciences, and health
research to deepen understanding beyond direct relationships.

You might also like