Correlation
Correlation
Correlation: Overview
Correlation is a statistical measure that describes the degree to which two variables move in
relation to each other. It quantifies the strength and direction of a linear relationship
between two variables. A correlation coefficient is a numerical value that can range from -1
to +1:
A correlation of +1 means a perfect positive relationship: as one variable increases,
the other also increases in a perfectly proportional manner.
A correlation of -1 means a perfect negative relationship: as one variable increases,
the other decreases in a perfectly proportional manner.
A correlation of 0 means no linear relationship between the two variables.
Types of Correlation
1. Positive Correlation:
When the value of one variable increases as the value of another variable also increases,
they are said to have a positive correlation. For example, the relationship between height
and weight.
2. Negative Correlation:
When the value of one variable increases while the value of the other decreases, the
variables have a negative correlation. For example, the relationship between the speed of a
car and the time it takes to reach a destination.
3. Zero or No Correlation:
If there is no predictable relationship between two variables, they are said to have no
correlation. For instance, the relationship between a person’s shoe size and their
intelligence level.
where XiX_iXi and YiY_iYi are the individual data points, and Xˉ\bar{X}Xˉ and Yˉ\bar{Y}Yˉ
are the means of the respective variables.
2. Spearman’s Rank Correlation Coefficient (ρ or rₛ):
Measures the strength and direction of the monotonic relationship between two
ranked variables.
Used when data is ordinal or not normally distributed.
It evaluates how well the relationship between two variables can be described using
a monotonic function.
Formula:
rs=1−6∑di2n(n2−1)r_s = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}rs=1−n(n2−1)6∑di2
where did_idi is the difference between the ranks of corresponding variables, and nnn is the
number of observations.
3. Kendall's Tau (τ):
Measures the association between two ordinal variables.
It is used for smaller datasets and when dealing with ties in data ranks.
More robust to outliers than Spearman’s coefficient.
Formula:
τ=(C−D)12n(n−1)\tau = \frac{(C - D)}{\frac{1}{2}n(n-1)}τ=21n(n−1)(C−D)
where CCC is the number of concordant pairs and DDD is the number of discordant pairs.
4. Point-Biserial Correlation:
Used to measure the relationship between a continuous variable and a binary
variable (i.e., a variable that takes only two values, like 0 or 1).
Similar to Pearson’s correlation but adapted for binary data.
5. Phi Coefficient (φ):
Used when both variables are binary.
For example, it could measure the correlation between gender (male/female) and
voting behavior (yes/no).
Conclusion
Correlation analysis is a fundamental tool in statistics for understanding the relationships
between variables. It is important to select the appropriate type of correlation coefficient
based on the data type and distribution, to interpret the results carefully, and to remember
that correlation does not imply causation. Statistical tests for significance help determine
whether the observed correlation is meaningful or simply due to random chance.
Multiple Correlation
Multiple correlation measures the strength of the relationship between one dependent
(criterion) variable and two or more independent (predictor) variables taken together. It’s
essentially used when you want to predict or explain one variable based on several other
variables.
1. Multiple Correlation Coefficient (R):
Denoted as R, the multiple correlation coefficient shows how well the set of
independent variables collectively predict or explain the dependent variable.
The value of R ranges from 0 to 1, where:
o R = 1: Indicates a perfect linear relationship between the dependent variable
and the independent variables.
o R = 0: Indicates no linear relationship.
2. Multiple Correlation Formula:
The formula for R in terms of correlation between a dependent variable YYY and
independent variables X1,X2,...,XnX_1, X_2, ..., X_nX1,X2,...,Xn can be written as:
R=rY,X12+rY,X22−2rX1,X2rY,X1rY,X2R = \sqrt{r_{Y,X_1}^2 + r_{Y,X_2}^2 - 2r_{X_1,
X_2}r_{Y,X_1}r_{Y,X_2}}R=rY,X12+rY,X22−2rX1,X2rY,X1rY,X2
where:
rY,X1,rY,X2r_{Y,X_1}, r_{Y,X_2}rY,X1,rY,X2 are the simple correlation coefficients
between the dependent variable YYY and independent variables X1,X2X_1, X_2X1,X2
.
rX1,X2r_{X_1,X_2}rX1,X2 is the correlation between the two independent variables.
This formula can be extended to more than two independent variables.
3. Interpretation of R:
The closer R is to 1, the stronger the relationship between the independent variables
and the dependent variable.
However, R does not indicate whether the relationship is positive or negative; it only
measures the strength of the relationship.
R² (also called the coefficient of determination) represents the proportion of
variance in the dependent variable explained by the independent variables
combined. For example, if R² = 0.75, it means 75% of the variation in the dependent
variable can be explained by the independent variables.
Partial Correlation
Partial correlation measures the strength and direction of the relationship between two
variables while controlling for the effect of one or more additional variables. In other
words, it assesses the direct association between two variables, removing the influence of
the control variable(s).
1. Purpose of Partial Correlation:
Partial correlation helps to isolate the relationship between two variables by
"partialing out" or controlling for the effects of other variables.
It is useful when you want to know whether the relationship between two variables
is spurious (i.e., falsely attributed to a direct relationship but actually due to a third
variable).
2. Partial Correlation Coefficient:
The partial correlation coefficient is denoted as r_{XY·Z}, which measures the
correlation between variables X and Y while controlling for Z.
r_{XY·Z} ranges from -1 to +1:
o r_{XY·Z} = 0: No direct relationship between X and Y after controlling for Z.
o r_{XY·Z} > 0: A positive relationship between X and Y after controlling for Z.
o r_{XY·Z} < 0: A negative relationship between X and Y after controlling for Z.
3. Partial Correlation Formula:
For two variables XXX and YYY while controlling for ZZZ, the partial correlation is given by: