Eda Reviewer

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

EDA REVIEWER

NON-PARAMETRIC TEST FOR CORRELATION


1. PHI-COEFFICIENT
Introduction
- The phi-coefficient is a statistical measure used to assess the association
between two binary (dichotomous) variables.
- It was designed for the comparison of truly dichotomous distributions, i.e.,
distributions that have only two points on their scale which indicate some
unmeasurable attribute. It is also sometimes known as the Yule 
When to use?
- When you want to know the relationship between two variables
- Variables of interest are binary
Assumptions
Binary Variables
For this test, your two variables must be binary. Binary means that your variables
is a category with only two possible values.
Examples:
 Gender (Male or Female)
 True or False
 Yes or No variable

How to Interpret Phi Results


The analysis yields a Phi coefficient and a p-value. Phi values range from -1 to 1,
where a negative Phi indicates an inverse relationship between variables (one
increases while the other decreases), and a positive Phi indicates a direct
relationship (both increase together).
Phi Coefficient value Interpretation
0 No relationship
1 Perfect positive relationship: most of
your data falls along the diagonal
cells.
-1 Perfect negative relationship: most
of your data is not on the diagonal
cells.
-1 to -0.7 Strong negative association
-0.7 to -0.3 Weak positive association
-0.3 to 0.3 Little or no association
0.3 to 0.7 Weak positive association
0.7 to 1 Strong positive association
Hypothesis Testing
1. Formulate the hypotheses 4. Calculate the Chi-square Test
Statistic
2. Create a contingency table
5. Determine the p-value
3. Calculate the Phi-coefficient
6. Make a decision & Interpretation
Example: A study was conducted to determine whether individuals prefer tea or
coffee based on their gender among 100 participants. Test at 0.05 level of significance. 
= 0.05
1. State the null and alternative hypothesis
H0:  = 0
H1:   0
2. Contingency Table
COFFEE TEA ROW TOTAL
FEMALE 40 (a) 10 (b) 50
MALE 20 (c) 30 (d) 50
COLUMN TOTAL 60 40 100
3. Calculate the Phi-Coefficient

( 40 ) ( 30 )−(10)(20)
¿
√(40+ 10)(20+30)(40+20)(10+ 30)
1200−200
¿
√(50)(50)(60)(40)
1000
¿
√2449.489743
ϕ=0.4082
The Phi coefficient analysis indicates a weak positive correlation between gender
and tea or coffee preference. To assess the significance of this correlation, the chi-
square statistic needs to be calculated.
2 2
χ =n φ
Where:
2
χ is the chi-square statistic
n is the total number of observations (the grand total in your contingency table)
 is the phi-coefficient
4. Calculate the Chi-Square Statistic
2 2
χ =n φ
2
¿ 100( 0.4082)
2
χ =16.6627
5. Find the Critical Value or P-value:
 = 0.05
For a 2x2 contingency table:
df = (number of rows -1)(number of columns-1)
= (2-1)(2-1)
df = 1
C.V. = 3.841
6. Decision & Conclusion
Reject Null if χ 2 > C.V.
Since 16.66 > 3.841 then, we reject the null hypothesis.

At 0.05 LOS, there is a significant weak association between gender and preference
on either tea or coffee.

Application Using SPSS


- Import the data to SPSS, then click analyze
- Click descriptive statistics and crosstabs.
- Click chi-square and phi and Cramer’s V and click to continue.
- The results will appear on your screen.

2. POINT BISERIAL CORRELATION


Introduction

(rpb )
- Point-Biserial correlation is also called the point-biserial correlation coefficient

- It measures the strength and direction of the association between a continuous


variable and a binary dichotomous variable.
- Special case of the Pearson Product-moment Correlation
When to Use?
You should use Point-Biserial Correlation in the following scenario:
1. You want to know the relationship between two variables
2. Your variables of interest include one cautious and one dichotomous variable
3. You only have two variables
Assumption
There are five assumptions necessary for a valid point-biserial correlation:
1. One variable must be continuous.
2. The other variable must be dichotomous.
3. There should be no outliers in the continuous variable for each dichotomous
category, which can be checked using boxplots.
4. The continuous variable should be approximately normally distributed
within each category, assessed using the Shapiro-Wilk Test.
5. The continuous variable must have equal variances across categories, verified
by Levene’s test for equality of variance.
Hypothesis Testing
H0: rpb = 0
H1: rpb > 0, rpb < 0 (one-tailed test)
H1: rpb  0 (two-tailed test)
Formula:
Application to SPSS
- Import the data to SPSS, then click analyze
- Click the Correlate and proceed to Bivariate.
- Enter the variables then click options. After that, click Means and standard
deviations and exclude cases pairwise
- SPSS doesn’t have a specific Point-Biserial.” Instead, you can use Pearson’s r as
a substitute.
- The results will appear on your screen.

3. SPEARMAN RHO
Introduction
- Spearman’s rank correlation assesses the strength and direction of association
between two ranked variables, measuring the monotonicity of their relationship.
The calculations for the rank correlation coefficient are simpler than those for
the Pearson coefficient, involving ranking the data and computing differences in
ranks. The resulting coefficient, ( rpb), ranges from +1 (perfect positive rank
correlation) to -1 (perfect negative rank correlation), with values near 0
indicating no relationship. It is appropriate to use Spearman’s rank correlation
when analyzing the covariation of two ranked variables.
Formula:

Where:
 = Spearman’s rank correlation coefficient
d = difference in ranks
n = number of observations
The Spearman Rank correlation can take a value from +1 to -1 where,
- A value of +1 means a perfect positive association of rank
- A value of 0 means that there is no association between ranks
- A value of -1 means a perfect negative association of rank
Assumptions
There are two main assumptions for Spearman’s rank correlation coefficient:
1. The data must be measured on an ordinal or continuous scale.
2. The two variables should have a monotonic relationship, meaning that as one
variable increases
or decreases), the other variable also increases (or decreases), allowing for the
formation of ranks.
Hypothesis Testing
Application Using Excel
Excel does not have a procedure to compute the Spearman rank correlation
coefficient. However, you may compute this statistic by using the MegaStat Add-in
available on your CD. If you have installed this add-in, do so, following the
instructions.
1. Enter the rating scores from the raw data you have into columns A and B of a
new worksheet.
2. From the toolbar, select Add-ins, MegaStat>Nonparametric
Tests>Spearman Coefficient of Rank Correlation. Note: You a need to open
MegaStat from the MegaStat.xls file on your computer’s hard drive.
3. Select the cells of your data for Input Range.
4. Check the Correct for ties option.
5. Click [OK] and the result will appear on a different sheet named “Output”

4. KENDALL TAU
Introduction
- Kendall’s Tau is a non-parametric measure of relationships between columns of
ranked data. The Tau correlation coefficient returns a value of 0 to 1, where:
Tau Correlation Interpretation
0 No relationship
1 Perfect relationship
- Kendall's Tau can produce values from -1 to 0; however, a negative value in
ranked data may simply indicate a column switch, so the negative sign can be
disregarded in interpretations. There are different versions of Tau, including Tau-
A, Tau-B, and Tau-C, with Tau-B being the most widely available in statistical
software. Kendall's Tau can be calculated manually using the formula: Kendall’s
Tau = (C – D) / (C + D), where C is the number of concordant pairs and D is the
number of discordant pairs. Non-parametric methods like Kendall's Tau and
Spearman's rank-order correlation are recommended for non-normal data, while
Pearson's product moment correlation is suited for normally distributed data.
When To Use?
Kendall's Tau is a non-parametric test used to assess the strength and direction of
association between two ranked variables. It is recommended to use Kendall's Tau
under the following conditions:

1. Ordinal Data: The variables are ranked, such as survey responses.


2. Non-Normal Distributions: The data does not need to follow a normal
distribution, making it suitable for non-parametric data.
3. Tied Ranks: It effectively handles tied ranks in data.
4. Small Sample Sizes: It often yields more reliable results than Pearson's
correlation in small samples.
5. Monotonic Relationships: It is useful when a monotonic relationship is
suspected, where one variable consistently changes with another, though not
necessarily at a constant rate.
Types of Kendal Tau
There are various versions of Kendall’s Tau available. These are the following:
TYPE DESCRIPTION TIED RANK SUITABLE
ADJUSTMENT FOR
Tau-A The basic version of NO Square Tables
Kendall’s tau.
Tau-B Adjusts for tied ranks YES Square Tables
dividing the number of
concordant pairs and
discordant pair by the
total number of
possible pairs.
Tau-C Adjust for tied ranks YES Rectangular
using a different tied Tables
ranks formula to
accommodate
rectangular tables.
Formula
For Tau-B you can calculate it manually using the formula:
Kendall’s Tau = (C – D/ C + D)

For Tau-C, the formula is as follows:


T = (C – D)/ (T -t)

Where:
C = the number of concordant pairs
D = the number of discordant pairs
T = total number of possible pairs
t = number of tied ranks

Tau-C provides a more accurate measure of the association between two rankings
that Tau-A or Tau-B when there are tied ranks. However, it is also computationally
more intensive.

Application Using SPSS


To apply Kendall's Tau using SPSS, follow these steps:

1. Load Data into SPSS: Start SPSS and import your dataset, which should include
the relevant variables, by navigating to File > Open > Data.
2. Access the Analyze Menu: Click on “Analyze” in the top menu, then select
“Correlate” and choose “Bivariate.”
3. Choose Variables: In the “Bivariate Correlations” dialogue box, select the
variables to analyze and move them to the “Variables” box, ensuring to check
Kendall for the correlation coefficient.
4. Generate SPSS Output: Click “OK” to perform the analysis, and SPSS will produce
an output that includes the frequency table and chart for your dataset.

5. CHI-SQUARE
Introduction
Chi-Square test is a statistical method used to determine whether there is a
significant association between two categorical variables. It evaluates how closely
the observed frequencies of a dataset align with the expected frequencies under
the null hypothesis.

When to Use
The Chi-square test for correlation is appropriate under specific circumstances,
particularly when you aim to evaluate the relationship between two categorical
variables. Here are some scenarios when you should consider using the Chi-square
test:
1. Examine Relationships Between Categorical Variables: Use the Chi-square
test to determine if there is a significant association between two categorical
variables (e.g., gender and voting preference, or smoking status and lung disease).
2. Contingency Tables: It is particularly applicable when the data can be
summarized in a contingency table (also known as a cross-tabulation), which
displays the frequency distribution of the variables.
3. Nominal or Ordinal Data: The test is most commonly applied to nominal data
(categories without intrinsic order). However, it can also be used for ordinal data
(categories with a specific order) as long as the assumptions about independence
and expected frequency are respected.
4. Large Samples: It is best used for larger sample sizes where the expected
frequency in each cell of the contingency table is adequate (typically at least 5). If
you have small sample sizes, consider using Fisher's Exact Test instead.
5. Testing Hypotheses About Independence: The Chi-square test can be used
when you want to test hypotheses about whether two categorical variables are
independent (i.e., no association) or dependent (i.e., some association).
6. Exploratory Analysis: The Chi-square test is often used in exploratory data
analysis to identify potential associations or patterns that may warrant further
investigation.

When Not to Use


1. Continuous Data: The test is not suitable for continuous variables unless they
have been categorized.
2. Small Sample Sizes: For small samples, the assumptions may be violated,
making the Chi-square test less reliable.
3. Dependent Samples: This test should not be used for paired or dependent
samples; you would need to use alternative tests such as McNemar's test in such
situations.

Assumptions
The Chi-square test for correlation, commonly used in the context of contingency
tables to assess the relationship between two categorical variables, relies on
several assumptions:
1. Categorical Variables: The data must be in the form of categorical variables.
The test is not suitable for continuous data unless it is categorized.
2. Independence: The observations must be independent of each other. This
means that the occurrence of one observation should not influence another.
3. Expected Frequency: For the Chi-square test to be valid, the expected
frequency in each cell of the contingency table should be sufficiently large. A
common rule of thumb is that the expected frequency in each cell should be at
least 5. If this assumption is violated (for example, if the table has many cells with
low expected counts), the Chi-square test may not provide reliable results.
4. Random Sampling: The samples should be randomly selected; this ensures
that the data is representative of the population being studied.
5. Data Type: The data should be nominal (categories without a specific order) or
ordinal (categories with a specific order), but the Chi-square test is primarily used
for nominal data.

By ensuring these assumptions are met, the validity of the conclusions drawn from
the Chi-square test for correlation can be enhanced. If these assumptions are
violated, alternative statistical methods or corrections might be necessary.
Assumptions for the Chi-Square Independence
1. The data are obtained from a random sample.
2. The expected value in each cell must be 5 or more.
Hypothesis Testing

Null Hypothesis (H0)

The null hypothesis (H0) states that there is no association between the two
categorical variables. In other words, the variables are independent of each other.

H0: The two categorical variables are independent.

Alternative Hypothesis (H1)

The alternative hypothesis (H1) asserts that there is an association between the
two categorical variables, meaning they are not independent.

H1: The two categorical variables are not independent (i.e., they are
associated).

Application Using Excel

1. Enter the location variables labels in column A, beginning at cell A2.


2. Enter the Categories for the number of years of college in cells B1, C1, and D1,
respectively.
3. Enter the Observed values in the appropriate block (cell).
4. From the toolbar, select the data, MegaStat>Chi-square/Crosstab>Contingency
Table. Note: You may need to open MegaStat from the MegaStat.xls file on you
computer’s hard drive.
5. The results including the chi-square value, will appear in a new sheet named as
“output” by default.

PARAMETRIC TEST FOR CORRELATION


1. PEARSON’S PRODUCT MOMENT OF CORRELATION
Introduction
The Pearson Product-Moment Correlation Coefficient, denoted as r, quantifies the
strength of a linear relationship between two variables. It ranges from -1 to +1,
where values closer to 0 indicate a greater divergence from a perfect linear
relationship.
When to Use?
This test for correlation can only be used to measure the relationship between two
continuous variables which are both normally distributed.

Assumptions
The assumptions for using the Pearson Product-Moment Correlation Coefficient
include:
1. Both variables must be measured on a continuous scale.
2. Each case must have paired values for the two variables.
3. Observations for each case should be independent of each other.
4. A linear relationship must exist between the two continuous variables.
5. Ideally, both variables should follow a bivariate normal distribution, though
univariate normality is often deemed sufficient in practice.
6. Homoscedasticity should be present, meaning variances along the line of best
fit should remain consistent; varying variances indicate heteroscedasticity.
7. There should be no univariate or multivariate outliers present in the data.
Hypothesis Testing
1. State the null and alternative hypothesis
2. Find the critical values
3. Calculate the value r
4. Calculate the Test Statistic
5. Make the Decision
6. Summarize the result
Application Using SPSS
1. Measuring and Setting Variables
2. Input Data
3. Analyze Data
4. Set Options
5. Run the Analysis

You might also like