Chapter6 Tests Relation Variables
Chapter6 Tests Relation Variables
CAMPUS BRUSSELS
Statistical Modelling
Tests for relation between 2 variables
1
Context
2
Chi-square test of independence
Goal: Evaluate whether there is a statistical relation between two
qualitative variables.
o the two variables are independent
o the two variables are dependent
Method: The chi-square test statistic is based on counts in the cross-table
of two variables. It measures the distance between
o observed counts
o expected counts if the two variables are statistically independent
number of rows in cross-table,
number of columns in cross-table
3
Chi-square test: approach
Example: Are the categorical variables education level and income
category related?
4
Chi-square test: approach
As we are at the boundary of violating the assumptions, we join the
categories college degree and post-undergraduate degree.
5
Pearson correlation test
Goal: Evaluate whether two quantitative variables have a linear relation.
We also aim to assess the direction and strength of the linear relation.
We distinguish
o The population correlation coefficient
o The sample correlation coefficient
A correlation coefficient takes values between -1 and 1, i.e.
o means the variables are not related
o close to 0 means the variables have a weak relation
o means the variables have a perfect positive linear relation
o means the variables have a perfect negative linear relation
6
Pearson sample correlation
Suppose we have a SRS of the variables and
The sample correlation between quantitative variables and is defined
as:
7
Pearson sample correlation
measures the size and direction of the linear relation between two
variables
8
Pearson sample correlation
measures the size and direction of the linear relation between and .
In this example but there is a strong non-linear (i.e., quadratic)
relation between and .
9
Pearson sample correlation
Outliers can have a very big effect on the sample correlation coefficient.
10
Sample Pearson correlation in SPSS
We compute correlations between monthly wage, weekly working hours,
age for a sample of observations.
In SPSS: analyze/correlate/bivariate
Correlation
between age
monthly wage
= .302
11
Test
there is no linear relation between and : .
there is a linear relation between and : .
If is true, and if has a bivariate Normal distribution (or if
than the test statistic is -distributed with degrees of
freedom:
12
Test
If and have a bivariate normal distribution, the scatterplot has the
shape of an ellipse.
Bivariate normal distribution no bivariate normal distribution
14
Spearman correlation-test
The non-parametric Spearman correlation test can be used
o to measure and test the relation between two ordinal qualitative
variables
o to measure and test the relation between two quantitative variables if
the assumptions of the Pearson correlation test are violated (i.e., small
sample and do not have a Bivariate Normal distribution).
15
Overview testing the relation between variables
(Parametric // non-parametric) test
2 quantitative variables:
Pearson correlation // Spearman correlation
2 qualitative variables:
chi-square test
16
Exercise 1
Suppose we have a sample of 4000 observations for the following
variables:
o Trust of a respondent in the government measured on a scale from 0 to 100.
o Country with categories 1=Belgium, 2=France, 3= the Netherlands
o Age measured in years
o Gender: nominal variable with categories 0=male, 1=female
Which test can you use to test whether there is a relation between
o Country and trust
o Gender and trust
o Country and gender
o Trust and age
Formulate the null and alternative hypothesis for each test. Discuss
whether/when the proposed test is valid in the present context.
17
Exercise 2
Consider the cross-table between two qualitative variables education level and
type of company for a sample of observations . The table contains
observed counts and expected counts if the variables are assumed to be
independent.
Compute the expected counts for the first row of the table, compute the chi-
square test statistic and test (using ) the null hypothesis that education
level and type of company are statistically independent. Formulate a conclusion
about the result of the test.
18
Exercise 3
We compute the Pearson sample correlation between household income
and years with current employer in a SRS of employees.
Correlations
1 ,625
N 850 850
,625 1
N 850 850
20
Solution Exercise 1
The sample is very large and hence the t-statistic has an
approximate t-distribution if population variances for males and females
are equal. If the null-hypothesis of equal population variances for
males/females is rejected, a Welch correction to the t-statistic can be used.
Country and Gender
country and gender are statistically independent
country and gender are statistically dependent
To test we can use a Pearson chi-square test on the cross-table country
x gender. The assumptions are (1) that all expected counts are larger than 1,
(2) that not more than 20% of the cells in the cross-table are smaller than 5.
Stated otherwise, the chi-square test tests the null hypothesis that the
proportion of males is the same in the three countries:
versus is wrong
21
Solution Exercise 1
Trust and age
To test that there is no linear relation between age and trust in the
population we can use a Pearson correlation test. As the sample size is
large the test statistic has an approximate t-distribution and
hence the test is valid.
22
Solution Exercise 2
23
Solution Exercise 2
Let
and hence we reject . We conclude with 95% confidence that
company size and diploma are statistically related.
The assumptions of the test are satisfied:
o All expected counts are larger than or equal to 1
o There are no cells with an expected count smaller than 5, hence the
proportion of cells with is smaller than 20%.
24
Solution Exercise 3
We test against H A : 0 with
and hence we reject . We conclude with 95% confidence that the
population correlation between household income and years with current
employer is positive.
The scatterplot shows that the assumption of a bivariate Normal
distribution for the two variables is doubtful. However, as the sample size
is large , the test statistic will have an approximate t-distribution
and hence the test is valid.
Remark: to reduce the influence of outliers it is recommended to apply a
natural log transformation to household income.
25