3 Bivariate Data
3 Bivariate Data
ioc.pdf
It provides simple summaries about the observations that have been made. Such
summaries may be either quantitative, i.e. summary statistics, or visual, i.e.
simple-to-understand graphs.
ioc.pdf
It provides simple summaries about the observations that have been made. Such
summaries may be either quantitative, i.e. summary statistics, or visual, i.e.
simple-to-understand graphs.
ioc.pdf
It provides simple summaries about the observations that have been made. Such
summaries may be either quantitative, i.e. summary statistics, or visual, i.e.
simple-to-understand graphs.
ioc.pdf
4 Computing r by R
ioc.pdf
4 Computing r by R
ioc.pdf
In health studies the variables such as age, sex, height, weight, blood pressure, and total
cholesterol are often measured on each individual.
Economic studies may be interested in, among other things, personal income and years of
education.
ioc.pdf
In health studies the variables such as age, sex, height, weight, blood pressure, and total
cholesterol are often measured on each individual.
Economic studies may be interested in, among other things, personal income and years of
education.
ioc.pdf
One way to address the question is to look at pairs of ages for a sample of married
couples (an excerpt from a dataset consisting of 282 pairs of spousal ages):
We see that, yes, husbands and wives tend to be of about the same age, with men having
a tendency to be slightly older than their wives.
ioc.pdf
Not all husbands are older than their wives; this fact is lost when we separate the variables.
For example, based on the means alone, we cannot say what percentage of couples has
younger husbands than wives.
Another example of information not available from the separate descriptions what is the
average age of husbands with 45-year-old wives?
Finally, we do not know the relationship between the husband's age and the wife's age.
ioc.pdf
Not all husbands are older than their wives; this fact is lost when we separate the variables.
For example, based on the means alone, we cannot say what percentage of couples has
younger husbands than wives.
Another example of information not available from the separate descriptions what is the
average age of husbands with 45-year-old wives?
Finally, we do not know the relationship between the husband's age and the wife's age.
ioc.pdf
Scatter plots that show linear relationships between variables can dier in several ways
including the slope of the line about which they cluster and how tightly the points cluster
about the line.
ioc.pdf
Two observations:
1 there is a strong relationship between the husband's age and the wife's age: the older the
husband, the older the wife.
I When one variable (Y ) increases with the second variable (X ), we say that X and Y have a positive
association.
2 The points cluster along a straight line. When this occurs, the relationship is called a
linear relationship.
I There is a perfect linear relationship between two variables if a scatterplot of the points falls on a straight
line.
I The relationship is linear even if the points diverge from the line as long as the divergence is random rather
ioc.pdf
than being systematic.
Non-linear relationship
ioc.pdf
ioc.pdf
4 Computing r by R
ioc.pdf
The symbol for Pearson's correlation is ρ when it is measured in the population and r
when it is measured in a sample. (Further on, we are dealing with samples and will use r ).
Denition
Let X = {x1 , . . . , xN } and Y = {y1 , . . . , yN } are two datasets (two samples) with means MX and
MY and standard deviations σX and σY respectively, then the sample Pearson correlation
coecient (or simply correlation coecient) is dened by the formula
∑(X − MX )(Y − MY )
r= .
σX σY
Considering the formula of standard deviation, we obtain the formula for computing:
ioc.pdf
∑ XY − NMX MY
r=q
X 2 − NM 2 ∑ Y 2 − NMY2
∑ X
ioc.pdf
ioc.pdf
The closer the value is to 1 or 1, the stronger the linear correlation.
ioc.pdf
The closer the value is to 1 or 1, the stronger the linear correlation.
ioc.pdf
There are ve assumptions that are made with respect to Pearson's correlation:
2 The variables must be approximately normally distributed (we will discuss this later)
4 Outliers are either kept to a minimum or are removed entirely. (Use scatter plot to
determine outliers)
5 There is homoscedasticity of the data (All random variables in the sequence or vector
have the same nite variance. Homoscedasticity basically means that the variances along
the line of best t remain similar as you move along the line. Use scatter plot to
determine Homo- or heteroscedasticity).
ioc.pdf
ioc.pdf
ioc.pdf
1
The existence of a strong correlation does not imply a
causal link between the variables.!
2
We need to perform a signicance test to decide
whether based upon on a given sample there is any or
no evidence to suggest that linear correlation is present
in the population. (We will discuss signicance tests
later in our course.)
ioc.pdf
1
The existence of a strong correlation does not imply a
causal link between the variables.!
2
We need to perform a signicance test to decide
whether based upon on a given sample there is any or
no evidence to suggest that linear correlation is present
in the population. (We will discuss signicance tests
later in our course.)
ioc.pdf
1
The existence of a strong correlation does not imply a
causal link between the variables.!
2
We need to perform a signicance test to decide
whether based upon on a given sample there is any or
no evidence to suggest that linear correlation is present
in the population. (We will discuss signicance tests
later in our course.)
Recall:
4 Computing r by R
ioc.pdf
ioc.pdf
4 Computing r by R
ioc.pdf
Built-in to the base distribution of the program are three routines; for Pearson, Kendal
and Spearman Rank correlations.
ioc.pdf