0% found this document useful (0 votes)
38 views31 pages

3 Bivariate Data

The document discusses descriptive statistics and bivariate analysis. It describes univariate and bivariate analysis, and analyzing the relationship between two variables. It provides an example of examining the relationship between husband and wife ages using scatter plots and histograms.

Uploaded by

Nguyên Cát
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views31 pages

3 Bivariate Data

The document discusses descriptive statistics and bivariate analysis. It describes univariate and bivariate analysis, and analyzing the relationship between two variables. It provides an example of examining the relationship between husband and wife ages using scatter plots and histograms.

Uploaded by

Nguyên Cát
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

BIVARIATE DATA

David M. Lane. et al. Introduction to Statistics : pp. 172194

ioc.pdf

[email protected] ICY0006: Lecture 3 1 / 24


Descriptive statistics

Descriptive statistics is quantitatively describing the main features of a collection of


information.

It provides simple summaries about the observations that have been made. Such
summaries may be either quantitative, i.e. summary statistics, or visual, i.e.
simple-to-understand graphs.

It involves two kinds of analysis:

Univariate analysis: describing the distribution of a single variable, including


I central tendency (mean, median, and mode)
I dispersion (range and quantiles of the data-set, measures of
spread such as the variance and standard deviation)
I shape of the distribution (skewness and kurtosis
Bivariate analysis: more than one variable are involved and describing the relationship

between pairs of variables. In this case, descriptive statistics include:


I Cross-tabulations and contingency tables
I Graphical representation via scatterplots
I Quantitative measures of dependence
I Descriptions of conditional distributions

ioc.pdf

[email protected] ICY0006: Lecture 3 2 / 24


Descriptive statistics

Descriptive statistics is quantitatively describing the main features of a collection of


information.

It provides simple summaries about the observations that have been made. Such
summaries may be either quantitative, i.e. summary statistics, or visual, i.e.
simple-to-understand graphs.

It involves two kinds of analysis:

Univariate analysis: describing the distribution of a single variable, including


I central tendency (mean, median, and mode)
I dispersion (range and quantiles of the data-set, measures of
spread such as the variance and standard deviation)
I shape of the distribution (skewness and kurtosis
Bivariate analysis: more than one variable are involved and describing the relationship

between pairs of variables. In this case, descriptive statistics include:


I Cross-tabulations and contingency tables
I Graphical representation via scatterplots
I Quantitative measures of dependence
I Descriptions of conditional distributions

ioc.pdf

[email protected] ICY0006: Lecture 3 2 / 24


Descriptive statistics

Descriptive statistics is quantitatively describing the main features of a collection of


information.

It provides simple summaries about the observations that have been made. Such
summaries may be either quantitative, i.e. summary statistics, or visual, i.e.
simple-to-understand graphs.

It involves two kinds of analysis:

Univariate analysis: describing the distribution of a single variable, including


I central tendency (mean, median, and mode)
I dispersion (range and quantiles of the data-set, measures of
spread such as the variance and standard deviation)
I shape of the distribution (skewness and kurtosis
Bivariate analysis: more than one variable are involved and describing the relationship

between pairs of variables. In this case, descriptive statistics include:


I Cross-tabulations and contingency tables
I Graphical representation via scatterplots
I Quantitative measures of dependence
I Descriptions of conditional distributions

ioc.pdf

[email protected] ICY0006: Lecture 3 2 / 24


Contents

1 Introduction to Bivariate Data

2 Pearson product-moment correlation

3 Variance Sum Law II

4 Computing r by R

ioc.pdf

[email protected] ICY0006: Lecture 3 3 / 24


Next section

1 Introduction to Bivariate Data

2 Pearson product-moment correlation

3 Variance Sum Law II

4 Computing r by R

ioc.pdf

[email protected] ICY0006: Lecture 3 4 / 24


Bivariate Data  more than one variable

Often, more than one variable is collected on each individual.

In health studies the variables such as age, sex, height, weight, blood pressure, and total
cholesterol are often measured on each individual.

Economic studies may be interested in, among other things, personal income and years of
education.

Bivariate data consists of data on two variables

Usually we are interested in the relationship between the variables

ioc.pdf

[email protected] ICY0006: Lecture 3 5 / 24


Bivariate Data  more than one variable

Often, more than one variable is collected on each individual.

In health studies the variables such as age, sex, height, weight, blood pressure, and total
cholesterol are often measured on each individual.

Economic studies may be interested in, among other things, personal income and years of
education.

Bivariate data consists of data on two variables

Usually we are interested in the relationship between the variables

ioc.pdf

[email protected] ICY0006: Lecture 3 5 / 24


Example: Do the people tend to marry other
people of about the same age?

Our experience tells us yes, but how good is the correspondence?

One way to address the question is to look at pairs of ages for a sample of married
couples (an excerpt from a dataset consisting of 282 pairs of spousal ages):

We see that, yes, husbands and wives tend to be of about the same age, with men having
a tendency to be slightly older than their wives.

ioc.pdf

[email protected] ICY0006: Lecture 3 6 / 24


Example: Histograms and means of spousal ages

Each distribution is fairly skewed with a long right tail.

Not all husbands are older than their wives; this fact is lost when we separate the variables.

Also the pairing within couple is lost by separating the variables.

For example, based on the means alone, we cannot say what percentage of couples has
younger husbands than wives.

Another example of information not available from the separate descriptions what is the
average age of husbands with 45-year-old wives?

Finally, we do not know the relationship between the husband's age and the wife's age.
ioc.pdf

[email protected] ICY0006: Lecture 3 7 / 24


Example: Histograms and means of spousal ages

Each distribution is fairly skewed with a long right tail.

Not all husbands are older than their wives; this fact is lost when we separate the variables.

Also the pairing within couple is lost by separating the variables.

For example, based on the means alone, we cannot say what percentage of couples has
younger husbands than wives.

Another example of information not available from the separate descriptions what is the
average age of husbands with 45-year-old wives?

Finally, we do not know the relationship between the husband's age and the wife's age.
ioc.pdf

[email protected] ICY0006: Lecture 3 7 / 24


Visualization of Bivariate Data
A scatter plot displays the bivariate data in a graphical form that maintains the pairing.

Scatter plots that show linear relationships between variables can dier in several ways
including the slope of the line about which they cluster and how tightly the points cluster
about the line.

This is a scatter plot of the paired ages (all 282 pairs):

ioc.pdf

[email protected] ICY0006: Lecture 3 8 / 24


Scatter plot

Two observations:
1 there is a strong relationship between the husband's age and the wife's age: the older the
husband, the older the wife.

I When one variable (Y ) increases with the second variable (X ), we say that X and Y have a positive
association.

I When Y decreases as X increases, we say that they have a negative association.

2 The points cluster along a straight line. When this occurs, the relationship is called a
linear relationship.

I There is a perfect linear relationship between two variables if a scatterplot of the points falls on a straight
line.

I The relationship is linear even if the points diverge from the line as long as the divergence is random rather
ioc.pdf
than being systematic.

[email protected] ICY0006: Lecture 3 9 / 24


Linear and non-linear relationships

Perfect negative relationship

Non-linear relationship

ioc.pdf

[email protected] ICY0006: Lecture 3 10 / 24


Weak relationship and no relationship

Weak positive relationship No relationship

ioc.pdf

[email protected] ICY0006: Lecture 3 11 / 24


Next section

1 Introduction to Bivariate Data

2 Pearson product-moment correlation

3 Variance Sum Law II

4 Computing r by R

ioc.pdf

[email protected] ICY0006: Lecture 3 12 / 24


Pearson's correlation coecient

Pearson's correlation coecient is a statistical measure of the strength of a linear


relationship between paired data.

The symbol for Pearson's correlation is ρ when it is measured in the population and r
when it is measured in a sample. (Further on, we are dealing with samples and will use r ).

Denition
Let X = {x1 , . . . , xN } and Y = {y1 , . . . , yN } are two datasets (two samples) with means MX and
MY and standard deviations σX and σY respectively, then the sample Pearson correlation
coecient (or simply correlation coecient) is dened by the formula

∑(X − MX )(Y − MY )
r= .
σX σY

Considering the formula of standard deviation, we obtain the formula for computing:

∑(X − MX )(Y − MY ) ∑ XY − NMX MY


r=p =q
∑(X − MX )2 ∑(Y − MY )2 ∑ X 2 − NMX2 ∑ Y 2 − NMY2
 

ioc.pdf

[email protected] ICY0006: Lecture 3 13 / 24


Computing Pearson's r
Example

∑ XY − NMX MY
r=q
X 2 − NM 2 ∑ Y 2 − NMY2
 
∑ X

210 − 5 · 4 · 9 210 − 180 30


r=p = √ =√ = 0.9682458
(96 − 5 · 42 ) (465 − 5 · 92 ) 16 · 60 960

ioc.pdf

[email protected] ICY0006: Lecture 3 14 / 24


Correlation coecients

ioc.pdf

[email protected] ICY0006: Lecture 3 15 / 24


Properties of Pearson's r
The Pearson correlation coecient is symmetric: r = cor(X , Y ) = cor(Y , X ).
r is restricted as −1 6 r 6 1.
The Pearson correlation coecient is invariant to separate changes in location and scale in
the two variables. That is, we may transform X to a + bX and transform Y to c + dY ,
where a, b, c , and d are constants with b, d 6= 0, without changing the correlation
coecient.

Positive values denote positive linear correlation.

Negative values denote negative linear correlation.

A value of 0 denotes no linear correlation.

The closer the value is to 1 or 1, the stronger the linear correlation.

ioc.pdf

[email protected] ICY0006: Lecture 3 16 / 24


Properties of Pearson's r
The Pearson correlation coecient is symmetric: r = cor(X , Y ) = cor(Y , X ).
r is restricted as −1 6 r 6 1.
The Pearson correlation coecient is invariant to separate changes in location and scale in
the two variables. That is, we may transform X to a + bX and transform Y to c + dY ,
where a, b, c , and d are constants with b, d 6= 0, without changing the correlation
coecient.

Positive values denote positive linear correlation.

Negative values denote negative linear correlation.

A value of 0 denotes no linear correlation.

The closer the value is to 1 or 1, the stronger the linear correlation.

ioc.pdf

[email protected] ICY0006: Lecture 3 16 / 24


Assumptions

There are ve assumptions that are made with respect to Pearson's correlation:

1 The variables must be either interval or ratio measurements.

2 The variables must be approximately normally distributed (we will discuss this later)

3 There is a linear relationship between the two variables.

4 Outliers are either kept to a minimum or are removed entirely. (Use scatter plot to
determine outliers)

5 There is homoscedasticity of the data (All random variables in the sequence or vector
have the same nite variance. Homoscedasticity basically means that the variances along
the line of best t remain similar as you move along the line. Use scatter plot to
determine Homo- or heteroscedasticity).

ioc.pdf

[email protected] ICY0006: Lecture 3 17 / 24


Removing of outliers

ioc.pdf

[email protected] ICY0006: Lecture 3 18 / 24


Homo- and heteroscedasticity

ioc.pdf

[email protected] ICY0006: Lecture 3 19 / 24


Caution!

1
The existence of a strong correlation does not imply a
causal link between the variables.!
2
We need to perform a signicance test to decide
whether based upon on a given sample there is any or
no evidence to suggest that linear correlation is present
in the population. (We will discuss signicance tests
later in our course.)

ioc.pdf

[email protected] ICY0006: Lecture 3 20 / 24


Caution!

1
The existence of a strong correlation does not imply a
causal link between the variables.!
2
We need to perform a signicance test to decide
whether based upon on a given sample there is any or
no evidence to suggest that linear correlation is present
in the population. (We will discuss signicance tests
later in our course.)

ioc.pdf

[email protected] ICY0006: Lecture 3 20 / 24


Caution!

1
The existence of a strong correlation does not imply a
causal link between the variables.!
2
We need to perform a signicance test to decide
whether based upon on a given sample there is any or
no evidence to suggest that linear correlation is present
in the population. (We will discuss signicance tests
later in our course.)
Recall:

Pearson correlation is a measure of the strength of a relationship between two variables


But any relationship should be assessed for itssignicance as well as its strength.
If your data does not meet the above assumptions then use the Spearman's rank
correlation (ρ ) or Kendall rank correlation (τ ).
ioc.pdf

[email protected] ICY0006: Lecture 3 20 / 24


Next section

1 Introduction to Bivariate Data

2 Pearson product-moment correlation

3 Variance Sum Law II

4 Computing r by R

ioc.pdf

[email protected] ICY0006: Lecture 3 21 / 24


Variance Sum Law II

Variance Sum Law I


If X and Y are independent (uncorrelated) variables, then

σX2 ±Y = σX2 + σY2

Variance Sum Law II


When X and Y are correlated variables, the following is valid:

σX2 ±Y = σX2 + σY2 ± 2ρσX σY

where ρ is the correlation between X and Y in the population.

ioc.pdf

[email protected] ICY0006: Lecture 3 22 / 24


Next section

1 Introduction to Bivariate Data

2 Pearson product-moment correlation

3 Variance Sum Law II

4 Computing r by R

ioc.pdf

[email protected] ICY0006: Lecture 3 23 / 24


Correlations in R

R can perform correlation with the cor() function.

Built-in to the base distribution of the program are three routines; for Pearson, Kendal
and Spearman Rank correlations.

Simplied formats of the function call are


1 cor(x,y)  the default correlation returns the Pearson correlation coecient;
2 cor(dataset)  if you use a datset instead of separate variables you will return a
matrix of all the pairwize correlation coecients;
3 cor(x, y, method = "spearman")  if you specify "spearman" you will get the
Spearman correlation coecient;
4 cor(x, y, use="complete.obs")  The parameter use species the handling of
missing data. Options areall.obs (assumes no missing data  missing data will
produce an error), complete.obs (listwise deletion), and pairwise.complete.obs
(pairwise deletion).

ioc.pdf

[email protected] ICY0006: Lecture 3 24 / 24

You might also like