0% found this document useful (0 votes)

38 views31 pages

3 Bivariate Data

The document discusses descriptive statistics and bivariate analysis. It describes univariate and bivariate analysis, and analyzing the relationship between two variables. It provides an example of examining the relationship between husband and wife ages using scatter plots and histograms.

Uploaded by

Nguyên Cát

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views31 pages

3 Bivariate Data

Uploaded by

Nguyên Cát

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

BIVARIATE DATA

David M. Lane. et al. Introduction to Statistics : pp. 172194

ioc.pdf

[email protected] ICY0006: Lecture 3 1 / 24

Descriptive statistics

Descriptive statistics is quantitatively describing the main features of a collection of

information.

It provides simple summaries about the observations that have been made. Such
summaries may be either quantitative, i.e. summary statistics, or visual, i.e.
simple-to-understand graphs.

It involves two kinds of analysis:

Univariate analysis: describing the distribution of a single variable, including

I central tendency (mean, median, and mode)
I dispersion (range and quantiles of the data-set, measures of
spread such as the variance and standard deviation)
I shape of the distribution (skewness and kurtosis
Bivariate analysis: more than one variable are involved and describing the relationship

between pairs of variables. In this case, descriptive statistics include:

I Cross-tabulations and contingency tables
I Graphical representation via scatterplots
I Quantitative measures of dependence
I Descriptions of conditional distributions

ioc.pdf

[email protected] ICY0006: Lecture 3 2 / 24

Descriptive statistics

Descriptive statistics is quantitatively describing the main features of a collection of

information.

It provides simple summaries about the observations that have been made. Such
summaries may be either quantitative, i.e. summary statistics, or visual, i.e.
simple-to-understand graphs.

It involves two kinds of analysis:

Univariate analysis: describing the distribution of a single variable, including

between pairs of variables. In this case, descriptive statistics include:

I Cross-tabulations and contingency tables
I Graphical representation via scatterplots
I Quantitative measures of dependence
I Descriptions of conditional distributions

ioc.pdf

[email protected] ICY0006: Lecture 3 2 / 24

Descriptive statistics

Descriptive statistics is quantitatively describing the main features of a collection of

information.

It provides simple summaries about the observations that have been made. Such
summaries may be either quantitative, i.e. summary statistics, or visual, i.e.
simple-to-understand graphs.

It involves two kinds of analysis:

Univariate analysis: describing the distribution of a single variable, including

between pairs of variables. In this case, descriptive statistics include:

I Cross-tabulations and contingency tables
I Graphical representation via scatterplots
I Quantitative measures of dependence
I Descriptions of conditional distributions

ioc.pdf

[email protected] ICY0006: Lecture 3 2 / 24

Contents

1 Introduction to Bivariate Data

2 Pearson product-moment correlation

3 Variance Sum Law II

4 Computing r by R

ioc.pdf

[email protected] ICY0006: Lecture 3 3 / 24

Next section

1 Introduction to Bivariate Data

2 Pearson product-moment correlation

3 Variance Sum Law II

4 Computing r by R

ioc.pdf

[email protected] ICY0006: Lecture 3 4 / 24

Bivariate Data more than one variable

Often, more than one variable is collected on each individual.

In health studies the variables such as age, sex, height, weight, blood pressure, and total
cholesterol are often measured on each individual.

Economic studies may be interested in, among other things, personal income and years of
education.

Bivariate data consists of data on two variables

Usually we are interested in the relationship between the variables

ioc.pdf

[email protected] ICY0006: Lecture 3 5 / 24

Bivariate Data more than one variable

Often, more than one variable is collected on each individual.

In health studies the variables such as age, sex, height, weight, blood pressure, and total
cholesterol are often measured on each individual.

Economic studies may be interested in, among other things, personal income and years of
education.

Bivariate data consists of data on two variables

Usually we are interested in the relationship between the variables

ioc.pdf

[email protected] ICY0006: Lecture 3 5 / 24

Example: Do the people tend to marry other
people of about the same age?

Our experience tells us yes, but how good is the correspondence?

One way to address the question is to look at pairs of ages for a sample of married
couples (an excerpt from a dataset consisting of 282 pairs of spousal ages):

We see that, yes, husbands and wives tend to be of about the same age, with men having
a tendency to be slightly older than their wives.

ioc.pdf

[email protected] ICY0006: Lecture 3 6 / 24

Example: Histograms and means of spousal ages

Each distribution is fairly skewed with a long right tail.

Not all husbands are older than their wives; this fact is lost when we separate the variables.

Also the pairing within couple is lost by separating the variables.

For example, based on the means alone, we cannot say what percentage of couples has
younger husbands than wives.

Another example of information not available from the separate descriptions what is the
average age of husbands with 45-year-old wives?

Finally, we do not know the relationship between the husband's age and the wife's age.
ioc.pdf

[email protected] ICY0006: Lecture 3 7 / 24

Example: Histograms and means of spousal ages

Each distribution is fairly skewed with a long right tail.

Not all husbands are older than their wives; this fact is lost when we separate the variables.

Also the pairing within couple is lost by separating the variables.

For example, based on the means alone, we cannot say what percentage of couples has
younger husbands than wives.

Another example of information not available from the separate descriptions what is the
average age of husbands with 45-year-old wives?

Finally, we do not know the relationship between the husband's age and the wife's age.
ioc.pdf

[email protected] ICY0006: Lecture 3 7 / 24

Visualization of Bivariate Data
A scatter plot displays the bivariate data in a graphical form that maintains the pairing.

Scatter plots that show linear relationships between variables can dier in several ways
including the slope of the line about which they cluster and how tightly the points cluster
about the line.

This is a scatter plot of the paired ages (all 282 pairs):

ioc.pdf

[email protected] ICY0006: Lecture 3 8 / 24

Scatter plot

Two observations:
1 there is a strong relationship between the husband's age and the wife's age: the older the
husband, the older the wife.

I When one variable (Y ) increases with the second variable (X ), we say that X and Y have a positive
association.

I When Y decreases as X increases, we say that they have a negative association.

2 The points cluster along a straight line. When this occurs, the relationship is called a
linear relationship.

I There is a perfect linear relationship between two variables if a scatterplot of the points falls on a straight
line.

I The relationship is linear even if the points diverge from the line as long as the divergence is random rather
ioc.pdf
than being systematic.

[email protected] ICY0006: Lecture 3 9 / 24

Linear and non-linear relationships

Perfect negative relationship

Non-linear relationship

ioc.pdf

[email protected] ICY0006: Lecture 3 10 / 24

Weak relationship and no relationship

Weak positive relationship No relationship

ioc.pdf

[email protected] ICY0006: Lecture 3 11 / 24

Next section

1 Introduction to Bivariate Data

2 Pearson product-moment correlation

3 Variance Sum Law II

4 Computing r by R

ioc.pdf

[email protected] ICY0006: Lecture 3 12 / 24

Pearson's correlation coecient

Pearson's correlation coecient is a statistical measure of the strength of a linear

relationship between paired data.

The symbol for Pearson's correlation is ρ when it is measured in the population and r
when it is measured in a sample. (Further on, we are dealing with samples and will use r ).

Denition
Let X = {x1 , . . . , xN } and Y = {y1 , . . . , yN } are two datasets (two samples) with means MX and
MY and standard deviations σX and σY respectively, then the sample Pearson correlation
coecient (or simply correlation coecient) is dened by the formula

∑(X − MX )(Y − MY )
r= .
σX σY

Considering the formula of standard deviation, we obtain the formula for computing:

∑(X − MX )(Y − MY ) ∑ XY − NMX MY

r=p =q
∑(X − MX )2 ∑(Y − MY )2 ∑ X 2 − NMX2 ∑ Y 2 − NMY2

ioc.pdf

[email protected] ICY0006: Lecture 3 13 / 24

Computing Pearson's r
Example

∑ XY − NMX MY
r=q
X 2 − NM 2 ∑ Y 2 − NMY2

∑ X

210 − 5 · 4 · 9 210 − 180 30

r=p = √ =√ = 0.9682458
(96 − 5 · 42 ) (465 − 5 · 92 ) 16 · 60 960

ioc.pdf

[email protected] ICY0006: Lecture 3 14 / 24

Correlation coecients

ioc.pdf

[email protected] ICY0006: Lecture 3 15 / 24

Properties of Pearson's r
The Pearson correlation coecient is symmetric: r = cor(X , Y ) = cor(Y , X ).
r is restricted as −1 6 r 6 1.
The Pearson correlation coecient is invariant to separate changes in location and scale in
the two variables. That is, we may transform X to a + bX and transform Y to c + dY ,
where a, b, c , and d are constants with b, d 6= 0, without changing the correlation
coecient.

Positive values denote positive linear correlation.

Negative values denote negative linear correlation.

A value of 0 denotes no linear correlation.

The closer the value is to 1 or 1, the stronger the linear correlation.

ioc.pdf

[email protected] ICY0006: Lecture 3 16 / 24

Positive values denote positive linear correlation.

Negative values denote negative linear correlation.

A value of 0 denotes no linear correlation.

The closer the value is to 1 or 1, the stronger the linear correlation.

ioc.pdf

[email protected] ICY0006: Lecture 3 16 / 24

Assumptions

There are ve assumptions that are made with respect to Pearson's correlation:

1 The variables must be either interval or ratio measurements.

2 The variables must be approximately normally distributed (we will discuss this later)

3 There is a linear relationship between the two variables.

4 Outliers are either kept to a minimum or are removed entirely. (Use scatter plot to
determine outliers)

5 There is homoscedasticity of the data (All random variables in the sequence or vector
have the same nite variance. Homoscedasticity basically means that the variances along
the line of best t remain similar as you move along the line. Use scatter plot to
determine Homo- or heteroscedasticity).

ioc.pdf

[email protected] ICY0006: Lecture 3 17 / 24

Removing of outliers

ioc.pdf

[email protected] ICY0006: Lecture 3 18 / 24

Homo- and heteroscedasticity

ioc.pdf

[email protected] ICY0006: Lecture 3 19 / 24

Caution!

1
The existence of a strong correlation does not imply a
causal link between the variables.!
2
We need to perform a signicance test to decide
whether based upon on a given sample there is any or
no evidence to suggest that linear correlation is present
in the population. (We will discuss signicance tests
later in our course.)

ioc.pdf

[email protected] ICY0006: Lecture 3 20 / 24

Caution!

ioc.pdf

[email protected] ICY0006: Lecture 3 20 / 24

Caution!

Pearson correlation is a measure of the strength of a relationship between two variables

But any relationship should be assessed for itssignicance as well as its strength.
If your data does not meet the above assumptions then use the Spearman's rank
correlation (ρ ) or Kendall rank correlation (τ ).
ioc.pdf

[email protected] ICY0006: Lecture 3 20 / 24

Next section

1 Introduction to Bivariate Data

2 Pearson product-moment correlation

3 Variance Sum Law II

4 Computing r by R

ioc.pdf

[email protected] ICY0006: Lecture 3 21 / 24

Variance Sum Law II

Variance Sum Law I

If X and Y are independent (uncorrelated) variables, then

σX2 ±Y = σX2 + σY2

Variance Sum Law II

When X and Y are correlated variables, the following is valid:

σX2 ±Y = σX2 + σY2 ± 2ρσX σY

where ρ is the correlation between X and Y in the population.

ioc.pdf

[email protected] ICY0006: Lecture 3 22 / 24

Next section

1 Introduction to Bivariate Data

2 Pearson product-moment correlation

3 Variance Sum Law II

4 Computing r by R

ioc.pdf

[email protected] ICY0006: Lecture 3 23 / 24

Correlations in R

R can perform correlation with the cor() function.

Built-in to the base distribution of the program are three routines; for Pearson, Kendal
and Spearman Rank correlations.

Simplied formats of the function call are

1 cor(x,y) the default correlation returns the Pearson correlation coecient;
2 cor(dataset) if you use a datset instead of separate variables you will return a
matrix of all the pairwize correlation coecients;
3 cor(x, y, method = "spearman") if you specify "spearman" you will get the
Spearman correlation coecient;
4 cor(x, y, use="complete.obs") The parameter use species the handling of
missing data. Options areall.obs (assumes no missing data missing data will
produce an error), complete.obs (listwise deletion), and pairwise.complete.obs
(pairwise deletion).

ioc.pdf

[email protected] ICY0006: Lecture 3 24 / 24

Microsoft Power BI Manual
100% (1)
Microsoft Power BI Manual
1,606 pages
Team Handbook Downloadable Worksheets Plus Catalog
No ratings yet
Team Handbook Downloadable Worksheets Plus Catalog
31 pages
EDA Unit3
No ratings yet
EDA Unit3
44 pages
Chap4 Normality (Data Analysis) FV
100% (1)
Chap4 Normality (Data Analysis) FV
72 pages
Pearson'S Product-Moment Correlation Coefficient: Statistics and Probability
No ratings yet
Pearson'S Product-Moment Correlation Coefficient: Statistics and Probability
18 pages
Univariate and Bivariate Statistical Analysespdf
100% (1)
Univariate and Bivariate Statistical Analysespdf
6 pages
Correlation
No ratings yet
Correlation
19 pages
BRM Data Analysis Techniques
No ratings yet
BRM Data Analysis Techniques
53 pages
Analysing Quantitative Data
No ratings yet
Analysing Quantitative Data
33 pages
2 Distributions
No ratings yet
2 Distributions
81 pages
Day 01-Basic Statistics
No ratings yet
Day 01-Basic Statistics
36 pages
STA780 - Wk1 - Intro To Multivariate Analysis-Student
No ratings yet
STA780 - Wk1 - Intro To Multivariate Analysis-Student
92 pages
9.bivariate Analysis
No ratings yet
9.bivariate Analysis
64 pages
5 Probability
No ratings yet
5 Probability
41 pages
Module 3 - Lesson 3.2 Quantitative Data Analysis
No ratings yet
Module 3 - Lesson 3.2 Quantitative Data Analysis
41 pages
Business Statistics and Analysis Course 2&3
No ratings yet
Business Statistics and Analysis Course 2&3
42 pages
Unit 18
No ratings yet
Unit 18
42 pages
Further Bound Reference
No ratings yet
Further Bound Reference
42 pages
Lecture 1 Exploratory Data Analysis
No ratings yet
Lecture 1 Exploratory Data Analysis
41 pages
Introduction To Data Analytics-Module 1 Part 2
No ratings yet
Introduction To Data Analytics-Module 1 Part 2
78 pages
STAT22209 - Chapter 01-Correlation Analyisis - 2022
No ratings yet
STAT22209 - Chapter 01-Correlation Analyisis - 2022
53 pages
Aspects of Multivariate Analysis
No ratings yet
Aspects of Multivariate Analysis
50 pages
AIML Module - 4
No ratings yet
AIML Module - 4
25 pages
Chapter 03 Describing Bivarate Data
No ratings yet
Chapter 03 Describing Bivarate Data
32 pages
Statistics
No ratings yet
Statistics
61 pages
Stat and Prob Q4 Week 7 Module 15 Lorena
No ratings yet
Stat and Prob Q4 Week 7 Module 15 Lorena
24 pages
3.5.14 Correlation and Regression PDF
No ratings yet
3.5.14 Correlation and Regression PDF
37 pages
Variable: An Item of Data Examples
No ratings yet
Variable: An Item of Data Examples
60 pages
BA 216 Lecture 5 Notes
No ratings yet
BA 216 Lecture 5 Notes
31 pages
What Is Statistics?: Definition of Statistics Statistics
No ratings yet
What Is Statistics?: Definition of Statistics Statistics
108 pages
Ymzv Further Mathematics Bound Reference
No ratings yet
Ymzv Further Mathematics Bound Reference
30 pages
Variables & Chart
No ratings yet
Variables & Chart
60 pages
BRM Presentation Group 5 - Univariate & Bivariate Analysis
No ratings yet
BRM Presentation Group 5 - Univariate & Bivariate Analysis
26 pages
Statistics 101: Introduction To Data Management
No ratings yet
Statistics 101: Introduction To Data Management
37 pages
Module 4 Correlations and Nonparametric Statistics
No ratings yet
Module 4 Correlations and Nonparametric Statistics
53 pages
The Data Analyst's Guide To Data Types, Distributions, and Statistical Tests
No ratings yet
The Data Analyst's Guide To Data Types, Distributions, and Statistical Tests
38 pages
3 - Bidimensional Statistics
No ratings yet
3 - Bidimensional Statistics
41 pages
Unit I II III IV
No ratings yet
Unit I II III IV
23 pages
Review1 5
No ratings yet
Review1 5
36 pages
MATH& 146 Lesson 3: Sections 1.1 and 1.2
No ratings yet
MATH& 146 Lesson 3: Sections 1.1 and 1.2
29 pages
Chapter Two
No ratings yet
Chapter Two
36 pages
Mod2 Notes
No ratings yet
Mod2 Notes
72 pages
Session 3 - Bivariate Data Analysis Tutorial Prac
No ratings yet
Session 3 - Bivariate Data Analysis Tutorial Prac
24 pages
Analise Bivariada - Moodle
No ratings yet
Analise Bivariada - Moodle
46 pages
5 Distributions
No ratings yet
5 Distributions
11 pages
7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics
No ratings yet
7CCMMS61 Statistics For Data Analysis: Francisco Javier Rubio Department of Mathematics
13 pages
Module 4 - Chapter 2
No ratings yet
Module 4 - Chapter 2
14 pages
Source: Pllnu4Dk9H04Wqyrebvzx4?Fr Yfp-T-701-S &toggle 1&cop Mss&Ei Utf8&Fp - Ip PH&P Types of Descriptive Statistics
No ratings yet
Source: Pllnu4Dk9H04Wqyrebvzx4?Fr Yfp-T-701-S &toggle 1&cop Mss&Ei Utf8&Fp - Ip PH&P Types of Descriptive Statistics
51 pages
Descriptive Analytics - Uni and Bi
No ratings yet
Descriptive Analytics - Uni and Bi
36 pages
Unit 1
No ratings yet
Unit 1
24 pages
Exploratory Data Analysis - v3 - Part1
No ratings yet
Exploratory Data Analysis - v3 - Part1
36 pages
Correg
No ratings yet
Correg
19 pages
4 Describing Bivariate Data
No ratings yet
4 Describing Bivariate Data
19 pages
Data Analysis and Report Writing BRM
No ratings yet
Data Analysis and Report Writing BRM
49 pages
Lesson 11 Statistical Techniques Toanalyze Data
No ratings yet
Lesson 11 Statistical Techniques Toanalyze Data
34 pages
Stats 1 - IITM BS Notes - Part 2
No ratings yet
Stats 1 - IITM BS Notes - Part 2
16 pages
3 Bivariate Data
No ratings yet
3 Bivariate Data
33 pages
MKT3600 - L07 - Basic Data Analysis - FMA
No ratings yet
MKT3600 - L07 - Basic Data Analysis - FMA
32 pages
Textbook ML - Removed
No ratings yet
Textbook ML - Removed
22 pages
Correlation Rank - Correlation Curve - Fitting For Student
No ratings yet
Correlation Rank - Correlation Curve - Fitting For Student
26 pages
Bivariate Data Definition, Analysis & Examples - Lesson
No ratings yet
Bivariate Data Definition, Analysis & Examples - Lesson
14 pages
BA 14 - Decision-Making
No ratings yet
BA 14 - Decision-Making
13 pages
Descriptive Analytics
No ratings yet
Descriptive Analytics
6 pages

3 Bivariate Data

Uploaded by

3 Bivariate Data

Uploaded by

BIVARIATE DATA

David M. Lane. et al. Introduction to Statistics : pp. 172194

[email protected] ICY0006: Lecture 3 1 / 24

Descriptive statistics is quantitatively describing the main features of a collection of

It involves two kinds of analysis:

Univariate analysis: describing the distribution of a single variable, including

between pairs of variables. In this case, descriptive statistics include:

[email protected] ICY0006: Lecture 3 2 / 24

Descriptive statistics is quantitatively describing the main features of a collection of

It involves two kinds of analysis:

Univariate analysis: describing the distribution of a single variable, including

between pairs of variables. In this case, descriptive statistics include:

[email protected] ICY0006: Lecture 3 2 / 24

Descriptive statistics is quantitatively describing the main features of a collection of

It involves two kinds of analysis:

Univariate analysis: describing the distribution of a single variable, including

between pairs of variables. In this case, descriptive statistics include:

[email protected] ICY0006: Lecture 3 2 / 24

1 Introduction to Bivariate Data

2 Pearson product-moment correlation

3 Variance Sum Law II

[email protected] ICY0006: Lecture 3 3 / 24

1 Introduction to Bivariate Data

2 Pearson product-moment correlation

3 Variance Sum Law II

[email protected] ICY0006: Lecture 3 4 / 24

Often, more than one variable is collected on each individual.

Bivariate data consists of data on two variables

Usually we are interested in the relationship between the variables

[email protected] ICY0006: Lecture 3 5 / 24

Often, more than one variable is collected on each individual.

Bivariate data consists of data on two variables

Usually we are interested in the relationship between the variables

[email protected] ICY0006: Lecture 3 5 / 24

Our experience tells us yes, but how good is the correspondence?

[email protected] ICY0006: Lecture 3 6 / 24

Each distribution is fairly skewed with a long right tail.

Also the pairing within couple is lost by separating the variables.

[email protected] ICY0006: Lecture 3 7 / 24

Each distribution is fairly skewed with a long right tail.

Also the pairing within couple is lost by separating the variables.

[email protected] ICY0006: Lecture 3 7 / 24

This is a scatter plot of the paired ages (all 282 pairs):

[email protected] ICY0006: Lecture 3 8 / 24

I When Y decreases as X increases, we say that they have a negative association.

[email protected] ICY0006: Lecture 3 9 / 24

Perfect negative relationship

[email protected] ICY0006: Lecture 3 10 / 24

Weak positive relationship No relationship

[email protected] ICY0006: Lecture 3 11 / 24

1 Introduction to Bivariate Data

2 Pearson product-moment correlation

3 Variance Sum Law II

[email protected] ICY0006: Lecture 3 12 / 24

Pearson's correlation coecient is a statistical measure of the strength of a linear

∑(X − MX )(Y − MY ) ∑ XY − NMX MY

[email protected] ICY0006: Lecture 3 13 / 24

210 − 5 · 4 · 9 210 − 180 30

[email protected] ICY0006: Lecture 3 14 / 24

[email protected] ICY0006: Lecture 3 15 / 24

Positive values denote positive linear correlation.

Negative values denote negative linear correlation.

A value of 0 denotes no linear correlation.

[email protected] ICY0006: Lecture 3 16 / 24

Positive values denote positive linear correlation.

Negative values denote negative linear correlation.

A value of 0 denotes no linear correlation.

[email protected] ICY0006: Lecture 3 16 / 24

1 The variables must be either interval or ratio measurements.

3 There is a linear relationship between the two variables.

[email protected] ICY0006: Lecture 3 17 / 24

[email protected] ICY0006: Lecture 3 18 / 24

[email protected] ICY0006: Lecture 3 19 / 24

[email protected] ICY0006: Lecture 3 20 / 24

[email protected] ICY0006: Lecture 3 20 / 24

Pearson correlation is a measure of the strength of a relationship between two variables

David M. Lane. et al. Introduction to Statistics : pp. 172194

Our experience tells us yes, but how good is the correspondence?

Pearson's correlation coecient is a statistical measure of the strength of a linear

Simplied formats of the function call are