Lecture 8 Bivariate Data
Lecture 8 Bivariate Data
Harold C Banda
Paired Data
is there a relationship?
if so, what is the equation?
use the equation for prediction.
Assumptions of Correlation
The sample of paired data (x, y ) is a random sample.
The pairs of (x, y ) data have a bivariate normal distribution.
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Introduction
Correlation measures the strength of a relationship of
variables while regression is a way of representing that
relationship.
Thus, Correlation means the extent to which the two
variables vary directly (positive correlation) or inversely
(negative correlation).
The degree of relationship is expressed as a numeric index
called the coefficient of correlation denoted by r.
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Introduction
Correlation measures the strength of a relationship of
variables while regression is a way of representing that
relationship.
Thus, Correlation means the extent to which the two
variables vary directly (positive correlation) or inversely
(negative correlation).
The degree of relationship is expressed as a numeric index
called the coefficient of correlation denoted by r.
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Properties of Correlation coefficient
Exercise
Consider the paired data:
(x, y ) : (2, 1.4), (4, 1.8), (8, 2.1), (8, 2.3), (9, 2.6).
Draw a scatter diagram and comment on the relationship.
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Techniques for determining correlation...Cont’d
Example
Refer to the bivariate data set below, the number of hours
(X ) six students studied for a final exam and their final exam
scores (Y ).
Hours of study (X) Exam score (Y)
3 86
5 95
4 92
4 83
2 78
3 82
Calculate the correlation coefficient between hours studied
and exam score and interpret your results.
From the table
P P above, wePhave the following
P 2 results: n = 6;
P x 2= 21; y = 516; xy = 1, 835; x = 79;
y = 44, 582.
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Example
Example...cont’d
Substituting these results in the given formula, we get
r= 0.862.
Interpretation: There is a strong postive correlation
between hours of study and exam score. The more hours one
studies, the higher the score.
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Example
Example...cont’d
Note: It is important to understand the limitations of
correlation as a measure.
While we have seen in the previous example a high correlation
between hours of study and test score, is there a causal
connection?
External evidence may lead us to think that studying for more
hours may cause one to have a high score but it is quite
possible that some students are gifted they could also have
high scores without spending more hours on studies.
We have to look at the other evidence and, unless we are
carrying out an experiment, have no idea what the causal
connection is between two variables.
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Techniques for determining correlation...Cont’d
Example...cont’d
STUDENTS MATHEMATICS COSTING
John 98 77
Annie 72 84
Peter 52 50
Chikondi 65 64
Mary 45 49
George 50 20
Solution:
STUDENTS Rank in MATHS Rank in COSTING d d2
John 1 2 -1 1
Annie 2 1 1 1
Peter 4 4 0 0
Chikondi 3 3 0 0
Mary 6 5 1 1
George 5 6 -1 1
6 d2
P
6×4
R =1− n(n2 −1)
=1− 6(62 −1)
= 0.886.
Coefficient of Determination
r 2 is called the coefficient of determination and it gives the
proportion of the total variation in the dependent variable
which is explained by the variation in the independent variable.
From the example above r = 0.862(3d.p)
So, r 2 = 0.862 × 0.862 = 0.743.
Thus 74.3% of the variation in the grades is explained by the
variation in x.
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Covariance
Covariance
The term covariance has the same meaning as the variance of
one variable: how spread out or variable things are.
It is calculated as follows: P
i (xi −x̄)(yi −ȳ )
sxy = n−1 .
Regression Equation
Given a collection of paired data, the regression equation
y = a + bx algebraically describes the relationship between
the two variables (x, y).
Regression Line is the line of best fit or least-squares line
which connects the two variables.
Given a value xi with its corresponding observed value yi ,
plugging xi into the equation (y = bx + a ) yields say ybi as an
estimate of yi .
The difference in the estimation is yi − ybi .
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Linear Model or structure
Assumptions
We are investigating only linear relationships.
For each x value, y is a random variable having a normal
(bell-shaped) distribution. All of these y distributions have the
same variance. Also, for a given value of x, the distribution of
y -values has a mean that lies on the regression line.
Some Definitions
Marginal Change: the amount a variable changes when the
other variable changes by exactly one unit.
Outlier: a point lying far away from the other data points.
Influential Points: points which strongly affect the graph of
the regression line.
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Some Definitions...cont’d
Some Definitions...cont’d
Residual: for a sample of paired (x, y ) data, the difference
(y − yb) between an observed sample y -value and the value of
yb, which is the value of y that is predicted by using the
regression equation.
Least-Squares Property: A straight line satisfies this
property if the sum of the squares of the residuals is the
smallest sum possible.
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Contingency and Association Tables