0% found this document useful (0 votes)
11 views24 pages

Lecture 8 Bivariate Data

The document discusses bivariate frequency distribution, focusing on correlation and simple regression analysis. It explains the concepts of correlation, regression equations, and techniques for determining correlation, including scatter diagrams and various correlation coefficients. Additionally, it highlights common errors in correlation interpretation and provides examples to illustrate the application of these statistical methods.

Uploaded by

che-006-22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views24 pages

Lecture 8 Bivariate Data

The document discusses bivariate frequency distribution, focusing on correlation and simple regression analysis. It explains the concepts of correlation, regression equations, and techniques for determining correlation, including scatter diagrams and various correlation coefficients. Additionally, it highlights common errors in correlation interpretation and provides examples to illustrate the application of these statistical methods.

Uploaded by

che-006-22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Bivariate Frequency Distribution: Correlation and Simple Regression Analysis

Bivariate Frequency Distribution: Correlation and


Simple Regression Analysis

Harold C Banda

Phone : +265 9997-733-78/8893-733-57.


Email : [email protected]
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Paired Data

Paired Data
is there a relationship?
if so, what is the equation?
use the equation for prediction.

Assumptions of Correlation
The sample of paired data (x, y ) is a random sample.
The pairs of (x, y ) data have a bivariate normal distribution.
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis

Introduction
Correlation measures the strength of a relationship of
variables while regression is a way of representing that
relationship.
Thus, Correlation means the extent to which the two
variables vary directly (positive correlation) or inversely
(negative correlation).
The degree of relationship is expressed as a numeric index
called the coefficient of correlation denoted by r.
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis

Introduction
Correlation measures the strength of a relationship of
variables while regression is a way of representing that
relationship.
Thus, Correlation means the extent to which the two
variables vary directly (positive correlation) or inversely
(negative correlation).
The degree of relationship is expressed as a numeric index
called the coefficient of correlation denoted by r.
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Properties of Correlation coefficient

Properties/Interpretation of Correlation coefficient


−1 ≤ r ≤ 1
A value of r=1 means perfect positive correlation.
A value of r=-1 means perfect negative correlation.
0 < r < 1 means positive partial correlation.
−1 < r < 0 means negative partial correlation.
r=0 means no correlation (absence of a linear relationship
between the two variables).
r is not affected by the choice of variables(variables can be
interchanged).
r measures strength of a linear relationship.
Value of r does not change if all values of either variable are
converted to a different scale.
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Techniques for determining correlation

Techniques for determining correlation


1 Inspection of a scatter diagram(is a graph in which the paired
(x, y ) sample data are plotted with a horizontal x axis and a
vertical y axis. Each individual (x, y ) pair is plotted as a
single point).
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Techniques for determining correlation

Exercise
Consider the paired data:
(x, y ) : (2, 1.4), (4, 1.8), (8, 2.1), (8, 2.3), (9, 2.6).
Draw a scatter diagram and comment on the relationship.
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Techniques for determining correlation...Cont’d

Techniques for determining correlation...Cont’d


1 Inspection of a scatter diagram.

2 The Pearson’s product moment correlation coefficient which is


found by using the formulaP P P
−( x)( y )
r = √ P n2 xy P 2 P 2 P 2 .
[n x −( x) ][n y −( y) ]
Where x = the values of the independent variable.
y = the values of the dependent variable.
n = the number of the paired data points in the sample.
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Example

Example
Refer to the bivariate data set below, the number of hours
(X ) six students studied for a final exam and their final exam
scores (Y ).
Hours of study (X) Exam score (Y)
3 86
5 95
4 92
4 83
2 78
3 82
Calculate the correlation coefficient between hours studied
and exam score and interpret your results.
From the table
P P above, wePhave the following
P 2 results: n = 6;
P x 2= 21; y = 516; xy = 1, 835; x = 79;
y = 44, 582.
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Example

Example...cont’d
Substituting these results in the given formula, we get
r= 0.862.
Interpretation: There is a strong postive correlation
between hours of study and exam score. The more hours one
studies, the higher the score.
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Example

Example...cont’d
Note: It is important to understand the limitations of
correlation as a measure.
While we have seen in the previous example a high correlation
between hours of study and test score, is there a causal
connection?
External evidence may lead us to think that studying for more
hours may cause one to have a high score but it is quite
possible that some students are gifted they could also have
high scores without spending more hours on studies.
We have to look at the other evidence and, unless we are
carrying out an experiment, have no idea what the causal
connection is between two variables.
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Techniques for determining correlation...Cont’d

Techniques for determining correlation...Cont’d


1 Inspection of a scatter diagram.

2 The Pearson’s product moment correlation coefficient


3 The spearman’s rank correlation coefficient which is found by
using the formula:
6 d2
P
R =1− n(n2 −1)
.
Where d is the difference between the two ranks for any one
item, and n is the number of items involved.
Example: The mid-semester results in Mathematics and Costing
of a sample of 6 students were as follows:
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Example

Example...cont’d
STUDENTS MATHEMATICS COSTING
John 98 77
Annie 72 84
Peter 52 50
Chikondi 65 64
Mary 45 49
George 50 20

Use the Spearman’s rank correlation coefficient to investigate


whether there is a relationship between ability in Mathematics and
Costing.
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Example

Solution:
STUDENTS Rank in MATHS Rank in COSTING d d2
John 1 2 -1 1
Annie 2 1 1 1
Peter 4 4 0 0
Chikondi 3 3 0 0
Mary 6 5 1 1
George 5 6 -1 1
6 d2
P
6×4
R =1− n(n2 −1)
=1− 6(62 −1)
= 0.886.

This means ability in Mathematics and Costing are strongly related.


Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Common Errors Involving Correlation

Common Errors Involving Correlation


Causation: It is wrong to conclude that correlation implies
causality.
Averages: Averages suppress individual variation and may
inflate the correlation coefficient.
Linearity: There may be some relationship between x and y
even when there is no significant linear correlation.

Coefficient of Determination
r 2 is called the coefficient of determination and it gives the
proportion of the total variation in the dependent variable
which is explained by the variation in the independent variable.
From the example above r = 0.862(3d.p)
So, r 2 = 0.862 × 0.862 = 0.743.
Thus 74.3% of the variation in the grades is explained by the
variation in x.
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Covariance

Covariance
The term covariance has the same meaning as the variance of
one variable: how spread out or variable things are.
It is calculated as follows: P
i (xi −x̄)(yi −ȳ )
sxy = n−1 .

Regression Equation
Given a collection of paired data, the regression equation
y = a + bx algebraically describes the relationship between
the two variables (x, y).
Regression Line is the line of best fit or least-squares line
which connects the two variables.
Given a value xi with its corresponding observed value yi ,
plugging xi into the equation (y = bx + a ) yields say ybi as an
estimate of yi .
The difference in the estimation is yi − ybi .
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Linear Model or structure

Linear Model or structure


Independent/predictor variable (x): A single numerical
variable assumed to measure a cause.
Dependent/response variable (y): A single numerical
variable assumed to measure an effect.
Equation model: yb = β1 x + β0 .
Note: In a regression context, making a prediction means
taking an x-value that is not found in our sample, and
calculating a y -value for that individual. The ability to make
these sorts of predictions is very valuable in business, simply
because measurement costs money. If we can measure just
some of the variables and then calculate the rest, we can save
money, time, and resources
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Least squares Regression line

Least squares Regression line


We want a line y = bx + a which minimises yi − ybi .
To do this we find b and a which minimises (yi − ybi )2 .
P
Such a line is called the Least Squares Regression line
Since y is assumed to be dependent on x we call this line the
Least Squares Regression Line
P
y on x.
−nx̄ ȳ
From calculus we obtain b = P xyx 2 −nx̄ 2
and a = ȳ − bx̄ (this
follows from the fact that the point (x̄, x̄) lies on the line).
Hence we end up with the line y = bx + a.
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Assumptions

Assumptions
We are investigating only linear relationships.
For each x value, y is a random variable having a normal
(bell-shaped) distribution. All of these y distributions have the
same variance. Also, for a given value of x, the distribution of
y -values has a mean that lies on the regression line.

Guidelines for Using The Regression Equation


If there is no significant linear correlation, don’t use the
regression equation to make predictions.
When using the regression equation for predictions, stay
within the scope of the available sample data.
A regression equation based on old data is not necessarily
valid now.
Don’t make predictions about a population that is different
from the population from which the sample data was drawn.
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Least squares Regression line-Example

Least squares Regression line-Example


The following table shows the amount of time (in hours) that
students spend preparing for an exam and the grade they get
in the exam:
Hours of study (X) Exam score (Y)
10 51
7 48
12 52
15 58
6 48
14 53
2 23
Find the equation of the least squares regression line grade on
time (y on x) and hence estimate the grade obtained by a
students who spent 3 hours preparing for the exam.
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Least squares Regression line-Example...cont’d

Least squares Regression line-Example...cont’d


We
P have the Pfollowing results from thePtable above:
x 2 = 754,
P
x = 66, y = 333, xy = 3416,
x̄ = 9.429(3d.p) and ȳ = 47.571(3d.p).
Now b = 2.098(3d.p) and a = 27.792(3d.p).
Hence the line is y = 2.098x + 27.792.
So for x = 3 the corresponding y is
y = 2.098 × 3 + 27.972 = 34.084(3d.p).

Some Definitions
Marginal Change: the amount a variable changes when the
other variable changes by exactly one unit.
Outlier: a point lying far away from the other data points.
Influential Points: points which strongly affect the graph of
the regression line.
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Some Definitions...cont’d

Some Definitions...cont’d
Residual: for a sample of paired (x, y ) data, the difference
(y − yb) between an observed sample y -value and the value of
yb, which is the value of y that is predicted by using the
regression equation.
Least-Squares Property: A straight line satisfies this
property if the sum of the squares of the residuals is the
smallest sum possible.
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Contingency and Association Tables

Contingency and Association Tables


We have so far discussed the relationships between two
quantitative variables—the strength, direction and form of the
linear relationship with the correlation.
What about qualitative (categorical) variables?
Suppose a class of 82 students is asked this question: “do you
enjoy Statistics?” the following table shows the responses:

Strongly Agree Agree Neutral Disagree Strongly D


Males 9 13 5 2 1
Females 12 18 11 6 5
Bivariate Frequency Distribution: Correlation and Simple Regression Analysis
Contingency and Association Tables

Contingency and Association Tables


A contingency table relates two categories of data.
In the example above, the relationship is between the gender
of the student and his/her response to the question.
A marginal distribution of a variable is a frequency or a
relative frequency distribution of either the row or the column
variable in the contingency table (Totals).
If each of the totals above is divided by n (n=82), then the
result is called relative frequency marginal distribution.
A conditional distribution lists the relative frequency of each
category of variable, given a specific value of the other
variable in the contingency table.

You might also like