0% found this document useful (0 votes)
23 views23 pages

Stat Chapter 6

This document discusses simple linear regression and correlation. It defines key terms like regression, dependent and independent variables. It explains that regression attempts to determine the relationship between one dependent variable and one or more independent variables. Correlation describes the strength and direction of the linear relationship between two variables. The coefficient of correlation r ranges from -1 to 1, where values closer to these extremes indicate a stronger relationship. An example calculates r between household income and consumption as 0.973, showing a strong positive correlation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views23 pages

Stat Chapter 6

This document discusses simple linear regression and correlation. It defines key terms like regression, dependent and independent variables. It explains that regression attempts to determine the relationship between one dependent variable and one or more independent variables. Correlation describes the strength and direction of the linear relationship between two variables. The coefficient of correlation r ranges from -1 to 1, where values closer to these extremes indicate a stronger relationship. An example calculates r between household income and consumption as 0.973, showing a strong positive correlation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Unit six(6)

SIMPLE LINEAR REGRESSION AND CORRELATION

Unit Objectives
After completing this unit, you will be able to:
• Describe the meaning of regression and correlation
• Demonstrate the procedures for computing descriptive
measures of the strength of linear relationship between
two variables.
• Explain how to find a ‘best fitting’ line relating two
variables.
• Outline the computation of rank correlation which is a
measure of association between two rankings
• Demonstrate the procedure of test statistics for
analyzing analytical data
6.1 Definition of key terms

• Regression: is a statistical measure that attempts to determine


the relationship between one dependent variable and the values
of one or more independent variables.
• Regression analysis is the estimation or prediction of the
Unknown values of one variable from known values of the other
variable.
 In Regression analysis there are two types of variables.
• The variable whose value is influenced or to be predicted is called
dependent (regressed or explained) variable.
• The variable which influences the values or is used for prediction,
is called independent variable (regressor or Predictor or
explanatory).
• The mathematical equation (or mathematical model) relating the
dependent variable and the independent variable(s) is called a
regression model.
Cont…
• There are two types of regression. These are
simple and multiple regression.
• The situation where we have only one independent
variable is called simple regression. While if two or
more independent variables are involved in the
system are called multiple regression.
• In simple regression, the relationship between the
dependent variable (Y) and the independent
variable (X) may have various forms:
1.linear relationship: Y  a  bX
Y  a bX 
2.exponential relationship:
3.quadratic relationship: Y  aX  bX  c
2
6.2 Correlation

 Correlation: is used to describe the degree of relationship (association or


interdependence) between the two variables.
 The relation ship b/n the two variable may be either
• Positive/Direct or
• Negative/inverse or
• No relation b/n them. We can identify these relation ship by plotting a
scatter diagram.
a) Positive or direct linear relationship
• The points cluster around a line that runs from the
lower left to upper right of the graph area.
• An increase in the value of X is more likely associated
with an increase in the value of Y and Vise versa.
• If the points closer to the line, the relationship is
strong.
Cont…
Graphically

Figure 1: Positive (direct) linear relationship between variables


B) Negative or inverse linear relationship
• The points cluster around a line that runs from the
upper left to lower right of the graph area.
• An increase in the value of X is more likely associated
with a Decrease in the value of Y and Vise versa.
• If the points closer to the line, the relationship is
strong.
Cont…
Graphycally,

Figure 2: Negative (inverse) linear relationship between variables


C) No linear relationship
 The data points are randomly scattered, then
there is no linear relationship between the two
variables. This means there is a low or zero
correlation between the variables
Cont…
Graphically,

Figure 3: No linear relationship between variables


 A measure of strength and direction of linear
relationship between two variables X and Y is called
coefficient of correlation(r), which is defined as:
n  xy  ( x)( y)
r 
 n  x 2  ( x)2  n  y 2  ( y) 2 
Properties of Correlation Coefficient
• The coefficient of correlation lies between –1≤ r ≤1
• The sign of “r” indicates the direction of the relation
• The magnitude of “r” indicates the strength of linear
relationship between the two variables X and Y.
• If r =0 indicate that there is no linear relation ship between
two variables.
• If r = -1 indicate that there is perfect negative (inverse) linear
relationship between two variables.
• If r = 1 indicate that there is perfect positive (direct) linear
relationship between two variables.
• A coefficient of correlation(r) that is closes to zero shows the
relationship is quite weak.
• A coefficient of correlation(r) is closest to +1 or -
1,shows that the relationship is strong.
Cont…
The following table shows the summary of these relationships.
What happens to What happens to Types of correlation Value Example
variable X variable Y
X increase in Y increase in value Direct or positive Positive , rangingThe more time you
value from 0 to +1 spend studying, the
higher your test score
will be
X decreases in Y decreases in value Direct or positive Positive , ranging The less money you
value from 0 to +1 put in the bank, the
less interest you will
earn
X increases in Y decrease in value Indirect or negative Negative, ranging The more you exercise,
value from -1 to 0 the less you will
weight.
X decreases in Y increase in value Indirect or negative Negative, ranging The less time you take
value from -1 to 0 to complete the exam,
the more you will get
wrong.
Example 1:

A researcher who is concerned about the


consumption rate of households took a sample of
10 households and observed their consumption
and income (both in tens of Birr) for one month.
The results are given in table 1 below.
household income (x) consumption (y)

1 15 15
2 35 30
3 42 30
4 60 50
5 72 48
6 128 100
7 98 93
8 35 33
9 15 14
10 50 50
Calculate the coefficient of correlation and interpret.
Cont…
Solution:
Table 2: Calculation of the necessary summary statistics

income consumption (y) 2


(x) xy x 2 y
15 15 15(15) = 225 (15)2 = 225 (15)2 = 225
35 30 35(30) = 1050 (35)2 = 1225 (30)2 = 900
42 30 42(30) = 1260 (42)2 = 1764 (30)2 = 900
60 50 60(50) = 3000 (60)2 = 3600 (50)2 = 2500
72 48 72(48) = 3456 (72)2 = 5184 (48)2 = 2304
128 100 128(100) = 12800 (128)2 = 16384 (100)2 = 10000
98 93 98(93) = 9114 (98)2 = 9604 (93)2 = 8649
35 33 35(33) = 1155 (35)2 = 1225 (33)2 = 1089
15 14 15(14) = 210 (15)2 = 225 (14)2 = 196
50 50 50(50) = 2500 (50)2 = 2500 (50)2 = 2500
550 463 34770 41936 29263
Cont…
• The coefficient of correlation is then computed
as:
n  xy  ( x)( y)
r 
 n  x 2  ( x)2   n  y2  ( y)2 

10(34770)  (550)(463)
 = 0.973
[10(41936)  (550) ][10(29263)  (463) ]
2 2

Here, since the value of r is very close to 1, we can conclude


that there is a strong direct (positive) linear relationship
between income and consumption.
6.3 Coefficient of Determination
 Another measure of goodness-of-fit of the regression line is the
coefficient of determination which is the square of the coefficient of
correlation; i.e., coefficient of determination = r2
 The coefficient of determination is used to explain how much variability of
one factor can be caused by its relationship to another factor.
 The value of the coefficient of determination (r2) lies between 0 and 1,
inclusive.
 If r2 is close to 1, then this is an indication of dependent variable is better
to predicted by the independent variable ,
 while a value of r2 close to 0 indicates that the dependent variable is not
predicted by the independent variable.
• The total variation in the dependent variable (Y) can be divided into two:
1. Explained variation and
2. Unexplained variation
1. Explained variation is the variation in the dependent variable (Y) that is
explained by changes (or variation) in the independent variable (X). The
proportion of explained variation is: r2 x 100%.
Cont…
2. Unexplained variation is the variation in the dependent variable (Y)
that is caused by factors other than X (such as chance, excluded
variables, etc). The proportion of unexplained variation is:
(1- r2) x 100%.
Example 2: Consider the data on consumption expenditure and
income of households in Table 1. Find
1. Coefficient of determination?
2. the proportion of explained Variation?
3. unexplained variations? and Interpret each results?
Households income (x) consumption (y)
1 15 15
2 35 30
3 42 30
4 60 50
5 72 48
6 128 100
7 98 93
8 35 33
9 15 14
10 50 50
Cont…
Solution: Table 2: Calculation of the necessary summary statistics
income (x) consumption (y) XY X2 Y2
15 15 15(15) = 225 (15)2 = 225 (15)2 = 225
35 30 35(30) = 1050 (35)2 = 1225 (30)2 = 900
42 30 42(30) = 1260 (42)2 = 1764 (30)2 = 900
60 50 60(50) = 3000 (60)2 = 3600 (50)2 = 2500
72 48 72(48) = 3456 (72)2 = 5184 (48)2 = 2304
128 100 128(100) = 12800 (128)2 = 16384 (100)2 = 10000
98 93 98(93) = 9114 (98)2 = 9604 (93)2 = 8649
35 33 35(33) = 1155 (35)2 = 1225 (33)2 = 1089
15 14 15(14) = 210 (15)2 = 225 (14)2 = 196
50 50 50(50) = 2500 (50)2 = 2500 (50)2 = 2500
550 463 34770 41936 29263

The coefficient of correlation was computed as r = 0.973.


• Coefficient of determination =r2=(0.973)2=0.95
• The proportion of explained variation is: r2*100%= 0.95%100%=95% . Thus, about 95%
of the variation (change) in the monthly consumption expenditure of households is due to
variation in their income.
 The proportion of unexplained variation is:(1- r2)*100%= 0.05*100%=5%. Thus, about
5% of the variation in the monthly consumption expenditure of households is due to
factors other than income.
6.4 Regression and the method of least squares

• Once we have a clear understanding of the strength of linear


relationship existing between the dependent and independent
variables, the next step is to determine a mathematical model (a
linear equation) relating the two.
• The most common technique for obtaining such an equation is the
method of least squares
• The liner least square fitting technique is the simplest and the most
commonly applied form of linear regression and provides a solution
to the problem of finding the best fitting straight line through a set
of points.
• If the relationship between two variables X and Y is linear, we
express this as:
Where,Y-dependent variable
X-independent variable
α- y-intercept
β -slop
Cont…
• This y represents the individual values of the
actual observed points.
• So, we should begin to use to symbolize the
individual values of the estimated points; i.e.,
those points that lie on the estimating line.
Accordingly, we shall write the equation of the
estimating line as:

The sum of squares of the errors (SSE) is:


Cont…
The estimating line will have a ‘good fit’ if it
minimizes the error between the estimated points
on the line and the actual observed points that
were used to draw it.
Cont…
The ‘best’ fitting line is the line for which the SSE is the minimum. By applying
differential calculus to the SSE, the slope of the best fitting line becomes:

a  y  bx
Cont…
Example:- Table 5 shows the number of items produced
(X) and the cost (Y) incurred in producing them (in Birr) at a
certain factory.

n  xy  (  x)(  y) 5(616)  (32)(93)


b
n  x 2  (  x) 2 5(222)  (32) 2

a  y  bx
Cont…
Therefore, the equation of the least squares line is:
ŷ  a  bx  ŷ = 10.86 + 1.21x
•The y-intercept is: a = 10.86. This value tells us that,
even if no item is produced, there will be a fixed cost
of 10.86 Birr (such as insurance cost, maintenance
cost, etc.). The slope is: b = 1.21. This figure
indicates that for a unit increase (decrease) in the
number of items produced, the cost increases
(decreases) by 1.21 Birr.
6.5 Rank correlation
Rank correlation is used to measure the strength of the
linear association between two ranked variables, denoted
6 d 2
by rs and given by rs  1 
n(n  1)
2

where n = number of paired observations


d = difference between the ranks for each pair of observations
 The steps involved in computing the Spearman’s rank correlation
coefficient are as follows:
Step1: Rank the x’s among themselves giving rank 1 to the largest (or
smallest) observation, rank 2 to the second largest (or second
smallest) observation, and so on.
and Rank the y’s similarly.
Step 2: Find rank of x - rank of y for each pair of observations
Step 3: Find d = 
2
d (the sum of squares of the differences
between each pair of ranks)
Step 4: Compute the rank correlation coefficient using the above
TH
A NK
YO
U

You might also like