0% found this document useful (0 votes)
16 views23 pages

Stats 4

STATISTICS PPT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views23 pages

Stats 4

STATISTICS PPT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

LINEAR REGRESSION AND CORRELATION

GE 104- MATHEMATICS IN THE MODERN WORLD


WHAT IS LINEAR REGRESSION AND CORRELATION

 The most commonly used techniques for investigating the


relationship between two quantitative variables are
correlation and linear regression. Correlation quantifies the
strength of the linear relationship between a pair of
variables, whereas regression expresses the relationship in
the form of an equation.
OBJECTIVES

•Understand the basics of linear regression.


•Learn how correlation quantifies relationships.
•Explore how these concepts are applied in real-
world scenarios.
UNDERSTANDING CORRELATION

 On a scatter diagram, the closer the points lie to a straight line, the stronger the linear
relationship between two variables. To quantify the strength of the relationship, we can
calculate the correlation coefficient. In algebraic notation, if we have two variables x and y,
and the data take the form of n pairs (i.e. [x 1, y1], [x2, y2], [x3, y3] ... [xn, yn]), then the
correlation coefficient is given by the following equation:

 where is the mean of the x values, and is the mean of the y values.
 This is the product moment correlation coefficient (or Pearson
correlation coefficient). The value of r always lies between -1 and
+1. A value of the correlation coefficient close to +1 indicates a
strong positive linear relationship (i.e. one variable increases with
the other; Fig. 2). A value close to -1 indicates a strong negative
linear relationship (i.e. one variable decreases as the other
increases; Fig. 3). A value close to 0 indicates no linear relationship
(Fig. 4); however, there could be a nonlinear relationship between
the variables (Fig. 5).
FIGURE 2;
CORRELATION COEFFICIENT (R) = +0.9. POSITIVE LINEAR RELATIONSHIP
FIGURE 3;
CORRELATION COEFFICIENT (R) = -0.9. NEGATIVE LINEAR RELATIONSHIP.
FIGURE 4;
CORRELATION COEFFICIENT (R) = 0.04. NO RELATIONSHIP.
FIGURE 5;
CORRELATION COEFFICIENT (R) = -0.03. NONLINEAR RELATIONSHIP.
HYPOTHESIS TEST OF CORRELATION

 We can use the correlation coefficient to test whether there is a linear


relationship between the variables in the population as a whole. The
null hypothesis is that the population correlation coefficient equals 0.
The value of r can be compared with those given in Table 2, or
alternatively exact P values can be obtained from most statistical
packages. For the A&E data, r = 0.62 with a sample size of 20 is
greater than the value highlighted bold in Table 2 for P = 0.01,
indicating a P value of less than 0.01. Therefore, there is sufficient
evidence to suggest that the true population correlation coefficient is
not 0 and that there is a linear relationship between ln urea and age.
TABLE 2.
5% AND 1% POINTS FOR THE DISTRIBUTION OF THE CORRELATION COEFFICIENT
UNDER THE NULL HYPOTHESIS THAT THE POPULATION CORRELATION IS 0 IN A TWO-
TAILED TEST
UNDERSTANDING LINEAR REGRESSION

 In the A&E example we are interested in the effect of age (the


predictor or x variable) on ln urea (the response or y variable). We
want to estimate the underlying linear relationship so that we can
predict ln urea (and hence urea) for a given age. Regression can be
used to find the equation of this line. This line is usually referred to
as the regression line.
EQUATION OF A STRAIGHT LINE

 The equation of a straight line is given by y = a + bx, where the


coefficients a and b are the intercept of the line on the y axis and
the gradient, respectively. The equation of the regression line for the
A&E data (Fig. 7) is as follows: ln urea = 0.72 + (0.017 × age)
(calculated using the method of least squares, which is described
below). The gradient of this line is 0.017, which indicates that for an
increase of 1 year in age the expected increase in ln urea is 0.017
units (and hence the expected increase in urea is 1.02 mmol/l).
 The predicted ln urea of a patient aged 60 years, for example, is
0.72 + (0.017 × 60) = 1.74 units. This transforms to a urea level of
e1.74 = 5.70 mmol/l. The y intercept is 0.72, meaning that if the line
were projected back to age = 0, then the ln urea value would be
0.72. However, this is not a meaningful value because age = 0 is a
long way outside the range of the data and therefore there is no
reason to believe that the straight line would still be appropriate.
FIGURE 7;
REGRESSION LINE FOR LN UREA AND AGE: LN UREA = 0.72 + (0.017 × AGE).
METHOD OF LEAST SQUARES

 The regression line is obtained using the method of least squares.


Any line y = a + bx that we draw through the points gives a
predicted or fitted value of y for each value of x in the data set. For
a particular value of x the vertical difference between the observed
and fitted value of y is known as the deviation, or residual (Fig. 8).
The method of least squares finds the values of a and b that
minimise the sum of the squares of all the deviations. This gives the
following formulae for calculating a and b:
FIGURE 8;
REGRESSION LINE OBTAINED BY MINIMIZING THE SUMS OF SQUARES OF ALL OF
THE DEVIATIONS.
ASSUMPTIONS AND LIMITATIONS

 The use of correlation and regression depends on some underlying


assumptions. The observations are assumed to be independent. For
correlation both variables should be random variables, but for
regression only the response variable y must be random. In carrying
out hypothesis tests or calculating confidence intervals for the
regression parameters, the response variable should have a Normal
distribution and the variability of y should be the same for each
value of the predictor variable.
 A scatter diagram of the data provides an initial check of the
assumptions for regression. The assumptions can be assessed in
more detail by looking at plots of the residuals [4,7]. Commonly, the
residuals are plotted against the fitted values. If the relationship is
linear and the variability constant, then the residuals should be
evenly scattered around 0 along the range of fitted values (Fig. 11).
 (a) Scatter diagram of y against x suggests that the relationship is
nonlinear. (b) Plot of residuals against fitted values in panel a; the
curvature of the relationship is shown more clearly. (c) Scatter
diagram of y against x suggests that the variability in y increases
with x. (d) Plot of residuals against fitted values for panel c; the
increasing variability in y with x is shown more clearly.
 In addition, a Normal plot of residuals can be produced. This is a plot
of the residuals against the values they would be expected to take if
they came from a standard Normal distribution (Normal scores). If
the residuals are Normally distributed, then this plot will show a
straight line. (A standard Normal distribution is a Normal distribution
with mean = 0 and standard deviation = 1.) Normal plots are
usually available in statistical packages.
CONCLUSION

 Both correlation and simple linear regression can be used to


examine the presence of a linear relationship between two variables
providing certain assumptions about the data are satisfied. The
results of the analysis, however, need to be interpreted with care,
particularly when looking for a causal relationship or when using the
regression equation for prediction. Multiple and logistic regression
will be the subject of future reviews.

You might also like