0% found this document useful (0 votes)
30 views25 pages

ANALYTICAL TECHNIQUES LU4 Lecture Notes

This learning unit covers correlation and simple linear regression, focusing on identifying dependent and independent variables, interpreting scatterplots, calculating Pearson's correlation coefficient, and using regression models for predictions. Students will learn to perform these analyses using computational formulas and calculators, and understand the significance of the coefficient of determination. The unit emphasizes the importance of correctly identifying variable roles and the implications of linear relationships in data analysis.

Uploaded by

chisangachama18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views25 pages

ANALYTICAL TECHNIQUES LU4 Lecture Notes

This learning unit covers correlation and simple linear regression, focusing on identifying dependent and independent variables, interpreting scatterplots, calculating Pearson's correlation coefficient, and using regression models for predictions. Students will learn to perform these analyses using computational formulas and calculators, and understand the significance of the coefficient of determination. The unit emphasizes the importance of correctly identifying variable roles and the implications of linear relationships in data analysis.

Uploaded by

chisangachama18
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

LEARNING UNIT 4: Correlation and Simple Linear Regression

Learning objectives

• Distinguish between the dependent and independent variables

• Identify the type of relationship in a bivariate dataset by using a scatterplot

• Calculate and interpret Pearson’s correlation coefficient and the coefficient of determination

• Calculate and interpret regression coefficients of the regression model

• Use the regression model to predict the value of the dependent variable for valid values of the
independent variable

Textbook reference
• Chapter 12
o §12.1 – §12.4
o Exclude §12.5 – §12.7

ATE01A1 – LU 4 1
INTRODUCTION

Researchers often investigate the nature of the relationship between numerical variables to
see what kind of relationship exists, if it does, and how strong it is. This relationship can be
modelled mathematically and used for prediction purposes. This is done through correlation
analysis, linear regression analysis, and the coefficient of determination.

Students must be able to:

• perform and interpret these analyses from raw data using computational formulae and
by using the calculator,

• by interpreting computer output.

ATE01A1 – LU 4 2
THE ROLES OF THE VARIABLES

For prediction purposes, it is important to correctly identify the roles that the variables have
in the relationship.

• Variable Y is called the dependent variable, as its value depends on the values of one or
more other variables.

• Variable X is called the independent variable as it impacts the value of the dependent
variable.

For example, a company’s sales for a particular product may depend on how much the
company spends on advertising. Therefore, sales is the dependent variable, and
advertising expenditure is the independent variable.

ATE01A1 – LU 4 3
Exercise 4.1

Give two more examples of dependent and independent variables.

ATE01A1 – LU 4 4
THE SCATTERPLOT

The scatterplot is a visual representation of the relationship between two numerical


variables. The independent variable is plotted on the x-axis and the dependent variable on
the y-axis. The nature of the relationship could be linear (positive or negative), non-linear or
non-existent.

Positive linear Negative linear

Non-linear Non-existent

ATE01A1 – LU 4 5
CORRELATION ANALYSIS

The correlation coefficient is a numerical measure that quantifies the strength of the
relationship between two variables.

A number of different correlation coefficients exist and depend on the nature of the
variables. The most commonly used correlation coefficient is Pearson’s Product Moment
Correlation Coefficient, denoted by r, which is a value between −1 and +1 (inclusive).

The computational formula for Pearson’s correlation is:

𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦
𝑟=
𝑛 σ 𝑥2 − σ 𝑥 2 𝑛 σ 𝑦2 − σ 𝑦 2

ATE01A1 – LU 4 6
• The roles of the variables are interchangeable and do not influence the value of r.
Therefore, if X is correlated with Y, then Y is correlated with X.

• Note that r is very sensitive to outliers.

• Two variables are positively correlated when large values of the one variable are
associated with large values of the other variable and small values of the one are
associated with small values of the other.

• Two variables are negatively correlated when large values of the one are associated
with small values of the other and vice versa.

• The closer the value of r is to 0, the weaker the linear relationship between the variables.

• When r = 0, there is no linear relationship between two variables, but it is still possible
that a non-linear relationship between the variables exists.

ATE01A1 – LU 4 7
The value of r is interpreted as follows:

r = −1 Perfect negative linear relationship


r close to −1 Strong negative linear relationship
r far from −1 and far from 0 Moderate negative linear relationship
r negative and relatively close to 0 Weak negative linear relationship
r negative and very close to 0 No linear relationship
r=0 No linear relationship
r positive and very close to 0 No linear relationship
r positive and relatively close to 0 Weak positive linear relationship
r far from 0 and far from +1 Moderate positive linear relationship
r close to +1 Strong positive linear relationship
r = +1 Perfect positive linear relationship

ATE01A1 – LU 4 8
Steps to find the correlation coefficient using the calculator

1) SETUP → down arrow → 3:STAT → 2:OFF

2) MODE → 2:STAT → 2:A + BX

3) Enter the independent variable values in column X and the dependent variable values in
column Y

4) AC

5) SHIFT STAT → 5:REG → 3:r → =

ATE01A1 – LU 4 9
SIMPLE LINEAR REGRESSION ANALYSIS

Linear regression analysis is a technique used to mathematically model the relationship


between two variables for the purpose of prediction.

• Specifically, one or more independent variables are used to predict the value of a single
numerical dependent variable.

• Simple linear regression has only one numerical independent variable.

• Multiple linear regression has multiple independent variables (categorical or numerical)


(this will be covered in ATE B).

• The main idea of simple linear regression is to fit a straight line through the data to
capture or describe the relationship with a simplistic mathematical model.

ATE01A1 – LU 4 10
If a linear relationship exists, then each pairwise observation (x, y) can be written in the
form 𝑦 = 𝑎 + 𝑏𝑥 + 𝑒, i.e., the value of the straight line plus the deviation from the line (e),
also referred to as the error term.

• The equation obtained to describe the linear relationship between the variables X and Y
is called the least squares regression equation/model of Y on X.

• The least squares method ensures that the sum of the squared distances between the
points on the scatterplot and the straight line is a minimum, i.e., the combined error is a
minimum.

• The resulting line is referred to as the line of best fit. This is the straight line that is
closest to all points simultaneously. This does not imply that the line is necessarily good;
it simply means it is the best of all possible lines.

ATE01A1 – LU 4 11
• The equation of a straight line is 𝑦ො = 𝑎 + 𝑏𝑥, where b denotes the slope (gradient) of the

line, 𝑎 denotes the point on the y-axis through which the line passes (y-intercept), and
𝑦ො is the predicted value of Y on the straight line.
• The values a and b are the regression coefficients, estimated through the method of
least squares, and are calculated as follows:

𝑛 σ 𝑥𝑦− σ 𝑥 σ 𝑦 σ 𝑦−𝑏 σ 𝑥
𝑏= 𝑎=
𝑛 σ 𝑥2− σ 𝑥 2 𝑛

• Note that the roles of the variables are not interchangeable in these formulae, so it is
very important to correctly identify the roles of the variables for regression analysis.

ATE01A1 – LU 4 12
• The regression equation/model is used to predict the value of the dependent variable for
a given value of the independent variable.

• The x-values used to make predictions should fall within the range of the observed
values of X, as there is no guarantee that the regression model still applies to x-values
outside of the observed range. This is referred to as interpolation and will yield a valid
prediction.

• Values outside the observed range of X are not valid and will lead to extrapolation of the
model.

ATE01A1 – LU 4 13
• A positive value for b means that, on average, the values of Y will increase as the values of
X increase, indicating a positive (direct) relationship between X and Y. 𝑦ො = 3 + 2𝑥

• A negative value for b means that, on average, the values of Y will decrease as the values of
X increase, indicating a negative (inverse) relationship between X and Y. 𝑦ො = 3 − 2𝑥

• It, therefore, follows that the sign of the slope corresponds to the sign of the correlation
coefficient, although the actual numerical values of r and b are not related.

• The slope of the regression line shows the predicted change in the dependent variable for a
1-unit change in the independent variable.

Steps to find the regression coefficient using the calculator

1) Same steps as for the correlation coefficient (steps 1-4)

2) SHIFT STAT → 5:REG → 1:A (intercept) or 2:B (slope) → =

ATE01A1 – LU 4 14
COEFFICIENT OF DETERMINATION

The coefficient of determination provides a measure of how well the regression line fits the
data, i.e. a goodness-of-fit measure.

• It is the square of the correlation coefficient, denoted by r2.

• It is typically expressed as a percentage and interpreted as the percentage of variation


in the dependent variable that can be explained by the regression model.

• If all data points lie directly on the regression line in the positive direction, there is no
unexplained variation. For such data, the correlation coefficient r = 1; therefore
𝑟 2 × 100 = 12 × 100 = 100%. That is, 100% of the variation in Y is accounted for by the
variation in X in the regression model.

ATE01A1 – LU 4 15
• On the other hand, if the points are so scattered that none of the variation can be
explained by the regression model, in other words r = 0, it follows that
𝑟 2 × 100 = 02 × 100 = 0%, indicating that none of the variation in Y is accounted for by
the variation in X in the regression model.

• The value of r2 will always be between 0 and 1 (or between 0% and 100% when
expressed as a percentage) and does not indicate the direction of the linear relationship,
but rather the goodness-of-fit of the regression model.

• Also, if 𝑟 2 = 80%, then 20% of the variation in Y is unexplained by the regression model.

ATE01A1 – LU 4 16
Exercise 4.2

A sample of 10 athletes took a series of tests to measure their fitness. Their overall fitness
scores ranged from 1 to 3.5. After these tests, the athletes ran a standard marathon, and
their times were measured in hours.

1) Identify the dependent and the independent variables:

Dependent =

Independent =

ATE01A1 – LU 4 17
2) Describe the nature of the relationship between the athletes’ fitness and their marathon
times based on the following scatterplot.

ATE01A1 – LU 4 18
3) The following table shows the raw data for the 10 athletes:

Fitness score 2.4 1.7 2.8 2.8 3.5 2 1.8 2.5 2.2 1

Marathon time 5.7 7.4 5.6 5.2 3.2 9 9.3 6.5 6.5 10.6

Use your calculator to find the following values:

Correlation =

Coefficient of determination =

Regression equation =

ATE01A1 – LU 4 19
4) Use the sums below in the computational formulae to calculate the following values:

σ 𝑥 = 22.7 σ 𝑥 2 = 55.91

σ 𝑦 = 69.0 σ 𝑦 2 = 520.24 σ 𝑥𝑦 = 143.59

𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦
Correlation = 𝑟=
𝑛 σ 𝑥2 − σ 𝑥 2 𝑛 σ 𝑦2 − σ 𝑦 2

ATE01A1 – LU 4 20
σ 𝑥 = 22.7 σ 𝑥 2 = 55.91

σ 𝑦 = 69.0 σ 𝑦 2 = 520.24 σ 𝑥𝑦 = 143.59

Coefficient of determination =

𝑛 σ 𝑥𝑦− σ 𝑥 σ 𝑦 σ 𝑦−𝑏 σ 𝑥
Regression equation: 𝑏 = 𝑎=
𝑛 σ 𝑥2− σ 𝑥 2 𝑛

ATE01A1 – LU 4 21
5) The following computer output gives the results of a least squares regression analysis
of marathon time on fitness:

Regression Statistics
Multiple R 0.94
R Square 0.88
Adjusted R Square 0.86
Standard Error 0.82
Observations 10

Coefficients Standard Error t Stat P-value


Intercept 13.66 0.92 14.82 0.00
Fitness score −2.98 0.39 −7.64 0.00

ATE01A1 – LU 4 22
a) Identify and interpret the strength of the linear relationship between fitness and
marathon time.

b) What proportion of variation in marathon time is explained by variation in fitness?

c) What proportion of variation in marathon time is not explained by variation in fitness?

ATE01A1 – LU 4 23
d) Interpret the slope of the regression model.

e) Predict an athlete’s marathon time if his/her fitness score is equal to 2. Is this a valid
prediction?

ATE01A1 – LU 4 24
d) Is it possible to predict an athlete’s marathon time if his/her fitness score is equal to 5?

e) Is it possible to predict an athlete’s fitness score if his/her marathon time is equal to 3


hours?

ATE01A1 – LU 4 25

You might also like