ANALYTICAL TECHNIQUES LU4 Lecture Notes
ANALYTICAL TECHNIQUES LU4 Lecture Notes
Learning objectives
• Calculate and interpret Pearson’s correlation coefficient and the coefficient of determination
• Use the regression model to predict the value of the dependent variable for valid values of the
independent variable
Textbook reference
• Chapter 12
o §12.1 – §12.4
o Exclude §12.5 – §12.7
ATE01A1 – LU 4 1
INTRODUCTION
Researchers often investigate the nature of the relationship between numerical variables to
see what kind of relationship exists, if it does, and how strong it is. This relationship can be
modelled mathematically and used for prediction purposes. This is done through correlation
analysis, linear regression analysis, and the coefficient of determination.
• perform and interpret these analyses from raw data using computational formulae and
by using the calculator,
ATE01A1 – LU 4 2
THE ROLES OF THE VARIABLES
For prediction purposes, it is important to correctly identify the roles that the variables have
in the relationship.
• Variable Y is called the dependent variable, as its value depends on the values of one or
more other variables.
• Variable X is called the independent variable as it impacts the value of the dependent
variable.
For example, a company’s sales for a particular product may depend on how much the
company spends on advertising. Therefore, sales is the dependent variable, and
advertising expenditure is the independent variable.
ATE01A1 – LU 4 3
Exercise 4.1
ATE01A1 – LU 4 4
THE SCATTERPLOT
Non-linear Non-existent
ATE01A1 – LU 4 5
CORRELATION ANALYSIS
The correlation coefficient is a numerical measure that quantifies the strength of the
relationship between two variables.
A number of different correlation coefficients exist and depend on the nature of the
variables. The most commonly used correlation coefficient is Pearson’s Product Moment
Correlation Coefficient, denoted by r, which is a value between −1 and +1 (inclusive).
𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦
𝑟=
𝑛 σ 𝑥2 − σ 𝑥 2 𝑛 σ 𝑦2 − σ 𝑦 2
ATE01A1 – LU 4 6
• The roles of the variables are interchangeable and do not influence the value of r.
Therefore, if X is correlated with Y, then Y is correlated with X.
• Two variables are positively correlated when large values of the one variable are
associated with large values of the other variable and small values of the one are
associated with small values of the other.
• Two variables are negatively correlated when large values of the one are associated
with small values of the other and vice versa.
• The closer the value of r is to 0, the weaker the linear relationship between the variables.
• When r = 0, there is no linear relationship between two variables, but it is still possible
that a non-linear relationship between the variables exists.
ATE01A1 – LU 4 7
The value of r is interpreted as follows:
ATE01A1 – LU 4 8
Steps to find the correlation coefficient using the calculator
3) Enter the independent variable values in column X and the dependent variable values in
column Y
4) AC
ATE01A1 – LU 4 9
SIMPLE LINEAR REGRESSION ANALYSIS
• Specifically, one or more independent variables are used to predict the value of a single
numerical dependent variable.
• The main idea of simple linear regression is to fit a straight line through the data to
capture or describe the relationship with a simplistic mathematical model.
ATE01A1 – LU 4 10
If a linear relationship exists, then each pairwise observation (x, y) can be written in the
form 𝑦 = 𝑎 + 𝑏𝑥 + 𝑒, i.e., the value of the straight line plus the deviation from the line (e),
also referred to as the error term.
• The equation obtained to describe the linear relationship between the variables X and Y
is called the least squares regression equation/model of Y on X.
• The least squares method ensures that the sum of the squared distances between the
points on the scatterplot and the straight line is a minimum, i.e., the combined error is a
minimum.
• The resulting line is referred to as the line of best fit. This is the straight line that is
closest to all points simultaneously. This does not imply that the line is necessarily good;
it simply means it is the best of all possible lines.
ATE01A1 – LU 4 11
• The equation of a straight line is 𝑦ො = 𝑎 + 𝑏𝑥, where b denotes the slope (gradient) of the
line, 𝑎 denotes the point on the y-axis through which the line passes (y-intercept), and
𝑦ො is the predicted value of Y on the straight line.
• The values a and b are the regression coefficients, estimated through the method of
least squares, and are calculated as follows:
𝑛 σ 𝑥𝑦− σ 𝑥 σ 𝑦 σ 𝑦−𝑏 σ 𝑥
𝑏= 𝑎=
𝑛 σ 𝑥2− σ 𝑥 2 𝑛
• Note that the roles of the variables are not interchangeable in these formulae, so it is
very important to correctly identify the roles of the variables for regression analysis.
ATE01A1 – LU 4 12
• The regression equation/model is used to predict the value of the dependent variable for
a given value of the independent variable.
• The x-values used to make predictions should fall within the range of the observed
values of X, as there is no guarantee that the regression model still applies to x-values
outside of the observed range. This is referred to as interpolation and will yield a valid
prediction.
• Values outside the observed range of X are not valid and will lead to extrapolation of the
model.
ATE01A1 – LU 4 13
• A positive value for b means that, on average, the values of Y will increase as the values of
X increase, indicating a positive (direct) relationship between X and Y. 𝑦ො = 3 + 2𝑥
• A negative value for b means that, on average, the values of Y will decrease as the values of
X increase, indicating a negative (inverse) relationship between X and Y. 𝑦ො = 3 − 2𝑥
• It, therefore, follows that the sign of the slope corresponds to the sign of the correlation
coefficient, although the actual numerical values of r and b are not related.
• The slope of the regression line shows the predicted change in the dependent variable for a
1-unit change in the independent variable.
ATE01A1 – LU 4 14
COEFFICIENT OF DETERMINATION
The coefficient of determination provides a measure of how well the regression line fits the
data, i.e. a goodness-of-fit measure.
• If all data points lie directly on the regression line in the positive direction, there is no
unexplained variation. For such data, the correlation coefficient r = 1; therefore
𝑟 2 × 100 = 12 × 100 = 100%. That is, 100% of the variation in Y is accounted for by the
variation in X in the regression model.
ATE01A1 – LU 4 15
• On the other hand, if the points are so scattered that none of the variation can be
explained by the regression model, in other words r = 0, it follows that
𝑟 2 × 100 = 02 × 100 = 0%, indicating that none of the variation in Y is accounted for by
the variation in X in the regression model.
• The value of r2 will always be between 0 and 1 (or between 0% and 100% when
expressed as a percentage) and does not indicate the direction of the linear relationship,
but rather the goodness-of-fit of the regression model.
• Also, if 𝑟 2 = 80%, then 20% of the variation in Y is unexplained by the regression model.
ATE01A1 – LU 4 16
Exercise 4.2
A sample of 10 athletes took a series of tests to measure their fitness. Their overall fitness
scores ranged from 1 to 3.5. After these tests, the athletes ran a standard marathon, and
their times were measured in hours.
Dependent =
Independent =
ATE01A1 – LU 4 17
2) Describe the nature of the relationship between the athletes’ fitness and their marathon
times based on the following scatterplot.
ATE01A1 – LU 4 18
3) The following table shows the raw data for the 10 athletes:
Fitness score 2.4 1.7 2.8 2.8 3.5 2 1.8 2.5 2.2 1
Marathon time 5.7 7.4 5.6 5.2 3.2 9 9.3 6.5 6.5 10.6
Correlation =
Coefficient of determination =
Regression equation =
ATE01A1 – LU 4 19
4) Use the sums below in the computational formulae to calculate the following values:
σ 𝑥 = 22.7 σ 𝑥 2 = 55.91
𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦
Correlation = 𝑟=
𝑛 σ 𝑥2 − σ 𝑥 2 𝑛 σ 𝑦2 − σ 𝑦 2
ATE01A1 – LU 4 20
σ 𝑥 = 22.7 σ 𝑥 2 = 55.91
Coefficient of determination =
𝑛 σ 𝑥𝑦− σ 𝑥 σ 𝑦 σ 𝑦−𝑏 σ 𝑥
Regression equation: 𝑏 = 𝑎=
𝑛 σ 𝑥2− σ 𝑥 2 𝑛
ATE01A1 – LU 4 21
5) The following computer output gives the results of a least squares regression analysis
of marathon time on fitness:
Regression Statistics
Multiple R 0.94
R Square 0.88
Adjusted R Square 0.86
Standard Error 0.82
Observations 10
ATE01A1 – LU 4 22
a) Identify and interpret the strength of the linear relationship between fitness and
marathon time.
ATE01A1 – LU 4 23
d) Interpret the slope of the regression model.
e) Predict an athlete’s marathon time if his/her fitness score is equal to 2. Is this a valid
prediction?
ATE01A1 – LU 4 24
d) Is it possible to predict an athlete’s marathon time if his/her fitness score is equal to 5?
ATE01A1 – LU 4 25