Chapter 14 - Correlation and Regression - 2012 - Practical Biostatistics
Chapter 14 - Correlation and Regression - 2012 - Practical Biostatistics
14.1 CORRELATION
Correlation has two goals: (1) to quantify the degree of connection between
a pair of variables and (2) to determine the direction of this relationship.
Biological plausibility demands that, in the clinical setting, the “strongest”
variable influences the “more susceptible” one. This proposition implies two
opposing concepts:
G Independent variable: The independent variable is the variable that has
influence on the dependent variable but in turn is not influenced by the
latter. In didactic terms, it is the “dominant” variable. For example, body
temperature (independent variable) influences heart rate (dependent vari-
able) rather than the opposite. By convention, it is identified as x and
graphically represented by the horizontal axis (x-axis).
G Dependent variable: The dependent variable is the variable that is influ-
enced by the independent variable but in turn has no influence on the lat-
ter. In didactic terms, it is the “submissive” variable. For example, heart
rate (dependent variable) is influenced by body temperature (independent
variable) rather than the opposite. By convention, it is identified as y and
graphically represented by the vertical axis (y-axis).
Take, for example, a cohort of 26 overweight middle-aged men whose
body weight and mean arterial blood pressure data are collected and tabulated
(Table 14.1). Biological plausibility implies that body weight is the indepen-
dent variable (x) and mean arterial blood pressure is the dependent variable (y).
We want to verify if these results might be useful to predict mean arterial
blood pressure based on body weight in this cohort and possibly in other sim-
ilar cohorts. First, however, we must determine if there is a hint of a relation-
ship between both variables using a graph and to quantify it (Figure 14.1).
250
150
100
50
x
0
0 50 100 150 200 250 300
Body weight (kg)
14.2 REGRESSION
Regression (linear regression) takes correlation one step further by predicting
the value of a dependent variable based on the value of an independent vari-
able, as it quantifies the strength and direction of this prediction (the term
“regression” does not imply a temporal dimension to the problem; it only
represents a historical aspect of the development of this tool). Linear regression
assumes that their variables present a linear relationship. This method is based
on the regression line, which is determined by the linear regression formula:
y0 5 4 1 2x
where y0 is the value of the dependent variable to be predicted, and x is the
independent variable. This means that if x 5 2, then y0 5 8. Graphically, this
line is represented as shown in Figure 14.2.
The regression line crosses as closely as possible all the intersection
points of a graph. By doing so, it is expected to represent the trend of pooled
data, as well as its direction (ascending, flat, or descending line). The follow-
ing are important elements of regression lines:
G Slope: Slope represents the steepness of the line. It informs on the size of the
influence x has over y; that is, the steeper the line, the greater the influence.
G y-intercept: y-intercept is the point where the regression line touches the
y-axis. It informs on the value of the dependent variable when the inde-
pendent variable equals 0.
To predict a dependent variable, it is first necessary to build a linear
regression graph by adding a regression line to the correlation scatterplot of
studied population. Let us use the example from Figure 14.1 (Figure 14.3).
18 y
16
14
12
10
8
6
4
2
x
0
0 1 2 3 4 5 6 7
FIGURE 14.2 Regression line traced according to the intersection of independent and depen-
dent variables.
Chapter | 14 Correlation and Regression 171
250 y
Mean arterial blood pressure (mmHg)
150
100
50
x
0
0 50 100 150 200 250 300
Body weight (kg)
FIGURE 14.3 Regression line traced on the scatterplot from Figure 14.1.
172 PART | IV Additional Concepts in Biostatistics
250
150
100
50
0
0 50 100 150 200 250 300
FIGURE 14.4 Regression lines for two independent variables, x1 and x2, traced on a scatterplot
built for Table 14.2 (the continuous line represents x1, and the dashed line represents x2).
Step 2
Select Vertical (Value) Axis Major Guidelines and Series 1 box (both not
illustrated). Hit the delete key.
Chapter | 14 Correlation and Regression 177
Step 3
Click on the Layout tag and then on the Axis Title button in the Labels
area. Click on the Primary Horizontal Axis Title drop-down menu button
and then on the Title Below Axis secondary drop-down menu button for the
horizontal axis title box. Add desired title.
178 PART | IV Additional Concepts in Biostatistics
Chapter | 14 Correlation and Regression 179
Click on the Primary Vertical Axis drop-down menu button and then on
the Rotated Title secondary drop-down menu button for the vertical axis
title box. Add desired title.
180 PART | IV Additional Concepts in Biostatistics
Chapter | 14 Correlation and Regression 181
Step 4
Select the chart you have created, click on the Layout tab, and then click on
the Trendline button in the Analysis area. Click on the Linear Trendline
button in the drop-down menu. The trend line is shown on the chart.
182 PART | IV Additional Concepts in Biostatistics
Chapter | 14 Correlation and Regression 183
Step 2
Pearson’s formula is shown in the C28 cell. Put your cursor in the Array 1
field and then select the independent variables array (B2:B27) on the sheet
itself. The array is shown in the Array 1 field. Put your cursor in the Array
2 field and then select the dependent variables array (C2:C27) on the sheet
itself. The array is shown in the Array 2 field. Pearson’s correlation coeffi-
cient is shown in the result space.
Chapter | 14 Correlation and Regression 185
Step 2
TREND formula is shown in the C28 cell. Put your cursor in the
Known_y’s field and then select the dependent variables array (B2:B27) on
the sheet itself. The array is shown in the Known_y’s field. Put your cursor
in the Known_x’s field and then select the independent variables array (C2:
C27) on the sheet itself. The array is shown in the Known_x’s field. Leave
the Const field blank. The dependent variable is shown in the result space.