0% found this document useful (0 votes)
12 views20 pages

Chapter 14 - Correlation and Regression - 2012 - Practical Biostatistics

Uploaded by

tyanafm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views20 pages

Chapter 14 - Correlation and Regression - 2012 - Practical Biostatistics

Uploaded by

tyanafm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Chapter 14

Correlation and Regression


Sometimes, the investigator’s hypothesis does not involve searching for sta-
tistically significant differences between groups but, rather, how well two dif-
ferent study variables relate to each other. If they do significantly relate to
each other, then predicting the value of one variable based on the value of the
other one may be feasible. Two mathematical tools are available to accom-
plish these objectives: correlation and regression.

14.1 CORRELATION
Correlation has two goals: (1) to quantify the degree of connection between
a pair of variables and (2) to determine the direction of this relationship.
Biological plausibility demands that, in the clinical setting, the “strongest”
variable influences the “more susceptible” one. This proposition implies two
opposing concepts:
G Independent variable: The independent variable is the variable that has
influence on the dependent variable but in turn is not influenced by the
latter. In didactic terms, it is the “dominant” variable. For example, body
temperature (independent variable) influences heart rate (dependent vari-
able) rather than the opposite. By convention, it is identified as x and
graphically represented by the horizontal axis (x-axis).
G Dependent variable: The dependent variable is the variable that is influ-
enced by the independent variable but in turn has no influence on the lat-
ter. In didactic terms, it is the “submissive” variable. For example, heart
rate (dependent variable) is influenced by body temperature (independent
variable) rather than the opposite. By convention, it is identified as y and
graphically represented by the vertical axis (y-axis).
Take, for example, a cohort of 26 overweight middle-aged men whose
body weight and mean arterial blood pressure data are collected and tabulated
(Table 14.1). Biological plausibility implies that body weight is the indepen-
dent variable (x) and mean arterial blood pressure is the dependent variable (y).
We want to verify if these results might be useful to predict mean arterial
blood pressure based on body weight in this cohort and possibly in other sim-
ilar cohorts. First, however, we must determine if there is a hint of a relation-
ship between both variables using a graph and to quantify it (Figure 14.1).

M. Suchmacher & M. Geller: Practical Biostatistics. DOI: 10.1016/B978-0-12-415794-1.00014-8


© 2012 Elsevier Inc. All rights reserved. 167
168 PART | IV Additional Concepts in Biostatistics

TABLE 14.1 Data from a Cohort of 26 Overweight Middle-Aged Mena

Patient Body Weight Mean Arterial Blood


No. (kg) (x) Pressure (mmHg) (y)
1 20 80
2 30 78
3 40 90
4 50 92
5 60 76
6 70 78
7 80 86
8 90 76
9 100 108
10 110 74
11 120 85
12 130 108
13 140 110
14 150 88
15 160 90
16 170 80
17 180 118
18 20 150
19 30 89
20 40 90
21 50 75
22 60 78
23 70 108
24 80 145
25 90 198
26 100 149
Mean 142.8 99.9
a
The purpose of presented data (body weight) is merely to demonstrate the proposed
concepts in a didactic manner, even though they are clinically irrealistic.
Chapter | 14 Correlation and Regression 169

250

Mean arterial blood pressure (mmHg)


y

200 Body weight

150

100

50

x
0
0 50 100 150 200 250 300
Body weight (kg)

FIGURE 14.1 Scatterplot based on the data from Table 14.1.

Based on a simple visual analysis, it is possible to notice a correlation


between body weight and mean arterial blood pressure variables; that is, the
higher the body weight, the higher the mean arterial blood pressure. In order
to better quantify this relationship, an index called correlation coefficient
must be determined. One of the most commonly used is Pearson’s correla-
tion coefficient (r). This coefficient is determined by the following formula:
P
ðx 2 xÞ ðy 2 yÞ
r 5 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P P
ðx2xÞ2 ðy2yÞ2

where x is the x values mean, and y is the y values mean.


ð20 2 142:8Þ 1 ð30 2 142:8Þ 1    1 ð100 2 142:8Þ 3 ð80 2 99:9Þ
1    1 ð198 2 99:9Þ 1 ð149 2 99:9Þ
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
r5 u
u ð202142:8Þ2 1 ð302142:8Þ2 1    1 ð1002142:8Þ2 3 ð80299:9Þ2
t
1    1 ð198299:9Þ2 1 ð149299:9Þ2
5 10:57
r ranges from 21 to 11 (2denotes a negative direction and 1 a positive
direction). The following are possible inferences based on these data:
G The closer to 0, the weaker the correlation (dependent variable is “indif-
ferent” to independent variable changes).
G The closer to 1 (positive or negative), the stronger the correlation (depen-
dent variable changes as much as independent variable does).
G The closer to 21, the more the dependent variable distances from the
independent variable (if the latter increases, then the former decreases,
and vice-versa).
G The closer to 11, the more both variables change in parallel (if the inde-
pendent variable increases, the dependent variable also increases).
170 PART | IV Additional Concepts in Biostatistics

In the previous example, r 510.57. Positive direction implies two possi-


bilities: (1) Mean arterial blood pressure is supposed to increase as body
weight decreases, and (2) mean arterial blood pressure is supposed to
decrease as body weight decreases.

14.2 REGRESSION
Regression (linear regression) takes correlation one step further by predicting
the value of a dependent variable based on the value of an independent vari-
able, as it quantifies the strength and direction of this prediction (the term
“regression” does not imply a temporal dimension to the problem; it only
represents a historical aspect of the development of this tool). Linear regression
assumes that their variables present a linear relationship. This method is based
on the regression line, which is determined by the linear regression formula:
y0 5 4 1 2x
where y0 is the value of the dependent variable to be predicted, and x is the
independent variable. This means that if x 5 2, then y0 5 8. Graphically, this
line is represented as shown in Figure 14.2.
The regression line crosses as closely as possible all the intersection
points of a graph. By doing so, it is expected to represent the trend of pooled
data, as well as its direction (ascending, flat, or descending line). The follow-
ing are important elements of regression lines:
G Slope: Slope represents the steepness of the line. It informs on the size of the
influence x has over y; that is, the steeper the line, the greater the influence.
G y-intercept: y-intercept is the point where the regression line touches the
y-axis. It informs on the value of the dependent variable when the inde-
pendent variable equals 0.
To predict a dependent variable, it is first necessary to build a linear
regression graph by adding a regression line to the correlation scatterplot of
studied population. Let us use the example from Figure 14.1 (Figure 14.3).

18 y
16
14
12
10
8
6
4
2
x
0
0 1 2 3 4 5 6 7

FIGURE 14.2 Regression line traced according to the intersection of independent and depen-
dent variables.
Chapter | 14 Correlation and Regression 171

Interpretation of a linear regression graph may depend on visual analysis


(the closer to the intersection points, the greater the predictability) and bio-
logical plausibility. One can also use the linear regression formula to predict
mean arterial blood pressure (y) based on any value of body weight (x) by
replacing 4 for a, representing the y-intercept, and 2 by b or 2 b, represent-
ing the slope:
y0 5 a 1 bx
or
y0 5 a 1 ð2bxÞ
where y0 is the value to be predicted, a is the y-intercept, b is the ascending
slope (positive direction), 2 b is the descending slope (negative direction),
and x is the independent variable.
a and b are in fact regression coefficients, which must be calculated
before finding y0 . Their formulas are, respectively,
a 5 yP
2 bx
ðx 2 xÞ ðy 2 yÞ
b 5 P
ðx 2 xÞ2

where x is the x values mean, and y is the y values mean.


By applying the data from Table 14.1, we have the following values:

ð20 2 142:8Þð80 2 99:9Þ 1 ð30 2 142:8Þð78 2 99:9Þ 1 . . .


b 5 1 ð90 2 142:8Þð198 2 99:9Þ 1 ð100 2 142:8Þð149 2 99:9Þ
ð202142:8Þ2 1 ð302142:8Þ2 1 . . . 1 ð902142:8Þ2 1 ð1002142:8Þ2
5 0:235
a 5 99:9 2 ð0:235 3 142:8Þ 5 66:2

250 y
Mean arterial blood pressure (mmHg)

200 Body weight

150

100

50

x
0
0 50 100 150 200 250 300
Body weight (kg)

FIGURE 14.3 Regression line traced on the scatterplot from Figure 14.1.
172 PART | IV Additional Concepts in Biostatistics

Suppose we want to predict mean arterial blood pressure for a particular


subject from our cohort. His weight is 129 kg:
y0 5 a 1 bx
0
y 5 66:2 1 0:235ð129Þ 5 96:5 mmHg
Nevertheless, it is also necessary to determine how much this mean arte-
rial blood pressure (y0 ) is explainable by his body weight (x), according to
this model. This is achievable through coefficient of determination (R2):
R2 5 r 2
where r is the Pearson’s correlation coefficient (read as “r two”).
R2 510:572 5 0:32 or 32%
This figure means that from all possible reasons for the patient to present
a 96.5 mmHg mean arterial blood pressure (y0 ), 32% can be explained by his
129-kg body weight (x), according to linear regression model.

14.3 MULTIPLE LINEAR REGRESSION


Eventually, two or more independent variables are available for predicting
the dependent variable. These independent variables can be considered in
combination for dependent variable estimate through multivariable analysis
(Chapter 3). Plenty of tools are available for performing multivariable analy-
sis, and we focus on multiple linear regression to demonstrate this type of
resource. Let us detail multiple linear regression by extending the example
used previously (Table 14.2).
By scatterplotting these data together and adding a trend line for each
independent variable, we can verify that x1 and x2 have different effects on
y, as represented in Figure 14.4.
Notice how the second independent variable, diastolic blood pressure
(x2), generates by itself a steeper slope toward greater mean arterial blood
pressure values. Hence, if we want to predict mean arterial blood pressure in
a more “realistic” way, it would be advisable to consider both independent
variables combined. If the investigator decides to do so, he or she will have
two options according to the presence or the absence of interaction between
(or among) independent variables:
G Independent variables do not interact.
In this case, the multiple linear regression formula can be directly
applied. With this resource, the dependent variable tends to change line-
arly (i.e., proportionally) with the weighted sum of independent variables.
In the multiple linear regression formula, partial regression coefficients 
b1, b2, . . . bk  are used instead of regression coefficients:
y0 5 a 1 b1 x1 1 b2 x2 1 . . . 1 bk xk
Chapter | 14 Correlation and Regression 173

TABLE 14.2 Data from a Cohort of 26 Overweight Middle-Aged Men


Considering Diastolic Blood Pressure as a Second Independent Variable

Patient No. Body Weight Diastolic Blood Mean Arterial Blood


(kg) (x1) Pressure (mmHg) (x2) Pressure (mmHg) (y)
1 20 90 80
2 30 82 78
3 40 99 90
4 50 99 92
5 60 80 76
6 70 82 78
7 80 95 86
8 90 80 76
9 100 110 108
10 110 80 74
11 120 89 85
12 130 115 108
13 140 120 110
14 150 100 88
15 160 100 90
16 170 90 80
17 180 129 118
18 20 160 150
19 30 99 89
20 40 100 90
21 50 85 75
22 60 88 78
23 70 118 108
24 80 153 145
25 90 202 198
26 100 180 149
Mean 142.8 108.6 99.9
174 PART | IV Additional Concepts in Biostatistics

250

Mean arterial blood pressure (mmHg) (y) 200


Body weight (x1)

Diastolic blood pressure (x2)

150

100

50

0
0 50 100 150 200 250 300

Body weight (kg) (x1)

FIGURE 14.4 Regression lines for two independent variables, x1 and x2, traced on a scatterplot
built for Table 14.2 (the continuous line represents x1, and the dashed line represents x2).

where y0 is the value to be predicted (dependent variable), a is the y-inter-


cept, b1 is the partial regression coefficient to x1, x1 is the independent
variable 1, b2 is the partial regression coefficient to x2, x2 is the indepen-
dent variable 2, bk is the partial regression coefficient to xk, and xk is a
given independent variable.
In this setting, changes in one independent variable, such as x1,
change y0 on its own and not through indirect influence on other indepen-
dent variables. In medical sciences, this situation represents a minority of
cases, and the current example is no exception.
G Independent variables interact.
In the previous example, body weight and diastolic blood pressure 
our so-called “independent” variables  are in fact mutually influenced.
It is known that patients with an elevated body weight are prone to pres-
ent a higher diastolic blood pressure. Therefore, body weight would be
an “independent variable” to diastolic blood pressure, now the “depen-
dent variable.” In this scenario, application of the multiple linear regres-
sion formula cannot be directly performed. Partial regression coefficients
must be mathematically adjusted according to the influence strength of
the so-called independent variables on other independent variables that
these coefficients represent (method detailed elsewhere). This strength
must be statistically validated before including adjusted partial regression
coefficients in the analysis. In medical sciences, this situation represents
the majority of cases.
Chapter | 14 Correlation and Regression 175

APPENDIX 14.1: HOW TO BUILD A SCATTERPLOT AND TO


ADD A TREND LINE USING MICROSOFT EXCEL
Step 1
Tabulate your data, and select them disregarding subject identification and
column names. Click on the Insert tag and then on the Scatter button in the
Charts area. Select Scatter with only Markers.
176 PART | IV Additional Concepts in Biostatistics

Step 2
Select Vertical (Value) Axis Major Guidelines and Series 1 box (both not
illustrated). Hit the delete key.
Chapter | 14 Correlation and Regression 177

Step 3
Click on the Layout tag and then on the Axis Title button in the Labels
area. Click on the Primary Horizontal Axis Title drop-down menu button
and then on the Title Below Axis secondary drop-down menu button for the
horizontal axis title box. Add desired title.
178 PART | IV Additional Concepts in Biostatistics
Chapter | 14 Correlation and Regression 179

Click on the Primary Vertical Axis drop-down menu button and then on
the Rotated Title secondary drop-down menu button for the vertical axis
title box. Add desired title.
180 PART | IV Additional Concepts in Biostatistics
Chapter | 14 Correlation and Regression 181

Step 4
Select the chart you have created, click on the Layout tab, and then click on
the Trendline button in the Analysis area. Click on the Linear Trendline
button in the drop-down menu. The trend line is shown on the chart.
182 PART | IV Additional Concepts in Biostatistics
Chapter | 14 Correlation and Regression 183

APPENDIX 14.2: HOW TO CALCULATE PEARSON’S


CORRELATION COEFFICIENT USING MICROSOFT EXCEL
Step 1
Tabulate your data and select the cell immediately inferior to the columns of
arguments for which you want to calculate Pearson’s correlation coefficient.
Click on the Formulas tag and then on the More Functions button in the
Function Library area. Click on the Statistical drop-down menu button and
then on the PEARSON secondary drop-down menu button for the Function
Arguments dialog box.
184 PART | IV Additional Concepts in Biostatistics

Step 2
Pearson’s formula is shown in the C28 cell. Put your cursor in the Array 1
field and then select the independent variables array (B2:B27) on the sheet
itself. The array is shown in the Array 1 field. Put your cursor in the Array
2 field and then select the dependent variables array (C2:C27) on the sheet
itself. The array is shown in the Array 2 field. Pearson’s correlation coeffi-
cient is shown in the result space.
Chapter | 14 Correlation and Regression 185

APPENDIX 14.3: HOW TO PREDICT A DEPENDENT VARIABLE


USING MICROSOFT EXCEL
Step 1
Tabulate your data and select the cell immediately inferior to the dependent
variable column. Click on the Formulas tag and then on the More
Functions button in the Function Library area. Click on the Statistical
drop-down menu button and then on the TREND secondary drop-down
menu button for the Function Arguments dialog box.
186 PART | IV Additional Concepts in Biostatistics

Step 2
TREND formula is shown in the C28 cell. Put your cursor in the
Known_y’s field and then select the dependent variables array (B2:B27) on
the sheet itself. The array is shown in the Known_y’s field. Put your cursor
in the Known_x’s field and then select the independent variables array (C2:
C27) on the sheet itself. The array is shown in the Known_x’s field. Leave
the Const field blank. The dependent variable is shown in the result space.

You might also like