0% found this document useful (0 votes)
7 views

Unit 3 - Notes

FUNDAMENTALS OF HEALTHCARE ANALYTICS REGULATION 2021

Uploaded by

suhagaja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Unit 3 - Notes

FUNDAMENTALS OF HEALTHCARE ANALYTICS REGULATION 2021

Uploaded by

suhagaja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Unit 3 - REGRESSION AND CORRELATION ANALYSIS

Regression

Regression analysis is an important statistical method for the analysis of medical data. It enables
the identification and characterization of relationships among multiple factors. It also enables the
identification of prognostically relevant risk factors and the calculation of risk scores for
individual prognostication.

Methods

This article is based on selected textbooks of statistics, a selective review of the literature, and
our own experience.

Results

After a brief introduction of the uni- and multivariable regression models, illustrative examples
are given to explain what the important considerations are before a regression analysis is
performed, and how the results should be interpreted. The reader should then be able to judge
whether the method has been used correctly and interpret the results appropriately.

Conclusion

The performance and interpretation of linear regression analysis are subject to a variety of
pitfalls, which are discussed here in detail. The reader is made aware of common errors of
interpretation through practical examples. Both the opportunities for applying linear regression
analysis and its limitations are presented.

The purpose of statistical evaluation of medical data is often to describe relationships between
two variables or among several variables. For example, one would like to know not just whether
patients have high blood pressure, but also whether the likelihood of having high blood pressure
is influenced by factors such as age and weight. The variable to be explained (blood pressure) is
called the dependent variable, or, alternatively, the response variable; the variables that explain it
(age, weight) are called independent variables or predictor variables. Measures of association
provide an initial impression of the extent of statistical dependence between variables. If the
dependent and independent variables are continuous, as is the case for blood pressure and
weight, then a correlation coefficient can be calculated as a measure of the strength of the
relationship between them (box 1).
Interpretation of the correlation coefficient (r)
Spearman’s coefficient:

Describes a monotone relationship

A monotone relationship is one in which the dependent variable either rises or sinks continuously
as the independent variable rises.

Pearson’s correlation coefficient:

Describes a linear relationship

Interpretation/meaning:

Correlation coefficients provide information about the strength and direction of a relationship
between two continuous variables. No distinction between the explaining variable and the
variable to be explained is necessary:

 r = ± 1: perfect linear and monotone relationship. The closer r is to 1 or –1, the stronger
the relationship.
 r = 0: no linear or monotone relationship
 r < 0: negative, inverse relationship (high values of one variable tend to occur together
with low values of the other variable)
 r > 0: positive relationship (high values of one variable tend to occur together with high
values of the other variable)
Graphical representation of a linear relationship:

Scatter plot with regression line

A negative relationship is represented by a falling regression line (regression coefficient b < 0), a
positive one by a rising regression line (b > 0).

Regression analysis is a type of statistical evaluation that enables three things:

 Description: Relationships among the dependent variables and the independent variables
can be statistically described by means of regression analysis.
 Estimation: The values of the dependent variables can be estimated from the observed
values of the independent variables.
 Prognostication: Risk factors that influence the outcome can be identified, and individual
prognoses can be determined.
Regression analysis employs a model that describes the relationships between the dependent
variables and the independent variables in a simplified mathematical form. There may be
biological reasons to expect a priori that a certain type of mathematical function will best
describe such a relationship, or simple assumptions have to be made that this is the case (e.g.,
that blood pressure rises linearly with age). The best-known types of regression analysis are the
following (table 1)

Table 1
Regression models

Application Dependent variables Independentvariables

Linear Description of a Continuous(weight, blood


regression linear pressure)
relationship

Logistic Prediction of the Dichotomous(success of


regression probability of treatment: yes/no)
belonging to
groups (outcome:
yes/no)
Continuous and/or
Proportional Modeling of Survival time (time from
categorical
hazard survival data diagnosis to event)
regression
(Cox
regression)

Poisson Modeling of Counting data: whole numbers


regression counting representing events in temporal
processes sequence (e.g., the number of
times a woman gave birth over a
certain period of time)

 Linear regression,
 Logistic regression, and
 Cox regression.

The goal of this article is to introduce the reader to linear regression. The theory is briefly
explained, and the interpretation of statistical parameters is illustrated with examples. The
methods of regression analysis are comprehensively discussed in many standard textbooks
(1– 3).
Methods

Linear regression is used to study the linear relationship between a dependent variable Y (blood
pressure) and one or more independent variables X (age, weight, sex).

The dependent variable Y must be continuous, while the independent variables may be either
continuous (age), binary (sex), or categorical (social status). The initial judgment of a possible
relationship between two continuous variables should always be made on the basis of a scatter
plot (scatter graph). This type of plot will show whether the relationship is linear (figure 1) or
nonlinear (figure 2).

Figure 1
A scatter plot showing a linear relationship
Figure 2
A scatter plot showing an exponential relationship. In this case, it would not be appropriate to
compute a coefficient of determination or a regression line

Performing a linear regression makes sense only if the relationship is linear. Other methods must
be used to study nonlinear relationships. The variable transformations and other, more complex
techniques that can be used for this purpose will not be discussed in this article.

Go to:

Univariable linear regression

Univariable linear regression studies the linear relationship between the dependent variable Y
and a single independent variable X. The linear regression model describes the dependent
variable with a straight line that is defined by the equation Y = a + b × X, where a is the y-
intersect of the line, and b is its slope. First, the parameters a and b of the regression line are
estimated from the values of the dependent variable Y and the independent variable X with the
aid of statistical methods. The regression line enables one to predict the value of the dependent
variable Y from that of the independent variable X. Thus, for example, after a linear regression
has been performed, one would be able to estimate a person’s weight (dependent variable) from
his or her height (independent variable) (figure 3).
Figure 3
A scatter plot and the corresponding regression line and regression equation for the relationship
between the dependent variable body weight (kg) and the independent variable height (m).
r = Pearsons’s correlation coefficient
R-squared linear = coefficient of determination

The slope b of the regression line is called the regression coefficient. It provides a measure of the
contribution of the independent variable X toward explaining the dependent variable Y. If the
independent variable is continuous (e.g., body height in centimeters), then the regression
coefficient represents the change in the dependent variable (body weight in kilograms) per unit
of change in the independent variable (body height in centimeters). The proper interpretation of
the regression coefficient thus requires attention to the units of measurement. The following
example should make this relationship clear:

In a fictitious study, data were obtained from 135 women and men aged 18 to 27. Their height
ranged from 1.59 to 1.93 meters. The relationship between height and weight was studied:
weight in kilograms was the dependent variable that was to be estimated from the independent
variable, height in centimeters. On the basis of the data, the following regression line was
determined: Y= –133.18 + 1.16 × X, where X is height in centimeters and Y is weight in
kilograms. The y-intersect a = –133.18 is the value of the dependent variable when X = 0, but X
cannot possibly take on the value 0 in this study (one obviously cannot expect a person of height
0 centimeters to weigh negative 133.18 kilograms). Therefore, interpretation of the constant is
often not useful. In general, only values within the range of observations of the independent
variables should be used in a linear regression model; prediction of the value of the dependent
variable becomes increasingly inaccurate the further one goes outside this range.

The regression coefficient of 1.16 means that, in this model, a person’s weight increases by 1.16
kg with each additional centimeter of height. If height had been measured in meters, rather than
in centimeters, the regression coefficient b would have been 115.91 instead. The constant a, in
contrast, is independent of the unit chosen to express the independent variables. Proper
interpretation thus requires that the regression coefficient should be considered together with the
units of all of the involved variables. Special attention to this issue is needed when publications
from different countries use different units to express the same variables (e.g., feet and inches vs.
centimeters, or pounds vs. kilograms).

Figure 3 shows the regression line that represents the linear relationship between height and
weight.

For a person whose height is 1.74 m, the predicted weight is 68.50 kg (y = –133.18 + 115.91 ×
1.74 m). The data set contains 6 persons whose height is 1.74 m, and their weights vary from 63
to 75 kg.

Linear regression can be used to estimate the weight of any persons whose height lies within the
observed range (1.59 m to 1.93 m). The data set need not include any person with this precise
height. Mathematically it is possible to estimate the weight of a person whose height is outside
the range of values observed in the study. However, such an extrapolation is generally not useful.

If the independent variables are categorical or binary, then the regression coefficient must be
interpreted in reference to the numerical encoding of these variables. Binary variables should
generally be encoded with two consecutive whole numbers (usually 0/1 or 1/2). In interpreting
the regression coefficient, one should recall which category of the independent variable is
represented by the higher number (e.g., 2, when the encoding is 1/2). The regression coefficient
reflects the change in the dependent variable that corresponds to a change in the independent
variable from 1 to 2.

For example, if one studies the relationship between sex and weight, one obtains the regression
line Y = 47.64 + 14.93 × X, where X = sex (1 = female, 2 = male). The regression coefficient of
14.93 reflects the fact that men are an average of 14.93 kg heavier than women.

When categorical variables are used, the reference category should be defined first, and all other
categories are to be considered in relation to this category.
The coefficient of determination, r2, is a measure of how well the regression model describes the
observed data (Box 2). In univariable regression analysis, r2 is simply the square of Pearson’s
correlation coefficient. In the particular fictitious case that is described above, the coefficient of
determination for the relationship between height and weight is 0.785. This means that 78.5% of
the variance in weight is due to height. The remaining 21.5% is due to individual variation and
might be explained by other factors that were not taken into account in the analysis, such as
eating habits, exercise, sex, or age.

Box 2

Coefficient of determination (R-squared)


Definition:

Let

 n be the number of observations (e.g., subjects in the study)


 ŷi be the estimated value of the dependent variable for the ith observation, as computed
with the regression equation
 yi be the observed value of the dependent variable for the ith observation
 y be the mean of all n observations of the dependent variable

The coefficient of determination is then defined

as follows:

In formal terms, the null hypothesis, which is the hypothesis that b = 0 (no relationship between
variables, the regression coefficient is therefore 0), can be tested with a t-test. One can also
compute the 95% confidence interval for the regression coefficient

Multivariable linear regression

In many cases, the contribution of a single independent variable does not alone suffice to explain
the dependent variable Y. If this is so, one can perform a multivariable linear regression to study
the effect of multiple variables on the dependent variable.

In the multivariable regression model, the dependent variable is described as a linear function of
the independent variables Xi, as follows: Y = a + b1 × X1 + b2 × X2 +…+ bn × Xn. The model
permits the computation of a regression coefficient bi for each independent variable Xi.
Box 3

Regression line for a multivariable regression


Y= a + b1 × X1 + b2 × X2+ …+ bn × Xn,

where

Y = dependent variable

Xi = independent variables

a = constant (y-intersect)

bi= regression coefficient of the variable Xi


Example: regression line for a multivariable regression Y = –120.07 + 100.81 × X1 + 0.38 ×
X2 + 3.41 × X3,

where

X1 = height (meters)
X2 = age (years)
X3 = sex (1 = female, 2 = male)

Y = the weight to be estimated (kg)

Just as in univariable regression, the coefficient of determination describes the overall


relationship between the independent variables Xi (weight, age, body-mass index) and the
dependent variable Y (blood pressure). It corresponds to the square of the multiple correlation
coefficient, which is the correlation between Y and b1 × X1 + … + bn × Xn.

It is better practice, however, to give the corrected coefficient of determination, as discussed


in Box 2. Each of the coefficients bi reflects the effect of the corresponding individual
independent variable Xi on Y, where the potential influences of the remaining independent
variables on Xi have been taken into account, i.e., eliminated by an additional computation. Thus,
in a multiple regression analysis with age and sex as independent variables and weight as the
dependent variable, the adjusted regression coefficient for sex represents the amount of variation
in weight that is due to sex alone, after age has been taken into account. This is done by a
computation that adjusts for age, so that the effect of sex is not confounded by a simultaneously
operative age effect.
Box 4

Two important terms


 Confounder (in non-randomized studies): an independent variable that is associated, not
only with the dependent variable, but also with other independent variables. The presence
of confounders can distort the effect of the other independent variables. Age and sex are
frequent confounders.
 Adjustment: a statistical technique to eliminate the influence of one or more
confounders on the treatment effect. Example: Suppose that age is a confounding variable
in a study of the effect of treatment on a certain dependent variable. Adjustment for age
involves a computational procedure to mimic a situation in which the men and women in
the data set were of the same age. This computation eliminates the influence of age on the
treatment effect.

In this way, multivariable regression analysis permits the study of multiple independent variables
at the same time, with adjustment of their regression coefficients for possible confounding
effects between variables.

Multivariable analysis does more than describe a statistical relationship; it also permits
individual prognostication and the evaluation of the state of health of a given patient. A linear
regression model can be used, for instance, to determine the optimal values for respiratory
function tests depending on a person’s age, body-mass index (BMI), and sex. Comparing a
patient’s measured respiratory function with these computed optimal values yields a measure of
his or her state of health.

Medical questions often involve the effect of a very large number of factors (independent
variables). The goal of statistical analysis is to find out which of these factors truly have an effect
on the dependent variable. The art of statistical evaluation lies in finding the variables that best
explain the dependent variable.

One way to carry out a multivariable regression is to include all potentially relevant independent
variables in the model (complete model). The problem with this method is that the number of
observations that can practically be made is often less than the model requires. In general, the
number of observations should be at least 20 times greater than the number of variables under
study.

Moreover, if too many irrelevant variables are included in the model, overadjustment is likely to
be the result: that is, some of the irrelevant independent variables will be found to have an
apparent effect, purely by chance. The inclusion of irrelevant independent variables in the model
will indeed allow a better fit with the data set under study, but, because of random effects, the
findings will not generally be applicable outside of this data set (1). The inclusion of irrelevant
independent variables also strongly distorts the determination coefficient, so that it no longer
provides a useful index of the quality of fit between the model and the data (Box 2).
In the following sections, we will discuss how these problems can be circumvented.

The selection of variables

For the regression model to be robust and to explain Y as well as possible, it should include only
independent variables that explain a large portion of the variance in Y. Variable selection can be
performed so that only such independent variables are included (1).

Variable selection should be carried out on the basis of medical expert knowledge and a good
understanding of biometrics. This is optimally done as a collaborative effort of the physician-
researcher and the statistician. There are various methods of selecting variables:

Forward selection

Forward selection is a stepwise procedure that includes variables in the model as long as they
make an additional contribution toward explaining Y. This is done iteratively until there are no
variables left that make any appreciable contribution to Y.

Backward selection

Backward selection, on the other hand, starts with a model that contains all potentially relevant
independent variables. The variable whose removal worsens the prediction of the independent
variable of the overall set of independent variables to the least extent is then removed from the
model. This procedure is iterated until no dependent variables are left that can be removed
without markedly worsening the prediction of the independent variable.

Stepwise selection

Stepwise selection combines certain aspects of forward and backward selection. Like forward
selection, it begins with a null model, adds the single independent variable that makes the
greatest contribution toward explaining the dependent variable, and then iterates the process.
Additionally, a check is performed after each such step to see whether one of the variables has
now become irrelevant because of its relationship to the other variables. If so, this variable is
removed.

Block inclusion

There are often variables that should be included in the model in any case—for example, the
effect of a certain form of treatment, or independent variables that have already been found to be
relevant in prior studies. One way of taking such variables into account is their block inclusion
into the model. In this way, one can combine the forced inclusion of some variables with the
selective inclusion of further independent variables that turn out to be relevant to the explanation
of variation in the dependent variable.

The evaluation of a regression model requires the performance of both forward and backward
selection of variables. If these two procedures result in the selection of the same set of variables,
then the model can be considered robust. If not, a statistician should be consulted for further
advice.

The study of relationships between variables and the generation of risk scores are very important
elements of medical research. The proper performance of regression analysis requires that a
number of important factors should be considered and tested:

1. Causality

Before a regression analysis is performed, the causal relationships among the variables to be
considered must be examined from the point of view of their content and/or temporal
relationship. The fact that an independent variable turns out to be significant says nothing about
causality. This is an especially relevant point with respect to observational studies.

2. Planning of sample size

The number of cases needed for a regression analysis depends on the number of independent
variables and of their expected effects (strength of relationships). If the sample is too small, only
very strong relationships will be demonstrable. The sample size can be planned in the light of the
researchers’ expectations regarding the coefficient of determination (r2) and the regression
coefficient (b). Furthermore, at least 20 times as many observations should be made as there are
independent variables to be studied; thus, if one wants to study 2 independent variables, one
should make at least 40 observations.

3. Missing values

Missing values are a common problem in medical data. Whenever the value of either a
dependent or an independent variable is missing, this particular observation has to be excluded
from the regression analysis. If many values are missing from the dataset, the effective sample
size will be appreciably diminished, and the sample may then turn out to be too small to yield
significant findings, despite seemingly adequate advance planning. If this happens, real
relationships can be overlooked, and the study findings may not be generally applicable.
Moreover, selection effects can be expected in such cases. There are a number of ways to deal
with the problem of missing values.
4. The data sample

A further important point to be considered is the composition of the study population. If there are
subpopulations within it that behave differently with respect to the independent variables in
question, then a real effect (or the lack of an effect) may be masked from the analysis and remain
undetected. Suppose, for instance, that one wishes to study the effect of sex on weight, in a study
population consisting half of children under age 8 and half of adults. Linear regression analysis
over the entire population reveals an effect of sex on weight. If, however, a subgroup analysis is
performed in which children and adults are considered separately, an effect of sex on weight is
seen only in adults, and not in children. Subgroup analysis should only be performed if the
subgroups have been predefined, and the questions already formulated, before the data analysis
begins; furthermore, multiple testing should be taken into account.

5. The selection of variables

If multiple independent variables are considered in a multivariable regression, some of these may
turn out to be interdependent. An independent variable that would be found to have a strong
effect in a univariable regression model might not turn out to have any appreciable effect in a
multivariable regression with variable selection. This will happen if this particular variable itself
depends so strongly on the other independent variables that it makes no additional contribution
toward explaining the dependent variable. For related reasons, when the independent variables
are mutually dependent, different independent variables might end up being included in the
model depending on the particular technique that is used for variable selection.

Overview

Linear regression is an important tool for statistical analysis. Its broad spectrum of uses includes
relationship description, estimation, and prognostication. The technique has many applications,
but it also has prerequisites and limitations that must always be considered in the interpretation
of findings (Box 5).

Box 5

What special points require attention in the interpretation of a regression analysis?


1. How big is the study sample?
2. Is causality demonstrable or plausible, in view of the content or temporal relationship of
the variables?
3. Has there been adjustment for potential confounding effects?
4. Is the inclusion of the independent variables that were used justified, in view of their
content?
5. What is the corrected coefficient of determination (R-squared)?
6. Is the study sample homogeneous?
7. In what units were the potentially relevant independent variables reported?
8. Was a selection of the independent variables (potentially relevant independent variables)
performed, and, if so, what kind of selection?
9. If a selection of variables was performed, was its result confirmed by a second selection
of variables that was performed by a different procedure?
10. Are predictions of the dependent variable made on the basis of extrapolated data?

→ r2 is the fraction of the overall variance that is explained. The closer the regression model’s
estimated values ŷi lie to the observed values yi, the nearer the coefficient of determination is to 1
and the more accurate the regression model is.
Meaning: In practice, the coefficient of determination is often taken as a measure of the validity
of a regression model or a regression estimate. It reflects the fraction of variation in the Y-values
that is explained by the regression line.
Problem: The coefficient of determination can easily be made artificially high by including a
large number of independent variables in the model. The more independent variables one
includes, the higher the coefficient of determination becomes. This, however, lowers the
precision of the estimate (estimation of the regression coefficients bi).
Solution: Instead of the raw (uncorrected) coefficient of determination, the corrected coefficient
of determination should be given: the latter takes the number of explanatory variables in the
model into account. Unlike the uncorrected coefficient of determination, the corrected one is
high only if the independent variables have a sufficiently large effect.

we measure more than one variable for each individual. For example, we measure precipitation
and plant growth, or number of young with nesting habitat, or soil erosion and volume of water.
We collect pairs of data and instead of examining each variable separately (univariate data), we
want to find ways to describe bivariate data, in which two variables are measured on each
subject in our sample. Given such data, we begin by determining if there is a relationship
between these two variables. As the values of one variable change, do we see corresponding
changes in the other variable?

We can describe the relationship between these two variables graphically and numerically. We
begin by considering the concept of correlation.
Correlation is defined as the statistical association between two variables.
A correlation exists between two variables when one of them is related to the other in some way.
A scatterplot is the best place to start. A scatterplot (or scatter diagram) is a graph of the paired
(x, y) sample data with a horizontal x-axis and a vertical y-axis. Each individual (x, y) pair is
plotted as a single point.

Figure 1. Scatterplot of chest girth versus


length.

In this example, we plot bear chest girth (y) against bear length (x). When examining a
scatterplot, we should study the overall pattern of the plotted points. In this example, we see that
the value for chest girth does tend to increase as the value of length increases. We can see an
upward slope and a straight-line pattern in the plotted data points.

A scatterplot can identify several different types of relationships between two variables.

 A relationship has no correlation when the points on a scatterplot do not show any
pattern.

 A relationship is non-linear when the points on a scatterplot follow a pattern but not a
straight line.

 A relationship is linear when the points on a scatterplot follow a somewhat straight line
pattern. This is the relationship that we will examine.

Linear relationships can be either positive or negative. Positive relationships have points that
incline upwards to the right. As x values increase, y values increase. As x values
decrease, y values decrease. For example, when studying plants, height typically increases as
diameter increases.
Figure 2. Scatterplot of height versus
diameter.

Negative relationships have points that decline downward to the right. As x values
increase, y values decrease. As x values decrease, y values increase. For example, as wind speed
increases, wind chill temperature decreases.

Figure 3. Scatterplot of temperature versus


wind speed.

Non-linear relationships have an apparent pattern, just not linear. For example, as age increases
height increases up to a point then levels off after reaching a maximum height.

Figure 4. Scatterplot of height versus age.


When two variables have no relationship, there is no straight-line relationship or non-linear
relationship. When one variable changes, it does not influence the other variable.

Figure 5. Scatterplot of growth versus


area.

Linear Correlation Coefficient

Because visual examinations are largely subjective, we need a more precise and objective
measure to define the correlation between the two variables. To quantify the strength and
direction of the relationship between two variables, we use the linear correlation coefficient:

where x̄ and sx are the sample mean and sample standard deviation of the x’s, and ȳ and sy are
the mean and standard deviation of the y’s. The sample size is n.

An alternate computation of the correlation coefficient is:

where
The linear correlation coefficient is also referred to as Pearson’s product moment correlation
coefficient in honor of Karl Pearson, who originally developed it. This statistic numerically
describes how strong the straight-line or linear relationship is between the two variables and the
direction, positive or negative.

The properties of “r”:

 It is always between -1 and +1.

 It is a unitless measure so “r” would be the same value whether you measured the two
variables in pounds and inches or in grams and centimeters.

 Positive values of “r” are associated with positive relationships.


 Negative values of “r” are associated with negative relationships.
 Examples of Positive Correlation

Figure 6. Examples of positive


correlation.
Examples of Negative Correlation

Figure 7. Examples of
negative correlation.
Correlation is not causation!!! Just because two variables are correlated does not mean
that one variable causes another variable to change.

Examine these next two scatterplots. Both of these data sets have an r = 0.01, but they are very
different. Plot 1 shows little linear relationship between x and y variables. Plot 2 shows a strong
non-linear relationship. Pearson’s linear correlation coefficient only measures the strength and
direction of a linear relationship. Ignoring the scatterplot could result in a serious mistake when
describing the relationship between two variables.

Figure 8. Comparison of scatterplots.

When you investigate the relationship between two variables, always begin with a scatterplot.
This graph allows you to look for patterns (both linear and non-linear). The next step is to
quantitatively describe the strength and direction of the linear relationship using “r”. Once you
have established that a linear relationship exists, you can take the next step in model building.

Simple Linear Regression

Once we have identified two variables that are correlated, we would like to model this
relationship. We want to use one variable as a predictor or explanatory variable to explain the
other variable, the response or dependent variable. In order to do this, we need a good
relationship between our two variables. The model can then be used to predict changes in our
response variable. A strong relationship between the predictor variable and the response variable
leads to a good model.

Figure 9. Scatterplot with regression


model.
A simple linear regression model is a mathematical equation that allows us to predict a
response for a given predictor value.

Our model will take the form of ŷ = b 0 + b1x where b0 is the y-intercept, b1 is the slope, x is
the predictor variable, and ŷ an estimate of the mean value of the response variable for any value
of the predictor variable.

The y-intercept is the predicted value for the response (y) when x = 0. The slope describes the
change in y for each one unit change in x. Let’s look at this example to clarify the interpretation
of the slope and intercept.

What Is Correlation?
In general terms, correlation is a measure of the degree to which two variables change
simultaneously. The correlation coefficient describes the strength and direction of the
relationship.

You can calculate the correlation between two variables in a pandas DataFrame with the corr()
function. Let's look at the correlation between the height and the weight in the school dataset.
import pandas as pd
# Load the School dataset
df = pd.read_csv('school.csv')
# Apply the corr() function to the dataframe
# and select the weight and height correlation
r = df.corr()['height']['weight']
print("Correlation between height and weight is {:.2f}".format( r ))
Correlation between height and weight is 0.77

The result is a string positive correlation r = 0.77.


If you look at the relation between fuel efficiency and the weight of the car from the auto-mpg
dataset, you see that as cars become heavier, fuel efficiency decreases. To calculate the
correlation between these two variables, use the same code as above, and the result is a negative
correlation of r = -0.83.
df = pd.read_csv('auto-mpg.csv')
r = df.corr()['mpg']['weight']
print("Correlation between mpg and weight is {:.2f}".format( r ))
Correlation between mpg and weight is -0.83

To sum up:
 The correlation coefficient is always a number between -1 and 1.
 Positive correlation ( 0<r≤10<𝑟≤1 ) means that as one variable increases, the other also
increases. They move in the same direction.
 Negative correlation ( −1≤r<0−1≤𝑟<0 ) means that as one variable increases, the other
decreases. They move in opposite directions.
 A zero correlation ( r=0𝑟=0 ) means that the two variables are totally independent from
each other.

And two properties:

 Correlation of a variable with itself is always 1: corr(x,x)=1𝑐𝑜𝑟𝑟(𝑥,𝑥)=1


 Correlation is symmetric. For two variables x and
y: corr(x,y)=corr(y,x)𝑐𝑜𝑟𝑟(𝑥,𝑦)=𝑐𝑜𝑟𝑟(𝑦,𝑥)

Use a Correlation Matrix


To calculate the correlation of two of the variables, we loaded the dataset into a pandas
DataFrame, and applied the corr() method to the df DataFrame itself. The corr() method
returns the correlation coefficients for all the numerical variables. The correlation matrix is
symmetric with diagonal 1.
Let's look at the whole correlation matrix for the school dataset.

df = pd.read_csv('school.csv')
# Apply the corr() function to the dataframe
print(df.corr())
age height weight
age 1.000000 0.648857 0.634636
height 0.648857 1.000000 0.774876
weight 0.634636 0.774876 1.000000
You can see that all the variables (age, weight, and height) are positively correlated; as one
increases, the other two variables also increase. You also see that weight is more correlated with
height than it is with age (r = 0.77 vs r = 0.63), which makes sense, as physical growth is a more
important factor of weight gain than age.

The Auto-mpg dataset shows a different type of result. Let's look at which variables are
correlated with the mpg fuel efficiency variable.
df = pd.read_csv('auto-mpg.csv')
print(df.corr()['mpg'])
mpg 1.000000
cylinders -0.777618
displacement -0.805127
horsepower -0.778427
weight -0.832244
acceleration 0.423329
year 0.580541
origin 0.565209
As you can see, there are four variables negatively correlated with mpg, and three positively
correlated to a lesser degree. Also, notice that the weight of the car is the worst characteristic for
its fuel efficiency (r = -0.83), and that fuel efficiency increased over the years from 1970 to 1982
with r = 0.58.

Use the Pearson Correlation


There are several ways to calculate the correlation between two variables. So far, we have used
the Pearson correlation, invented in the 1880s by Karl Pearson. It is the most frequent type of
correlation, and often the default one in scripting libraries such as pandas or NumPy.

Math Time!
Mathematically speaking, and just to illustrate how the correlation is calculated, the Pearson
correlation between two variables x={x0,x1,…,xN−1}𝑥={𝑥0,𝑥1,…,𝑥𝑁−1} and y={y0,y1,
…,yN−1}𝑦={𝑦0,𝑦1,…,𝑦𝑁−1} of N𝑁 samples each, is defined as:
rxy=∑N−1i=o(xi−x¯)(yi−y¯)∑N−1i=0(xi−x¯)2−−−−−−−−−−−−√∑ni=1(yi−y¯)2−−−−−−−−−−
−√𝑟𝑥𝑦=∑𝑖=𝑜𝑁−1(𝑥𝑖−𝑥¯)(𝑦𝑖−𝑦¯)∑𝑖=0𝑁−1(𝑥𝑖−𝑥¯)2∑𝑖=1𝑛(𝑦𝑖−𝑦¯)2
Where: x¯𝑥¯ (resp. y¯𝑦¯) is the mean of x𝑥: x¯=1N∑N−1i=0xi𝑥¯=1𝑁∑𝑖=0𝑁−1𝑥𝑖
The formula above can be interpreted as the ratio between:

 How the variables vary together (the covariance between x𝑥 and y𝑦 ).


 How each variable varies individually (the variance of x multiplied by the variance of
y).
Assumptions
To be precise, the Pearson correlation is a measure of the linear correlation between two
continuous variables, and it relies on the two following assumptions:

 There is a linear relationship between the two variables i.e. y=ax+b.


 x and y both follow a normal distribution.
We will come back to the normal distribution in the next chapter. For now, just remember it as
being the bell curve.

Bell curves with expected value μ and variance σ2


It is often the case that you must calculate the correlation of variables that do not follow these
assumptions. So you need other ways of calculating the correlation of two variables.

Different Types of Correlations (And How to Choose!)


If you look at the pandas documentation for the corr() method, you can see that there are three
types of correlation available: Pearson (the default), Spearman, and Kendall.
 df.corr(method='pearson')
 df.corr(method='spearman')
 df.corr(method='kendall')
We won't go in the detail of how each of these correlations is defined. Suffice to say that
depending on the type of data you are working with, choosing Spearman over Pearson is more
appropriate. The Kendall correlation can be seen as a variant of Spearman, and we'll leave it
aside.

Pearson is the default correlation. It is relevant when there is a linear relationship between the
variables, and the variables follow a normal distribution. The good news is that the Pearson
correlation is very robust concerning these two assumptions. If your variables are not strictly
linearly related or if their distribution does not exactly follow a normal distribution, the Pearson
correlation coefficient is still very often reliable.
However, if your data is nonlinear, as the variables in the two plots below, then it makes more
sense to use the Spearman correlation.

Pearson vs. Spearman on variables with nonlinear relations


The Spearman correlation doesn't make any assumptions on the distribution of the variables or
their relative linearity. It is calculated solely based on the order of the values of the two
variables, aka ranking the values. We won't go into details here, you can learn how to compute
the Spearman correlation.

Correlation analysis is used to quantify the degree to which two variables are related or on a set
of paired observations. You evaluate the correlation coefficient that tells you how much one
variable changes when the other variable changes. Correlation analysis can show you a linear
relationship between variables.

The possible values of the correlation coefficient or r can range from -1 (when there is a perfect
negative correlation) to + 1 (when there is a perfect positive correlation). The closer the r values
are to 0, the weaker the correlation (either negative or positive).

As an example, we can consider weight and diastolic BP. We can document weight in kilograms
as a continuous variable and diastolic BP in mm Hg as a continuous variable. We can explore if
weight and diastolic BP are correlated. If diastolic BP decreases as weight increases, we might
find a perfect negative correlation-higher weight leading to lower diastolic BP. If there is a unit
increase in diastolic BP as weight increases, we might find a perfect positive correlation- higher
weights leading to higher diastolic BP. Pragmatically, perfect negative or perfect positive
correlations are rare, what we usually get is somewhere in between. If both weight and diastolic
BP show no pattern of relation, then we may find a r value closer to 0.Correlation does not imply
causation.
Pearson’s r is probably one of the most frequently used measures of agreement for continuous
variables in the biomedical literature and is also one of the least appropriate tests to do.

Pearson’s r

The variables that are considered for Pearson’s correlation analysis preferably have a continuous
structure.
It is an index of linear association but does not necessarily mean good agreement. It is insensitive
to systematic differenceeach variabs between two observers or readings.
The value of r is sensitive to the range of values and is usually higher when the spread of values
is higher. Pearson’s r is very sensitive to extreme values (outliers) which can change the r values
significantly.
The r can give you an idea of the strength (weak, strong) and direction (positive, negative, none)
of the relationship between the two variables.
To do a Pearson’s r, the following assumptions must be met

Each variable must be continuous


Both variables must be normally distributed
The two variables are assumed to have a linear relationship
The observations are paired observations
There are no significant outliers

Spearman’s Correlation or Spearman’s p (pronounced rho) or Rs

This is a nonparametric measure of rank correlation or correlation between the ranking or


ordering of two variables. It explores how well the relationship between two variables can be
described using a monotonic function. The variables can be continuous or Ordinal and the
relationship is assessed based on the ranked values for each variable rather than the raw data.

What is a monotonic relationship?

As the value of one variable increases, the value of the other variable increases
As the value of one variable increases, the other variable value decreases
However, not exactly at a constant rate whereas in a linear relationship the rate of
increase/decrease is constant.

The Spearman P can thus be used both for a linear and a non-linear relationship.

The Spearman P can be used with data that is normally distributed and with data that is not
normally distributed

The Spearman P works with rank-ordered variables and not with raw data values and hence it
measures the strength and direction of the monotonic relationship between the two ranked or
ordered variables
The Spearman P is less affected by outliers and hence can be used even in the presence of outlier
values

Assumptions for Spearman’s P

The two variables must be measured on an ordinal, interval or ratio scale.


The variables represent paired observations.
There is a monotonic relationship between the variables (can be checked using a scatterplot).
Caution in interpreting P values around R values. Even when the relationship is weak (r=0.3 or
0.4 for example), the corresponding p value may be significant if the sample size is reasonably
large and maybe misinterpreted as showing a significant relationship.
It is more meaningful to look at and interpret the confidence limits of the R rather than interpret
the p values associated with the R value.

Types of correlation coefficients4

There are two main types of correlation coefficients: Pearson's product moment correlation
coefficient and Spearman's rank correlation coefficient. The correct usage of correlation
coefficient type depends on the types of variables being studied. We will focus on these two
correlation types; other types are based on these and are often used when multiple variables are
being considered.

Pearson's product moment correlation coefficient

Pearson's product moment correlation coefficient is denoted as ϱ for a population parameter and
as r for a sample statistic. It is used when both variables being studied are normally distributed.
This coefficient is affected by extreme values, which may exaggerate or dampen the strength of
relationship, and is therefore inappropriate when either or both variables are not normally
distributed. For a correlation between variables x and y, the formula for calculating the sample
Pearson's correlation coefficient is given by3

where xi and yi are the values of x and y for the ith


individual.
Spearman's rank correlation coefficient

Spearman's rank correlation coefficient is denoted as ϱs for a population parameter and as rs for
a sample statistic. It is appropriate when one or both variables are skewed or ordinal1 and is
robust when extreme values are present. For a correlation between variables x and y, the formula
for calculating the sample Spearman's correlation coefficient is given by

rs=1−6∑i=1nd2in(n2−1) where di is the difference in ranks for x and y

The distinction between Pearson's and Spearman's correlation coefficients in applications will be
discussed using examples below.

Relationship between correlation coefficient and scatterplots using statistical simulations

The data depicted in figures 1–4 were simulated from a bivariate normal distribution of 500
observations with means 2 and 3 for the variables x and y respectively. The standard deviations
were 0.5 for x and 0.7 for y. Scatter plots were generated for the correlations 0.2, 0.5, 0.8 and
−0.8.

Fig. 1
Scatterplot of x and y: Pearson's correlation=0.2
Fig. 4
Scatterplot of x and y: Pearson's correlation=−0.80

In Fig. 1, the scatter plot shows some linear trend but the trend is not as clear as that of Fig. 2.
The trend in Fig. 3 is clearly seen and the points are not as scattered as those of Figs. 1 and
and2.2. That is, the higher the correlation in either direction (positive or negative), the more
linear the association between two variables and the more obvious the trend in a scatter plot.
For Figures 3 and and4,4, the strength of linear relationship is the same for the variables in
question but the direction is different. In Figure 3, the values of y increase as the values of x
increase while in figure 4 the values of y decrease as the values of x increase.

Fig. 2
Scatterplot of x and y: Pearson's correlation=0.50
Fig. 3
Scatterplot of x and y: Pearson's correlation=0.80

Python program for Linear Regression of data:


import matplotlib.pyplot as plt
from scipy import stats

x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

slope, intercept, r, p, std_err = stats.linregress(x, y)

def myfunc(x):
return slope * x + intercept

mymodel = list(map(myfunc, x))

plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()

O/p
Explaination:
Import the modules you need.
import matplotlib.pyplot as plt
from scipy import stats
Create the arrays that represent the values of the x and y axis:
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
Execute a method that returns some important key values of Linear Regression:
slope, intercept, r, p, std_err = stats.linregress(x, y)
Create a function that uses the slope and intercept values to return a new value. This new value
represents where on the y-axis the corresponding x value will be placed:
def myfunc(x):
return slope * x + intercept
Run each value of the x array through the function. This will result in a new array with new
values for the y-axis:
mymodel = list(map(myfunc, x))
Draw the original scatter plot:
plt.scatter(x, y)
Draw the line of linear regression:
plt.plot(x, mymodel)
Display the diagram:
plt.show()
R for Relationship
It is important to know how the relationship between the values of the x-axis and the values of
the y-axis is, if there are no relationship the linear regression can not be used to predict anything.
This relationship - the coefficient of correlation - is called r.
The r value ranges from -1 to 1, where 0 means no relationship, and 1 (and -1) means 100%
related.
Python and the Scipy module will compute this value for you, all you have to do is feed it with
the x and y values.

Correlation:
import numpy
from sklearn import linear_model

#Reshaped for Logistic function.


X = numpy.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52, 3.69, 5.88]).reshape(-
1,1)
y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

logr = linear_model.LogisticRegression()
logr.fit(X,y)

#predict if tumor is cancerous where the size is 3.46mm:


predicted = logr.predict(numpy.array([3.46]).reshape(-1,1))
print(predicted)

Explaination:
Import the numpy library and define a custom dataset x and y of equal length:
# Import the numpy library
import numpy as np

# Define the dataset


x = np.array([1,3,5,7,8,9, 10, 15])
y = np.array([10, 20, 30, 40, 50, 60, 70, 80])
Define the correlation by applying the above formula:
def Pearson_correlation(X,Y):
if len(X)==len(Y):
Sum_xy = sum((X-X.mean())*(Y-Y.mean()))
Sum_x_squared = sum((X-X.mean())**2)
Sum_y_squared = sum((Y-Y.mean())**2)
corr = Sum_xy / np.sqrt(Sum_x_squared * Sum_y_squared)
return corr

print(Pearson_correlation(x,y))
print(Pearson_correlation(x,x))
OUTPUT
0.974894414261588
1.0
The above output shows that the relationship between x and y is 0.974894414261588 and x and
x is 1.0

You might also like