Unit 3 - Notes
Unit 3 - Notes
Regression
Regression analysis is an important statistical method for the analysis of medical data. It enables
the identification and characterization of relationships among multiple factors. It also enables the
identification of prognostically relevant risk factors and the calculation of risk scores for
individual prognostication.
Methods
This article is based on selected textbooks of statistics, a selective review of the literature, and
our own experience.
Results
After a brief introduction of the uni- and multivariable regression models, illustrative examples
are given to explain what the important considerations are before a regression analysis is
performed, and how the results should be interpreted. The reader should then be able to judge
whether the method has been used correctly and interpret the results appropriately.
Conclusion
The performance and interpretation of linear regression analysis are subject to a variety of
pitfalls, which are discussed here in detail. The reader is made aware of common errors of
interpretation through practical examples. Both the opportunities for applying linear regression
analysis and its limitations are presented.
The purpose of statistical evaluation of medical data is often to describe relationships between
two variables or among several variables. For example, one would like to know not just whether
patients have high blood pressure, but also whether the likelihood of having high blood pressure
is influenced by factors such as age and weight. The variable to be explained (blood pressure) is
called the dependent variable, or, alternatively, the response variable; the variables that explain it
(age, weight) are called independent variables or predictor variables. Measures of association
provide an initial impression of the extent of statistical dependence between variables. If the
dependent and independent variables are continuous, as is the case for blood pressure and
weight, then a correlation coefficient can be calculated as a measure of the strength of the
relationship between them (box 1).
Interpretation of the correlation coefficient (r)
Spearman’s coefficient:
A monotone relationship is one in which the dependent variable either rises or sinks continuously
as the independent variable rises.
Interpretation/meaning:
Correlation coefficients provide information about the strength and direction of a relationship
between two continuous variables. No distinction between the explaining variable and the
variable to be explained is necessary:
r = ± 1: perfect linear and monotone relationship. The closer r is to 1 or –1, the stronger
the relationship.
r = 0: no linear or monotone relationship
r < 0: negative, inverse relationship (high values of one variable tend to occur together
with low values of the other variable)
r > 0: positive relationship (high values of one variable tend to occur together with high
values of the other variable)
Graphical representation of a linear relationship:
A negative relationship is represented by a falling regression line (regression coefficient b < 0), a
positive one by a rising regression line (b > 0).
Description: Relationships among the dependent variables and the independent variables
can be statistically described by means of regression analysis.
Estimation: The values of the dependent variables can be estimated from the observed
values of the independent variables.
Prognostication: Risk factors that influence the outcome can be identified, and individual
prognoses can be determined.
Regression analysis employs a model that describes the relationships between the dependent
variables and the independent variables in a simplified mathematical form. There may be
biological reasons to expect a priori that a certain type of mathematical function will best
describe such a relationship, or simple assumptions have to be made that this is the case (e.g.,
that blood pressure rises linearly with age). The best-known types of regression analysis are the
following (table 1)
Table 1
Regression models
Linear regression,
Logistic regression, and
Cox regression.
The goal of this article is to introduce the reader to linear regression. The theory is briefly
explained, and the interpretation of statistical parameters is illustrated with examples. The
methods of regression analysis are comprehensively discussed in many standard textbooks
(1– 3).
Methods
Linear regression is used to study the linear relationship between a dependent variable Y (blood
pressure) and one or more independent variables X (age, weight, sex).
The dependent variable Y must be continuous, while the independent variables may be either
continuous (age), binary (sex), or categorical (social status). The initial judgment of a possible
relationship between two continuous variables should always be made on the basis of a scatter
plot (scatter graph). This type of plot will show whether the relationship is linear (figure 1) or
nonlinear (figure 2).
Figure 1
A scatter plot showing a linear relationship
Figure 2
A scatter plot showing an exponential relationship. In this case, it would not be appropriate to
compute a coefficient of determination or a regression line
Performing a linear regression makes sense only if the relationship is linear. Other methods must
be used to study nonlinear relationships. The variable transformations and other, more complex
techniques that can be used for this purpose will not be discussed in this article.
Go to:
Univariable linear regression studies the linear relationship between the dependent variable Y
and a single independent variable X. The linear regression model describes the dependent
variable with a straight line that is defined by the equation Y = a + b × X, where a is the y-
intersect of the line, and b is its slope. First, the parameters a and b of the regression line are
estimated from the values of the dependent variable Y and the independent variable X with the
aid of statistical methods. The regression line enables one to predict the value of the dependent
variable Y from that of the independent variable X. Thus, for example, after a linear regression
has been performed, one would be able to estimate a person’s weight (dependent variable) from
his or her height (independent variable) (figure 3).
Figure 3
A scatter plot and the corresponding regression line and regression equation for the relationship
between the dependent variable body weight (kg) and the independent variable height (m).
r = Pearsons’s correlation coefficient
R-squared linear = coefficient of determination
The slope b of the regression line is called the regression coefficient. It provides a measure of the
contribution of the independent variable X toward explaining the dependent variable Y. If the
independent variable is continuous (e.g., body height in centimeters), then the regression
coefficient represents the change in the dependent variable (body weight in kilograms) per unit
of change in the independent variable (body height in centimeters). The proper interpretation of
the regression coefficient thus requires attention to the units of measurement. The following
example should make this relationship clear:
In a fictitious study, data were obtained from 135 women and men aged 18 to 27. Their height
ranged from 1.59 to 1.93 meters. The relationship between height and weight was studied:
weight in kilograms was the dependent variable that was to be estimated from the independent
variable, height in centimeters. On the basis of the data, the following regression line was
determined: Y= –133.18 + 1.16 × X, where X is height in centimeters and Y is weight in
kilograms. The y-intersect a = –133.18 is the value of the dependent variable when X = 0, but X
cannot possibly take on the value 0 in this study (one obviously cannot expect a person of height
0 centimeters to weigh negative 133.18 kilograms). Therefore, interpretation of the constant is
often not useful. In general, only values within the range of observations of the independent
variables should be used in a linear regression model; prediction of the value of the dependent
variable becomes increasingly inaccurate the further one goes outside this range.
The regression coefficient of 1.16 means that, in this model, a person’s weight increases by 1.16
kg with each additional centimeter of height. If height had been measured in meters, rather than
in centimeters, the regression coefficient b would have been 115.91 instead. The constant a, in
contrast, is independent of the unit chosen to express the independent variables. Proper
interpretation thus requires that the regression coefficient should be considered together with the
units of all of the involved variables. Special attention to this issue is needed when publications
from different countries use different units to express the same variables (e.g., feet and inches vs.
centimeters, or pounds vs. kilograms).
Figure 3 shows the regression line that represents the linear relationship between height and
weight.
For a person whose height is 1.74 m, the predicted weight is 68.50 kg (y = –133.18 + 115.91 ×
1.74 m). The data set contains 6 persons whose height is 1.74 m, and their weights vary from 63
to 75 kg.
Linear regression can be used to estimate the weight of any persons whose height lies within the
observed range (1.59 m to 1.93 m). The data set need not include any person with this precise
height. Mathematically it is possible to estimate the weight of a person whose height is outside
the range of values observed in the study. However, such an extrapolation is generally not useful.
If the independent variables are categorical or binary, then the regression coefficient must be
interpreted in reference to the numerical encoding of these variables. Binary variables should
generally be encoded with two consecutive whole numbers (usually 0/1 or 1/2). In interpreting
the regression coefficient, one should recall which category of the independent variable is
represented by the higher number (e.g., 2, when the encoding is 1/2). The regression coefficient
reflects the change in the dependent variable that corresponds to a change in the independent
variable from 1 to 2.
For example, if one studies the relationship between sex and weight, one obtains the regression
line Y = 47.64 + 14.93 × X, where X = sex (1 = female, 2 = male). The regression coefficient of
14.93 reflects the fact that men are an average of 14.93 kg heavier than women.
When categorical variables are used, the reference category should be defined first, and all other
categories are to be considered in relation to this category.
The coefficient of determination, r2, is a measure of how well the regression model describes the
observed data (Box 2). In univariable regression analysis, r2 is simply the square of Pearson’s
correlation coefficient. In the particular fictitious case that is described above, the coefficient of
determination for the relationship between height and weight is 0.785. This means that 78.5% of
the variance in weight is due to height. The remaining 21.5% is due to individual variation and
might be explained by other factors that were not taken into account in the analysis, such as
eating habits, exercise, sex, or age.
Box 2
Let
as follows:
In formal terms, the null hypothesis, which is the hypothesis that b = 0 (no relationship between
variables, the regression coefficient is therefore 0), can be tested with a t-test. One can also
compute the 95% confidence interval for the regression coefficient
In many cases, the contribution of a single independent variable does not alone suffice to explain
the dependent variable Y. If this is so, one can perform a multivariable linear regression to study
the effect of multiple variables on the dependent variable.
In the multivariable regression model, the dependent variable is described as a linear function of
the independent variables Xi, as follows: Y = a + b1 × X1 + b2 × X2 +…+ bn × Xn. The model
permits the computation of a regression coefficient bi for each independent variable Xi.
Box 3
where
Y = dependent variable
Xi = independent variables
a = constant (y-intersect)
where
X1 = height (meters)
X2 = age (years)
X3 = sex (1 = female, 2 = male)
In this way, multivariable regression analysis permits the study of multiple independent variables
at the same time, with adjustment of their regression coefficients for possible confounding
effects between variables.
Multivariable analysis does more than describe a statistical relationship; it also permits
individual prognostication and the evaluation of the state of health of a given patient. A linear
regression model can be used, for instance, to determine the optimal values for respiratory
function tests depending on a person’s age, body-mass index (BMI), and sex. Comparing a
patient’s measured respiratory function with these computed optimal values yields a measure of
his or her state of health.
Medical questions often involve the effect of a very large number of factors (independent
variables). The goal of statistical analysis is to find out which of these factors truly have an effect
on the dependent variable. The art of statistical evaluation lies in finding the variables that best
explain the dependent variable.
One way to carry out a multivariable regression is to include all potentially relevant independent
variables in the model (complete model). The problem with this method is that the number of
observations that can practically be made is often less than the model requires. In general, the
number of observations should be at least 20 times greater than the number of variables under
study.
Moreover, if too many irrelevant variables are included in the model, overadjustment is likely to
be the result: that is, some of the irrelevant independent variables will be found to have an
apparent effect, purely by chance. The inclusion of irrelevant independent variables in the model
will indeed allow a better fit with the data set under study, but, because of random effects, the
findings will not generally be applicable outside of this data set (1). The inclusion of irrelevant
independent variables also strongly distorts the determination coefficient, so that it no longer
provides a useful index of the quality of fit between the model and the data (Box 2).
In the following sections, we will discuss how these problems can be circumvented.
For the regression model to be robust and to explain Y as well as possible, it should include only
independent variables that explain a large portion of the variance in Y. Variable selection can be
performed so that only such independent variables are included (1).
Variable selection should be carried out on the basis of medical expert knowledge and a good
understanding of biometrics. This is optimally done as a collaborative effort of the physician-
researcher and the statistician. There are various methods of selecting variables:
Forward selection
Forward selection is a stepwise procedure that includes variables in the model as long as they
make an additional contribution toward explaining Y. This is done iteratively until there are no
variables left that make any appreciable contribution to Y.
Backward selection
Backward selection, on the other hand, starts with a model that contains all potentially relevant
independent variables. The variable whose removal worsens the prediction of the independent
variable of the overall set of independent variables to the least extent is then removed from the
model. This procedure is iterated until no dependent variables are left that can be removed
without markedly worsening the prediction of the independent variable.
Stepwise selection
Stepwise selection combines certain aspects of forward and backward selection. Like forward
selection, it begins with a null model, adds the single independent variable that makes the
greatest contribution toward explaining the dependent variable, and then iterates the process.
Additionally, a check is performed after each such step to see whether one of the variables has
now become irrelevant because of its relationship to the other variables. If so, this variable is
removed.
Block inclusion
There are often variables that should be included in the model in any case—for example, the
effect of a certain form of treatment, or independent variables that have already been found to be
relevant in prior studies. One way of taking such variables into account is their block inclusion
into the model. In this way, one can combine the forced inclusion of some variables with the
selective inclusion of further independent variables that turn out to be relevant to the explanation
of variation in the dependent variable.
The evaluation of a regression model requires the performance of both forward and backward
selection of variables. If these two procedures result in the selection of the same set of variables,
then the model can be considered robust. If not, a statistician should be consulted for further
advice.
The study of relationships between variables and the generation of risk scores are very important
elements of medical research. The proper performance of regression analysis requires that a
number of important factors should be considered and tested:
1. Causality
Before a regression analysis is performed, the causal relationships among the variables to be
considered must be examined from the point of view of their content and/or temporal
relationship. The fact that an independent variable turns out to be significant says nothing about
causality. This is an especially relevant point with respect to observational studies.
The number of cases needed for a regression analysis depends on the number of independent
variables and of their expected effects (strength of relationships). If the sample is too small, only
very strong relationships will be demonstrable. The sample size can be planned in the light of the
researchers’ expectations regarding the coefficient of determination (r2) and the regression
coefficient (b). Furthermore, at least 20 times as many observations should be made as there are
independent variables to be studied; thus, if one wants to study 2 independent variables, one
should make at least 40 observations.
3. Missing values
Missing values are a common problem in medical data. Whenever the value of either a
dependent or an independent variable is missing, this particular observation has to be excluded
from the regression analysis. If many values are missing from the dataset, the effective sample
size will be appreciably diminished, and the sample may then turn out to be too small to yield
significant findings, despite seemingly adequate advance planning. If this happens, real
relationships can be overlooked, and the study findings may not be generally applicable.
Moreover, selection effects can be expected in such cases. There are a number of ways to deal
with the problem of missing values.
4. The data sample
A further important point to be considered is the composition of the study population. If there are
subpopulations within it that behave differently with respect to the independent variables in
question, then a real effect (or the lack of an effect) may be masked from the analysis and remain
undetected. Suppose, for instance, that one wishes to study the effect of sex on weight, in a study
population consisting half of children under age 8 and half of adults. Linear regression analysis
over the entire population reveals an effect of sex on weight. If, however, a subgroup analysis is
performed in which children and adults are considered separately, an effect of sex on weight is
seen only in adults, and not in children. Subgroup analysis should only be performed if the
subgroups have been predefined, and the questions already formulated, before the data analysis
begins; furthermore, multiple testing should be taken into account.
If multiple independent variables are considered in a multivariable regression, some of these may
turn out to be interdependent. An independent variable that would be found to have a strong
effect in a univariable regression model might not turn out to have any appreciable effect in a
multivariable regression with variable selection. This will happen if this particular variable itself
depends so strongly on the other independent variables that it makes no additional contribution
toward explaining the dependent variable. For related reasons, when the independent variables
are mutually dependent, different independent variables might end up being included in the
model depending on the particular technique that is used for variable selection.
Overview
Linear regression is an important tool for statistical analysis. Its broad spectrum of uses includes
relationship description, estimation, and prognostication. The technique has many applications,
but it also has prerequisites and limitations that must always be considered in the interpretation
of findings (Box 5).
Box 5
→ r2 is the fraction of the overall variance that is explained. The closer the regression model’s
estimated values ŷi lie to the observed values yi, the nearer the coefficient of determination is to 1
and the more accurate the regression model is.
Meaning: In practice, the coefficient of determination is often taken as a measure of the validity
of a regression model or a regression estimate. It reflects the fraction of variation in the Y-values
that is explained by the regression line.
Problem: The coefficient of determination can easily be made artificially high by including a
large number of independent variables in the model. The more independent variables one
includes, the higher the coefficient of determination becomes. This, however, lowers the
precision of the estimate (estimation of the regression coefficients bi).
Solution: Instead of the raw (uncorrected) coefficient of determination, the corrected coefficient
of determination should be given: the latter takes the number of explanatory variables in the
model into account. Unlike the uncorrected coefficient of determination, the corrected one is
high only if the independent variables have a sufficiently large effect.
we measure more than one variable for each individual. For example, we measure precipitation
and plant growth, or number of young with nesting habitat, or soil erosion and volume of water.
We collect pairs of data and instead of examining each variable separately (univariate data), we
want to find ways to describe bivariate data, in which two variables are measured on each
subject in our sample. Given such data, we begin by determining if there is a relationship
between these two variables. As the values of one variable change, do we see corresponding
changes in the other variable?
We can describe the relationship between these two variables graphically and numerically. We
begin by considering the concept of correlation.
Correlation is defined as the statistical association between two variables.
A correlation exists between two variables when one of them is related to the other in some way.
A scatterplot is the best place to start. A scatterplot (or scatter diagram) is a graph of the paired
(x, y) sample data with a horizontal x-axis and a vertical y-axis. Each individual (x, y) pair is
plotted as a single point.
In this example, we plot bear chest girth (y) against bear length (x). When examining a
scatterplot, we should study the overall pattern of the plotted points. In this example, we see that
the value for chest girth does tend to increase as the value of length increases. We can see an
upward slope and a straight-line pattern in the plotted data points.
A scatterplot can identify several different types of relationships between two variables.
A relationship has no correlation when the points on a scatterplot do not show any
pattern.
A relationship is non-linear when the points on a scatterplot follow a pattern but not a
straight line.
A relationship is linear when the points on a scatterplot follow a somewhat straight line
pattern. This is the relationship that we will examine.
Linear relationships can be either positive or negative. Positive relationships have points that
incline upwards to the right. As x values increase, y values increase. As x values
decrease, y values decrease. For example, when studying plants, height typically increases as
diameter increases.
Figure 2. Scatterplot of height versus
diameter.
Negative relationships have points that decline downward to the right. As x values
increase, y values decrease. As x values decrease, y values increase. For example, as wind speed
increases, wind chill temperature decreases.
Non-linear relationships have an apparent pattern, just not linear. For example, as age increases
height increases up to a point then levels off after reaching a maximum height.
Because visual examinations are largely subjective, we need a more precise and objective
measure to define the correlation between the two variables. To quantify the strength and
direction of the relationship between two variables, we use the linear correlation coefficient:
where x̄ and sx are the sample mean and sample standard deviation of the x’s, and ȳ and sy are
the mean and standard deviation of the y’s. The sample size is n.
where
The linear correlation coefficient is also referred to as Pearson’s product moment correlation
coefficient in honor of Karl Pearson, who originally developed it. This statistic numerically
describes how strong the straight-line or linear relationship is between the two variables and the
direction, positive or negative.
It is a unitless measure so “r” would be the same value whether you measured the two
variables in pounds and inches or in grams and centimeters.
Figure 7. Examples of
negative correlation.
Correlation is not causation!!! Just because two variables are correlated does not mean
that one variable causes another variable to change.
Examine these next two scatterplots. Both of these data sets have an r = 0.01, but they are very
different. Plot 1 shows little linear relationship between x and y variables. Plot 2 shows a strong
non-linear relationship. Pearson’s linear correlation coefficient only measures the strength and
direction of a linear relationship. Ignoring the scatterplot could result in a serious mistake when
describing the relationship between two variables.
When you investigate the relationship between two variables, always begin with a scatterplot.
This graph allows you to look for patterns (both linear and non-linear). The next step is to
quantitatively describe the strength and direction of the linear relationship using “r”. Once you
have established that a linear relationship exists, you can take the next step in model building.
Once we have identified two variables that are correlated, we would like to model this
relationship. We want to use one variable as a predictor or explanatory variable to explain the
other variable, the response or dependent variable. In order to do this, we need a good
relationship between our two variables. The model can then be used to predict changes in our
response variable. A strong relationship between the predictor variable and the response variable
leads to a good model.
Our model will take the form of ŷ = b 0 + b1x where b0 is the y-intercept, b1 is the slope, x is
the predictor variable, and ŷ an estimate of the mean value of the response variable for any value
of the predictor variable.
The y-intercept is the predicted value for the response (y) when x = 0. The slope describes the
change in y for each one unit change in x. Let’s look at this example to clarify the interpretation
of the slope and intercept.
What Is Correlation?
In general terms, correlation is a measure of the degree to which two variables change
simultaneously. The correlation coefficient describes the strength and direction of the
relationship.
You can calculate the correlation between two variables in a pandas DataFrame with the corr()
function. Let's look at the correlation between the height and the weight in the school dataset.
import pandas as pd
# Load the School dataset
df = pd.read_csv('school.csv')
# Apply the corr() function to the dataframe
# and select the weight and height correlation
r = df.corr()['height']['weight']
print("Correlation between height and weight is {:.2f}".format( r ))
Correlation between height and weight is 0.77
To sum up:
The correlation coefficient is always a number between -1 and 1.
Positive correlation ( 0<r≤10<𝑟≤1 ) means that as one variable increases, the other also
increases. They move in the same direction.
Negative correlation ( −1≤r<0−1≤𝑟<0 ) means that as one variable increases, the other
decreases. They move in opposite directions.
A zero correlation ( r=0𝑟=0 ) means that the two variables are totally independent from
each other.
df = pd.read_csv('school.csv')
# Apply the corr() function to the dataframe
print(df.corr())
age height weight
age 1.000000 0.648857 0.634636
height 0.648857 1.000000 0.774876
weight 0.634636 0.774876 1.000000
You can see that all the variables (age, weight, and height) are positively correlated; as one
increases, the other two variables also increase. You also see that weight is more correlated with
height than it is with age (r = 0.77 vs r = 0.63), which makes sense, as physical growth is a more
important factor of weight gain than age.
The Auto-mpg dataset shows a different type of result. Let's look at which variables are
correlated with the mpg fuel efficiency variable.
df = pd.read_csv('auto-mpg.csv')
print(df.corr()['mpg'])
mpg 1.000000
cylinders -0.777618
displacement -0.805127
horsepower -0.778427
weight -0.832244
acceleration 0.423329
year 0.580541
origin 0.565209
As you can see, there are four variables negatively correlated with mpg, and three positively
correlated to a lesser degree. Also, notice that the weight of the car is the worst characteristic for
its fuel efficiency (r = -0.83), and that fuel efficiency increased over the years from 1970 to 1982
with r = 0.58.
Math Time!
Mathematically speaking, and just to illustrate how the correlation is calculated, the Pearson
correlation between two variables x={x0,x1,…,xN−1}𝑥={𝑥0,𝑥1,…,𝑥𝑁−1} and y={y0,y1,
…,yN−1}𝑦={𝑦0,𝑦1,…,𝑦𝑁−1} of N𝑁 samples each, is defined as:
rxy=∑N−1i=o(xi−x¯)(yi−y¯)∑N−1i=0(xi−x¯)2−−−−−−−−−−−−√∑ni=1(yi−y¯)2−−−−−−−−−−
−√𝑟𝑥𝑦=∑𝑖=𝑜𝑁−1(𝑥𝑖−𝑥¯)(𝑦𝑖−𝑦¯)∑𝑖=0𝑁−1(𝑥𝑖−𝑥¯)2∑𝑖=1𝑛(𝑦𝑖−𝑦¯)2
Where: x¯𝑥¯ (resp. y¯𝑦¯) is the mean of x𝑥: x¯=1N∑N−1i=0xi𝑥¯=1𝑁∑𝑖=0𝑁−1𝑥𝑖
The formula above can be interpreted as the ratio between:
Pearson is the default correlation. It is relevant when there is a linear relationship between the
variables, and the variables follow a normal distribution. The good news is that the Pearson
correlation is very robust concerning these two assumptions. If your variables are not strictly
linearly related or if their distribution does not exactly follow a normal distribution, the Pearson
correlation coefficient is still very often reliable.
However, if your data is nonlinear, as the variables in the two plots below, then it makes more
sense to use the Spearman correlation.
Correlation analysis is used to quantify the degree to which two variables are related or on a set
of paired observations. You evaluate the correlation coefficient that tells you how much one
variable changes when the other variable changes. Correlation analysis can show you a linear
relationship between variables.
The possible values of the correlation coefficient or r can range from -1 (when there is a perfect
negative correlation) to + 1 (when there is a perfect positive correlation). The closer the r values
are to 0, the weaker the correlation (either negative or positive).
As an example, we can consider weight and diastolic BP. We can document weight in kilograms
as a continuous variable and diastolic BP in mm Hg as a continuous variable. We can explore if
weight and diastolic BP are correlated. If diastolic BP decreases as weight increases, we might
find a perfect negative correlation-higher weight leading to lower diastolic BP. If there is a unit
increase in diastolic BP as weight increases, we might find a perfect positive correlation- higher
weights leading to higher diastolic BP. Pragmatically, perfect negative or perfect positive
correlations are rare, what we usually get is somewhere in between. If both weight and diastolic
BP show no pattern of relation, then we may find a r value closer to 0.Correlation does not imply
causation.
Pearson’s r is probably one of the most frequently used measures of agreement for continuous
variables in the biomedical literature and is also one of the least appropriate tests to do.
Pearson’s r
The variables that are considered for Pearson’s correlation analysis preferably have a continuous
structure.
It is an index of linear association but does not necessarily mean good agreement. It is insensitive
to systematic differenceeach variabs between two observers or readings.
The value of r is sensitive to the range of values and is usually higher when the spread of values
is higher. Pearson’s r is very sensitive to extreme values (outliers) which can change the r values
significantly.
The r can give you an idea of the strength (weak, strong) and direction (positive, negative, none)
of the relationship between the two variables.
To do a Pearson’s r, the following assumptions must be met
As the value of one variable increases, the value of the other variable increases
As the value of one variable increases, the other variable value decreases
However, not exactly at a constant rate whereas in a linear relationship the rate of
increase/decrease is constant.
The Spearman P can thus be used both for a linear and a non-linear relationship.
The Spearman P can be used with data that is normally distributed and with data that is not
normally distributed
The Spearman P works with rank-ordered variables and not with raw data values and hence it
measures the strength and direction of the monotonic relationship between the two ranked or
ordered variables
The Spearman P is less affected by outliers and hence can be used even in the presence of outlier
values
There are two main types of correlation coefficients: Pearson's product moment correlation
coefficient and Spearman's rank correlation coefficient. The correct usage of correlation
coefficient type depends on the types of variables being studied. We will focus on these two
correlation types; other types are based on these and are often used when multiple variables are
being considered.
Pearson's product moment correlation coefficient is denoted as ϱ for a population parameter and
as r for a sample statistic. It is used when both variables being studied are normally distributed.
This coefficient is affected by extreme values, which may exaggerate or dampen the strength of
relationship, and is therefore inappropriate when either or both variables are not normally
distributed. For a correlation between variables x and y, the formula for calculating the sample
Pearson's correlation coefficient is given by3
Spearman's rank correlation coefficient is denoted as ϱs for a population parameter and as rs for
a sample statistic. It is appropriate when one or both variables are skewed or ordinal1 and is
robust when extreme values are present. For a correlation between variables x and y, the formula
for calculating the sample Spearman's correlation coefficient is given by
The distinction between Pearson's and Spearman's correlation coefficients in applications will be
discussed using examples below.
The data depicted in figures 1–4 were simulated from a bivariate normal distribution of 500
observations with means 2 and 3 for the variables x and y respectively. The standard deviations
were 0.5 for x and 0.7 for y. Scatter plots were generated for the correlations 0.2, 0.5, 0.8 and
−0.8.
Fig. 1
Scatterplot of x and y: Pearson's correlation=0.2
Fig. 4
Scatterplot of x and y: Pearson's correlation=−0.80
In Fig. 1, the scatter plot shows some linear trend but the trend is not as clear as that of Fig. 2.
The trend in Fig. 3 is clearly seen and the points are not as scattered as those of Figs. 1 and
and2.2. That is, the higher the correlation in either direction (positive or negative), the more
linear the association between two variables and the more obvious the trend in a scatter plot.
For Figures 3 and and4,4, the strength of linear relationship is the same for the variables in
question but the direction is different. In Figure 3, the values of y increase as the values of x
increase while in figure 4 the values of y decrease as the values of x increase.
Fig. 2
Scatterplot of x and y: Pearson's correlation=0.50
Fig. 3
Scatterplot of x and y: Pearson's correlation=0.80
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
def myfunc(x):
return slope * x + intercept
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()
O/p
Explaination:
Import the modules you need.
import matplotlib.pyplot as plt
from scipy import stats
Create the arrays that represent the values of the x and y axis:
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
Execute a method that returns some important key values of Linear Regression:
slope, intercept, r, p, std_err = stats.linregress(x, y)
Create a function that uses the slope and intercept values to return a new value. This new value
represents where on the y-axis the corresponding x value will be placed:
def myfunc(x):
return slope * x + intercept
Run each value of the x array through the function. This will result in a new array with new
values for the y-axis:
mymodel = list(map(myfunc, x))
Draw the original scatter plot:
plt.scatter(x, y)
Draw the line of linear regression:
plt.plot(x, mymodel)
Display the diagram:
plt.show()
R for Relationship
It is important to know how the relationship between the values of the x-axis and the values of
the y-axis is, if there are no relationship the linear regression can not be used to predict anything.
This relationship - the coefficient of correlation - is called r.
The r value ranges from -1 to 1, where 0 means no relationship, and 1 (and -1) means 100%
related.
Python and the Scipy module will compute this value for you, all you have to do is feed it with
the x and y values.
Correlation:
import numpy
from sklearn import linear_model
logr = linear_model.LogisticRegression()
logr.fit(X,y)
Explaination:
Import the numpy library and define a custom dataset x and y of equal length:
# Import the numpy library
import numpy as np
print(Pearson_correlation(x,y))
print(Pearson_correlation(x,x))
OUTPUT
0.974894414261588
1.0
The above output shows that the relationship between x and y is 0.974894414261588 and x and
x is 1.0